1 Introduction

The Sloan Digital Sky Survey (SDSS) provides vast information about the universe regarding the three-dimensional maps, deep multi-colour images of one-third of the sky, spectra for more than three million astronomical objects, and exploration of all phases and surveys in the past, present, and future [9]. Data Release 15 is the third data release of the fourth phase of the Sloan Digital Sky Survey and includes SDSS data taken through June 2017 [10]. Whilst different forms of questions about the universe can be queried from the vast SDSS database, the scope of this paper is to propose the Cognitive Pairwise Comparison Forward Feature Selection (CPC-FFS) as the expert knowledge judgment approach with the FFS algorithm to rank and select features, and the Deep Learning (DL) is applied for model training for astronomical object classification.

Machine learning techniques are increasingly used in SDSS data. Feigelson and Babu [4] demonstrated various statistical and machine learning methods for astronomy study with R applications. Bazarghan and Gupta [2] described the use of a Probabilistic Neural Network (PNN) for the automatic classification of about 5000 SDSS spectra into 158 spectral types of a reference library ranging from O type to M type stars. Buisson et al. [3] described the use of the principal component analysis for feature extraction, and the uses of Random Forest, k-Nearest Neighbours, SkyNet, Support Vector Machine (SVM), Minimum Error Classification (MEC), and Naïve Bayes (NB) to predict whether an object is real on the basis of the dataset of 27,480 objects for each object consisting of three 51 × 51-pixel difference images (for each image including the gr and i colour bands). Kim and Brunner [6] presented the use of a convolutional neural network for the classification of star-galaxy based on five photometric bands: u, g, r, i, and z. It appears that there is a lack of research using human-centered methods that the human decision-making methods are incorporated into the machine learning methods, as the experts of users and designers play significant roles in data modelling during the human-in-the-loop processes for data science projects.

As different combinations of features lead to different performance of machine learning methods, this study utilizes CPC as human centered method to select the features. The cognitive pairwise comparison (CPC) is the core technique of PCNP [12,13,14] to determine the priorities of the feature choices based on experts’ knowledge. The PCNP method is proposed to address the problems of the Analytic Hierarchy Process (AHP)’s paired ratio scale [8], potentially producing misapplications, and the use of the AHP is controversial [7]. The overview of the proposed framework of the Cognitive Pairwise Comparison Forward Feature Selection with Deep Learning (CPC-FFS-DL) approach is presented in Fig. 1. The general steps are described in the rest of this paper and organized below.

  1. 1.

    Data preparation: the data obtained from different tables from SDSS are divided into two parts: training data and test data, which are used to build training models and test the training models previously constructed. The details are presented in Sect. 2.1.

  2. 2.

    Data exploration: whilst insufficient features can produce an under-fit model, irrelevant features can create bias for the model leading to low accuracy. The relevant features must be selected to construct a learning model after the data is explored and visualized. Some outliers may be identified and removed. The details are discussed in Sect. 2.2.

  3. 3.

    CPC process: whilst the motivation for using CPC is presented in Sect. 3.1, the details of CPC are offered in Sect. 3.2. To elicit expert knowledge, the experts conduct a CPC evaluation, in which the CPC scores are stored in the form of a Pairwise Opposite Matrix (POM) to produce the ranks of the features.

  4. 4.

    DL Configuration process: parameters setting is a complex process in DL. After several trials, the good enough setting is proposed to simplify the illustration of the use of CPC-FFS-DL. The details of the setting for DL are illustrated in Sect. 3.3.

  5. 5.

    CPC-FFS-DL process: after the POM and DL configurations are established, The CPC-FFS-DL algorithm, applying the deep learning algorithm to proceed with the data by evaluating additional features based on the CPC ranks, is used to produce a promising DL training model. The details are discussed in Sect. 3.4.

  6. 6.

    Prediction process: the testing data, i.e., data without target or class labels, are taken for the DL training model to produce the predicted results. If the testing data are with ground truth values (or target labels), the prediction accuracy can be calculated. The details are discussed in Sect. 4.

  7. 7.

    The conclusions and prospects of the proposed approach are presented in Sect. 5.

Fig. 1
figure 1

Overview of hybrid CPC-FFS-DL framework

The following sections demonstrate the details in the CPC-FFS-DL framework, which is used for the astronomical object classification problem.

2 SDSS astronomical data

2.1 SDSS source data

The online platform of Sloan Digital Sky Survey (SDSS) SkyServer DR15 offers a substantial amount of open data for the universe. SDSS provides SQL search on the webFootnote 1 to query the dataset. The SQL database structures are documented on the website of the schema browser.Footnote 2 Users can refer to the information about API and Tools for Programmer's ReferenceFootnote 3 to use SDSS web services to access the data.

Figure 2 presents the R code for SQL query with SDSS API web services to obtain the data used in this paper. Whilst there are three classes for the target variable, the value of the parameter, s.class = ‘STAR’, is changed to ‘GALAXY’ and ‘QSO’ respectively in Fig. 2. The ‘STAR’ and ‘GALAXY’ classes have 3500 records respectively, whilst the ‘QSO’ class has 3000 records. Therefore, the sample data comprises 10,000 records.Footnote 4 According to some hints on searching SkyServer,Footnote 5 the “TOP < n > ” SQL construct is not a deterministic ordering. That means the “TOP < n > ” objects may differ if the same query is performed again.

Fig. 2
figure 2

SQL query in R for ‘STAR’ class

According to the SQL query in Fig. 2, 18 features are selected. The feature definitionsFootnote 6 are shown in Table 1. The variable, class, is the target label class comprising GALAXY, QSO, and STAR. The rest are the features potentially related to the class prediction. Two identity variables, objid and specobjid, are not used for building a learning model. Therefore, 15 features are used for creating a learning model.

Table 1 Definitions of data features

2.2 Data exploration and visualization

To understand the 15 features related to the astronomic object classification problem, the statistics summary and data visualization are presented in Table 2 and Figs. 34 respectively. Regarding the outliers or missing values, eight observations with five filter bands equal to − 9999 are removed. The descriptive statistics for the selected features are presented in Table 2. The rerun feature is not related to the classification problem as it is just a single value, 301.

Table 2 Descriptive statistics of sample data
Fig. 3
figure 3

SDSS coverage for sample data

Fig. 4
figure 4

Image instances of GALAXY, QSO, and STAR

To understand the GALAXY, QSO, and STAR classes, the coverage of two features ra and dec of the sample data produced by R ggplot2 [11] is presented in Fig. 3. The features ra and dec are related to the label class of astronomic object classification as objects of the same class are likely grouped. The image instances for these three classes are presented in Fig. 4. The images are obtained from the getjepg web method in SkyServer web services.Footnote 7 By browsing the images, five filer band features are highly related to the astronomic object classification label class.

3 Cognitive pairwise comparison foreword features selection with deep learning

3.1 Feature selection challenges

Whilst there are several alternative ways for heuristic feature combinations, domain experts may have their preferences to introduce the features one by one (or batch by batch) to build the machine learning models and observe how these results are accordingly changed. The motivation may include their preferences to see how changes are made by adding and evaluating new feature(s), and the confidence to build a suitable model with fewer features but higher prediction accuracy. The Pairwise Opposite Matrix (POM) is a promising method to encode his implicit knowledge explicitly, rather than just getting higher ranking results.

Whilst there are 15 features, the number of combinations of the features, N, is calculated below.

$$N = \sum {_{i = 1}^{15} } \,C_{i}^{15} = 32,767$$
(1)

The domain expert may consider u, g, r, i, and z to be used together as a batch of filters, and ra and dec to be used together as a pair of positions. Therefore, the number of features to be ranked is reduced to 10, and the number of combinations is calculated below.

$$N = \sum {_{i = 1}^{1} } C_{i}^{10} = 1,023$$
(2)

Although the number of combinations is significantly reduced, 1023 combinations are still a large number. The proposed CPC-FSS for DL can address the problem of selecting features for a machine-learning model such as DL.

3.2 Cognitive pairwise comparison

The Cognitive Pairwise Comparison is a human-centered method as an interface for humans to interact with machine learning algorithms. The CPC is the core technique of PCNP [12,13,14] to determine the ranks of features. The Pairwise Opposite Matrix (POM) is used to interpret the priority of each feature. Let an ideal utility set be \(V=\left\{{v}_{1},\dots ,{v}_{n}\right\}\), and a comparison score to represent the difference between two feature priorities be \({b}_{ij}\cong {v}_{i}-{v}_{j}\). The ideal pairwise opposite matrix is \(\widetilde{B}=\left[{v}_{i}-{v}_{j}\right]\). A subjective judgmental pairwise opposite matrix using paired interval scales is \(B=\left[{b}_{ij}\right]\). \(\widetilde{B}\) is determined by \(B\) as follows:

$$\tilde{B} = \left[ {\tilde{b}_{ij} } \right] = \left[ {v_{i} - v_{j} } \right] \cong \left[ {b_{ij} } \right] = B$$
(3)

The \({b}_{ij}\) is chosen from the paired rating scale \(\left\{-\frac{8}{\kappa },\dots ,-\frac{1}{\kappa },0,\frac{1}{\kappa },\dots ,\frac{8}{\kappa }\right\}\) representing {“extremely less important than”, …., “weekly less important than”, “equal to”, “weekly more important than”, …, “extremely more important than”}. The normal utility \(\kappa\) represents the mean of the priorities of the features. By default, \(\kappa\) is set to 8.

In this case, after the domain expert explores the data mentioned in Sect. 2.2, a POM for comparing astronomic features is conducted on the basis of the expert’s CPC evaluation. The POM results are presented in Table 3. For example, the score 5 of (ra, dec) vs. rerun means that (ra, dec) is 5 units more important than the rerun. Mathematically, the form is \({v}_{(ra, dec) }-{v}_{rerun}=5\). A cognitive pairwise matrix B is verified by the Accordance Index (AI) as below.

Table 3 POM, priority, and rank for astronomic features for astronomic object classification
$$AI=\frac{1}{{n}^{2}}{\sum }_{i=1}^{n}{\sum }_{j=1}^{n}\sqrt{\frac{1}{n}{\sum }_{p=1}^{n}{\left(\frac{\left({b}_{ip}+{b}_{pj}-{b}_{ij}\right)}{\kappa }\right)}^{2}}$$
(4)

\(AI\ge 0\), and the normal utility \(\kappa\) is equal to 8 by default. If \(AI=0\), then B is perfectly accordant; If \(0<AI\le 0.1\), B is acceptable; If \(AI>0.1\), B may be required to be revised. The accordance index for POM in Table 3 is 0.095, which is within the acceptable range.

To derive the priority, the Row Average plus the normal Utility (RAU) below is used as the prioritization operator.

$${v}_{i}=\left(\frac{1}{n}{\sum }_{j=1}^{n}{b}_{ij}\right)+\kappa ,\forall i\in \left\{1,\dots ,n\right\}$$
(5)

The individual priority \({v}_{i}\) from POM is rescaled as a normalized priority vector by the rescale function (or normalization function) as below.

$${w}_{i}=\frac{{v}_{i}}{n\kappa },\forall i\in \left\{1,\dots ,n\right\}$$
(6)

The priorities and ranks for all features are presented in Table 3. To illustrate the calculation, the priority of (ra, dec) is calculated as below.

$${v}_{1}=\frac{\left(0+\left(-6\right)+1+5+3+2+\left(-3\right)+2+3+3\right)}{10}+8=9$$

Thus, the relative priority of (ra, dec) is calculated as below.

$${w}_{1}=\frac{9}{10\times 8}=0.1125$$

The priorities of the rest of the features are calculated by the same methods as demonstrated above.

3.3 Deep learning configuration

The CPC-FFS-DL algorithm (Algorithm 1) includes two major components: the feature ranks evaluated by an expert using CPC, which is presented in the previous sub-section, and the deep learning model configured, built, and tested according to the feature ranks. Deep learning is an Artificial Neural Network (ANN) with deep layers and each layer consists of neurons. The word “deep" is not related to deeper understanding about mind in psychology or in our common sense, but is just mentioned to the depth of an ANN mathematical model.

Keras is a Python deep learning library that runs on top of TensorFlow, CNTK, or Theono, and enables fast experimentation [5]. This research applies the Keras package (Version 2.15.0, 2024-04-19) in R language (Version 4.4.0, 2024-04-24) [1] as an interface to use the Keras (Version 3.3.3) on top of TensorFlow (Version 2.16.1) in Python language (Version 3.12.3, 2024-4-9). Difference versions may have slight difference for the performance. Whilst there are many configurations of deep learning models, a multi-layer perceptron is a good enough initiative for CPC-FFS-DL to address the proposed astronomical object classification problem in this study.

Figure 5 presents the R code to implement the fully connected multi-layer perceptron with the configuration information after several trials of the parameter tuning manually. Figure 6 demonstrates the depth of the DL model of eight inputs: u, g, r, i, z, redshit, ra, and dec. Whilst the number of features increases, the number of neurons in the first layer in the network increase by times accordingly. There are 11587 parameters to be tuned in the model. Comparing with the classical neural network with a few parameters, this network is called “deep”. The definitions of functions and configurations are found in Keras Documentation [5].

Fig. 5
figure 5

Configuration of multi-layer perceptron model using Keras in R code

Fig. 6
figure 6

Summary of a DL instance of eight inputs (u, g, r, i, z, redshit, ra and dec)

3.4 CPC-FFS-DL algorithm

Forward Features Selection (FFS) is a greedy approach that iteratively evaluates a new feature on a set of selected features, and adds the new feature if there is an improvement. The problem with basic FFS is that the order of features has significant impacts on the model construction and prediction performance. Incorporating human knowledge with CPC can be a promising solution to achieve the optimal model performance result. The CPC approach can be applied to rank the features for the use of FFS. The domain expert is supposed to have sufficient knowledge to rank. The CPC is used to encode the preference in POM data structure denoted by B and understand how such features are compared and ranked.

Algorithm 1
figure a

CPC-FFS-DL

Algorithm 1 presents the calculation steps of the cognitive pairwise comparison forward features selection with DL, denoted by \(CFD\left(X,B,\kappa ,DL,\psi ,\varrho \right)\). Whilst CPC component is shown in Steps 1–2, FFS with DL is shown in Steps 4–5. X is the data table with m rows by n + 1 columns including one column of the target variable. X is split into training data \({X}_{trn}\) and testing data \({X}_{tst}\). B is a POM and \(\kappa\) is the normal utility. DL is the configured deep learning model presented in Sect. 3.3. \(DL\left({X}_{trn} {[\beta }^{+}\right])\) is the DL training model based on the training data \({X}_{trn}\) and selected feature index set \({\beta }^{+}\). \(\psi\) is a measure function to evaluate the performance of predicted results, e.g., accuracy. \(\varrho\) is the baseline for the significant improvement. \(\psi \left(DL\left({X}_{trn} {[\beta }^{+}\right]),{X}_{tst}\right)\) yields the performance score for the classification results from testing data \({X}_{tst}\) based on the DL training model, i.e., \(DL\left({X}_{trn} {[\beta }^{+}\right])\), with the selected feature index set \({\beta }^{+}\). The demonstration of the use of Algorithm 1 is presented in Sect. 4.

4 Simulations and discussions

4.1 Background and random selections

As the details are described in Sect. 2, 10,000 records from SDSS have been obtained in the sample data. After 8 outliers or missing values are removed, 9992 records are used for this simulation. A random number function, runif, with a seed number of 999 is used to randomize the order of the sample data retrieved from the SQL query. The first 9000 records are chosen as the training data (\({X}_{trn}\)), whilst the rest 992 records are chosen as the testing data (\({X}_{tst}\)). The data consists of 15 features. The feature indices are defined in Table 4 according to the rank in Table 3. The scaling method is based on standardization method. As all instances of the “rerun” feature is of the same value, the rescaled value is NA due to division by zero. However, the packages of R Keras on top of TensorFlow still can handle the NA values using default setting, which may lead to lower testing accuracy.

Table 4 Simulation results using random feature selection

To illustrate the challenges of feature selection, Table 4 presents seven cases with random feature selections for raw or scaled training data to train the Deep Learning model, and the accuracy results are predicted by testing data. The accuracy is calculated by the number of successful classification cases over the total number of classification cases. It is observed that some features can contribute to significant accuracy increment, whilst some have a negative impact. The combinations influence the overall accuracy. As indicated by Eq. (1), it is challenging to find the best or optimal feature combinations among 32,767 possibilities. Although limitations exist, the forward feature selection is one of the possible ways to address this issue.

4.2 Basic forward feature selection with DL

Forward feature selection is one of the popular feature selection methods in feature engineering. The major steps of the basic FFS without CPC are shown in Steps 4–5 in Algorithm 1. The training data with the selected feature is used to train the DL model whilst testing data with the selected feature is used for benchmarking. Table 5 exhibits sequences of different feature orders in the dataset that can lead to different combinations of selected features resulting in different accuracy values, in addition to another factor that the data is either scaled or not.

Table 5 Simulation results using forward features selection without CPC order

To understand the reasons that features are selected from the sequences of different feature orders, Table 6 exhibits how accuracy (%) changes with FFS based on raw data, whilst Table 7 exhibits the accuracy changes based on scaled data. The significant improvement \(\varrho\) is set to 1%, which means that a new feature contributing 1% to the accuracy is included. For example, Feature Sequence Index 1 for Case 1 in Table 6 means Feature 15 for Case 1 in Table 5, which refers to rerun in Table 4. If the FSI 2, i.e., mjd (Feature 14), is added for the DL training model, the testing accuracy does not improvement, and thus mjd is not included. Similarly, when the FSI 3 is added into DL training model, the testing accuracy has no improvement, and FSI 3 is also not included. When the FSI 4 (fiberid) is added and has more than 1% improvement, the new feature is included.

Table 6 Accuracy (%) changes using forward feature selection with raw data
Table 7 Accuracy (%) changes using forward feature selection with scaled data

The orders of feature sequence and data transformation have significant impacts on the performance of feature selection results. The problem of how to order the features must be addressed. By utilizing expert knowledge, the Human Centered Forward Feature Selection with CPC is a promising solution introduced in the next subsection.

4.3 Human-centered forward feature selection with CPC and DL

For the CPC-FFS-DL algorithm shown in Algorithm 1, the objective of the simulations is to observe changes in the accuracy by adding one feature or a batch of features according to the feature rank evaluated by an expert using CPC. Table 8 presents the simulation results for the forward features selection based on the rank evaluated by the expert using CPC shown in Table 3. The features of higher preferences are selected first.

Table 8 Simulation results using CPC batch forward features selection

For the first six simulation cases in Table 8, the data for the same setting of the Deep Learning model has two types: raw data and scaled data. Regarding the data transformation, standardization is applied for scaled data. Different configurations very likely result in different training times and accuracy. After several trial, the tuned configuration of the DL model is presented in Fig. 5. When accuracy and loss are converged, the DL training model is good enough to be used. Figure 7 presents the convergence of an instance of DL training model. Whilst the maximum of iterations is set to 500 in this experiment, the loss and accuracy start to converged after 300 iterations.

Fig. 7
figure 7

Convergence of training iterations

Table 9 presents the selection results with \(\varrho\) of 0.2% and a pure element-by-element approach, instead of combining some relevant features as batches such as (u,g,r,i,z) and (ra,dec). The results of Table 9 are quite similar to Table 8. If we run the DL model code again using Keras, the results may be slightly different even with the same seeds, as DL is a heuristic approach.

Table 9 Accuracy (%) changes using CPC element-by-element forward feature selection

To conclude the simulation results in Table 8, when the features of raw data are added one by one according to the rank order of the CPC selection, the accuracy increases whilst redshit and (ra, dec) are added into (u,g,r,i,z), but decreasing whilst the rest of features are continuously added. This situation reflects that the ranking results from the CPC make reasonable sense. If a significant improvement \(\varrho\) is set to 1% for the CPC-FFS-DL algorithm with batch selection, eight features are chosen to build a robust DL model when raw data are either scaled or not, although there may be a slight improvement when an addition feature is added, whilst some additional features may lead to lower accuracy.

On the other hand, if the data is scaled by standardization, accuracy increases further when more scaled features are included until a rerun is added. The main reason is that rerun is just a single value of 301 (or a constant) in the sample data (Table 2) and becomes a null value after standardizing the feature value. When a feature of NA value is used for the proposed DL model of the package, the performance of the DL model may be distorted. In summary, all features except for rerun can be used to build the DL model when raw data are standardized.

Concerning random feature selections in Table 4 and the forward feature selection with initial random feature sequence in Table 5, it appears that none of the combinations with scaled data have better accuracy than the CPC methods shown in Table 8 (98.9%). It is concluded that the order of feature sequences influences the combinations, which further results in the performance of the deep learning model, and expert evaluation for the order of selection using CPC can benefit the model prediction performance. If we blindly choose all features, inappropriate features are included, and the accuracy could be only 37.1% (subject to a DL design as well), as indicated in Table 8. To address this issue, the proposed CPC can be a promising method to support domain experts to make better decisions for feature selections.

4.4 Comparisons

To compare with the existing approaches, the Recursive Feature Elimination (RFE) function of the Caret R package [15] is selected. RFE is a feature selection method based on existing ML algorithms. Whilst complete and exhaustive comparison for discussion is not the focus of this study, five popular machine learning algorithms are selected for RFE: C5.0 using C5.0 function, k-Nearest Neighbors using knn function, Naive Bayes using nb function, Classification and Regression Trees (CART) using rpart function, Regularized Logistic Regression using glmnet function, and Support Vector Machine with linear kernal using svmLinear function. The Caret depends on the other packages in which readers may refer to the related documents such as in [15]. As some ML function cannot handle NA value, the scaled value of rerun feature is set to 1.

The train and RFE method in Caret R Package automatically tune the parameters for each selection machine learning methods. Comparing to the proposed CPC-FFS due to proper expert involvement, the RFE method requires more computational resources for more exhaustive search with more subsets of features and more partitions of data to achieve the higher accuracy. Five-fold cross validation is used to train the model. Table 10 presents simulation results of feature selection and accuracy using recursive feature elimination with the five machine learning models. Generally, the accuracy is slightly lower than the proposed method shown in Case 3 of Table 8.

Table 10 Simulation results using recursive feature elimination with five machine learning models

5 Conclusions and future study

Whilst the selection of the best combination of features among many possible combinations is challenging, this paper proposes a Cognitive Pairwise Comparison Forward Feature Selection with Deep Learning (CPC-FFS-DL) for astronomical object classification. The data is obtained from the SDSS SkyServer SQL web service. The CPC approach is used to rank the choices of features according to the domain expert's preference. The DL models are implemented according to choices of features ranked by experts using the CPC. The DL model can achieve very high accuracy (97% with raw data and 98.9% with scaled data) for the astronomical object classification using the eight features (u, g, r, i, z, redshift, ra, dec) with raw data and additional six features (run, field, plate, fiberid, camcol, mjd) with scaled data. Without proper expert involvement, the accuracy potentially can only be 37.1%. The comparisons demonstrated that the proposed method is promising.

As the rating scores from the domain experts significantly influence the results, the CPC is demonstrated as a promising tool to explicitly capture the implicit knowledge from the domain experts, but the limitation is that the model result depends on how knowledgeable an engaged domain expert is. If the data is standardized, more features added may increase the accuracy further, but the limitation is that the computational workloads will increase.

The study provides the initiatives for the use of CPC for feature selection applied to deep learning for the insight of astronomical physics study. The Forward Features Selection with CPC can be applied to not only the Deep Learning model but also the other supervised classification approaches, such as C5.0, CART, Random Forest, support vector machine, and k-Nearest Neighbors, which will be conducted in future studies. Based on the proposed CPC-FFS-DL applying to the astronomical object classification, the proposed approach can be applied to other astronomical classification problems with the other features from the SDSS or the other open platforms. To further extend, the CPC-FFS-DL can be applied to the other areas of human-centered data sciences.