1 Introduction

ALICE (A Large Ion Collider Experiment) [1] is one of the four major detectors located at the Large Hadron Collider at CERN [2]. The main goal of ALICE is to study the properties of quark–gluon plasma (QGP), a hot and dense state of matter, and the strong force that binds quarks together inside hadrons [

2 Related works

2.1 Missing data sources

The incomplete data problem in statistical and machine learning (ML) models has been widely studied over decades [13, 14]. In [15], authors highlight three main categories of missing data according to the source of the information gaps. The first case occurs when examples are missing completely at random (MCAR) – where the probability of missing value is independent of all parameters. If the probability of the missing value is dependent only on the value of the observed attributes, we can call it missing at random (MAR). Finally, if this probability depends on the value of the missing parameter itself, we refer to this case as missing not at random (MNAR).

In our case of the data used for particle identification at the ALICE experiment, we can distinguish two main sources of the missing data. For some cases, the lack of particular value – signal from one of the detectors is caused by its temporal malfunction, independent of the measured particle. In this case the signal is missing completely at random (MCAR). On the other hand, for some cases we might lack the measurement of values that fall outside the effective region of particular detector. For example, the efficiency of the TOF detector measurements drop significantly for particles with transverse momentum below 0.5 \(\mathrm {Ge\hspace{-1.00006pt}V/}c \). In such case, the missing data is missing not at random (MNAR).

2.2 Machine learning with incomplete data

Several popular machine learning algorithms such as neural networks, cannot be directly trained with missing data. Therefeore there exists different solutions tot that problem that can be roughly divided into two groups: (1) those that modify the training dataset [16, 17] and (2) those that adapt ML architectures [18,19,20,21]. Data treatment methods transform datasets with missing values into complete ones that can be used with standard machine learning architectures. Adapted architectures are capable of processing incomplete data without any alteration.

The simplest method for data transformation is known as case deletion, where examples without full data availability are simply removed from the training set. This limits the potential training capabilities of the model and can result in bias induced by missing part of the original data distribution. The alternative technique known as imputation fills the missing data with artificial values calculated as a mean, median, or value predicted with a simple model, e.g. linear regression. This idea is further extended in [22], where the imputation model is optimized jointly with the regressor. Similarly to the case deletion approach, imputation methods can also significantly disturb the predictions of the neural model, which might result in unrealistic behavior.

To overcome those shortcomings, methods based on the model’s adjustment were proposed. The most straightforward approach suitable in a situation when missing data affects only a few attributes is an ensemble of different classifiers. In such a case, different models are trained on subsets of the training dataset without missing data. In particular, in neural network reduction [21], authors propose to split the dataset into the largest possible complete subsets.

There are several domains where machine learning with incomplete data is already employed. In [23], authors use a recurrent neural network to model healthcare data with missing data. A similar idea was employed in medical applications such as breast cancer prediction [24]. Machine learning algorithms with missing data are also used in other domains such as traffic prediction [25, 26].

2.3 Particle identification techniques

As mentioned in the introduction, the recently employed PID techniques depend on the human-defined selection criteria on the response signal from given detectors used in the analysis. In most cases, the so-called “\(\textrm{n}_{\sigma }\)” method is used, whose main parameter is the number of standard deviations from the expected value that a signal for a given particle left in the detector. For example, if both the TPC and TOF detectors are used, a common approach is to define a PID selection as

$$\begin{aligned} \sqrt{\hbox {n}_{\sigma ,\textrm{TPC}}^{2}+\hbox {n}_{\sigma ,\textrm{TOF}}^{2}}<\lambda . \end{aligned}$$
(1)

The actual cut-off value \(\mathrm \lambda \), however, depends on the specific analysis, and requirements for purity and efficiency of the sample. Typically, this value is set between 2 and 3. This method can be further modified by adding additional conditions, for instance, on rejection of other types of particles. In such a case, if one analyzes, i.e., charged kaons, one may provide the acceptance criterion for the kaon hypothesis (i.e. how far in terms of \(\mathrm n_{\sigma }\) the signal can be from the expected value for kaons) as well as the rejection criterion for the pion, proton, and electron hypothesis (i.e. what is the minimum \(\mathrm n_{\sigma }\) that the signal must have compared to the expected values for pions, protons, and electrons).

Outside of the ALICE experiment, several attempts have been made to use neural networks for the PID. In the LHCb experiment [9], shallow neural networks are used to classify calorimeter signals. Similarly, in ATLAS, neural models are used to identify electrons [10], while the CMS Collaboration identifies \(\tau \) lepton decays with deep neural networks [11].

3 Attention-based neural network architecture for particle identification

Fig. 2
figure 2

The proposed model architecture. Layered blocks are applied separately to each vector in a set. Single blocks are applied to their input as a whole

In this work, we focus on the problem of PID in which a type of particle is assigned to a data sample based on its characteristics recorded by a set of detectors. We treat particle identification as a set of binary classification tasks in a so-called one vs. all approach. Given a set of examples, each labeled with one of \(k\) particle types, the task is to find \(k\) binary classifiers corresponding to each particle type. Each classifier is tasked with identifying whether or not a particle, represented by a given example, belongs to the specific particle type. Every example consists of a set of real-valued features, based on measurements from a detector, out of which some might be missing.

With this formulation, we introduce a novel method for PID that employs missing data. We base our approach on the attention mechanism, similar to the method introduced for a medical use-case in AMI-Net [12]. Our system is composed of several steps. We first encode all data examples into a set of feature-value pairs [27], independently of their characteristics regarding missing data, and transform them into embeddings. These embeddings are further processed by the Transformer’s [28] encoder module with multi-head attention. The encoder output is combined into a single feature vector that is an input to the final classifier. The overview of our system is shown in Fig. 2, while in the following subsections, we describe its building blocks in detail.

3.1 Feature value pairs

We prepare the incomplete data for the model by creating sets of feature-value pairs from the incomplete vectors. Each pair consists of a non-missing value in the vectors and the index of that value. We apply one-hot encoding to these indices to aid the model in their processing. An example is shown in Table 1.

Table 1 An example for preprocessing of data samples into feature set values

3.2 Embedding

Embedding is a continuous vector representation of discrete data. Embedded feature indices can use relations between features more effectively by placing similar features close in the embedded space. We create embeddings by applying a neural network with a single hidden layer to the feature-value vectors.

3.3 Transformer encoder

Transformer is an attention-based architecture commonly used for sequence transduction. As the Transformer is used to process sequences of arbitrary lengths, it can naturally be adapted to find/learn relations between available features regardless of the amount of missing values. We apply the Transformer’s encoding module to the set of embedding vectors to connect the different features each vector represents.

The encoder consists of a stack of N identical layers. Each layer is made of two sub-layers: a multi-head attention layer and a dense neural network. The multi-head attention is applied to the input set as a whole, while the NN is applied to each vector in a set separately. To ease the training of the encoder, we apply a residual connection around each sub-layer, followed by layer normalization.

3.3.1 Multi-head attention

To connect pairs of vectors in the input set, each input vector is linearly transformed into \(h\) query, key, and value vectors, and scaled dot-product attention is applied to each of the \(h\) triples of query, key, value sets, where \(h\) is the number of heads. This allows specific patterns to be found in pairs of input vectors. For example, a measurement from a specific detector could be used if the momentum is within a particular range. As each of the \(N\) layers in the encoder contains a multi-head attention sub-layer, and each sub-layer connects pairs of vectors, the full encoder can theoretically connect subsets of \(2^N\) vectors.

3.3.2 Scaled dot-product attention

Given sets of query, key, and value vectors of dimension \(d_k\), scaled dot-product attention calculates a set of weighted averages of the value vectors based on the similarities between the corresponding keys and queries. The similarities are obtained by computing the dot product of the query and key vectors. The dot products are scaled by a factor of \(\frac{1}{\sqrt{d_k}}\) to counteract the increase of the dot product with an increasing dimension of the query, key, and value vectors. Finally, the softmax function is applied to the scaled dot products to obtain the weights used to calculate the averages of value vectors. The whole computation can be described as

$$\begin{aligned} \textrm{Attention}(Q, K, V) = \textrm{softmax}\left( \frac{QK^T}{\sqrt{d_k}}\right) V, \end{aligned}$$
(2)

where \(Q,K,V \in \mathbb {R}^{n \times d_k}\) are sets of \(n\) query, key and value vectors, respectively.

3.4 Attention pooling

As the classifier neural network cannot process an unordered, variable-size set of vectors, merging the vectors obtained from the Transformer’s encoder into a single vector is necessary. Therefore, we pool the output vectors together by using the self-attention technique that was also used in the Transformer, where each vector is assigned weights based on its own values. These weights are then used to calculate the weighted average of all the vectors in the set.

3.4.1 Self attention

Given a set of vectors \(\{v_1, v_2,..., v_n\}\), where \(v_i \in \mathbb {R}^{d_{model}}\), we obtain the pooled vector as follows:

$$\begin{aligned} E_{i, \star }&= \textrm{NN}(v_i) \quad \forall i \quad \in [1,n] \end{aligned}$$
(3)
$$\begin{aligned} A_{{\star }, j}&= \textrm{softmax}(E_{{\star }, j}) \quad \forall j \quad \in [1,d_{model}] \end{aligned}$$
(4)
$$\begin{aligned} o_j&= \sum _{k=1}^{n} A_{k, j}v_{k_j} \quad \forall j \quad \in [1,d_{model}] \end{aligned}$$
(5)

Here, \(E, A \in \mathbb {R}^{n \times d_{model} }\) are respectively the self-attention values and weights, \(o \in \mathbb {R}^{d_{model}} \) is the pooled output vector, and \(NN\) is a neural network that has both dimensions of input and output equal to \(d_{model}\).

3.5 Classifier

We obtain the final prediction score for the PID task by applying a simple neural network with a single-value output to the pooled vector. We normalize the score to the range \((0,1)\) by applying the logistic function:

$$\begin{aligned} f(x) = \frac{1}{1+e^{-x}}. \end{aligned}$$
(6)

The obtained score approximates the probability of the example corresponding to a specific particle type.

4 Evaluation

We evaluate our model on the PID task by comparing it against a standard \(n_\sigma \)-based selection technique, an ensemble of neural networks [21] as well as two standard techniques for ML with missing data: mean imputation, and linear regression imputation. Additionally, in a scenario with no missing data in the test-set, we compare our approach to the case deletion procedure.

4.1 Dataset

The data comes from a Monte Carlo simulation of proton–proton collisions at \(\sqrt{s}=13\) TeV with a realistic simulation of the time evolution of the detector conditions in the LHC Run 2 data-taking period. The simulation was performed with PYTHIA 8 [29], the GEANT [30] particle transport model, and general-purpose settings.

The dataset consists of 2,751,934 examples with track transverse momentum \(p_\textrm{T} \ge 0.1\) \(\mathrm {Ge\hspace{-1.00006pt}V/}c \). 95% of examples fall into the range [0.12, 1.76] \(\mathrm {Ge\hspace{-1.00006pt}V/}c \). Each example contains 19 features:

  • detector signals: 1 TPC feature, 2 features per TRD and TOF

  • number of shared TPC clusters

  • spatial coordinates of a track reconstruction starting point (x, y, z in the local coordinate system, and the rotation angle \(\alpha \) between local and global coordinate systems),

  • track momentum p and its \(p_\textrm{T}\), \(p_x\), \(p_y\), \(p_z\) components,

  • track charge and type (propagated/non-propagated Run 3 track, Run 2 track)

  • the distances of closest approach (DCA) of the track trajectory to the collision primary vertex measured in the xy plane (\(d_{xy}\)) and the z direction (\(d_{z}\))

The examples are labeled with ten different particle types. The particle species distribution is shown in Table 2.

There are four combinations of missing values, as some examples’ measurements from the TOF and TRD detectors are missing. Figure 3 shows the missing value distribution.

Table 2 Particle type distribution. Approximately 97.8% of the examples belong to the 6 most populous particle types
Fig. 3
figure 3

Missing data distribution. Over 62.8% of the examples are missing at least one value

4.2 Models

For each compared method, we train identical binary classification models for 6 of the most populous particle types: pions, protons, kaons, and their respective antiparticles.

All trained models are identical for imputation methods and the neural network ensemble. They have three hidden layers of sizes 64, 32 and 16, and a single output. Between the layers, we use Rectified Linear Unit (ReLU) activation function \(f(x) = \max ({0, x})\). Dropout regularization with a rate of 0.1 is applied after each activation layer. The input dimension depends on the method used. For imputation techniques, all models process inputs of size 19, as all missing features are imputed. For the neural net ensemble, the four networks have input sizes 19, 17, 17, and 15, depending on the combination of missing values from the TRD and TOF detectors (each detector has two features associated with it).

Our proposed architecture consists of the embedding neural network, the Transformer’s encoder, the self-attention neural network, and the classifier network. ReLU activation is also used between neural network layers, and dropout regularization with a rate of 0.1 is applied to the output of the embedding layer and the output of each sub-layer in the encoder. The parameters of all the layers are shown in Table 3.

Table 3 Training hyperparameters

Model parameters are selected using a hyperparameter sweep for a fair comparison of different network architectures. Given a set of parameter, various combinations of their values are used to train different models, from which the model achieving the best results on the validation dataset is chosen. In this way, each of the compared methods may achieve the best possible results. The hyperparameter sweep is a computationally expensive procedure, hence we perform it with a model trained to detect kaons – the most challenging particles to identify in our dataset. This might bring a small bias of our results towards kaons, so before the full integration of our solution with the ALICE system, we will perform independent sweeps for all particles.

4.3 Results

Each model is trained using all complete and incomplete examples and tested in two cases: (1) with all available test examples and (2) with the complete ones only. For case deletion, no results are available for testing on the missing data dataset, since it is not possible to directly apply the appropriate models on data with missing values. We evaluate our models on test-set with standard metrics – precision (purity) and recall (efficiency). Since precision and recall are antagonistic, we also include the \(F_1\) metric, which is a combination of the first two according to the formula:

$$\begin{aligned} F_1=\frac{2 \cdot precision~\cdot ~recall}{precision~+~recall}. \end{aligned}$$
(7)

Numerical results for the pion, proton, and kaon particles are shown in Tables 4 and in 5 for antiparticles. We highlight the best-performing method for each particle in bold.

Table 4 Classification result for the three most common particle species
Table 5 Classification result for the three most common antiparticle species

For the case of incomplete examples, we compare our machine learning solutions with the standard technique described in Sect. 2.3. For our comparison we used the following selections: \(|n_\mathrm{\sigma ,TPC}|<3\) for particles with transverse momenta below 0.5 \(\mathrm {Ge\hspace{-1.00006pt}V/}c \) and \(\sqrt{n_\mathrm{\sigma ,TPC}^{2}+n_\mathrm{\sigma ,TOF}^{2}}<3\) for particles with \(p_\textrm{T} \ge 0.5\) \(\mathrm {Ge\hspace{-1.00006pt}V/}c \) (in this case TOF signal was required). Tables 4 and 5 show that machine learning approaches, in general, outperform standard \(n_\sigma \)-based techniques, providing significantly higher recall for similar or higher precision. Moreover, the proposed architecture is comparable with other tested techniques. We can also observe that training with additional examples with missing data still results in the good quality of the PID on complete examples, as measured by \(F_1\) scores. On the other hand, synthetic imputation of mean or predicted missing values can disturb the learned function and might result in lower performance. The cases of pion and kaon identification on complete cases are the only two where the ensemble achieves slightly better \(F_1\) results than the proposed architecture.

4.4 Detailed analysis of the results

To further highlight the benefits of our approach, we provide a detailed analysis of its performance when compared to the baseline approaches. In Fig. 4, we present precision-recall curves of different methods applied on all available test data (including incomplete examples) in the most challenging kaon detection task. This plot provides detailed comparison between methods without a need for threshold selection. Additionally, in Fig. 5, we analyze the differences in performance for different range of particle momentum \(p_T\). We can observe that our approach yields significantly higher area under the curve performance, and has a lower degradation in performance given particles with high momentum values.

Fig. 4
figure 4

Precision recall curve for different ML based approaches with missing data

Fig. 5
figure 5

Performance of different PID methods in kaon selection task with missing data as a function of particle momentum

In Figs. 6 and 7, we perform similar analysis for using only complete test data examples. Detailed evaluations for all other particles are included in the supplementary material.

Fig. 6
figure 6

Precision recall curve for different ML based approaches without missing data

Fig. 7
figure 7

Performance of different PID methods in kaon selection task without missing data as a function of particle momentum

4.5 Computational time comparison

In Tables 6 and 7, we present the computational overhead of calculating predictions with our method, when compared to the other baselines. In particular for training time, we report average time needed (in milliseconds) to update model’s weights from loading data, through forward and backward propagation up to the final update. For inference time, we report the average forward pass trough the network (in microseconds). Since our model requires calculation of the attention between features coming from different detectors, both it’s training and inference times are around 3 times longer than for a standard methods. However, all methods can be easily parallelized (up to the GPU memory size), what can be seen in nearly constant times across different batch sizes. All of the calculations are performed on a small local machine with Nvidia GTX 1660Ti 6GB GPU, Intel core i7-9750 H CPU, and 16 GB 2400 MHz (MT/s) of RAM.

Table 6 Average training time of our method compared to the baseline [ms]
Table 7 Average training time of our method compared to the baseline [\(\upmu \)s]

5 Conclusion

This work considers the real-case scenario of classification from incomplete data in the particle identification task, where due to the nature of physical processes, not all of the information is always recorded by all of the detectors. To solve this problem, we propose a novel method based on the attention mechanism. We verify that our approach is able to learn from both complete and incomplete data improving the performance of models trained solely on the complete examples. Moreover, our method provides no worse performance than other techniques while avoiding their drawbacks such as insertion of artificial data (imputation methods) and potentially too complex architecture (neural network ensemble).