Machine-learning-based particle identification with missing data

Kasak, Miłosz; Deja, Kamil; Karwowska, Maja; Jakubowska, Monika; Graczykowski, Łukasz; Janik, Małgorzata

doi:10.1140/epjc/s10052-024-13047-3

Machine-learning-based particle identification with missing data

Regular Article - Experimental Physics
Open access
Published: 13 July 2024

Volume 84, article number 691, (2024)
Cite this article

Download PDF

You have full access to this open access article

The European Physical Journal C Aims and scope Submit manuscript

Machine-learning-based particle identification with missing data

Download PDF

Miłosz Kasak¹,
Kamil Deja^1,2,
Maja Karwowska^1,3,
Monika Jakubowska¹,
Łukasz Graczykowski¹ &
…
Małgorzata Janik¹

Abstract

In this work, we introduce a novel method for Particle Identification (PID) within the scope of the ALICE experiment at the Large Hadron Collider at CERN. Identifying products of ultrarelativisitc collisions delivered by the LHC is one of the crucial objectives of ALICE. Typically employed PID methods rely on hand-crafted selections, which compare experimental data to theoretical simulations. To improve the performance of the baseline methods, novel approaches use machine learning models that learn the proper assignment in a classification task. However, because of the various detection techniques used by different subdetectors, as well as the limited detector efficiency and acceptance, produced particles do not always yield signals in all of the ALICE components. This results in data with missing values. Out of the box machine learning solutions cannot be trained with such examples without either modifying the training dataset or re-designing the model architecture. In this work, we propose the new method for PID that addresses these issues and can be trained with all of the available data examples, including incomplete ones. Our approach improves the PID purity and efficiency of the selected sample for all investigated particle species.

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

ALICE (A Large Ion Collider Experiment) [1] is one of the four major detectors located at the Large Hadron Collider at CERN [2]. The main goal of ALICE is to study the properties of quark–gluon plasma (QGP), a hot and dense state of matter, and the strong force that binds quarks together inside hadrons [

2 Related works

2.1 Missing data sources

The incomplete data problem in statistical and machine learning (ML) models has been widely studied over decades [13, 14]. In [15], authors highlight three main categories of missing data according to the source of the information gaps. The first case occurs when examples are missing completely at random (MCAR) – where the probability of missing value is independent of all parameters. If the probability of the missing value is dependent only on the value of the observed attributes, we can call it missing at random (MAR). Finally, if this probability depends on the value of the missing parameter itself, we refer to this case as missing not at random (MNAR).

In our case of the data used for particle identification at the ALICE experiment, we can distinguish two main sources of the missing data. For some cases, the lack of particular value – signal from one of the detectors is caused by its temporal malfunction, independent of the measured particle. In this case the signal is missing completely at random (MCAR). On the other hand, for some cases we might lack the measurement of values that fall outside the effective region of particular detector. For example, the efficiency of the TOF detector measurements drop significantly for particles with transverse momentum below 0.5 $\mathrm {Ge\hspace{-1.00006pt}V/}c $. In such case, the missing data is missing not at random (MNAR).

2.2 Machine learning with incomplete data

Several popular machine learning algorithms such as neural networks, cannot be directly trained with missing data. Therefeore there exists different solutions tot that problem that can be roughly divided into two groups: (1) those that modify the training dataset [16, 17] and (2) those that adapt ML architectures [18,19,20,21]. Data treatment methods transform datasets with missing values into complete ones that can be used with standard machine learning architectures. Adapted architectures are capable of processing incomplete data without any alteration.

The simplest method for data transformation is known as case deletion, where examples without full data availability are simply removed from the training set. This limits the potential training capabilities of the model and can result in bias induced by missing part of the original data distribution. The alternative technique known as imputation fills the missing data with artificial values calculated as a mean, median, or value predicted with a simple model, e.g. linear regression. This idea is further extended in [22], where the imputation model is optimized jointly with the regressor. Similarly to the case deletion approach, imputation methods can also significantly disturb the predictions of the neural model, which might result in unrealistic behavior.

To overcome those shortcomings, methods based on the model’s adjustment were proposed. The most straightforward approach suitable in a situation when missing data affects only a few attributes is an ensemble of different classifiers. In such a case, different models are trained on subsets of the training dataset without missing data. In particular, in neural network reduction [21], authors propose to split the dataset into the largest possible complete subsets.

There are several domains where machine learning with incomplete data is already employed. In [23], authors use a recurrent neural network to model healthcare data with missing data. A similar idea was employed in medical applications such as breast cancer prediction [24]. Machine learning algorithms with missing data are also used in other domains such as traffic prediction [25, 26].

2.3 Particle identification techniques

As mentioned in the introduction, the recently employed PID techniques depend on the human-defined selection criteria on the response signal from given detectors used in the analysis. In most cases, the so-called “$\textrm{n}_{\sigma }$” method is used, whose main parameter is the number of standard deviations from the expected value that a signal for a given particle left in the detector. For example, if both the TPC and TOF detectors are used, a common approach is to define a PID selection as

$$\begin{aligned} \sqrt{\hbox {n}_{\sigma ,\textrm{TPC}}^{2}+\hbox {n}_{\sigma ,\textrm{TOF}}^{2}}<\lambda . \end{aligned}$$

(1)

The actual cut-off value $\mathrm \lambda $, however, depends on the specific analysis, and requirements for purity and efficiency of the sample. Typically, this value is set between 2 and 3. This method can be further modified by adding additional conditions, for instance, on rejection of other types of particles. In such a case, if one analyzes, i.e., charged kaons, one may provide the acceptance criterion for the kaon hypothesis (i.e. how far in terms of $\mathrm n_{\sigma }$ the signal can be from the expected value for kaons) as well as the rejection criterion for the pion, proton, and electron hypothesis (i.e. what is the minimum $\mathrm n_{\sigma }$ that the signal must have compared to the expected values for pions, protons, and electrons).

Outside of the ALICE experiment, several attempts have been made to use neural networks for the PID. In the LHCb experiment [9], shallow neural networks are used to classify calorimeter signals. Similarly, in ATLAS, neural models are used to identify electrons [10], while the CMS Collaboration identifies $\tau $ lepton decays with deep neural networks [11].

3 Attention-based neural network architecture for particle identification

In this work, we focus on the problem of PID in which a type of particle is assigned to a data sample based on its characteristics recorded by a set of detectors. We treat particle identification as a set of binary classification tasks in a so-called one vs. all approach. Given a set of examples, each labeled with one of $k$ particle types, the task is to find $k$ binary classifiers corresponding to each particle type. Each classifier is tasked with identifying whether or not a particle, represented by a given example, belongs to the specific particle type. Every example consists of a set of real-valued features, based on measurements from a detector, out of which some might be missing.

With this formulation, we introduce a novel method for PID that employs missing data. We base our approach on the attention mechanism, similar to the method introduced for a medical use-case in AMI-Net [12]. Our system is composed of several steps. We first encode all data examples into a set of feature-value pairs [27], independently of their characteristics regarding missing data, and transform them into embeddings. These embeddings are further processed by the Transformer’s [28] encoder module with multi-head attention. The encoder output is combined into a single feature vector that is an input to the final classifier. The overview of our system is shown in Fig. 2, while in the following subsections, we describe its building blocks in detail.

3.1 Feature value pairs

We prepare the incomplete data for the model by creating sets of feature-value pairs from the incomplete vectors. Each pair consists of a non-missing value in the vectors and the index of that value. We apply one-hot encoding to these indices to aid the model in their processing. An example is shown in Table 1.

Table 1 An example for preprocessing of data samples into feature set values

Full size table

3.2 Embedding

Embedding is a continuous vector representation of discrete data. Embedded feature indices can use relations between features more effectively by placing similar features close in the embedded space. We create embeddings by applying a neural network with a single hidden layer to the feature-value vectors.

3.3 Transformer encoder

Transformer is an attention-based architecture commonly used for sequence transduction. As the Transformer is used to process sequences of arbitrary lengths, it can naturally be adapted to find/learn relations between available features regardless of the amount of missing values. We apply the Transformer’s encoding module to the set of embedding vectors to connect the different features each vector represents.

The encoder consists of a stack of N identical layers. Each layer is made of two sub-layers: a multi-head attention layer and a dense neural network. The multi-head attention is applied to the input set as a whole, while the NN is applied to each vector in a set separately. To ease the training of the encoder, we apply a residual connection around each sub-layer, followed by layer normalization.

3.3.1 Multi-head attention

To connect pairs of vectors in the input set, each input vector is linearly transformed into $h$ query, key, and value vectors, and scaled dot-product attention is applied to each of the $h$ triples of query, key, value sets, where $h$ is the number of heads. This allows specific patterns to be found in pairs of input vectors. For example, a measurement from a specific detector could be used if the momentum is within a particular range. As each of the $N$ layers in the encoder contains a multi-head attention sub-layer, and each sub-layer connects pairs of vectors, the full encoder can theoretically connect subsets of $2^N$ vectors.

3.3.2 Scaled dot-product attention

Given sets of query, key, and value vectors of dimension $d_k$, scaled dot-product attention calculates a set of weighted averages of the value vectors based on the similarities between the corresponding keys and queries. The similarities are obtained by computing the dot product of the query and key vectors. The dot products are scaled by a factor of $\frac{1}{\sqrt{d_k}}$ to counteract the increase of the dot product with an increasing dimension of the query, key, and value vectors. Finally, the softmax function is applied to the scaled dot products to obtain the weights used to calculate the averages of value vectors. The whole computation can be described as

$$\begin{aligned} \textrm{Attention}(Q, K, V) = \textrm{softmax}\left( \frac{QK^T}{\sqrt{d_k}}\right) V, \end{aligned}$$

(2)

where $Q,K,V \in \mathbb {R}^{n \times d_k}$ are sets of $n$ query, key and value vectors, respectively.

3.4 Attention pooling

As the classifier neural network cannot process an unordered, variable-size set of vectors, merging the vectors obtained from the Transformer’s encoder into a single vector is necessary. Therefore, we pool the output vectors together by using the self-attention technique that was also used in the Transformer, where each vector is assigned weights based on its own values. These weights are then used to calculate the weighted average of all the vectors in the set.

3.4.1 Self attention

Given a set of vectors $\{v_1, v_2,..., v_n\}$, where $v_i \in \mathbb {R}^{d_{model}}$, we obtain the pooled vector as follows:

$$\begin{aligned} E_{i, \star }&= \textrm{NN}(v_i) \quad \forall i \quad \in [1,n] \end{aligned}$$

(3)

$$\begin{aligned} A_{{\star }, j}&= \textrm{softmax}(E_{{\star }, j}) \quad \forall j \quad \in [1,d_{model}] \end{aligned}$$

(4)

$$\begin{aligned} o_j&= \sum _{k=1}^{n} A_{k, j}v_{k_j} \quad \forall j \quad \in [1,d_{model}] \end{aligned}$$

(5)

Here, $E, A \in \mathbb {R}^{n \times d_{model} }$ are respectively the self-attention values and weights, $o \in \mathbb {R}^{d_{model}} $ is the pooled output vector, and $NN$ is a neural network that has both dimensions of input and output equal to $d_{model}$.

3.5 Classifier

We obtain the final prediction score for the PID task by applying a simple neural network with a single-value output to the pooled vector. We normalize the score to the range $(0,1)$ by applying the logistic function:

$$\begin{aligned} f(x) = \frac{1}{1+e^{-x}}. \end{aligned}$$

(6)

The obtained score approximates the probability of the example corresponding to a specific particle type.

4 Evaluation

We evaluate our model on the PID task by comparing it against a standard $n_\sigma $-based selection technique, an ensemble of neural networks [21] as well as two standard techniques for ML with missing data: mean imputation, and linear regression imputation. Additionally, in a scenario with no missing data in the test-set, we compare our approach to the case deletion procedure.

4.1 Dataset

The data comes from a Monte Carlo simulation of proton–proton collisions at $\sqrt{s}=13$ TeV with a realistic simulation of the time evolution of the detector conditions in the LHC Run 2 data-taking period. The simulation was performed with PYTHIA 8 [29], the GEANT [30] particle transport model, and general-purpose settings.

The dataset consists of 2,751,934 examples with track transverse momentum $p_\textrm{T} \ge 0.1$ $\mathrm {Ge\hspace{-1.00006pt}V/}c $. 95% of examples fall into the range [0.12, 1.76] $\mathrm {Ge\hspace{-1.00006pt}V/}c $. Each example contains 19 features:

detector signals: 1 TPC feature, 2 features per TRD and TOF
number of shared TPC clusters
spatial coordinates of a track reconstruction starting point (x, y, z in the local coordinate system, and the rotation angle $\alpha $ between local and global coordinate systems),
track momentum p and its $p_\textrm{T}$, $p_x$, $p_y$, $p_z$ components,
track charge and type (propagated/non-propagated Run 3 track, Run 2 track)
the distances of closest approach (DCA) of the track trajectory to the collision primary vertex measured in the xy plane ($d_{xy}$) and the z direction ($d_{z}$)

The examples are labeled with ten different particle types. The particle species distribution is shown in Table 2.

There are four combinations of missing values, as some examples’ measurements from the TOF and TRD detectors are missing. Figure 3 shows the missing value distribution.

Table 2 Particle type distribution. Approximately 97.8% of the examples belong to the 6 most populous particle types

Full size table

4.2 Models

For each compared method, we train identical binary classification models for 6 of the most populous particle types: pions, protons, kaons, and their respective antiparticles.

All trained models are identical for imputation methods and the neural network ensemble. They have three hidden layers of sizes 64, 32 and 16, and a single output. Between the layers, we use Rectified Linear Unit (ReLU) activation function $f(x) = \max ({0, x})$. Dropout regularization with a rate of 0.1 is applied after each activation layer. The input dimension depends on the method used. For imputation techniques, all models process inputs of size 19, as all missing features are imputed. For the neural net ensemble, the four networks have input sizes 19, 17, 17, and 15, depending on the combination of missing values from the TRD and TOF detectors (each detector has two features associated with it).

Our proposed architecture consists of the embedding neural network, the Transformer’s encoder, the self-attention neural network, and the classifier network. ReLU activation is also used between neural network layers, and dropout regularization with a rate of 0.1 is applied to the output of the embedding layer and the output of each sub-layer in the encoder. The parameters of all the layers are shown in Table 3.

Table 3 Training hyperparameters

Full size table

Model parameters are selected using a hyperparameter sweep for a fair comparison of different network architectures. Given a set of parameter, various combinations of their values are used to train different models, from which the model achieving the best results on the validation dataset is chosen. In this way, each of the compared methods may achieve the best possible results. The hyperparameter sweep is a computationally expensive procedure, hence we perform it with a model trained to detect kaons – the most challenging particles to identify in our dataset. This might bring a small bias of our results towards kaons, so before the full integration of our solution with the ALICE system, we will perform independent sweeps for all particles.

4.3 Results

Each model is trained using all complete and incomplete examples and tested in two cases: (1) with all available test examples and (2) with the complete ones only. For case deletion, no results are available for testing on the missing data dataset, since it is not possible to directly apply the appropriate models on data with missing values. We evaluate our models on test-set with standard metrics – precision (purity) and recall (efficiency). Since precision and recall are antagonistic, we also include the $F_1$ metric, which is a combination of the first two according to the formula:

$$\begin{aligned} F_1=\frac{2 \cdot precision~\cdot ~recall}{precision~+~recall}. \end{aligned}$$

(7)

Numerical results for the pion, proton, and kaon particles are shown in Tables 4 and in 5 for antiparticles. We highlight the best-performing method for each particle in bold.

Table 4 Classification result for the three most common particle species

Full size table

Table 5 Classification result for the three most common antiparticle species

Full size table

For the case of incomplete examples, we compare our machine learning solutions with the standard technique described in Sect. 2.3. For our comparison we used the following selections: $|n_\mathrm{\sigma ,TPC}|<3$ for particles with transverse momenta below 0.5 $\mathrm {Ge\hspace{-1.00006pt}V/}c $ and $\sqrt{n_\mathrm{\sigma ,TPC}^{2}+n_\mathrm{\sigma ,TOF}^{2}}<3$ for particles with $p_\textrm{T} \ge 0.5$ $\mathrm {Ge\hspace{-1.00006pt}V/}c $ (in this case TOF signal was required). Tables 4 and 5 show that machine learning approaches, in general, outperform standard $n_\sigma $-based techniques, providing significantly higher recall for similar or higher precision. Moreover, the proposed architecture is comparable with other tested techniques. We can also observe that training with additional examples with missing data still results in the good quality of the PID on complete examples, as measured by $F_1$ scores. On the other hand, synthetic imputation of mean or predicted missing values can disturb the learned function and might result in lower performance. The cases of pion and kaon identification on complete cases are the only two where the ensemble achieves slightly better $F_1$ results than the proposed architecture.

4.4 Detailed analysis of the results

To further highlight the benefits of our approach, we provide a detailed analysis of its performance when compared to the baseline approaches. In Fig. 4, we present precision-recall curves of different methods applied on all available test data (including incomplete examples) in the most challenging kaon detection task. This plot provides detailed comparison between methods without a need for threshold selection. Additionally, in Fig. 5, we analyze the differences in performance for different range of particle momentum $p_T$. We can observe that our approach yields significantly higher area under the curve performance, and has a lower degradation in performance given particles with high momentum values.

In Figs. 6 and 7, we perform similar analysis for using only complete test data examples. Detailed evaluations for all other particles are included in the supplementary material.

4.5 Computational time comparison

In Tables 6 and 7, we present the computational overhead of calculating predictions with our method, when compared to the other baselines. In particular for training time, we report average time needed (in milliseconds) to update model’s weights from loading data, through forward and backward propagation up to the final update. For inference time, we report the average forward pass trough the network (in microseconds). Since our model requires calculation of the attention between features coming from different detectors, both it’s training and inference times are around 3 times longer than for a standard methods. However, all methods can be easily parallelized (up to the GPU memory size), what can be seen in nearly constant times across different batch sizes. All of the calculations are performed on a small local machine with Nvidia GTX 1660Ti 6GB GPU, Intel core i7-9750 H CPU, and 16 GB 2400 MHz (MT/s) of RAM.

Table 6 Average training time of our method compared to the baseline [ms]

Full size table

Table 7 Average training time of our method compared to the baseline [$\upmu $s]

Full size table

5 Conclusion

This work considers the real-case scenario of classification from incomplete data in the particle identification task, where due to the nature of physical processes, not all of the information is always recorded by all of the detectors. To solve this problem, we propose a novel method based on the attention mechanism. We verify that our approach is able to learn from both complete and incomplete data improving the performance of models trained solely on the complete examples. Moreover, our method provides no worse performance than other techniques while avoiding their drawbacks such as insertion of artificial data (imputation methods) and potentially too complex architecture (neural network ensemble).

Data Availability Statement

This manuscript has no associated data. [Author’s’ comment: It is not possible to share the data used for this study outside of the ALICE Collaboration.]

Code Availability Statement

This manuscript has associated code/software in a data repository. [Author’s comment: Code repository link: https://github.com/KamilDeja/PID_with_missing_data.]

References

K. Aamodt et al., The ALICE experiment at the CERN LHC. J. Instrum. 3, 08002 (2008). https://doi.org/10.1088/1748-0221/3/08/S08002
Article Google Scholar
L. Evans, P. Bryant, LHC machine. J. Instrum. 3, 08001 (2008). https://doi.org/10.1088/1748-0221/3/08/S08001
Article Google Scholar
S. Acharya et al., The ALICE experiment—a journey through QCD. (2022). ar**v:2211.04384 [nucl-ex]
E. Botta, Particle identification performance at ALICE, in Proceeding of the Fifth Annual Conference on Large Hadron Collider Physics (2017). ar**v:1709.00288 [nucl-ex]
G. Dellacasa et al., ALICE time projection chamber: technical design report. Technical design report. ALICE. CERN, Geneva (2000). https://cds.cern.ch/record/451098. Accessed 12 Dec 2023
G. Dellacasa et al., ALICE time-of-flight system (TOF): technical design report. Technical design report. ALICE. CERN, Geneva (2000). https://cds.cern.ch/record/430132. Accessed 12 Dec 2023
P. Cortese, ALICE transition-radiation detector: technical design report. Technical design report. ALICE. CERN, Geneva (2001). https://cds.cern.ch/record/519145. Accessed 12 Dec 2023
J. Adam et al., Particle identification in ALICE: a Bayesian approach. Eur. Phys. J. Plus 131(5), 168 (2016). https://doi.org/10.1140/epjp/i2016-16168-5. ar**v:1602.01392 [physics.data-an]
Article Google Scholar
R. Aaij et al., LHCb detector performance. Int. J. Mod. Phys. A 30(07), 1530022 (2015). https://doi.org/10.1142/S0217751X15300227
Article Google Scholar
J. Collado et al., Learning to identify electrons. Phys. Rev. D 103(11), 116028 (2021). https://doi.org/10.1103/PhysRevD.103.116028
Article ADS Google Scholar
A. Tumasyan et al., Identification of hadronic tau lepton decays using a deep neural network. J. Instrum. 17, 07023 (2022). https://doi.org/10.1088/1748-0221/17/07/P07023
Article Google Scholar
Z. Wang, J. Poon, S. Sun, S. Poon, Attention-based multi-instance neural network for medical diagnosis from incomplete and low quality data, in 2019 International Joint Conference on Neural Networks (IJCNN) (2019), pp. 1–8. https://doi.org/10.1109/IJCNN.2019.8851846
P.J. García-Laencina, J.-L. Sancho-Gómez, A.R. Figueiras-Vidal, Pattern classification with missing data: a review. Neural Comput. Appl. 19(2), 263–282 (2010). https://doi.org/10.1007/s00521-009-0295-6
Article Google Scholar
Z. Ghahramani, M.I. Jordan: Learning from incomplete data. Technical report AIM-1509, Massachusetts Institute of Technology (1995)
R.J.A. Little, D.B. Rubin, Statistical Analysis with Missing Data, 2nd edn. (Wiley, New York, 2014), pp.11–19
Google Scholar
A. Jadhav, D. Pramod, K. Ramanathan, Comparison of performance of data imputation methods for numeric dataset. Appl. Artif. Intell. 33(10), 913–933 (2019). https://doi.org/10.1080/08839514.2019.1637138
Article Google Scholar
W. Young, G. Weckman, W. Holland, A survey of methodologies for the treatment of missing values within datasets: limitations and benefits. Theor. Issues Ergon. Sci. 12(1), 15–43 (2011). https://doi.org/10.1080/14639220903470205
Article Google Scholar
K. Jiang, H. Chen, S. Yuan, Classification for incomplete data using classifier ensembles, in 2005 International Conference on Neural Networks and Brain, vol. 1 (2005), pp. 559–563. https://doi.org/10.1109/ICNNB.2005.1614675
P. Juszczak, R.P.W. Duin, Combining one-class classifiers to classify missing data, in International Workshop on Multiple Classifier Systems (2004), pp. 92–101. https://doi.org/10.1007/978-3-540-25966-4_9
S. Krause, R. Polikar, An ensemble of classifiers approach for the missing feature problem, in Proceedings of the International Joint Conference on Neural Networks, vol. 1 (2003), pp. 553–558. https://doi.org/10.1109/IJCNN.2003.1223406
P.K. Sharpe, R. Solly, Dealing with missing values in neural network-based diagnostic systems. Neural Comput. Appl. 3(2), 73–77 (1995). https://doi.org/10.1007/BF01421959
Article Google Scholar
M.L. Morvan, J. Josse, E. Scornet, G. Varoquaux, What’s a good imputation to predict with missing values? ar**v preprint (2021). ar**v:2106.00311
Z.C. Lipton, D.C. Kale, R. Wetzel et al., Modeling missing data in clinical time series with RNNs. Mach. Learn. Healthc. 56, 253–270 (2016)
Google Scholar
J.M. Jerez, I. Molina, P.J. García-Laencina, E. Alba, N. Ribelles, M. Martín, L. Franco, Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artif. Intell. Med. 50(2), 105–115 (2010)
W. Shi, Y. Zhu, J. Zhang, X. Tao, G. Sheng, Y. Lian, G. Wang, Y. Chen, Improving power grid monitoring data quality: an efficient machine learning framework for missing data prediction, in 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems (IEEE, 2015), pp. 417–422
Z. Cui, R. Ke, Z. Pu, Y. Wang, Stacked bidirectional and unidirectional LSTM recurrent neural network for forecasting network-wide traffic state with missing values. Transp. Res. Part C Emerg. Technol. 118, 102674 (2020)
Article Google Scholar
D. Grangier, I. Melvin, Feature set embedding for incomplete data. Adv. Neural Inf. Process. Syst. 23, 793–801 (2010)
A. Vaswani et al., Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 5998–6008 (2017)
T. Sjöstrand et al., An introduction to PYTHIA 8.2. Comput. Phys. Commun. 191, 159–177 (2015). https://doi.org/10.1016/j.cpc.2015.01.024. ar**v:1410.3012 [hep-ph]
Article ADS Google Scholar
R. Brun et al., GEANT Detector Description and Simulation Tool. Technical report, CERN (1994). https://doi.org/10.17181/CERN.MUHF.DMJ1

Download references

Acknowledgements

We would like to thank the ALICE Collaboration for guidance and support during our research as well as for the access to all software and data. This work was supported by the Polish National Science Centre under agreements no. no. 2021/43/D/ST2/02214 and UMO-2022/45/B/ST2/02029, by the Polish Ministry for Education and Science under agreements no. 2022/WK/01 and 5236/CERN/2022/0, as well as by the IDUB-POB-FWEiTE-2 project granted by Warsaw University of Technology under the program Excellence Initiative: Research University (ID-UB).

Author information

Authors and Affiliations

Warsaw University of Technology, pl. Politechniki 1, 00-661, Warsaw, Poland
Miłosz Kasak, Kamil Deja, Maja Karwowska, Monika Jakubowska, Łukasz Graczykowski & Małgorzata Janik
IDEAS NCBR, Chmielna 69, 00-801, Warsaw, Poland
Kamil Deja
CERN-European Organization for Nuclear Research, Espl. des Particules 1, 1211, Geneva, Switzerland
Maja Karwowska

Authors

Miłosz Kasak
View author publications
You can also search for this author in PubMed Google Scholar
Kamil Deja
View author publications
You can also search for this author in PubMed Google Scholar
Maja Karwowska
View author publications
You can also search for this author in PubMed Google Scholar
Monika Jakubowska
View author publications
You can also search for this author in PubMed Google Scholar
Łukasz Graczykowski
View author publications
You can also search for this author in PubMed Google Scholar
Małgorzata Janik
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kamil Deja.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Funded by SCOAP³.

Reprints and permissions

About this article

Cite this article

Kasak, M., Deja, K., Karwowska, M. et al. Machine-learning-based particle identification with missing data. Eur. Phys. J. C 84, 691 (2024). https://doi.org/10.1140/epjc/s10052-024-13047-3

Download citation

Received: 12 December 2023
Accepted: 17 June 2024
Published: 13 July 2024
DOI: https://doi.org/10.1140/epjc/s10052-024-13047-3

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Machine-learning-based particle identification with missing data

Abstract

1 Introduction

2 Related works

2.1 Missing data sources

2.2 Machine learning with incomplete data

2.3 Particle identification techniques

3 Attention-based neural network architecture for particle identification

3.1 Feature value pairs

3.2 Embedding

3.3 Transformer encoder

3.3.1 Multi-head attention

3.3.2 Scaled dot-product attention

3.4 Attention pooling

3.4.1 Self attention

3.5 Classifier

4 Evaluation

4.1 Dataset

4.2 Models

4.3 Results

4.4 Detailed analysis of the results

4.5 Computational time comparison

5 Conclusion

Data Availability Statement

Code Availability Statement

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation