1 Introduction

Electronic Health Records (EHRs) systems accumulate massive and a wide range of medical data concerning different aspects of healthcare. The explosive growth of EHRs in recent years provides researchers the opportunities of accessing to the valuable medical information which plays a significant role in describing patient’s condition, predicting patient’s mortality and future morbidity, etc. At present, utilizing existing medical big data to provide better and personalized medical services is the promising trend in the development of healthcare industry. Nevertheless, due to the temporality, high-dimensionality, irregularity and complexity of EHRs, the researches on EHRs are challenging.

Representation learning, which is regarded as a key step before any further applications, provides opportunities for researches on EHRs. Representation learning aims to represent the semantic information of the research objects as dense low-dimensional real-valued vectors with the technology of machine learning [15, 21, 37], representation learning which is based on these techniques also has attracted great attention since the learned vectors are able to capture the implicit regularities and patterns [38].

Recurrent Neural Networks (RNNs) are deep learning models designed to handle time series data [12]. Although traditional RNNs are not good at capturing long-term dependencies of data, many variants are effective in addressing this issue [27]. Long Short-Term Memory (LSTM) [14] is not only capable of processing lengthy temporal data more effectively, but also can overcome the issue of vanishing gradient [18] with a gating mechanism. Meanwhile, attention mechanism is an effective method, and its development renders the outputs of models more interpretable. Recently, they have often been combined with deep learning methods and successfully applied to multiple fields [22, 33, 35, 40].

EHRs data for each patient consists of a sequence of patient visits, where each visit contains a subset of diagnosis codes. However, a sequence of patient visits possess sequential relationship among them which can’t be captured by simply aggregating code vectors. So the effective representations need to be derived from the hierarchical learning of diagnosis codes and patient visits. In this paper, we propose a Multi-Layer Representation Learning method (MLRL) for patient’s EHRs. MLRL is implemented from two aspects: (1) Diagnosis code-level representation. We use the multi-head attention mechanism to explore the potential interactions and associations of the diagnosis codes. Then, the non-negative real-valued code representations are obtained by the linear transformation; (2) Visit-level representation. This part utilizes Bidirectional Long Short-Term Memory (BiLSTM) to explore the temporal relationships among various patient visits. Furthermore, because the patient visits are unlikely to contribute equally to the prediction of the target outcome, we combine the self-attention mechanism to learn the weighted visit vectors which are aggregated to form the patient representation.

Main contributions of this paper are as follows:

  1. 1.

    We propose a multi-layer representation learning method called MLRL to learn more efficient and robust patient representation based on raw EHRs data. MLRL utilizes a multi-level structure to explore the different relational information provided by EHRs hierarchical characteristic, namely diagnosis code-level and visit-level information.

  2. 2.

    We evaluate MLRL on real EHRs dataset and conduct the experiments of patient’s mortality prediction. Experimental results demonstrate the superior prediction performance achieved by MLRL. MLRL achieves around 0.915 in Area Under Curve (AUC) while baselines are in the range of 0.8–0.9.

  3. 3.

    We apply the learned data representation to various classifiers for prediction tasks. The experimental results with the representation learned by MLRL consistently and significantly outperform those achieved with representations based on raw data and baseline methods.

The rest of the paper is arranged as follows: In Sect. 2, we review the related work, including EHRs researches and the applications of representation learning in the medical field. Section 3 analyzes the overall architecture of the proposed method, and describes the relevant theory and processing details. Section 4 conducts the experiments based on the real EHRs, and the experimental results are analyzed and discussed in detail. Finally, Sect. 5 concludes the study and points out the future work.

2 Related Work

2.1 EHRs

Mining EHRs is a hot research topic in healthcare informatics, and massive amount of EHRs data motivates researchers to extract valuable clinical information for advanced analysis [31]. In recent years, EHRs data with different structure types, such as clinical text records and structured medical concepts, has been increasingly applied to multiple medical researches.

According to the research objectives and applications, we classify the research contents of EHRs into three types: (1) Disease risk level prediction and classification. Li et al.[19] proposed a stacked sparse auto-encoder (SSAE) based feature encoding algorithm. The proposed SSAE can effectively train on a small-scale data and learn the significant feature representation for PD diagnosis. Razavian et al. [7] was also not learned code representations by utilizing the code characteristics. Ashfaq et al. [2] leveraged the Paragraph Vector for Distributed Bag of Words (PV-DBOW) to generate simple numerical vectors of codes. Nevertheless, their process of obtaining representations did not consider the importance of the codes in current visit, that is the code weight. Besides, the patient’s multiple visit sequences also plays an important and different role in the target outcome prediction. Wang et al. [34] proposed a representation learning model for patient medical records. They aimed to capture the co-occurrence information and long-term dependence between clinical events, but ignored the visit sequentiality and the differences in the contribution of patient visits to the prediction task. Miotto et al. [25] proposed an unsupervised method, called DeepPatient, which generated patient representation from the original clinical information via a stack of denoising autoencoders (SDA). In their research, the classified diagnosis codes were used as training labels to evaluate the predictive results of diseases. However, in the vector learning process of patient EHRs, they did not consider the sequentiality and temporality of patient visits.

In our research, the proposed method is based on EHRs concept representation [30] and the idea of multi-layer structure is introduced to our proposed method, which aims to learn the patient representation by taking into account the different relational information existing in diagnosis codes and patient visits.

3 Multi-layer Representation Learning Method

In this section, we first present an overview of MLRL and then we describe the components of our proposed method in detail.

Fig. 1
figure 1

The structure of MLRL

3.1 Overview of MLRL

MLRL consists of the following parts: diagnosis code-level representation layer, visit-level representation layer and prediction layer.

As shown in Fig. 1, in the diagnosis code-level representation layer, we first embed the discrete diagnosis codes to the vectors by an embedding matrix. Then, the initial code representations are obtained by utilizing the multi-head attention mechanism to explore the potential connections existing in codes. Finally, the linear transformation and rectified linear unit (ReLU) activation function are introduced into the layer to map every code to the non-negative real-valued representation. The method of combining BiLSTM with self-attention mechanism is implemented to the visit-level representation layer. Based on the initial visit vectors which are aggregated by the learned code representations, the patient representation is obtained by the weighted sum of the learned visit vectors. The last layer is a fully connected layer with softmax classifier for patient’s mortality prediction.

Fig. 2
figure 2

The visits for a patient

Fig. 3
figure 3

Multi-head attention mechanism

3.2 Diagnosis Code-Level Representation

Assume that there are N patients, each patient has T visits and each visit contains M diagnosis codes. The visits for a patient is illustrated in Fig. 2. \(c_{n}^{t m}\) represents the diagnosis code numbered m that occurred in the tth visit of the nth patient.

There is valuable and implicit interrelated information in the disordered diagnosis codes within each visit, and the specific method is required to explore the connections and learn the vector representations.

Given the tth visit of the nth patient \(s_{n}^{t}=\left\{ c_{n}^{t 1}, \ldots , c_{n}^{t m}, \ldots , c_{n}^{t M}\right\} \), \(n \in [1, N]\), \(t \in [1, T]\), \(m \in [1, M]\), the diagnosis codes is embedded to the vectors with an embedding matrix.

$$\begin{aligned} s_{n}^{t^{\prime }}=W_{e m b} s_{n}^{t} \end{aligned}$$
(1)

where \(s_{n}^{t^{\prime }}=\left\{ e_{n}^{t 1}, \ldots , e_{n}^{t m}, \ldots , e_{n}^{t M}\right\} \).

Then, the multi-head attention mechanism [32] is used to learn the initial code representations. Every head in the attention mechanism represents an attention layer, i.e. Scaled Dot-Product Attention. The attention function is defined in Eq. 2.

$$\begin{aligned} \begin{array}{l} \text{ Attention } (Q, K, V)={\text {softmax}}\left( \frac{Q K^{T}}{\sqrt{d_{k}}}\right) V \\ \text{ where } Q, K, V=s_{n}^{t^{\prime }} W_{Q}, s_{n}^{t^{\prime }} W_{K}, s_{n}^{t^{\prime }} W_{V} \end{array} \end{aligned}$$
(2)

where Q, K and V are matrices representing queries, keys and values in the attention mechanism respectively; \(W_{Q}\), \(W_{K}\) and \(W_{V}\) are trainable parameter matrices.

Multi-head attention mechanism in Fig. 3 executes the attention function in parallel to produce the different output values which are then concatenated and linearly converted, yielding the final outputs \(u_{n}^{t}=\left\{ u_{n}^{t 1}, \ldots , u_{n}^{t m}, \ldots , u_{n}^{t M}\right\} \).

$$\begin{aligned} \begin{array}{l} u_{n}^{t}=\text{ MultiHeadAttention }\left( s_{n}^{t^{\prime }}\right) \\ \quad \,= \text{ concat(head}_{1}\left( s_{n}^{t^{\prime }}\right) , \ldots , \text{ head } \left. _{i}\left( s_{n}^{t^{\prime }}\right) , \ldots , {\text {head}}_{h}\left( s_{n}^{t^{\prime }}\right) \right) W_{O} \\ \text{ where } \text{ head}_{i}=A \text{ ttention }\left( Q_{i}, K_{i}, V_{i}\right) \end{array} \end{aligned}$$
(3)

where h represents the parallel attention layers, or heads. \(W_{O}\) is a parameter matrice.

Finally, the non-negative real-valued code representations \(z_{n}^{t}=\left\{ z_{n}^{t 1}, \ldots , z_{n}^{t m}, \ldots , z_{n}^{t M}\right\} \) are obtained by the following formula.

$$\begin{aligned} z_{n}^{t}=R\text {e}LU\left( W_{z} u_{n}^{t}+b_{z}\right) \end{aligned}$$
(4)

where \(W_{z}\) and \(b_{z}\) are trainable parameter vectors of the linear transformation.

In order to obtain the initial visit vectors, the representations of diagnosis codes included in each visit are aggregated as follows.

$$\begin{aligned} v_{n}^{t}=\sum _{m} z_{n}^{t m} \end{aligned}$$
(5)

Therefore, we can obtain a sequence of visit vectors for a patient \(v_{n}=\left\{ v_{n}^{1}, v_{n}^{2}, \ldots , v_{n}^{t}, \ldots , v_{n}^{T}\right\} \).

3.3 Visit-Level Representation

We describe the details of visit representations in the following. Because the visits of patients are temporally ordered and interrelated, BiLSTM, which is good at capturing long-term dependencies both forwards and backwards, is introduced to process the data and exploit the sequential information. LSTM has three gates with different types: the forget gate \(f_{t}\), the input gate \(i_{t}\) and the output gate \(o_{t}\), they together control how information is updated to the state. At time t, the forget gate \(f_{t}\) determines how much past information is discarded, and it is updated as follows.

$$\begin{aligned} f_{t}=\sigma \left( W_{f} x_{t}+U_{f} h_{t-1}+b_{f}\right) \end{aligned}$$
(6)

where \(x_{t}\) is used as input to the memory cell at time t.

\(i_{t}\) represents the input gate and determines what information to be retained currently.

$$\begin{aligned} i_{t}=\sigma \left( W_{i} x_{t}+U_{i} h_{t-1}+b_{i}\right) \end{aligned}$$
(7)

The calculation method of candidate state \(\tilde{C}_{t}\) is similar to the traditional RNNs.

$$\begin{aligned} \tilde{C}_{t}=\tanh \left( W_{c} x_{t}+U_{c} h_{t-1}+b_{c}\right) \end{aligned}$$
(8)

\(C_{t}\) is the updated cell state.

$$\begin{aligned} C_{t}=f_{t} * C_{t-1}+i_{t} * \tilde{C}_{t} \end{aligned}$$
(9)

\(o_{t}\) is the output gate and \(h_{t}\) is the output value.

$$\begin{aligned} o_{t}= & {} \sigma \left( W_{o} x_{t}+U_{o} h_{t-1}+b_{o}\right) \end{aligned}$$
(10)
$$\begin{aligned} h_{t}= & {} o_{t} * \tanh \left( C_{t}\right) \end{aligned}$$
(11)

Based on the initial visit vectors, the learning process of patient representation is as follows. First of all, BiLSTM is used to encode the vectors and generate the hidden states \(h_{n}=\left\{ h_{n}^{1}, h_{n}^{2}, \ldots , h_{n}^{t}, \ldots , h_{n}^{T}\right\} \), which is concatenated by the forward output \(\overrightarrow{h_{n}}\) and backward output \(\overleftarrow{h_{n}}\).

$$\begin{aligned} \overrightarrow{h_{n}}= & {} \overrightarrow{L S T M}\left( v_{n}\right) \end{aligned}$$
(12)
$$\begin{aligned} \overleftarrow{h_{n}}= & {} \overleftarrow{L S T M}\left( v_{n}\right) \end{aligned}$$
(13)
$$\begin{aligned} h_{n}= & {} \left[ \overrightarrow{h_{n}}, \overleftarrow{h_{n}}\right] \end{aligned}$$
(14)

Furthermore, an attention mechanism [20] is applied to reward patient visits that are clues to correctly predict patient’s mortality, and we compute the patient representation \(x_{n}\) as a weighted sum of the hidden state vectors based on the learned weights.

$$\begin{aligned} \alpha _{n}= & {} {\text {softmax}}\left( W_{2} \tanh \left( W_{1} h_{n}^{T}\right) \right) \end{aligned}$$
(15)
$$\begin{aligned} x_{n}= & {} \sum _{t} \alpha _{n}^{t} h_{n}^{t} \end{aligned}$$
(16)

where \(\alpha _{n}=\left\{ \alpha _{n}^{1}, \alpha _{n}^{2}, \ldots , \alpha _{n}^{t}, \ldots , \alpha _{n}^{T}\right\} \), \(W_{1}\) and \(W_{2}\) are parameter matrices.

Therefore, a sequence of representations of patients \(x=\left\{ x_{1}, x_{2}, \ldots , x_{n}, \ldots , x_{N}\right\} \) are obtained.

3.4 Patient’s Mortality Prediction

The results obtained by the representation learning method are high level representations of patients which can be used as the features for patient’s mortality prediction. We add a fully connected layer with softmax classifier for the final outcome prediction as follows,

$$\begin{aligned} \text{ pre } ={\text {softmax}}\left( W_{p r e} x+b_{p r e}\right) \end{aligned}$$
(17)

where \(W_{p r e}\) is a parameter matrice and \(b_{p r e}\) is a bias vector.

The cross-entropy is introduced to calculate the prediction loss as follows,

$$\begin{aligned} L=-\frac{1}{N} \sum _{n}^{N}\left[ y_{n} \log \left( pre_{n}\right) +\left( 1-y_{n}\right) \log \left( 1-pre_{n}\right) \right] +\frac{1}{N} \sum _{n}^{N}\left\| \alpha \alpha ^{\mathrm {T}}-\mathrm {I}\right\| _{F}^{2} \end{aligned}$$
(18)

where \(y_{n}\) is a binary variable in prediction problems. We use the dot product of \(\alpha \) and its transpose, subtracted by an identity matrix, as a penalization term to focus attention on multiple diverse areas instead of just being limited to a certain aspect.

4 Experimental Results and Analysis

4.1 Data Set

Medical Information Mart for Intensive Care (MIMIC-III) [16] is a large, single-center database, which is jointly released by Computational Physiology Laboratory of the Massachusetts Institute of Technology, the Beth Israel Deaconess Medical Center (BIDMC) and Philips Healthcare in 2006. The database has 26 kinds of data tables involving hospitalization, patient’s information, diagnosis, medication, and so on [6].

We extract the patients who have more than one visits and use the diagnosis information in terms of the first three digits of International Classification of Diseases-9 (ICD-9) codes to construct the EHRs sequences. For the extracted and classified diagnosis codes, we number them in order to better carry out code embedding. Specifically, each visit sequence in Fig. 2 is composed of a series of digital numbers representing diagnosis codes. The patient’s mortality \(y_{i} \in \{0,1\}\) (0 means survival, 1 means death) is extracted for training labels. The basic information of the database is shown in Table 1.

The reasons for selecting diagnosis codes as the features of clinical prediction tasks are as follows. On the one hand, the diagnosis codes can reflect the patient’s illness and conditions during the hospitalization, which plays an important role in predicting patient’s mortality; On the other hand, there are valuable implicit associations between them. For example, diabetic patients may also suffer from diabetes-related complications (e.g. cardiovascular disease) to a large extent, which indicates that there may be some valuable potential correlations between diabetes and its complications.

The codes follow a certain hierarchical pattern where the classification granularity of the disease gradually increases. For example, the diagnosis code 250.00 represents diabetes without mention of complications, 250.10 indicates diabetes with ketoacidosis and 250.20 indicates diabetes with hyperosmolarity. In this study, we group the codes into high-order categories by selecting the first three digits of them to reduce information overload and have a generalized specifcity level. The operation also makes the codes more generalized and hierarchical. The classified codes have also been widely practiced in multiple researches [7, 8, 10, 25, 26].

Table 1 Basic statistics of the MIMIC-III database

4.2 Evaluation Metrics

We use AUC, accuracy, recall and F1 score as the evaluation metrics. The ROC curve is a plot of true positive rate (TPR) versus false positive rate (FPR), which are defined in Eqs. (19) and (20) respectively. AUC is computed by integrating the ROC curve.

$$\begin{aligned} T P R= & {} \frac{T P}{T P+F N} \end{aligned}$$
(19)
$$\begin{aligned} F P R= & {} \frac{F P}{F P+T N} \end{aligned}$$
(20)

Accuracy refers to the proportion of the number of samples with correct classification among the total number of samples as follows.

$$\begin{aligned} \text{ Accuracy }=\frac{T P+T N}{T P+T N+F P+F N} \end{aligned}$$
(21)

Recall means the number of correct positive results divided by the number of positive results that should have been returned.

$$\begin{aligned} \text{ Recall }=\frac{T P}{T P+T N} \end{aligned}$$
(22)

F1 score is the harmonic mean of classification precision and recall. The formula is as follows.

$$\begin{aligned} F1=\frac{2 \times T P}{2 \times T P+F P+F N} \end{aligned}$$
(23)

The meanings of TP, FP, TN and FN are shown in Table 2.

Table 2 Mean of TP, FP, TN and FN for confusion matrix

4.3 Comparative Algorithms

In order to evaluate the performance of MLRL in patient’s mortality prediction as well as its effectiveness of feature learning, we compare MLRL with baseline methods as follows.

  1. (1)

    Logistic regression (LR)

The inputs of LR [10] are the aggregated vectors formed by the visits of patients. Specifically, without vector learning, the patient vector as input is directly constructed from the original visit sequences.

  1. (2)

    Multi-layer perception (MLP)

MLP [10] uses the same inputs as LR, and introduce a hidden layer with size 400 between the input and output.

  1. (3)

    Deep patient

Deep Patient [25], an unsupervised representation learning method, aims to learn the patient representation from raw clinical data by a stack of denoising autoencoders (SDA). In this paper, we use the patient’s visits as input and train a three-layer stacked autoencoder to minimize the reconstruction error. The number of hidden units per layer is set to 400. This setting makes the dimension of the output representation consistent with other methods. Similarly, this method also directly learns the visit vectors and constructs the patient representations based on them.

  1. (4)

    Med2Vec

Med2Vec [9] is a scalable two-layer neural network for learning lower dimensional representations of medical concepts. This method follows the idea of skip-gram to learn the code representations, and predicts the codes appearing in the following visit based on the current visit information. The hidden layer size of the network is set to 400. Since the original Med2Vec is used for multiple variable prediction, we change the final softmax function to implement binary prediction task (i.e. patient’s mortality prediction).

  1. (5)

    BiLSTM-soft (BiLSTM-Softmax)

BiLSTM-Soft [3] utilizes BiLSTM to process the patient visits and learn their representations. Then, the patient representation, formed by the aggregated visit representations, is used as the features to train the softmax classifier for prediction task. The inputs of this method are the original patient visit sequences shown in Fig. 2. Both forward LSTM and backward LSTM with 200 hidden units constitute BiLSTM.

  1. (6)

    BiLSTM-Att-Soft (BiLSTM -Attention-Softmax)

BiLSTM-Att-Soft method [36] performs the same process as BiLSTM-Soft and keep parameter settings consistent, but combine an attention mechanism to learn the weights which are generated for patient visits. Neither of these BiLSTM-based methods learn the representations of the diagnosis codes occurring in patient visits, and both directly process the original visit sequences.

Table 3 MLRL parameter settings
Table 4 The prediction performance of MLRL and baseline methods
Fig. 4
figure 4

The training process of MLRL and baseline methods

4.4 Experimental Results

4.4.1 Experimental Results and Analysis for Patient’s Mortality Prediction

We randomly divide the dataset into ten mutually exclusive subsets with the same mortality, of which eight subsets are used to train the models and the remaining two subsets are used for validation and testing respectively. Most of the data constitutes the training set for training the model, the validation set is used to test the generalization ability of the model and find out whether the model has over-fitting phenomenon in time, the test set is used to verify the model performance.

Fig. 5
figure 5

Comparison of the results for different data representations

Adam optimizer [17] with a learning rate of 0.0001 is used to minimize the loss of the task, and all methods are implemented in tensorflow. The detailed parameter settings of MLRL are shown in Table 3. In the process of parameter selection, we refer to the parameter settings of attention mechanism and BiLSTM in [10, 32], and make appropriate adjustments according to our data dimensions. Besides, we set a loss threshold of 0.15 to train the model and get the training epoch. In order to ensure the validity of the comparison results, the parameters (such as the optimizer, learning rate) of the network in baseline methods are consistent with MLRL and the dimension of the output representations is guaranteed to be same.

The predictive performance of MLRL and baselines is presented in Table 4, and the model performance is evaluated with AUC, accuracy, recall, and F1 score.

According to Table 4, compared with other baselines, Deep Patient reports the better accuracy, recall and F1 score with its unsupervised deep learning network. Med2Vec plays a certain role in exploiting the potential connections of medical concepts, and its performance metrics are slightly higher than other baselines. Furthermore, BiLSTM-based methods (including BiLSTM-Soft and BiLSTM-Att-Soft) achieve better prediction performance. They achieve an AUC close to 0.9, which is 7% higher than that obtained by the common classifiers such as LR and MLP. This is because the methods are good at processing the patient visits with chronological characteristic. In addition, it is worth mentioning that the results are improved after an attention mechanism are combined with BiLSTM, which indicates that the attention mechanism plays a significant role in improving the model quality. Finally, MLRL with multi-layer structure significantly outperforms all baseline methods, it achieves an AUC of 0.915 while baselines just get 0.8–0.9 (i.e., 1–10% improvement). In sum, MLRL does yield obvious improvements for the prediction task.

Table 5 Mortality prediction results of different data representations

Figure 4 shows the training process of MLRL and baseline methods. As shown in Fig. 4, we can find an obvious and important aspect of the experimental results, which is the overfitting phenomenon in the methods. Constantly increasing the training epochs degrades the performance of all of the methods, as it leads to overfitting. For example, overtraining makes the results of BiLSTM-Soft show a downward trend. Similar behavior can be seen as we train BiLSTM-Att-Soft for more epochs, which suggests that appropriate model training is necessary and early stop** technique should be applied to the representation learning of the medical field.

4.4.2 Analysis of Different Data Representations

To measure the quality of the data representation learned by MLRL and evaluate how well it performs in prediction task on different classifiers, we conduct experiments for several classifiers with different representations. Specifically, the experiment is to obtain the patient representations learned by all methods, and then input them as features to different classifiers for prediction tasks and compare the results. The classifiers we used include LR, MLP, LSTM, random forest (RF) and support vector machine (SVM), and the baseline methods include BiLSTM, BiLSTM-Att (BiLSTM-Attention), Deep Patient and Med2Vec.

Table 5 presents the prediction performance of different representations in terms of AUC, accuracy, recall and F1 score. From Table 5, we can observe that compared with raw data, the performance of all the representation learning methods improves, but MLRL shows superior predictive performance, demonstrates their applicability in the prediction tasks. This experimental results also show that all the methods can capture the effective information and learn different patient representations which have different contributions to the final outcome prediction. However, because the neural network-based baseline methods do not explore the hierarchical EHRs structure, or do not focus on the valuable relational information in the sequences, the prediction performance of them is not good as that achieved by MLRL which learns the patient representation by the hierarchical learning of code and visit information. In Table 5, the results of our learned representation are superior to those obtained by raw data. Particularly for SVM classifier, MLRL achieves an AUC of 0.837 while raw data just get 0.748, and other evaluation metrics are also improved about 7–12%. Meanwhile, MLRL consistently outperforms all other feature learning baseline methods. Taking LSTM classifier as an example, MLRL improves other baseline methods by 3.9%, 2.9%, 4.3% and 1.9% respectively in terms of AUC. In Fig. 5, we present a more intuitive comparison of all the methods in terms of AUC, accuracy, recall and F1 score.

In sum, the performance of MLRL on patient’s mortality prediction are better than baseline methods, which shows that taking advantage of the EHRs structure characteristics to hierarchically exploit the significant information embedded in EHRs helps to learn more effective representations.

5 Conclusion

In this paper, we propose MLRL to learn an effective deep representation of EHRs based on RNNs and attention mechanisms. MLRL learns the patient representation by hierarchically mining the valuable and effective information existed in diagnosis codes and patient visits. Then, we apply the proposed method to patient’s mortality prediction with real EHRs data. The experimental results demonstrate that MLRL is capable of achieving more accurate prediction and improving the prediction performance of the tasks. In addition, the evaluation results of the data representation learned by MLRL significantly outperform those achieved by raw EHRs data and other learned representations.

In future work, we are going to study a wider range of medical events, such as the various physical indicators and clinical notes, to further explore the valuable information. In addition, we also refer to the description of emotion in [41] and plan to explore the relationships between patient’s emotions and diseases.