1 Introduction

Human activity recognition in an intelligent environment is a highly dynamic research area which has gained a lot of attention due to its varied applications. The applications of activity recognition systems are categorized as: active and assisted living systems for smart homes (SH), monitoring and surveillance systems for indoor and outdoor, health care monitoring and tele-immersion applications [1,2,3]. Among these, SH plays an important role, especially in user behavior analyses, health monitoring, and assistance. Most of the research on activity recognition in SH has investigated single resident activity monitoring [4,5,6,7]. However, in real-life scenarios, a home is not always occupied by a single resident but often occupied by more than one resident. Therefore, develo** an SH solution from the perspective of multiple residents is extremely crucial.

In recent years, there has been an increase in multiple occupancy-based research related to activity modelling and data association. However, there are still various challenges to be addressed in multiple occupancies, such as finding the suitable models for data association i.e. identification of the residents by whom each sensor is triggered and capturing interactions between the occupants [8]. Another major challenge that occurs while develo** real-life applications is the class imbalance problem. Activity recognition is mainly considered a classification problem where the performance of the system depends on the model selection, features involved, number of classes, and the size of the datasets available for training the system. In most of the SH datasets, there is a lack of uniformity in different daily living activities of residents; which is obvious as in real-life situations, some activities are performed more often than others.

Although several studies have been conducted for class imbalance, there remains a lack of empirical work on addressing the class imbalance in multiple residents activity recognition. In this work, we report an empirical study of both data-driven and algorithm-driven techniques for handling class imbalance. Data-driven approaches modify the original dataset by oversampling the minority samples and can provide a balanced distribution without losing information on the majority class. Undersampling techniques alter the dataset by removing samples from the majority class. The main advantage of undersampling lies in the reduction of the training time, which is significant in the case of highly imbalanced large datasets [9]. In algorithm level techniques, we employed cost-sensitive learning to deep learning models, which has performed well as reported in previous works in class imbalance problem. However, the majority of works use statistical methods such as SVM and Naive Bayes as a base classifier in cost sensitive learning approach [10]. In other works, machine learning methods have been used in activity recognition which relies on feature extraction techniques including time-frequency transformation and statistical approaches. In such methods, the extracted features are carefully engineered and heuristic. There is no universal feature extraction method that can effectively capture distinguishable features of human activities. Consequently, we selected the Long Short-Term memory (LSTM) network as it allows extracting highly discriminative non-linear feature representations while modeling temporal sequences by learning long-term dependency. In addition, LSTM and 1D-convolutional neural network outperformed other statistical machine learning models on single resident activity recognition [11].

To summarize, the main contributions of this paper are:

  1. i.

    a review on handling class imbalance problem with deep learning and explainable AI (XAI);

  2. ii.

    employing LSTM and BiLSTM networks for multiple resident activity recognition;

  3. iii.

    evaluating model performance by taking each resident separately and also with combined activity labels of the residents;

  4. iv.

    conducting extensive experiments using both data level and algorithm level class imbalance techniques; and,

  5. v.

    investigating model performance at different sample ratios and cost coefficients on three benchmark datasets.

The paper is further structured as follows: Sect. 2 reports the related works. Section 3 introduces the SH datasets, LSTM and BiLSTM network and different data imbalance methods that are used in the paper and Sect. 4 describes experiments performed. The results of the paper are demonstrated and discussed in the next section, which is followed by a concluding section highlighting the major findings.

2 Related work

In this section, we review related works on multiple resident activity recognition and imbalanced data classification approaches and discuss them in detail which eventually laid the foundation of the current work.

2.1 Multiple resident activity recognition

Activity recognition has been categorized mainly into two approaches: Vision based [12,13,14] and pervasive sensing based [15,16,17]. Vision-based activity recognition can provide good results but have raised various privacy concerns among the residents due to required camera installations in their private spaces [18, 19] whereas pervasive sensing-based activity recognition approaches use data from wearable sensors and non-intrusive environment sensors [20]. A significant amount of work has been performed on activity recognition using wearable sensors. A new technology called Body Sensor Networks (BSN) has emerged which consists of different wearable sensors that capture and process physiological signals on the human body. BSNs then collect data from wearable sensors and process them to extract useful information [21, 22]. A major issue with wearable sensors is that wearing or carrying a tag is often not feasible especially with the old people, who often forget to wear, or not willing to wear tags at all. There have been efforts to create adaptive solutions for user adoption and integration. Nonetheless, the challenge of usability persists among older individuals [23,24,25]. Pervasive sensing using environment sensors offers the advantage of being non-intrusive to the inhabitants and do not require to carry any tag or device. In pervasive sensing, the sensors are deployed in the environment and capture activities of the residents, which then can be used for activity recognition. But there are some challenges in this approach as well. Recognizing human activities using environment sensors is challenging because sometimes the data captured by the sensors can be disturbed from the surroundings which can make data noisy and human activities are complex. In such a case, sensor deployment, its configuration, and selection of the classification model play an important role in the identification of activities of residents and the residents themselves [26].

In previous works, diverse computational models have been applied in the context of single resident activity recognition which includes standard data mining approaches, probabilistic models, and machine learning models such as neural networks, support vector machines, decision trees, and ontologies. However, for multi-resident activity recognition, such a diversity of models has not been used yet. The problem in multiple resident activity recognition using non-intrusive sensors is the association of sensor data when such sensors cannot directly identify residents and interactions between them, whereas, in a single resident setting, sensors’ states reflect directly the activity of the sole resident. Multiple residents’ activities can have different scenarios: the same activity can be performed by two or more residents (e.g. eating a meal or watching TV together) or multiple residents perform different activities independently (e.g. one resident is watching TV and the other preparing meal). Evidently, there is a need for a model that is capable of capturing the complex nature of both joint and independent activities. Previous works have addressed multiple resident activity recognition using wearable sensors such as RFID [17], accelerometer [27] and videos [28]. Machine learning approaches used previously for multi-resident activity recognition are naive Bayes, Markov Model classifier [29] and conditional random field (CRF) [30] on CASAS [31] dataset in which data association problem was investigated. In [32], the authors proposed a two-stage activity recognition method in order to exploit more knowledge in multi-resident activities. The two phases in the model include the building phase and activity recognition phase and it converts multi-label problems into a single-label problem by treating activities of residents as combined label state using HMM (Hidden Markov Model) and CRF (Conditional Random Field) classifiers. In recent works, deep learning models have shown impressive performances in various fields. LSTM network which is variants of recurrent neural network (RNN) is good at solving time series problems as its design enables gradients to flow through time readily [33]. Deep Residual Bidirectional LSTM network has been used for activity recognition using wearable sensors on UCI dataset (which uses data from a smartphone) and Opportunity dataset (data from wearable, object, and ambient sensor) [34]. In [69]. Most of the works on handling imbalance classes focus on vision and text classification problems but very less work has been performed in handling class imbalance in multiple resident activity recognition. In addition, existing works lack comparative studies of different class imbalance approaches.

Therefore, this paper presents a comprehensive study of both data level and algorithm level class imbalance approaches in multiple resident activity recognition. Since temporal deep learning methods have shown promising results on raw sensor datasets in single resident activity recognition, we used LSTM and BiLSTM networks as a classifier for addressing the class imbalance in activity recognition.

3 Methodology

3.1 Smart home datasets

In this work we have used publicly available ARAS [70] and CASAS-Kyoto Multiresident ADL Activities datasets (fourth number dataset on CASAS dataset list: http://casas.wsu.edu/datasets/) [16, 71]. ARAS is a widely used dataset in activity recognition systems whereas the CASAS-Kyoto Multiresident ADL Activities dataset has not been used much in previous works. As the collection of real SH data is time-consuming, costly, and difficult to annotate, the publicly available datasets are used to provide a baseline for comparison.

3.1.1 ARAS multi-resident ADL dataset

ARAS datasets use ambient sensors such as contact sensor, temperature sensor, sonar distance sensor, force sensor, photocells, resistors, and infrared receivers in the SH setting. The dataset consists of 20 different types of sensor signals as features together with the activity labels of two residents for two different houses which are termed as House A and House B. Each house has 30 days of a dataset with 30 separate files for a month and every file contains 86,400 instances. The dataset consists of 27 different types of activities for each resident. The distribution of activities in House A and House B of the ARAS dataset are shown in Fig. 1.

Fig. 1
figure 1

Activity distribution of both residents (R1 and R2) in the ARAS dataset

As visible from Fig. 1 the dataset of both the residents in two houses is highly imbalanced where only few activities in the distribution are more than 35% and most of the activities are less than 10% of the whole dataset.

3.1.2 CASAS-Kyoto multiresident ADL activities dataset

The CASAS-Kyoto Multiresident ADL Activities dataset was collected in a smart apartment testbed located at Washington State University (WSU). The sensors used in the dataset are motion, item, cabinet, water sensors, burner, phone, and temperature sensors. The smart space was occupied by two residents at the same time where they performed daily living tasks concurrently. The collected sensor events were labeled with activity and person identifications. The dataset has 15 different daily living activities that were performed by both residents, in which few activities (moving furniture, playing checkers, paying bills, and packing picnic supplies) were jointly accomplished by both residents. Since some activities were performed jointly by both the residents and some individually, when an activity is performed by only one resident, there is no label for the activity of the other resident. As both residents are present in the apartment, we assigned a label (named as "Other") to the activity of the second resident which is not known, which makes our dataset of 16 activity labels for both residents. However, in many cases, there were sensor readings for both residents and their activity labels. The frequency distribution of activities in the dataset is shown in Fig. 2.

Fig. 2
figure 2

Frequency count of activities in the CASAS-Kyoto Multiresident ADL Activities dataset

3.2 LSTM models

LSTM networks [72] are a successful extension of RNNs designed to avoid the long-term dependency problem associated with RNN. LSTM models introduce a new state, called cell state and constant error carousel (CEC) that allows constant propagation of error signals over time, thus solving the problem of vanishing gradients. In addition, LSTM uses a gating mechanism over an internal memory cell to control access to CEC and to learn a more complex representation of the long-term dependencies. LSTM is better at classifying, processing, and predicting time series data with the time lags of unknown sizes. An LSTM block consists of input, output, and forget gates which are responsible for write, read and reset operations respectively for the memory cell. The main component of LSTM is the memory cell which is responsible for remembering states for short or long periods over arbitrary time intervals. Each LSTM cell operates as a memory to write, read, and erase information based on the outcomes rendered by input, output, and forget gates respectively. Forget gate receives new time step \(X_{t}\) and previous output \(h_{t-1}\) as input and gives output using sigmoid activation function to decide which information will be kept or deleted. The information will be deleted if the output of the sigmoid activation function is 0, while information will be kept if the output is 1. The forget gate computation is shown in Eq. (1). The next step decides what new information will be stored in the cell state. This step has two parts, first, the input gate layer decides which new information from the current input (\(X_{t}, h_{t-1}\)) is updated to the cell state. In the second step, tanh activation function that generates a new candidate value \(\tilde{C_{t}}\), could be appended to the cell state. The multiplication of these two parts will be added to the multiplication of forget gate (\(f_{t}\)) with the previous cell state (\(C_{t-1}\)) to generate a new cell state (\(C_{t}\)). The forget gate (\(f_{t}\)) is multiplied with the previous cell state (\(C_{t-1}\)), forgetting the information which was specified to be deleted earlier. Then we append \(i_{t} * \tilde{C_{t}}\), which is the new candidate value, scaled by how much the cell state is updated. The computation of the input gate, new candidate value and cell state is shown in Eqs. (2)–(4). In the final step, the output gate is computed based on the filtered version. First, the previous hidden state and the current input time step are passed to the sigmoid activation function, and then the new state is put through \(\tanh\) function. Then the output of the sigmoid function is multiplied with the output of \(\tanh\) function to generate the next hidden state. The update cell state and new hidden state forward the information to the next time step. Equations (5) and (6) shows the computation of output gate and hidden state (\(h_{t}\)).

$$\begin{aligned} f_{t}&= \sigma (W_{f}\cdot [h_{t-1}, x_{t}] + b_{f}) \end{aligned}$$
(1)
$$\begin{aligned} i_{t}&= \sigma (W_{i}\cdot [h_{t-1}, x_{t}] + b_{i}) \end{aligned}$$
(2)
$$\begin{aligned} \tilde{C_{t}}&= \tanh (W_{C}\cdot [h_{t-1}, x_{t}] + b_{c}) \end{aligned}$$
(3)
$$\begin{aligned} C_{t}&= f_{t} * C_{t-1} + i_{t} * \tilde{C_{t}} \end{aligned}$$
(4)
$$\begin{aligned} o_{t}&= \sigma (W_{o}\cdot [h_{t-1}, x_{t}] + b_{o}) \end{aligned}$$
(5)
$$\begin{aligned} h_{t}&= o_{t} * \tanh (C_{t}) \end{aligned}$$
(6)

where \(\sigma\) is the sigmoid activation function, \(\tanh\) is hyperbolic tangent function, x is the input data and W is the weight matrix. The LSTM equations are adapted from [73].

Fig. 3
figure 3

LSTM and Bidirectional LSTM

The architecture of LSTM and BiLSTM network is shown in Fig. 3. The input layer of the network comprises an embedded vector that contains a sequence of sensor events and then n LSTM cells are fully connected to the inputs and have recurrent connections with the other LSTM cells.

Finally, a dense output layer of the network performs the classification task. In the BiLSTM network, two parallel LSTM are used for forward and backward loops, which extracts patterns from the past and future events. The forward layer reads the input from the left to right direction whereas the backward layer reads the input from right to left direction.

The output prediction is the weighted sum of the prediction score from both the forward and backward layer. In both networks, the Adam optimizer is used for training the network and minimizing the softmax cross-entropy loss function.

3.3 Handling class imbalance with LSTM and BiLSTM networks

In this paper, the following three methods are used with LSTM and BiLSTM networks.

3.3.1 Oversampling

Oversampling is the data level approach that aims to balance the class distribution by increasing samples of the minority class. Oversampling is performed by computing the sampling ratio (also known as the class imbalance ratio) between the minority class and majority class. We selected the most frequent activity and reduced the imbalance of less frequent activities in the training set. We oversampled less frequent activities with varying sampling ratios but we never, in any case, oversampled less frequent activity to the amount where it was more frequent than the actual most frequent activity. For example, suppose Resident 1 has 1000 samples and Resident 2 has 5000 samples and maximum activity is 10,000; in this case, we threshold oversample at 2, even though we could apply oversample by 10 (if only Resident 1 was taken into consideration). We used different sampling ratios (from range 1 to 10) and conducted experiments over these ranges. The optimal difference in model performance was observed at sampling ratios 2 and 5.

3.3.2 Undersampling

Similar to oversampling, undersampling is also a data level approach performed by computing sampling ratio where we reduced the samples of most frequent activities of the residents. We limited the undersampling ratio in a way that the most frequent activity will still be the most frequent even after being undersampled. For example, if any of the Resident 1 or Resident 2 activity ratios are lower than average (uniform distribution for all activities in the original count) we keep these instances and do not undersample. We only undersample if both activities are over-represented and again kee** in mind that we threshold undersampling ratio taking into account average. We also tried different sampling ratios from range 0.25 to 1.0 and conducted experiments over these ranges, however, the optimal difference was observed at 0.25 and 0.5 undersample ratio.

Data level approaches are not dependent on the classifier as they avoid the modification of the learning model by reducing the effect caused due to imbalanced data with a preprocessing step. Thus, these approaches are more versatile.

3.3.3 Cost-sensitive learning

Cost-sensitive learning lies between data level and algorithm level approach as it incorporates both data-level processing by adding costs to samples and algorithm level modifications by modifying the learning process [74]. This method evaluates the cost associated with the misclassifying samples. It does not create a balanced data distribution, rather assigns the training samples of different classes with different weights, where weights are in proportion to the misclassification costs. In the cost-sensitive version, we scaled the loss according to the cost coefficients in frequent activities and limit the ratio of cost coefficient below the ratio of most frequent/given activity frequency. In this approach, we also conducted experiments with different cost coefficients (from range 1 to 10), and the best model performance was observed at cost coefficients 2 and 5.

Since the dataset contains multiple residents where most of the activities are performed separately by each resident but some activities are performed together as well, we looked at how often each resident does each activity by themselves, and we also looked at how often they do activities together. Figure 4 depicts the LSTM model for multiple residents activity recognition with activity labels \(a^{1,1}a^{2,1}\),...,\(a^{1,T}a^{2,T}\), where \(a^{1,1}\) is activity label of first resident and \(a^{2,1}\) is activity label of second resident and similarly for all the labels. Figure 4a shows the LSTM model with activity of each resident separately and Fig. 4b shows the model with combined activities of residents. For example, in the case of separate activity labels, we selected activity 1 and activity 3 separately for different residents and applied class imbalance methodologies for users separately, where we always kept the most frequent activity samples more than any other activity. In the case of combined activities of both residents, we used a tuple of activities and calculate the frequency of tuple activities, such as activity (1, 3), and applied class imbalance methodologies to these tuple activities.

Fig. 4
figure 4

LSTM model for multi-resident activity recognition

Oversampling the activities with sampling ratio = 2 or 5, does not represent multiplying each resident activity with sample ratio 2 or 5. We took into consideration that increasing one resident activity will also change the distribution of other resident activity as in the dataset we have sensor information for both the residents together. Similarly, while undersampling the dataset with a sampling ratio = 0.25 and 0.5 does not mean reducing the activity distribution to one-fourth and half. Instead, we performed sampling such that when we undersample with 0.25 sampling ratio, we selected 0.25 probability if a certain data point should be added or not. Therefore, the exact distribution of activities may vary in each case.

4 Experiments

The experiments were performed on three SH datasets, in which two houses (House A and House B) are from the ARAS dataset and the third house is of the CASAS dataset. Both the datasets have sensor observations of two residents. The experiments are designed such that for all the three houses classification of activities of the residents are performed using different LSTM and BiLSTM networks and in each model, we explored oversampling, undersampling, and cost-sensitive learning methods to handle class imbalance problem. Each house of the ARAS dataset consists of 30 days of human activities data, where each day consists of 86,400 data points. The dataset is divided into training, validation and test set such that the first 18 days are used for training, the next six days of the dataset are used for validation and the last six days are used as a test set. In the CASAS-Kyoto Multiresident ADL activities dataset, human activities of two residents were carried out for 26 days and each file has a different number of data points. We followed a similar approach as other datasets, where the first 16 days are used as training (10572 instances), the next five days are used as validation test (3051 instances) and the last 5 days are used as test set (3608 instances of sensor readings). The experiments are computed first with the original dataset (without applying class imbalance methods) and then twelve different experiments are conducted for each model by applying class imbalance techniques to the training data.

The evaluation metrics play an important role in measuring the performance of models in handling class imbalance in multiple resident activity recognition. Hence, we used the Exact Match Ratio (EMR), Balanced accuracy, and micro average of F1-score to evaluate all the models. EMR metrics indicate the percentage of samples that have all their labels classified correctly (shown in Eq. 7). The balanced accuracy metric is used in multi-class classification problems to deal with imbalanced datasets and is based on two most commonly used metrics: sensitivity (also known as true positive rate or recall) and specificity (also known as a false-positive rate), shown in Eq. 8. Also, we used a micro average of F1-score as it is a weighted average of recall and precision, shown in Eq. 9. The Exact match ratio of both residents, balanced accuracy, and micro average of F1-score of each resident of the test set are computed at best validation accuracy for all the models.

In both LSTM and BiLSTM networks, we used a range of sequence lengths from 10 to 100, a range of batch sizes from 32 to 512, and a range of several epochs from 5 to 100. A series of trial and error experiments were conducted over these ranges. We observed that epochs = 30, batch size = 64, sequence length = 30, and hidden units (n) = 128 are found to be optimal parameters to avoid overfitting and achieved a low generalization error in training both the models. The model parameters are kept the same for all the datasets. The training of the network is performed on a single Quadro RTX 4000 8GB GPU, also trained models can be used for inference without losing much performance when there is no GPU. In addition, we also performed experiments on a single NVIDIA 12GB GeForce GTX 1080Ti GPU, and the same results were observed on both the computer environment.

$$\begin{aligned} Exact\,Match\,Ratio,EMR = \frac{1}{n}\sum _{i=1}^{n}I(Y_i=Z_i) \end{aligned}$$
(7)

where I is the indicator function, \(Y_{i}\) is target class and \(Z_{i}\) is predicted class.

$$\begin{aligned} Balanced\,Accuracy&= \frac{Sensitivity + Specificity}{2} \end{aligned}$$
(8)
$$\begin{aligned} F1-score&= \frac{2 * (precision * recall)}{(precision + recall)} \end{aligned}$$
(9)

5 Results and discussion

In this section, the experimental results of both LSTM and BiLSTM networks together with different class imbalance approaches in terms of exact match ratio, balanced accuracy, and a micro average of F1-score are presented and discussed. Figures 5, 6 and 7 present the balanced accuracy results of each resident of the dataset. Tables 1 and 2 report the results of House A (ARAS) dataset, Tables 3 and 4 report the results of House B (ARAS) and Tables 5 and 6 present the results of the CASAS-Kyoto Multiresident ADL Activities dataset in terms of EMR and micro average F1-score.

Fig. 5
figure 5

ARAS House A Balanced accuracy results

As discussed in the previous section, each table shows the experiment results of the baseline model which is without applying class imbalance techniques, and then 12 different experiment results with data level and algorithm level techniques on deep learning models. In all three approaches (oversample, undersample, and cost-sensitive learning), the term "single" represents activity recognition of each resident separately and the term "multi" represents combined activity recognition for the results. The models are evaluated at different oversample (2 and 5) and undersample (0.25 and 0.5) class ratios, together with different cost coefficient values (2 and 5) to have a detailed study and comparison of different class imbalance approaches in a multi-resident setting.

Fig. 6
figure 6

ARAS House B Balanced accuracy results

Fig. 7
figure 7

CASAS-Kyoto Balanced accuracy results

Balanced accuracy results show that a single cost-sensitive learning approach outperforms all the other class imbalance approaches in the majority of the cases. In the ARAS house A dataset, the single cost-sensitive learning approach of R1 improves by 3% in the LSTM and 1% in the BiLSTM in comparison to the baseline model, whereas in R2 cost-sensitive approach increases balanced accuracy by 1% in LSTM network but in BiLSTM model single-undersampling improves the balance accuracy by 3%. In House B of the ARAS dataset, the cost-sensitive approach performs better in both LSTM and BiLSTM models, except in the LSTM model of R2, where the undersampling approach is slightly better. In the CASAS dataset, a single cost-sensitive approach clearly outperforms all the other approaches and improves the balance accuracy results of R1 by 9% and 13% in LSTM and BiLSTM, 11% and 14% increase in balance accuracy of R2 LSTM and BiLSTM models in comparison to a baseline model.

Table 1 LSTM-HouseA (ARAS)
Table 2 BiLSTM-HouseA (ARAS)

To summarize, from the following results it can be observed that in almost all the networks cost-sensitive learning performs better in terms of balanced accuracy. In the EMR of both residents, no clear trend has been observed in the results, as in House B the difference in EMR results is minimal for both LSTM and BiLSTM networks, in House A, the baseline model performed better in comparison with other models, whereas in the BiLSTM network the results of EMR in undersampling and cost-sensitive approach are similar. In the CASAS-Kyoto dataset, EMR results are better in undersampling and cost-sensitive approach. The F1-score of R2 is better than R1 in the case of House A, whereas for House B high F1 score is achieved for both the residents in comparison to House A. Furthermore, in the CASAS-Kyoto smart home no significant difference can be seen in F1 scores of R1 and R2.

Table 3 LSTM-HouseB (ARAS)
Table 4 BiLSTM-HouseB (ARAS)

Since each SH dataset has a different configuration, sensor readings, activity labels, and class imbalance, the difference in model performance is observed in all three datasets. The computation time of the CASAS-Kyoto dataset was much faster in comparison to the ARAS dataset due to less number of sensor observations in each day of the dataset. In terms of model computational time, the undersampling method was faster to train in comparison to oversampling and cost-sensitive learning, where multi-oversampling took a quite long time to train which is quite obvious due to the increase in the number of samples to train the models. Among the deep learning models, training with LSTM was faster in comparison to the BiLSTM model. Figure 8 shows the computational time of both the models for all the three datasets.

Table 5 LSTM (CASAS-Kyoto)
Table 6 BiLSTM (CASAS-Kyoto)
Fig. 8
figure 8

Model execution time

5.1 Results on frequent activities

In order to have a further comprehensive analysis of different class, imbalance approaches on multi-resident activity recognition datasets, we extended our experiments by selecting the first top-five activities of the dataset, and performed classifications using the same LSTM and BiLSTM as described above. Since even after oversampling and undersampling the data the distribution is still imbalance, we selected the top five activities in all the datasets to analyze model performance on frequent activities. The model configurations are exactly the same as previous experiments and the results of the experiments of ARAS are shown in Tables 7, 8, 9 and 10. The model configurations in the CASAS frequent activities dataset are also the same as previous CASAS experiments. Tables 11 and 12 present the results of class imbalance techniques on frequent activities of the CASAS-Kyoto dataset.

Table 7 LSTM-House A (ARAS)
Table 8 BiLSTM-House A (ARAS)
Table 9 LSTM-House B (ARAS)
Table 10 BiLSTM-House B (ARAS)
Table 11 LSTM-(CASAS-Kyoto)
Table 12 BiLSTM-(CASAS-Kyoto)

The EMR, balanced accuracy, and F1 score of both House A and House B of the ARAS dataset improved a lot in comparison to previous experiments when we took frequent activities of the datasets, which also makes the dataset quite balanced and thus improving the performance of LSTM and BiLSTM models. The results of EMR improved a lot in comparison to previous experiments but are similar in each dataset for all the approaches. In the balanced accuracy results of frequent activities, again cost-sensitive approach performed better in most of the cases in comparison to oversampling and undersampling methods. There were few cases such as in House B, in R2 activities classification, the oversampling approach in LSTM, and the baseline model of the BiLSTM network performed better than other approaches. However, the cost-sensitive approach performed equally in these cases, for example, the results of cost single (2) and single oversample (2) are almost equal in the LSTM model. Similarly, in the BiLSTM network of House B, the difference between baseline and cost-multi (5) is much less. In the CASAS-Kyoto dataset, the multi-undersampling method performed better in the BiLSTM network for both residents. However, in per class F1-score results, the cost-sensitive method performed better in the classification of minority classes.

The CASAS-Kyoto dataset showed improvement in class imbalance techniques (Tables 11, 12) in comparison to baseline models such as in the LSTM model of CASAS. The cost-sensitive method outperformed all the other methods and in the BiLSTM model of CASAS, undersampling approach performed better, but the results of F1-score of the cost-sensitive approach are almost similar to the undersampling method for each class. The micro average F1-score of both House A and House B improved a lot in frequent activity experiments, whereas in CASAS it did not show much improvement. This can be due to the "curse of dimensionality" in the SH datasets, as not all sensors are relevant to the classification and high dimensions deteriorate the performance of the classifier. Furthermore, it has been observed that the CASAS-Kyoto dataset showed a difference in the performance of the model with different class imbalance techniques, whereas in the ARAS dataset not much clear trend is observed. This can be attributed to the fact that the CASAS-Kyoto dataset is quite a balanced dataset whereas the ARAS dataset is highly imbalanced.

6 Conclusion

In the realm of multiple resident activity recognition, which is integral to the enhancement of smart technologies [75], elder care [76], and ambient assisted living systems [77], as well as safety and context-aware applications [78], the importance of explainability, retraceability, and human interpretability cannot be overstated. Explainability is paramount in the application of complex models such as LSTM and BiLSTM networks, as it fosters trust and acceptance among users and stakeholders. The ability to interpret model decisions is critical, especially in environments such as health care and ambient assisted living, where decisions must be transparent and justifiable. This study, through the lens of class imbalance techniques, has not only sought to enhance the accuracy of activity recognition systems but also contributed to the field of explainable AI by exploring how different techniques can influence model interpretability. Retraceability, the ability to audit the data and processes that lead to a particular model decision, is essential for compliance with regulatory frameworks that govern AI systems, particularly in Europe where the right to explanation is an emerging requirement [79]. By meticulously documenting the experimental setup, including data processing, model configuration, and the application of class imbalance techniques, this study provides a blueprint for retraceability. Human interpretability is inherently linked with the first two concepts, emphasizing the need for model predictions to be understandable by humans. This is especially pertinent when AI systems are used to support decision-making in critical settings. The study’s findings suggest that cost-sensitive learning can improve performance metrics, such as balanced accuracy, which is a step toward making AI decisions more interpretable. The interpretability of such approaches must be further investigated to ensure that users can comprehend and trust the system’s predictions. The discussion on scalability also implicitly acknowledges the challenges of maintaining explainability and interpretability as the complexity of the environment increases. With more residents and potentially more complex class distributions, the importance of designing AI systems that are not only accurate but also explainable and interpretable becomes more pronounced. Thus, this research does not merely present a set of algorithms for activity recognition but also paves the way for future studies that must consider these critical dimensions of AI development in order to be deemed trustworthy and human-centered.

To elevate the performance of our model, particularly concerning the minority class, future endeavors will be directed towards the examination of alternative deep learning architectures and hybrid methodologies that are adept at negotiating class imbalance within the multi-resident milieu [80]. In the pursuit of universal access in the information society, it is critical to align these technological advances with the principles of explainable AI, especially in the context of graph neural networks [81, 82]. The ambition is to construct models that are not only effective but also human-interpretable, particularly for temporal Smart Home (SH) datasets.

Human interpretability engenders an understanding of the reasoning behind network decisions, which in turn cultivates trust in the system-a necessity for the universal adoption of such technologies [83]. Furthermore, the deployment of AI in settings with diverse and potentially vulnerable populations accentuates the need for transparent and accountable systems.

As we forge ahead, it is imperative to recognize that novel evaluation paradigms are required to adequately assess the efficacy of such models within the context of imbalanced datasets [84]. These new paradigms must address not only the technical accuracy of model predictions but also the explainability and fairness [85] of these predictions to ensure that AI systems contribute positively to the inclusivity and accessibility of the information society. Thus, our future research is poised to contribute to this critical discourse, ensuring that the advancements in AI are both technically sound and ethically responsible, facilitating a more equitable information society.