1 Introduction

In recent decades the demands of evaluating usability of interactive web-based systems have produced several assessment procedures. Very often, during usability inspection, there is a tendency to overlook features of the users, aspects of the context and characteristics of the tasks. This tendency is also justified by the lack of a model that unifies all of these aspects. Considering features of users is fundamental for the User Modeling community [1, 16]. Similarly, taking into consideration the context of use is of extreme importance for inferring reliable assessments of usability [3, 36]. Additionally, during the usability assessment process, accounting for the demands of the task executed is core for describing user experience [20]. Building a cohesive model is not trivial, however we believe the construct of human mental workload (MWL) – often referred to as cognitive load – can significantly contribute to such a goal and inform interaction and web-design. MWL, with roots in Psychology, has been mainly applied within the fields of Ergonomics and Human Factors. Its assessment is key to measuring performance, which in turn is fundamental for describing user experience and engagement. A few studies have tried to employ the construct of MWL to explain usability [2, 24, 41, 46, 50]. Despite this interest, not much has yet been done to investigate their relationship empirically. The aim of this research is to empirically test the relationship between subjective perception of usability and mental workload as well as their impact on objective user performance, which means tangible quantifiable facts (Fig. 1).

Fig. 1.
figure 1

Schematic overview of the empirical study

This paper is organised as follows. Firstly, notable definitions of usability and mental workload are provided, followed by an overview of the assessment techniques employed in Human-Computer Interaction (HCI). Related work is also presented, highlighting how the two constructs have been employed so far, distinctly and jointly. An experiment is subsequently designed in the context of human-web interaction, aimed at investigating the relationship between the perception of usability of three popular web-sites (Youtube, Wikipedia and Google) and the mental workload experienced by users after interacting with them. Results are presented and critically discussed, showing how these constructs interact and how they impact objective user performance. A summary concludes this paper pointing to future work and highlighting the contribution to knowledge.

2 Core Notions and Definitions

Widely employed in the broader field of HCI, usability and mental workload are two constructs from Ergonomics, with no crystal and generally applicable definitions. There is an acute debate on their assessment and measurement [4,5,6]. Although ill-defined, they remain extremely important for describing the user experience and improving interaction, interface and system design.

2.1 Definitions of Usability

The amount of literature covering definitions [21, 48], frameworks and methodologies for assessing usability is vast. The ISO (International Organisation for Standardisation) defines usability as ‘The extent to which a product can be used by specified users to achieve specified goals with effectiveness, efficiency, and satisfaction in a specified context of use’. Usability, according to Nielsen [38], is a method for improving ease-of-use in the design of interactive systems and technologies. It embraces other concepts such as efficiency, learnability and satisfaction. It is often associated with the functionalities of a product rather than being merely a feature of the user interface [39].

2.2 Measures of Usability

Often when selecting an appropriate procedure in the context of interaction and web-design, it is desirable to consider the effort and expense that will be incurred in collecting and analysing data. For this reason, designers have tended to adopt subjective usability assessment techniques for collecting feedback from users [21]. On one hand, self-reporting techniques can only be administered post-task, thus influencing their reliability with regard to long tasks. Meta-cognitive limitations can also diminish the accuracy of reporting and it is difficult to perform comparisons among raters on an absolute scale. On the other hand, these techniques appear to be the most sensitive and diagnostical [21]. Nielsen’s principles, thanks to their simplicity in terms of effort and time, are frequently employed to evaluate the usability of interfaces [38]. The evaluation is done iteratively by systematically finding usability problems in an interface and judging them according to the principles [39]. The main problem associated to these principles is that they mainly focus on the user interface forgetting contextual factors, the cognitive state of the users and the underlying tasks.

The System Usability Scale [9] is a questionnaire that consists of ten questions (Table 9). It is a highly cited usability assessment method and it has been massively applied [7]. It is a very easy scale to administer, demonstrating reliability to distinguishing usable and unusable systems and even with small sample sizes [54]. Alternatives include the Computer System Usability Questionnaire (CSUQ), developed at IBM and the Questionnaire for User Interface Satisfaction (QUIS), developed at the HCI lab at the University of Maryland. The former is a survey that consists of 19 questions on a seven-point Likert scale of ‘strongly disagree’ to ‘strongly agree’ [25]. The latter was designed to assess users’ satisfaction with aspects of a computer interface [49]. It includes: a demographic questionnaire, a measure of system satisfaction along six scales, and a hierarchy of measures of nine specific interface factors. Each of these factors relates to a user’s satisfaction with that particular aspect of an interface as well as to the factors that make up that facet, on a 9-point scale. Although it is more complex than other instruments, QUIS has shown high reliability across several interfaces [19]. Many other usability inspection methods and techniques have been proposed in the literature [21, 54].

2.3 Definitions of Mental Workload

Human Mental Workload (MWL) is an important design concept and it is fundamental for exploring the interaction of people with technological devices [29, 31, 32]. It has a long history in Psychology with applications in Ergonomics, especially in the transportation industry [14, 20]. The principal reason for MWL assessment is to quantify the cognitive cost associated to performing a task for predicting operator or system performance [10]. However, it has been largely reported that mental underload and overload can negatively influence performance [2, 50]. They proposed a technique to identify sub-areas of a web-site in which end-users manifested a higher mental worklaod during interaction, allowing designers to modify those critical regions. Similarly, [15] investigated how the design of query interfaces influence stress, workload and performance during information search. Here stress was measured by physiological signals and a subjective assessment technique – Short Stress State Questionnaire. Mental workload was assessed using the NASATLX and log data was used as objective indicator of performance to characterise search behaviour.

4 Design of Experiments

A study involving human participants executing typical tasks over 3 popular web-sites (Youtube, Google, Wikipedia) was set to investigate the relationship between perception of usability, mental workload and objective performance. One self-assessment procedure for measuring usability and two for mental workload:

  • the System Usability Scale (SUS) [9]

  • the Nasa Task Load Index (NASATLX), developed at NASA [20]

  • the Workload Profile (WP) [52], based on Multiple Resource Theory [56, 57].

Five classes of the objective performance of participants on tasks were set:

  1. 1.

    the task was not completed as the user gave up

  2. 2.

    the execution of the task was terminated because the available time was over

  3. 3.

    the task was completed and no answer was required by the user

  4. 4.

    the task was completed, the user provided an answer, but it was wrong

  5. 5.

    the task was completed and the user provided the correct answer.

These are sometimes conditionally dependent (Fig. 2). The experimental hypothesis are defined in Table 1 and illustrated in Fig. 3.

Fig. 2.
figure 2

Partial dependencies of classes of objective performance

Table 1. Research hypothesis
Fig. 3.
figure 3

Illustration of research hypothesis

4.1 Details of Experimental Subjective Self-reporting Techniques

The System Usability Scale is a subjective usability assessment instrument that uses a Likert scale, bounded in the range 1 to 5 [9]. Questions can be found in Table 9. Individual scores are not meaningful on their own. For odd questions (\(SUS_i\) with \(i=\{1|3|5|7|9\}\)), the score contribution is the scale position (\(SUS_i\)) minus 1. For even questions (\(SUS_i\) with \(i=\{2|4|6|8|10\}\)), the contribution is 5 minus the scale position. For comparison purposes, the SUS value is converted in the range \([1..100] \in \mathfrak {R}\) with \(i_1=\{1,3,5,7,9\}, \ i_2=\{2,4,6,8,10\}\)

$$SUS= 2.5 \cdot \Bigg [ \sum _{i_1} (SUS_i - 1) + \sum _{i_2} (5 - SUS_i) \Bigg ]$$

The NASA Task Load Index instrument [20] belongs to the category of self-assessment measures. It has been validated in the aviation industry and other contexts in Ergonomics [20, 45] with several applications in many socio-technical domains. It is a combination of six factors believed to influence MWL (questions of Table 10). Each factors is quantified with a subjective judgement coupled with a weight computed via a paired comparison procedure. Subjects are required to decide, for each possible pair (binomial coefficient, \(\left( {\begin{array}{c}6\\ 2\end{array}}\right) = 15\)) of the 6 factors, ‘which of the two contributed the most to mental workload during the task’, such as ‘Mental or Temporal Demand?’, and so forth. The weights w are the number of times each dimension was selected. In this case, the range is from 0 (not relevant) to 5 (more important than any other attribute). The final MWL score is computed as a weighed average, considering the subjective rating of each attribute \(d_i\) and the correspondent weights \(w_i\):

$$ NASATLX : [0..100] \in \mathfrak {R}\qquad NASATLX = \Biggl ( \sum _{i=1}^{6} d_i \times w_i\Biggr ) \frac{1 }{15} $$

The Workload Profile (WP) assessment procedure [52] is built upon the Multiple Resource Theory proposed in [56, 57]. In this theory, individuals are seen as having different capacities or ‘resources’ related to: \(\bullet \) stage of information processing – perceptual/central processing and response selection/execution; \(\bullet \) code of information processing – spatial/verbal; \(\bullet \) input – visual and auditory processing; \(\bullet \) output – manual and speech output. Each dimension is quantified through subjective rates (questions of Table 11) and subjects, after task completion, are required to rate the proportion of attentional resources used for performing a given task with a value in the range \(0..1 \in \mathfrak {R}\). A rating of 0 means that the task placed no demand while 1 indicates that it required maximum attention. The aggregation strategy is a simple sum of the 8 rates \(d\) (averaged here, and scaled in \([1..100] \in \mathfrak {R}\) for comparison purposes):

$$WP : [0..100] \in \mathfrak {R} WP=\frac{1}{8}\sum _{i=1}^8 d_i \times 100$$

4.2 Participants and Procedure

A sample of 46 people fluent in English volunteered to participate in the study after signing a consent form. Subjects were divided into 2 groups of 23 each: those in group A were different to those in group B. Participants could not interact with instructors during the tasks and they did not have to be trained. Ages ranges from 20 to 35 years; 24 females and 22 males evenly distributed across the 2 groups (Total - Avg.: 28.6, Std. 3.98; g.A - Avg. 28.35, Std.: 4.22; g.B - Avg: 28.85, Std.: 3.70) all with a daily Internet usage of at least 2 hours. Participants were required to execute a set of 9 information-seeking web-based tasks (Table 13) as naturally as they could, over 2 or 3 sessions of approximately 45/70 min each, on different non-consecutive days. Tasks differed in terms of difficulty, time-pressure, time-limits, interference, interruptions and demands on different psychological modalities. Two groups were created because the tasks were executed on web-based interfaces, sometimes altered at run-time (through a CSS/HTML manipulation) (as in Table 12). This manipulation was implemented, as part of a larger study [27, 28, 34], to enable A/B testing of web-interfaces (not included here). Interface alteration was not extreme, like making things very hard to read. Rather the goal was to alter the original interface to manipulate task difficulty and usability independently. The order of the tasks administered was the same for all the participants. Computerised versions of the SUS (Table 9), the NASATLX (Table 10) and the WP (Table 11) instruments were administered immediately after task completion. Note that the question of the \(NASA-TLX\) related to ‘physical load’ was set to 0 as well as its weight. Consequently, the pairwise comparison procedure was shorter. Some volunteer did not execute all the tasks and the final dataset contains 405 cases.

5 Results

Table 2 contains the means and standard deviations of the usability and the mental workload scores for each task, depicted also in Fig. 4.

Table 2. Mental workload and usability - Groups A, B (G.A/G.B)
Fig. 4.
figure 4

Summary statistics by task

5.1 Testing Hypothesis 1 - Difference Usability and Mental Workload

From an initial analysis of Fig. 5, it seems clear that there is no correlation between the usability scores (SUS) and the mental workload scores (NASATLX, WP). This is statistically confirmed in Table 3 by the Pearson and Spearman correlation coefficients computed over the full dataset (Groups A, B). Person was chosen for exploring linear correlation while Spearman for monotonic relationship, not necessarily linear.

Fig. 5.
figure 5

Scatterplots of NASATLX, WP vs SUS.

Table 3. Correlation coefficients

Despite perception of usability does not seem to correlate at all with mental workload, a further investigation of their relationship was performed on the scores obtained for each task. Table 4 lists the correlations between the MWL scores (NASATLX, WP) against the usability scores (SUS), and Fig. 6 their densities. Generally, in behavioural/social sciences, there may be a greater contribution from complicating factors, as in the case of subjective ratings. Hence, correlations above 0.5 are regarded as very high, within [0.1–0.3] small and within [0.3–0.5] as medium/moderate (symmetrically to negative values) [12, p. 82]. For this analysis, only medium/high coefficients are considered. Yet, a clearer picture does not emerge and just a few tasks show some form of correlation between mental workload and usability. Figure 7 provides further details aiming at extracting further information and possible interpretations on why workload scores were moderately/highly correlated with usability.

Table 4. Correlations MWL vs usability. Groups A and B
Fig. 6.
figure 6

Density plots of the correlations by task - Group A, B

Fig. 7.
figure 7

Details of tasks with moderate/high correlation

  • task 1/A and task 4/B: WP is moderately negatively correlated with SUS. This suggests that when the proportion of attentional resources being taxed by a task is moderated and decreases, the perception of good usability increases. In other words, when web-interfaces and the tasks executed over them require a moderate use of different stages, codes of information processing and input, output modalities (Sect. 4.1), the usability of those interfaces is increasingly perceived as positive.

  • task 9/A and task 9/B: the NASATLX is highly and positively correlated with SUS. This suggests that, even when time pressure is imposed upon tasks causing an increment in the workload experienced, and the perception of performance decreases because task answer is not found, than perception of usability is not affected if the task is pleasant and amusing (like task 9). In other words, even if experienced workload increases but is not excessive, and even if the interface is slightly altered (task 9 group B), the perception of good usability is strengthened if tasks are enjoyable.

  • tasks 1/B, 4/B, 5/B, 7/B the NASATLX is highly negatively correlated with SUS. This suggests that when the MWL experienced by users increases, perhaps because tasks are not straightforward, perception of usability can be negatively affected even with a slight alteration of the interface.

The above interpretations do not aim to be exhaustive; they are just our own interpretations, they cannot be generalised and are only confined to this study. To further strengthening the data analysis, an investigation of the correlation between the MWL and the usability scores has been performed by considering users on an individual-basis (Table 5 and Fig. 8).

Table 5. Correlation MWL-usability by user
Fig. 8.
figure 8

Density plots of the correlations by user

As in the previous analysis (by task), just medium and high correlation coefficients (\({>}0.3\)) are considered for deeper investigation. Additionally, because the results of Tables 3 and 4 were not able to systematically show common trends, the analysis on the individual-basis was reinforced by considering only those users for which a medium/high linear relationship (Pearson) and a monotonic relationship (Spearman) was detected between both the two MWL scores (NASA, WP) and the usability scores (SUS). Table 5 highlights these users (1, 5, 11, 12, 21, 22, 27, 39, 40, 46). The objective was to look for the presence of any particular pattern of user’s behaviour or a complex deterministic structure. Figure 9 depicts the linear scatterplots associated to these users with a linear straight regression line and a local smoothing regression line (Lowess algorithm [11]). The former type of regression is parametric and stands on the normal distribution, while the latter is non-parametric and it is aimed at supporting exploration and identification of patterns, enhancing the ability to see a line of best fit over data not necessarily normally distributed. Outliers from scatterplot are not removed: the rationale behind this decision is justified by the limited amount of points – maximum 9 points that coincides with the number of tasks.

Fig. 9.
figure 9

Correlations MWL-usability for users with moderate/high Pearson and Spearman coefficients

No clear and consistent patterns emerge from Fig. 9. However, by analysing the mental workload scores (NASATLX and WP), it is possible to note that the 10 selected users have all achieved, except a few outliers, a score of optimal mental workload (on average between 20–72). In other words, these users did not perceive underload or overload while executing the nine tasks. From an analysis of the usability assessments, all the users achieved scores higher than 40, indicating that no interface was perceived not usable at all. This might indicate that when the mental workload experienced by users is within an optimal range, and usability is not bad, then the combination of mental workload and usability in a joint model might not be fully powerful in explaining objective performance more than mental workload alone. In the other cases, where correlation of mental workload and usability is almost inexistent, then a joint model might better explain objective performance. The following section is devoted to test this.

5.2 Testing Hypothesis 2 - Usability and Mental Workload Impact Performance More than Just Workload

From the previous analysis it appears that the perception of usability and the mental workload experienced by users are not related, except few cases in which mental workload was optimal and usability was not bad. Nonetheless, as previously reviewed, literature suggests that these constructs are important for describing and exploring the user’s experience with an interactive system. For this reason a further investigation of the impact of the perception of usability and mental workload on objective performance has been conducted to test hypothesis 2 (Sect. 4). In this context, objective performance refers to objective indicators of the performance of the volunteers who participated in the user study, categorised in 5 classes (Sect. 4). During the experiment, the measurement of the objective performance of users was in some case faulty. These were discarded and a new dataset with 390 valid cases was formed. The exploration of the impact of the perception of usability and mental workload on the 5 classes of objective performance was treated as a classification problem, employing supervised machine learning. In detail, 4 different classification methods were chosen to predict the objective performance classes, according to different types of learning:

  • information-based learning: decision trees (with Gini coefficient);

  • similarity-based learning: k-nearest neighbors;

  • probability-based learning: Naive Bayes;

  • error-based learning: support vector machine (with a radial kernel) [8, 23].

The distribution of the 5 classes is depicted in Fig. 10 and Table 6:

Clearly, the above frequencies are unbalanced. For this reason a new dataset has been formed through oversampling, a technique to adjust class distributions and to correct for a bias in the original dataset, aimed at reducing the negative

impact of class unbalance on model fitting. Random sampling (with replacement) the minority classes to be the same size as the majority class is used (Table 6). The two mental workload indexes (NASA and WP) and the usability index (SUS) were treated as independent variables (features) and they were used both individually and in combination to form models aimed at predicting the 5 classes of objective performance (Fig. 11).

Fig. 10.
figure 10

Distribution of performance classes - original dataset

Table 6. Frequencies of classes
Fig. 11.
figure 11

Independent features and classification techniques

The independent features were normalised in the range \([0..1] \in \mathfrak {R}\) to facilitate the training of models and 10-fold stratified cross validation has been adopted in the training phase. In other words, the oversampled dataset was divided in 10 folds and in each fold, the original ratio of the distribution of the objective performance classes (Fig. 10, Table 6) was preserved. 9 folds were used for training and the remaining fold for testing against accuracy and this was repeated 10 times changing the testing fold. This generated 10 models and produced 10 classification accuracies for each learning technique and for each combination of independent features (Fig. 12, Table 7). It is important to note that training sets (a combination of 9 folds) and test sets (the remaining holdout set) were always the same across the classification techniques and the different combination of independent features (paired 10-fold CV). This is critical to perform a fair comparison of the different trained models using the same training/test sets.

Fig. 12.
figure 12

Independent features, classification technique, distribution of accuracies with 10-fold stratified cross validation

To test hypothesis 2, the 10-fold cross-validated paired Wilcoxon statistical test has been chosen for comparing two matched accuracy distributions and to assess whether their population mean ranks differ (it is a paired difference test) [58]. This test is a non-parametric alternative to the paired Student’s t-test selected because the population of accuracies (obtained testing each holdout set) was assumed to be not normally distributed. Table 8 lists these tests for the individual models (containing only the mental workload feature) against the combined models (containing the mental workload and the usability features). Except in one case (k-nearest neighbor, using the NASA-TLX as feature), the addition of the usability measure (SUS) to the mental workload feature (NASA or WP) always statistically significantly increased the classification accuracy of the induced models, trained with the 4 selected classifiers. This suggests how mental workload and usability can be jointly employed to explain objective performance measure, an extremely important dimension of user experience.

Table 7. Ordered distributions of accuracies of trained models
Table 8. Wilcoxon test of distributions of accuracies with different independent features and learning classifiers
Table 9. System Usability Scale (SUS)
Table 10. The NASA Task Load Index (NASA-TLX)
Table 11. Workload Profile (WP)
Table 12. Run-time manipulation of web-interfaces
Table 13. Experimental tasks (M = manipulated; g = Group)

5.3 Summary of Findings

In summary, from empirical evidence, the two hypothesis can be accepted.

  • \(H_{1}\): Usability and Mental workload are two uncorrelated constructs (as measured with the selected self-reporting techniques (SUS, NASA-TLX, WP).

They capture different variance in experimental tasks. This has been tested by a correlation analysis (both parametric and nonparametric) which confirmed that the two constructs are not correlated. The obtained Pearson coefficients suggest that there is no linear correlation between usability (SUS scale) and mental workload (NASA-TLX and WP scales). The Spearman coefficients confirmed that there is no tendency for usability to either increase or decrease when mental workload increases. The large variation in correlations within different tasks and for different individuals is interesting and worth of future investigation.

  • \(H_{2}\): A unified model incorporating a usability and a MWL measure can better explain objective performance than MWL alone.

This has been tested by inducing combined and individual models, using four supervised machine learning classification techniques, to predict objective performance of users (five classes of performance). The combined models were most of the times able to predict objective user performance significantly better than the individual models, according to the Wilcoxon non-parametric test.

6 Conclusion

This study attempted to investigate the correlation between the perception of usability and the mental workload imposed by typical tasks executed over three popular web-sites: Youtube, Wikipedia and Google. Prominent definitions of usability and mental workload were presented, with a particular focus on the latter. This because usability is a central notion in human-computer interaction, with a plethora of definitions and applications existing in the literature. Whereas, the construct of mental workload has a background in Ergonomics and Human Factors, but less mentioned in HCI. A well known subjective instrument for assessing usability—the System Usability Scale—and two subjective mental workload assessment procedures—the NASA Task Load Index, and the Workload Profile—have been employed in a user study involving 46 subjects. Empirical evidence suggests that there is no relationship between the perception of usability of a set of web-interfaces and the mental workload imposed on users by a set of tasks executed on them. In turn, this suggests that the two constructs seem to describe two not overlap** phenomena. The implication of this is that they could be jointly used to better describe objective indicator of user performance, a dimension of user experience. Future work will be devoted to replicate this study employing a set of different interfaces, tasks and with different usability and mental workload assessment instruments. The contributions of this research are to offer a new perspective on the application of mental workload to traditional usability inspection methods, and a richer approach to explain the human-system interaction and support its design.