Influence of Device Performance and Agent Advice on User Trust and Behaviour in a Care-taking Scenario

Zukerman, Ingrid; Partovi, Andisheh; Hohwy, Jakob

doi:10.1007/s11257-023-09357-y

Influence of Device Performance and Agent Advice on User Trust and Behaviour in a Care-taking Scenario

Open access
Published: 30 March 2023

Volume 33, pages 1015–1063, (2023)
Cite this article

Download PDF

You have full access to this open access article

User Modeling and User-Adapted Interaction Aims and scope Submit manuscript

Influence of Device Performance and Agent Advice on User Trust and Behaviour in a Care-taking Scenario

Download PDF

Ingrid Zukerman¹,
Andisheh Partovi¹ &
Jakob Hohwy²

1423 Accesses
Explore all metrics

Abstract

Monitoring systems have become increasingly prevalent in order to increase the safety of elderly people who live alone. These systems are designed to raise alerts when adverse events are detected, which in turn enables family and carers to take action in a timely manner. However, monitoring systems typically suffer from two problems: they may generate false alerts or miss true adverse events.

This motivates the two user studies presented in this paper: (1) in the first study, we investigate the effect of the performance of different monitoring systems, in terms of accuracy and error type, on users’ trust in these systems and behaviour; and (2) in the second study, we examine the effect of recommendations made by an advisor agent on users’ behaviour.

Our user studies take the form of a web-based game set in a retirement village, where elderly residents live in smart homes equipped with monitoring systems. Players, who “work” in the village, perform a primary task whereby they must ensure the welfare of the residents by attending to adverse events in a timely manner, and a secondary routine task that demands their attention. These conditions are typical of a retirement setting, where workers perform various duties in addition to kee** an eye on a monitoring system.

Our main findings pertain to: (1) the identification of user types that shed light on users’ trust in automation and aspects of their behaviour; (2) the effect of monitoring-system accuracy and error type on users’ trust and behaviour; (3) the effect of the recommendations made by an advisor agent on users’ behaviour; and (4) the identification of influential factors in models that predict users’ trust and behaviour. The studies that yield these findings are enabled by two methodological contributions: (5) the game itself, which supports experimentation with various factors, and a version of the game augmented with an advisor agent; and (6) techniques for calibrating the parameters of the game and determining the recommendations of the advisor agent.

A Game for Eliciting Trust Between People and Devices Under Diverse Performance Conditions

Trust in telemedicine portals for rehabilitation care: an exploratory focus group study with patients and healthcare professionals

Article Open access 27 January 2016

Argumentation Schemes for Events Suggestion in an e-Health Platform

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Monitoring systems have become increasingly prevalent in order to increase the safety of elderly people who live alone. These systems are designed to raise alerts when adverse events are detected, which in turn enables family and carers to take action in a timely manner. However, monitoring systems typically suffer from two problems: they may generate false alerts or miss true adverse events.

This motivates our first research question, focusing on a care-taking scenario: (RQ1) What is the effect of monitoring-system accuracy and error type on users’ trust in automation and behaviour? This question is addressed in our first user study, called monitoring-system study. This study was conducted in a web-based game set in a retirement village, where elderly residents live in smart homes equipped with monitoring systems. Players, who “work” in the village, perform a primary task whereby they must ensure the welfare of the residents by attending to adverse events in a timely manner, and a secondary routine task that demands their attention.^{Footnote 1} These conditions are typical of settings where workers perform various duties in addition to kee** an eye on a monitoring system.

In our study, users interacted with different monitoring systems that committed two types of common errors: False Alerts (FAs) and Misses of true events. We investigated the influence of these errors on users’ trust in a monitoring system, compliance and reliance. Compliance refers to an operator responding in accordance with an alarm signal (in our case, an alert raised by the monitoring system), and reliance refers to an operator taking no action when no alarm is raised (in our case, assuming that the monitoring system has not missed any adverse events) [45].

Even though participants reacted to FAs and Misses in this study, most users did not make large changes to their general behaviour patterns throughout the study. This led us to posit our second research question: (RQ2) If we had a good advisor agent that makes recommendations about how users should interact with a monitoring system, what is the effect of its recommendations on users’ behaviour?. This question is addressed in our second study, called advisor-agent study. Here, we added to the game a “very good” agent that makes rational recommendations about how to act in light of a monitoring system’s alerts or lack thereof, and is endowed with features that are deemed influential according to the literature (Section 2.2); and we determined the training effect of this agent, i.e., do users change their compliance and reliance on the monitoring system as a result of their exposure to the agent? The idea is that if such an agent can affect user behaviour, then it is worth performing further studies that examine the impact of individual agent features. However, such an investigation is outside the scope of this research.

Our observations of users’ behaviour in both studies prompted our third research question: (RQ3) Can we identify factors that predict users’ trust and behaviour? One of the factors we considered in order to answer this question was user type. Specifically, we asked Can we identify user types that shed light on behaviours of interest and have predictive value?. To answer this question, we automatically grouped participants according to basic observable behaviours. We then checked whether the answers to research questions RQ1 and RQ2 differ for different groups, e.g., do Misses affect the behaviour of one group more than that of another?

Table 1 summarizes the research questions, the studies that address them, the platforms that support these studies, and the main results. We now describe the main findings obtained from our studies, followed by the methodological contributions pertaining to the design of the game and the advisor agent.

Table 1 Research questions, user studies, supporting platforms and results.

Full size table

1.1 Main findings

Our main findings pertain to: (1) the identification of user types that shed light on users’ trust in automation and aspects of their behaviour; (2) the effect of monitoring-system accuracy and error type on users’ trust, compliance and reliance on the system; (3) the effect of the recommendations made by a good advisor agent on users’ compliance and reliance on the monitoring system, and learning outcomes; and (4) the identification of influential factors in models that predict users’ trust and aspects of their behaviour.

User types obtained from basic behaviours. We automatically identified different types of users according to their behaviour in various stages of the game. For instance, high-compliance / high-reliance users attend to a large proportion of alerts, and tend not to act in the absence of alerts; while low-compliance / low-reliance users ignore many alerts, but act often in the absence of alerts. These user types shed further light on the answers to our research questions, as described below.

Effect of monitoring-system accuracy and error type on user trust in automation and behaviour. Intuitively, one would expect high FAs to reduce trust and compliance, and high Misses to reduce trust and reliance. In addition, one would expect trust to be positively correlated with system performance, and with compliance and reliance (the more we trust a system, the more we comply with its advice and rely on it, i.e., the less we check on it).

As shown in Section 2.1, these intuitions were not always validated experimentally. Hoff and Bashir [27] address this point in their comprehensive survey paper by noting that “Overall, the specific impact that false alarms and misses have on trust likely depends on the negative consequences associated with each error type in a specific context”. Thus, it is worthwhile to conduct studies about trust and behaviour in different contexts — our elderly care-taking scenario, where the consequences of users’ actions affect elderly patients, has not been investigated to date. In addition, our study differs from most previous studies in that it considers FAs and Misses together, since both occur in real monitoring systems.

The results of our study agree with the first set of intuitions: (1) system performance (FAs and Misses) affects trust (FAs have a slightly stronger effect), (2) FAs affect compliance, and (3) Misses affect reliance. However, when we considered separately the types of users we identified, we found that they are affected differently by FAs and Misses. For instance, users of one type reacted to changes in FAs, but were largely unaffected by changes in Misses.

Interestingly, the second set on intuitions did not hold: we found no correlation between trust and reliance, and we found a very weak correlation between trust and compliance.

Effect of the recommendations made by a good advisor agent on users’ behaviour. This part of our work falls under the umbrella of choice architectures [66], in the sense that we compare two ways in which people interact with a monitoring system: (1) users make their own decisions about how to react to alerts raised by a monitoring system or lack thereof; and (2) an advisor agent makes suggestions to steer users towards making better decisions [46]. The results of our study indicate that the agent-based architecture leads to improved decision-making outcomes (in the direction encouraged by the agent), both while users work with the agent and when they work independently later on, which shows that the agent has a training effect. In addition, the agent’s advice attenuated the differences between different types of users, yielding larger improvements for those who did not perform the monitoring task well initially.

Influential factors in models that predict users’ trust in automation and aspects of their behaviour. We automatically derived models that make the following predictions: trust in the monitoring system, compliance and reliance; and trust in the advisor agent, conformity with its advice and behaviour improvement (when users no longer had the agent).^{Footnote 2} We found that a feature value is predictive of its value in the short term, e.g., previous trust and reliance predicts future trust and reliance respectively – a phenomenon that is related to trust inertia [39]; and that user types identified in early stages of a game have predictive value for compliance, reliance and training outcomes.

1.2 Methodological contributions – game and agent

Our experiments are supported by a web-based game and an advisor agent grafted onto this game.

The game. Participants in our experiment “work” in a retirement village whose elderly residents live in smart homes. Each home is equipped with a monitoring system that raises alerts when it detects an adverse event, such as a fall, or a potential hazard, such as a tap left running for a long time. The participants’ “job” consists of performing a routine “administration task”, and ensuring the welfare of the residents, which can be done by attending to the monitoring system’s alerts or checking on the residents from time to time.

Our game supports the systematic investigation of the influence of the following factors, and combinations thereof, on user trust and behaviour: system performance (accuracy and error type), situation (risk, cognitive load and consequences), communication style (etiquette and speech versus text), transparency (explanations) and feedback (apologies and communication of capabilities). In our monitoring-system study, we investigate the influence of system performance on users’ trust and behaviour.

Our game design methodology offers a principled approach for setting rewards and penalties in order to emphasize factors of interest (Section 3.3). This is in contrast to [11], who provided no details about their reward structure, or [30, 37], who had an ad hoc structure.

The advisor agent. We offer an algorithm for generating advice, and propose guidelines for justifying recommendations and providing feedback to users. Our advisor agent supports the investigation of the influence of different aspects of such an agent (viz competence, appearance, communication style, transparency and feedback) on users’ behaviour. However, as mentioned above, the focus of our study is to investigate the training effect of an agent endowed with influential features.

1.3 Roadmap to the paper

This paper is organized as follows. In Section 2, we discuss related research, focusing on the effect of system performance and features of assistive agents on users’ trust and behaviour. In Section 3, we describe the experiment, and present our game, including our approach for calibrating the parameters of the game. Section 4 describes the advisor agent, and outlines the generation of its policy for interacting with the monitoring system. Our experimental results appear in Section 5, followed by our predictive models in Section 6, and concluding remarks in Section 7.

2 Related Research

When investigating aspects of human behaviour, there is a long-standing tradition of using games and scenarios that rely on a narrative in order to engage participants. The idea is that such tools will elicit more realistic responses than questionnaires — this idea was refined in [59] by associating user types with particular game designs, and in [47] by associating strategies employed in persuasive gamified systems with domains and personality traits.

Trust between people has been studied by means of economic games such as the trust game [6] and the ultimatum game [25]. Games have also been used to investigate the influence of different aspects of automated devices on subjective trust-related attributes (e.g., self-reported trust and perception of reliability) and user behaviour (notably compliance and reliance). Some of the aspects that have been studied are: system dependability [39, 73] and error type [11], task criticality [52], anthropomorphism [29, 61, 70] or addressing the outcomes of actions (e.g., errors made) [5, 16, 19, 29, 32]. The cited studies showed that feedback increases trust in automation. However, the results regarding trust calibration are conflicting: according to [32, 61, 70], feedback leads to appropriate trust calibration, while the opposite was found in [16, 19]. Focusing on addressing outcomes, de Visser et al. [15] offered strategies to repair breaches in trust, such as apologizing (conveying regret and taking responsibility) [33], empathizing [8], explaining errors [19], and using an anthropomorphic channel [14, 50].

Level of control is the extent to which a user determines how a task is performed — it ranges from no control (full automation) to complete control (low-level automation) [62]. Verberne et al. [68] found that a mixed level of control between user and system was more trustworthy than no user control. In contrast, the experiments described in [55] yielded the lowest trust ratings for a medium level of automation, with a low automation level receiving the highest trust ratings, and a high automation level receiving intermediate ratings.

Summary. In short, increased anthropomorphism has been shown to increase users’ trust in automation, as well as its influence of on users’ behaviour. In addition, trust is increased by ease of use of the automation, utilization of an older avatar and polite interaction, transparency and provision of feedback. However, the results regarding level of control are conflicting.

3 The Experiment and the Game

Our experiments most resemble Johnson’s [30], in that they consider FAs and Misses jointly, require users to take initiative in the absence of a stimulus, and comprise two tasks, one of which is interrupted by alerts (but in our case, this task is repetitive, rather than continuous). Section 3.1 provides an overview of our experiments, Section 3.2 presents the game, and Section 3.3 describes how the parameters of the game are determined.

3.1 Experiment overview

Our game was employed in two studies: monitoring system and advisor agent. In the first study, we investigated the effect of system accuracy and error type on user trust and behaviour, and in the second study, we investigated the training effect of a good advisor agent. Clearly, we had to select design features for the monitoring system and the advisor agent. For the monitoring system, we followed current industry standards whereby a disembodied system sends written alerts to carers (e.g., www.sofihub.com). For the advisor agent, we selected features that distinguish it from the monitoring system, and which, according to the findings reported in Section 2.2, increase trust.

At the beginning of our studies, participants filled a demographic and technology-experience questionnaire, and a questionnaire about their propensity to trust devices, adapted from [43, 44] (Table 3). Participants were then shown the complete version of the abridged narrative in Figure 1 and a training video, and proceeded to the game part.

Table 3 Trust propensity questionnaire: Agreement indicated on a 1-5 Likert scale.

Full size table

Monitoring-system study. We considered six types of monitoring systems, where each system has a particular combination of % of FAs, {HighFA, LowFA}, and % of Misses, {HighMiss, LowMiss, NoMiss} (Table 4). The study was divided into two within-subject experiments, which were conducted as follows: Sona1, which featured four systems representing combinations of {HighFA, LowFA} $\times $ {LowMiss, NoMiss}; and Sona2, where the combinations were {HighFA, LowFA} $\times $ {HighMiss, LowMiss}. In each experiment, a participant played one or more trials and four games, each with a different type of monitoring system. A game comprises two stages, and a report of the performance of the monitoring system and the user is produced after each stage (Figure 3). Users entered their level of trust in a monitoring system (on a 1-5 Likert scale) after seeing the performance report.

In both experiments, we compensated for the effect of presentation order of different types of monitoring systems, but we found no order effect.

Upon completion of the game, participants were asked for their opinion about different aspects of the experiment, e.g., engagement, stress and difficulty of the experiment, importance of the administration and monitoring tasks, and reliability of the monitoring system.

Advisor-agent study. We compared two choice architectures: Agent and NoAgent – with and without an advisor agent respectively. The study was a mixed between-subjects and within-subject experiment, where we assigned one group of users to each choice architecture. In the between-subjects part, we compared the behaviours of the two cohorts, and in the within-subject part, we examined the behaviour of each cohort separately. Both arms of the experiment comprised a trial and two games, each with two stages. The experiment was conducted as follows: the NoAgent users decided by themselves how to interact with the monitoring system in both games; while the Agent users were shown by the advisor agent how to make decisions about their actions in the first game, and then played the second game by themselves. Prior to interacting with the advisor agent, users in the Agent group were given an introduction to the agent and its capabilities (Figure 6). After each stage of the game, a performance report is produced, which, for the Agent group, includes the agent’s performance. After seeing the report, all users entered their trust in the monitoring system, and users in the Agent group also entered their trust in the agent.

As noted in Section 4.3, the notional accuracy of our advisor agent is 80%, which is considered high by Parasuraman and Miller [52]. For this study, we chose a HighFA/HighMiss monitoring system, where 67% of the alerts are true and 67% of the adverse events are detected by the system. This configuration, which is deemed trustable according to [50], was selected in order to clearly differentiate between the advisor agent and the monitoring system.

Upon completion of the game, participants in the NoAgent cohort were asked the same questions as those posed in the monitoring-system study. However, for the Agent group, we replaced most of the post-game questions with questions about the advisor agent, including a question about users’ preferences regarding whether and how to interact with the agent. Owing to the length of this paper, we report only on this last question.

3.2 The game

As stated above, our game comprises two tasks: a primary maintenance task of taking care of the elderly residents, and a secondary routine administration task, which keeps players occupied (a version of the monitoring-system study and the advisor-agent study can be accessed at MONITOR and ADVISOR respectively). The routine task is a one-back memory card game (a special case of the n-back task [35]), where participants are shown a sequence of stimuli – in our case, playing cards – and they must decide whether the current stimulus is the same as the previous one. Clearly, the card-playing task differs from the routine tasks performed in a real care-taking scenario. However, the presence of a secondary task that demands users’ attention enhances the ecological validity of our scenario. This type of simplification is often used in psychological experiments [7, 36, 54], and was employed in [10, 17] to manipulate cognitive load.

Figure 1 shows an abridged version of the narrative about the scenario and the game, and Figure 2 illustrates the game interface. The left-hand panel shows the card game (Figure 2(a)) — the remaining game time and current income appear at the top of the screen. The right-hand panel is used for the trust-relevant task, viz interacting with the monitoring system. This panel allows users to check on the residents, which yields an accurate eyewitness report that displays events that were missed by the system (in ) or true alerts that were ignored by the user (Figure 2(b)). Figure 2(c) shows a true low-risk alert generated by the system, and the feedback after the user clicks the attend button. Figure 2(d) depicts a high-risk alert that turns out to be false, and the feedback generated for attending to it. The outcome of decisions made in the card game (correct or wrong) and the feedback for attending to alerts are reinforced with auditory signals.

To give players timely general feedback, a performance report is issued half-way through a game with a monitoring system, and at the end of the game (Figure 3). In the left-hand panel of the report, participants are given feedback about their own performance, and in the right-hand panel, they are informed about the monitoring system’s performance and the events that remained unattended. In the bottom-right panel of the report, participants are asked to enter their level of trust in the monitoring system on a 1-5 Likert scale.

3.3 Determining the Parameters of the Game

The operating parameters of the game must satisfy the following conditions: (1) the length of the game should enable users to learn how to act and to form an opinion about the monitoring system, without becoming bored; (2) the frequency of the cards must be suitable for maintaining users’ interest, but not cause stress; (3) the number and distribution of alerts must be such that users can still play the card game; and (4) the rewards and penalties must be such that “good citizens” that attend promptly to most events, and play the card game reasonably well, end up with earnings above their initial salary.

Next, we discuss the first three factors, followed by the calibration of rewards and penalties.

3.3.1 Game configuration: length, card frequency and alert frequency

A game with a monitoring system lasts 520 seconds, and is divided into two stages (the game clock stops when users perform monitoring activities). A card appears every 4 seconds (65 cards per stage) — 96% of the users deemed this pace to be not stressful or a little stressful, according to our post-experiment questionnaire.

A monitoring system may generate a true alert (TA) for an event or miss an event (Miss), or it may generate a false alert (FA) in the absence of an event. Thus, the number and distribution of alerts depends on the number and distribution of adverse events and the performance profile of a monitoring system. To achieve a playable game with monitoring systems that are trustable and reliable according to [50, 52], we allocated 14 adverse events to each stage (i.e., the probability of an adverse event during a card is $\hbox {Pr}(\textit{Event})=\frac{14}{65}$).

Table 4 illustrates the profiles of the monitoring systems in our study. These profiles reflect two proportions of FAs (High and Low FAs) and three proportions of Misses (High, Low and No Misses). The probabilities associated with these rubrics were chosen to reflect system reliabilities in the range specified in [50, 52] — the fluctuations of probabilities of the same type (High or Low) are due to some stages in a game having an extra event.

Table 4 Performance profile (FAs and Misses) of our six monitoring systems.

Full size table

3.3.2 Rewards and penalties

The game is monetized to reflect the notion that the main task, for which users earn a salary, is to take care of the residents. Hence, they do not receive additional rewards for performing this monitoring task correctly. However, players incur monitoring penalties when they respond to FAs or check on residents, which use company resources, or when they delay attendance to adverse events, which is bad for the residents. In addition, players earn “bonuses” for correct answers in the card game (secondary task), but they are penalized for giving wrong answers or skip** cards.

We first outline our formulation for calculating rewards and penalties (details appear in Appendix A), followed by the calibration of the parameters of the game.

Game formulation – rewards and penalties. A user’s expected net income combines expected earnings from the card game ( ), which are positive, and expected losses from the monitoring task ( ), which are negative:

(1)

where includes losses from checking on residents ( ), delayed responses to TAs and Misses ( ), and attendance to FAs ().

(2)

Calibrating rewards and penalties. The reward structure of the game comprises (1) the monitoring-task penalties $Check, $PenaltyFA and $DelaySlope, which respectively specify losses from checks on residents, attendance to FAs and attendance delays; and (2) the administration task reward $Correct for a correct answer in the card game, and penalties $Skip for a skipped card and $Wrong for an incorrect answer — we set $Skip to −$1 and $Wrong to −$Correct.

Our objective is to calibrate these parameters in order to convey the idea that the monitoring task is more important than the administration task, and to ensure that the expected absolute and relative income of different types of users is commensurate with their behaviour and performance. To this effect, we first defined several hypothetical user types in terms of card-playing and monitoring parameters. Table 5 shows four of these user types, e.g., the Ordinary player / Best carer (second row) skips 10% of the cards, gives a correct answer for 80% of the remaining cards, checks on the residents four times per stage on average, and attends immediately to all alerts.

Table 5 Card-playing and monitoring characteristics of four hypothetical user types.

Full size table

To calibrate the parameters of the game, we compute the expected earnings of these user types for monitoring systems with different performance profiles. Specifically, we iterate over the following steps: (1) determine the value of $Correct for particular values of $Check, $PenaltyFA and $DelaySlope; and (2) adjust the values of these last three parameters if necessary. These steps, which are performed manually, are repeated until no more adjustments are deemed necessary.^{Footnote 3} Figure 4 displays the expected net income of the user types in Table 5 as a function of $Correct for the LowFA/NoMiss monitoring system (Figure 4(a)) and the HighFA/LowMiss system (Figure 4(b)). As seen in Figure 4, the expected net income of the user types who take good care of the residents and engage in the game (with varying levels of competence) is positive for $\$\textit{Correct}\ge \$4$, which is the value we selected. In contrast, user types who do not take good care of the residents will make money for $\$\textit{Correct}\ge \$6$ and $\$\textit{Correct}\ge \$8$ for the LowFA/NoMiss system and the HighFA/LowMiss system respectively, if they play the card game very well.

4 The Advisor Agent

Our advisor agent offers suggestions about how to act in light of a monitoring system’s alerts or lack thereof. As mentioned in Section 1, we wanted our agent to have the best chance to influence users’ behaviour. Therefore, we endowed it with design and performance features that promote trust and compliance, according to the findings reported in Section 2.2. Section 4.1 describes the agent’s design features, focusing on transparency and feedback, Section 4.2 outlines our calculation of the agent’s policy, and Section 4.3 describes the agent’s performance.

4.1 Agent design features

We first describe briefly our agent’s appearance/anthropomorphism, ease of use and communication style — these attributes were directly sourced from the literature; and then discuss in more detail the agent’s transparency and feedback.

4.1.1 Appearance/anthropomorphism, ease of use and communication style

Appearance/anthropomorphism – In line with [13, 23, 27, 50], our agent has a human appearance and oral communication — speech is considered an anthropomorphic channel [14], plus we wanted to distinguish between agent communications and monitoring-system notifications, which are textual. The only textual instruction accompanying the agent is written advice about how to interact with the agent. Figure 5 shows how our agent is incorporated into the game: its image and the written advice (in ) are displayed at the top of the interface.
Ease of use – Our agent gives spoken advice when an alert is raised by a monitoring system and when a periodic check is required. In addition, the agent or the monitoring system provide feedback after a user’s actions.
Communication style – Based on the results in [38, 51], we chose a middle-aged male appearance.^{Footnote 4} To address the findings in [15] regarding variation in expression, we pre-recorded three spoken messages, with different levels of verboseness, for each type of justification and feedback given by the agent (Sections 4.1.2 and 4.1.3 respectively), and randomly select one of them (the components of these messages were generated manually on the basis of the concepts described in Sections 4.1.2 and 4.1.3, but they can be easily generated using programmable templates). Finally, our agent is proactive, i.e., it offers advice without being asked, owing to the training efficacy of this mode of interaction [34].

4.1.2 Transparency

In line with the findings in [19, 48, 71], the agent justifies its recommendations for specific actions — these explanations must be brief, since the game is fast paced, and they are given orally.

An advice message has the following components: (1) recommended action – obtained from the agent’s policy (Section 4.2), (2) alert risk (high or low) – optional; and (3) rationale for the action (Table 6). Our agent may recommend to Check on the residents in the absence of an alert, and it may recommend one of three actions when an alert has been raised: Attend, Check or Check$+1$; Check$+1$ differs from Check in that the user is advised to ignore the alert for one card, and then check on the residents in the next card.^{Footnote 5}Check or Check$+1$ may be suggested for an alert, instead of Attend, in order to catch previous Misses or avoid attending to an FA. These ideas, together with information about the performance profile of the monitoring system and the reward structure of the game, make up our manually-derived rationales for the actions recommended by the agent.^{Footnote 6}

Our agent’s rationales are backward looking (Section 2.2), as future rewards, which are used in forward-looking explanations, are not informative in our case. Specifically, our rationales mention the following “teachable” applicability conditions: (a) recency of the last check on the residents, (b) accuracy (error rate) of the monitoring system, (c) risk of the event flagged in an alert – optional, and (d) need for a periodic check.

Table 6 displays the components of advice messages and sample messages of different levels of verboseness. For instance, the row marked with an “$*$” illustrates the message generated for a Check$+1$ advice given for a low risk alert, where the rationale is that checking can be delayed because the alert is low risk.

**Table 6 Agent’s recommendations and their justifications: factors that determine the components of a message (, , rationale), and sample spoken advice message.**

Table 7 Agent’s immediate performance feedback: factors that determine the components of a message (, *whether the advice was followed* or , alert status and eyewitness report), feedback components (, and attend reminder), and sample spoken feedback message.

4.1.3 Feedback

Following [5, 16, 19, 29, 32], we present information about the agent’s capabilities when the agent is introduced (Figure 6), provide a summary feedback after each stage of the game with the agent (like that in Figure 3, with additional agent-related information), and give immediate outcome-related feedback for users’ actions. This feedback has the following components when the action is Check or Check$+1$ (if a user’s action is Attend, the game gives automatic feedback, as shown in Figures 2(c) and 2(d)):

– As noted in [15, 33], apologies are instrumental in repairing trust breaches due to automation failure, which happens when the agent’s advice is incorrect, e.g., to Attend to an alert that turns out to be false. An acknowledgment is generated when both the agent’s advice and an alternative action are correct.
– This is the teachable component of the feedback, where positive aspects about an action recommended by the agent or taken by a user are noted, e.g., an FA or Misses were revealed by checking on the residents.
Attend advice – This advice is given to make sure users attend to adverse events seen in an eyewitness report.

Table 7 displays information about the agent’s immediate feedback for an action performed by a user after receiving the agent’s advice (Columns 1-4), and sample messages of different levels of verboseness (Column 5). Focusing on users’ actions Check or Check$+1$, the components of the feedback appear in Column 4. The factors that determine these components are: the agent’s advice (gray rows spanning Table 7), whether the advice was followed ( if the advice was not followed, Column 1), whether the alert was True or False (Column 2), and whether the eyewitness report is empty ($\emptyset $) or contains Misses (Column 3). For example, the row marked with an “$*$” depicts a situation where the agent’s advice was Attend, but the user checked on the residents instead. This Check showed that the alert was True (i.e., the advice was correct), but also revealed previous Misses (i.e., the user’s action, which differed from the advice, was also correct). In this case, the agent acknowledges the accuracy of its advice, and reinforces the user’s correct action, which caught missed events.

4.1.4 Summary of design features

In short, our agent is a middle-aged male that proactively offers spoken, explained advice and feedback. Potential trust miscalibration due to increased anthropomorphism, male gender and feedback is prevented by our agent’s high expected accuracy (Section 4.3).

4.2 Calculating the agent’s policy

The problem of deriving a policy for the agent resembles a Partially Observable (high-order) Markov Decision Process (POMDP), in the sense that it has a probability distribution over all the possible states, and it combines the immediate reward for an action with future rewards [57] (we minimize expected costs instead of maximizing expected rewards). However, we have taken advantage of the features of our game to find a tailored solution: (1) periodic checks on residents are important in a care-taking scenario; and (2) when an alert is raised, an Attend does not affect the belief state about Misses, and a Check leaves no uncertainty about Misses or the status of the alert.

Thus, our agent’s policy has two main parts: (1) a sub-policy for periodically checking on the residents, i.e., the optimal interval that should elapse between successive checks (Section 4.2.1); and (2) a sub-policy for handling alerts, i.e., which action to perform when an alert is raised (Section 4.2.2).

The derivation of the policy takes into account the probability of adverse events and the performance profile of the monitoring system – probabilities of FAs and Misses (Section 3.3.1); and the reward structure of the game – cost of checking on residents ($Check), penalty for attending to FAs ($PenaltyFA) and penalty for attendance delay ($DelaySlope) (Section 3.3.2).

4.2.1 Sub-policy for performing periodic checks

The optimal checking interval $N_{{\varvec{opt}}}$ is that which yields the minimal total expected cost for a particular configuration of our game, as follows (details appear in Appendix B.1):

(3)

where (=65) is the number of cards in one stage of our game, and $E(\textit{CumCheckCost}(i+1))$ is the expected cumulative cost of checking every $i+1$ cards throughout a stage of a game. This cost is determined by the cost of checking on the residents ($Check) plus the expected penalty for delayed event attendance over i cards, which in turn depends on the probability that an adverse event was missed by the monitoring system ($\hbox {Pr}(\textit{Event}) \times \Pr (\textit{Miss}|\textit{Event})$), the penalty for each card elapsed while an event remains unattended ($DelaySlope), and the elapsed number of cards for each event that could happen over i cards.

Thus, in the absence of an alert, the agent will remind a player to check on the residents after cards have elapsed. For our game configuration, , i.e., if a user has not checked on the residents for the last 6 cards, the agent will remind them to do so in the 7th card.

4.2.2 Sub-policy for handling alerts

Figure 7 illustrates a segment of the game timeline (in number of cards) comprising a previous Check at card number $t_{c}$, an alert at card number $t_a$, and future periodic checks. If the alert is attended, these checks are expected to happen at cards according to the periodic-check policy ( sequence). However, if a Check or Check$+1$ is performed, the periodic checks shift, and the calculation of expected penalties for Misses is adjusted ( and sequences). Thus, the expected cumulative cost of a specific Action considered for an alert at card number $t_a$ is calculated by the following equation (details appear in Appendix B.2):

$$\begin{aligned}{} & {} E(\text{Cumulative } \text{ Cost } \text{ of } {} { Action} \text{ at } t_a) = \\{} & {} \quad E(\text{Cost } \text{ of } {} { Action} \text{ at } t_a) + E(\text{Cost } \text{ at } t_a\! +\! 1) +\nonumber \\{} & {} \quad E(\text{Cumulative } \text{ cost } \text{ of } \text{ events } \text{ missed } \text{ between } t_{c}\! +\! 1 \text{ and } \text{ the } \text{ current } \text{ check})\!+\nonumber \\{} & {} \quad E(\text{Cumulative } \text{ cost } \text{ of } \text{ the } \text{ periodic } \text{ checks } \text{ from } \text{ the } \text{ most } \text{ recent } \text{ check})\nonumber \end{aligned}$$

(4)

In terms of POMDPs, the first two terms comprise the immediate reward for an action (shaded pink in Figure 7), and the last two terms represent its future reward. For example, when the action considered for card $t_a$ is Attend, the immediate expected cost is derived from the probability of the alert being false and the penalty associated with attending to an FA (first term), plus the expected cost at card $t_a + 1$ (second term), which is obtained from the probability of having another alert at $t_a + 1$, the probability that this alert is false, and the penalty for attending to an FA or ignoring a TA. The future expected cost for an Attend comprises the expected cost of the sub-policy for periodic checks from $t_c$ onwards (fourth term) — the third term is 0 for an Attend. The future expected cost due to alerts beyond card $t_a + 1$ is the same for all the actions, and can therefore be omitted from the calculation.

Computing the alert-handling policy. Algorithm 1 uses Equation 4 to calculate the expected cumulative cost of each action for each card in a stage of the game (), assuming that the most recent Check could have been done at any card since the beginning of the game (even though we have a periodic-check sub-policy, users may ignore the agent’s advice, and fail to check on the residents). The algorithm then selects the action with the minimal cost. The chosen actions are stored in a triangular matrix ${{\varvec{ACTION}}_{{\varvec{opt}}}}$, which embodies the alert-handling policy.

4.3 Agent performance

As seen in Table 2, most studies found a correlation between system performance and user trust and behaviour. Thus, in order to engender trust, we need a highly reliable advisor agent. The reliability of the agent’s advice depends on its policy, which in turn depends on the performance profile of the monitoring system. At the limit, if the monitoring system is 100% accurate, our agent will recommend attending to all the alerts and no periodic checks, and will be 100% reliable as well. For the HighFA/HighMiss system employed in our advisor-agent study, the interval for periodic checks is (=7) cards (so Misses may be attended with some delay), and the ${{\varvec{ACTION}}_{{\varvec{opt}}}}$ matrix usually recommends Check or Check$+1$, rather than Attend; Attend may be suggested when an alert appears shortly after a check has been performed.

When the agent’s recommendations are followed, its notional accuracy (proportion of instances where the actions it recommends yield good outcomes) is 80%, which is defined as highly reliable in [52], compared to the 67% accuracy of the monitoring system. To further validate the agent’s policy, we compared its monetary reward for each stage of a game with that of all the players who had not been exposed to the agent — the agent’s reward was the highest.

5 Experimental Results

We first present demographic information, relevant experience and trust propensity for the participants in the monitoring-system study and the advisor-agent study (Table 8). Our findings are discussed next, followed by a characterization of our participants on the basis of their behaviour.

Table 8 Demographic and experience information (options with the most participants are shown), and tendency to trust machines: Monitoring-system and Advisor-agent studies.

Full size table

5.1 Participant demographics, experience and trust propensity

Participants in the monitoring-system study were recruited from the SONA platform (www.sona-systems.com); as seen in Table 8, most of these participants were under 30 years of age. For the advisor-agent study, we decided to recruit an older cohort, who would be more likely to have ageing parents than the monitoring-system study participants. The advisor-agent study participants were recruited from CloudResearch (www.cloudresearch.com).^{Footnote 7}

Ideally, an experiment should have participants who are personally engaged with the domain. However, it is extremely difficult to recruit participants who work in the aged-care sector. Hence, we recruited crowd workers – an accepted practice for user studies conducted in areas such as User Modeling, Language Technology and Trust. In our case, the problem of out-of-domain participants is mitigated by the following: (1) the narrative immersion we provided; (2) monitoring systems, such as that described in our study, have become increasingly prevalent (as indicated in Table 8, 59% of the participants in the monitoring-system study and 85% of the participants in the advisor-agent study reported medium-high experience with such systems); and (3) many people in the general population have ageing relatives.

As seen in Table 8, the participants in the monitoring-system study differ from the participants in the advisor-agent study in most demographic aspects. However, these differences do not affect the validity of our research for the following reasons: (1) the two studies address different research questions, and (2) the behaviour of the participants in the two studies was not affected by their demographic differences. This was ascertained by performing Wilcoxon rank-sum tests for the activities relevant to the monitoring task (% alerts attended and # checks on residents) to compare between the behaviour of participants from the two studies in their initial exposure to the game, i.e., the (first) trial of the monitoring-system study and the first stage of the trial of the advisor-agent study (# checks was pro-rated to account for differences in trial length). The test did not reject the null hypothesis that both groups exhibit the same behaviour (i.e., no statistically significant differences were found between the behaviours of the two groups).

5.2 Monitoring-system study

This section describes our findings regarding the relationship between system performance, trust and user behaviour, focusing on compliance (% alerts attended) and reliance (# checks on residents, where more checks indicate lower reliance).

Influence of error type on self-reported trust, compliance and reliance. The calculations for trust involved both stages of a game with a monitoring system, while the calculations for compliance and reliance involved only the second stage of a game. This is because users enter their level of trust in the system only after they have played a stage of a game and seen the report (Figure 3). In contrast, users do not have explicit performance information while they are playing the first stage of a game with a monitoring system, and their behaviour may be influenced by the previous system they saw.

As mentioned in Section 1, one would expect high FAs to reduce trust and compliance, and high Misses to reduce trust and reliance. Our results match these intuitions: more errors decrease trust, more FAs decrease compliance, and more Misses decrease reliance (statistically significant with $\textit{p}\hbox{-}\textit{value}\ll 0.01$ adjusted with Holm-Bonferroni correction for multiple comparisons [28]; the means and standard deviations of trust, compliance and reliance, and p-values prior to Holm-Bonferroni correction appear in Table 14 in Appendix C). In addition, FAs do not affect reliance and Misses do not affect compliance, which is consistent with the findings reported in [11, 64].

To determine which independent variable has a stronger effect on trust, we used three measures: ANOVA coefficients for standardized independent variables, change in $R^2$ and Spearman correlation. All three measures agreed that FAs have a slightly stronger influence than Misses on trust, which lends support to the findings in [24, 30].

Relationship between self-reported trust and behaviour. In line with our statistical significance tests, we used the second stage of each game to determine the relation between user behaviour and self-reported trust. Intuitively, one would expect trust in a system to be positively correlated with both compliance and reliance. However, similarly to [11, 58], we found no correlation between trust and reliance, and only a very weak correlation between trust and compliance.

5.3 Advisor-agent study

This section reports our findings about the views of participants in the Agent group about interacting with the agent, the relationship between trust in the agent and conformity with its advice, and the agent’s training effect.

Views of participants in the AGENT group about interacting with the advisor agent. Upon completion of the experiment, we asked participants “Which of these would more closely reflect your views about having the agent (Daniel) to help you use the monitoring system?”. The possible answers were “I prefer to always have the agent / I prefer to always NOT have the agent / I prefer to learn from the agent and work without the agent / Other [with an option to type free text]”. 29% of the participants preferred to always have the agent, and 55% preferred to learn from the agent and then work without the agent. That is, 84% of the users saw value in the agent.

Relationship between self-reported trust and behaviour. We calculated the correlation between trust in the agent and conformity with its advice for each stage of Game 1 — we deemed that the agent’s advice was followed if a user acted on it within six seconds.^{Footnote 8} We found no correlation between trust in the agent and conformity with its advice.

Training effect of the agent’s advice. As indicated in Section 4.3, in principle, the agent’s advice yields good outcomes 80% of the time, if its recommendations are followed. However, this is not always the case. In practice, the agent’s accuracy for different users varies according to their behaviour. Despite sometimes receiving inaccurate advice, we hoped that users would learn behavioural principles from the agent’s recommendations and their associated justifications (Section 4.1.2).

In order to determine the agent’s training effect, we first define what constitutes good behaviour. In Section 4.3, we noted that to obtain good performance when interacting with the monitoring system employed in this study, participants should perform relatively frequent checks on residents (low reliance) and should attend infrequently to alerts (low compliance). Thus, we define a “better behaviour” as an increase in # checks on residents and a reduction in % alerts attended. This definition is used to test the following hypotheses regarding the agent’s influence on users’ behaviour:

Hypothesis 1

(1) The agent has an immediate influence on users’ behaviour, i.e., users’ behaviour while being advised by the agent is better than their previous behaviour; and (2) this difference is more pronounced than the difference obtained by learning from experience.

Hypothesis 2

(1) The agent has a training effect, i.e., users’ behaviour after interacting with the agent is better than their pre-agent behaviour; and (2) this difference is more pronounced than the difference obtained by learning from experience.

To test Hypothesis 1, we compared the Game 1 behaviour of the Agent cohort (averaged over two stages) with their behaviour in the second stage of the trial, denoted Trial-Stage2, and the Game 1 behaviour of the NoAgent cohort with their Trial-Stage2 behaviour (the # checks in Trial-Stage2 was pro-rated to make them comparable to the # checks in the longer stages of the games). To test Hypothesis 2, we compared the behaviour of the two cohorts in Game 2 (averaged over two stages) with their (pro-rated) Trial-Stage2 behaviour. A Wilcoxon rank-sum test on # checks and % alerts attended in Trial-Stage2 did not reject the null hypothesis that the two cohorts exhibit the same behaviour for this stage.

Our results confirm both of the hypotheses postulated above (Wilcoxon signed-rank test, $\textit{p}\hbox{-}\textit{value}< 0.01$ adjusted with Holm-Bonferroni correction for multiple comparisons [28]; the means and standard deviations of compliance and reliance in Trial-Stage2, Game 1 and Game 2 for the Agent and NoAgent cohorts, and the p-values of the comparisons prior to Holm-Bonferroni correction appear in Table 15 in Appendix C):

Hypothesis 1 (1) The # checks for the Agent group in Game 1 was significantly higher than the (pro-rated) # checks in Trial-Stage2, and the % alerts attended in Game 1 was significantly lower than in Trial-Stage2. Both behaviours are consistent with the agent’s advice, and are better in Game 1. (2) There was no statistically significant difference between the # checks or % alerts attended by the NoAgent group in Game 1 and the corresponding behaviour in Trial-Stage2.^{Footnote 9}
Hypothesis 2 (1) The # checks for the Agent group in Game 2 was significantly higher than the (pro-rated) # checks in Trial-Stage2, and the % alerts attended in Game 2 was significantly lower than in Trial-Stage2. Further, in Game 2, the participants in the Agent group maintained the behaviours they learned from the agent in Game 1 (there was not statistically significant difference between Game 2 and Game 1). (2) There was no statistically significant difference between the # checks or % alerts attended in Game 2 by the NoAgent group and the corresponding behaviour in Trial-Stage2. In addition, there was no statistically significant difference between Game 2 and Game 1.$^9$

5.4 Characterizing user types

The behaviours that are directly observable in both studies are the card-playing behaviours (% cards correct and % cards skipped), and the monitoring behaviours (% alerts attended and # checks on residents).

In order to determine whether these behaviours represent different types of users, we used them as input features to a clustering algorithm – K-means. We then checked whether these user types are useful for inferring behaviour patterns pertaining to our research questions.

5.4.1 Monitoring-system study

Behaviour throughout the experiment. Initially, we clustered the participants in our study in order to validate our hypothetical user types (Table 5) against user types derived from real behaviours. Specifically, we used as clustering features the average of each of the four basic behaviours throughout an experiment. The best configuration (Silhouette score = 0.64) comprises the three clusters in Table 9. The users in all the clusters had a similar card-playing behaviour; in fact, we obtained the same clusters without the card-playing features, which indicates that the users differed mainly in their monitoring behaviour. The users in the HighReliance-HighCompliance cluster (low # checks and high % alerts attended) fit the Ordinary Carer type, and those in the MediumReliance-HighCompliance cluster resemble the Best Carer type. The LowReliance-MediumCompliance cluster (high # checks and medium % alerts attended) is noteworthy because the users in this cluster traded alert attendance for checks on residents — a behaviour we did not anticipate.

Table 9 Monitoring-system study $-$ Characteristics of clusters derived from participants’ behaviour throughout a game (Silhouette score = 0.64).

Full size table

Behaviour in the initial stages of the experiment. The user types identified for the entire experiment led us to ask whether user types derived from behaviour in the initial stages of the experiment shed light on users’ trust, compliance and reliance on the monitoring system. This question is addressed below. In Section 6.1, we determine whether these user types have predictive value with respect to trust and behaviour.

To answer the first question, we clustered users according to their monitoring behaviour in the first game of the experiment (averaged over two stages).^{Footnote 10} Table 10 shows the best configuration (Silhouette score = 0.68). Overall, the user types based on their initial monitoring behaviour resemble the types obtained from users’ behaviour in the entire experiment (Table 9), but there was some migration between clusters throughout the experiment.

The following results were obtained for the three user types in Table 10 (the means, standard deviations and p-values prior to Holm-Bonferroni correction, and Spearman correlations between trust and compliance and between trust and reliance, appear in Table 16 in Appendix C):

HighReliance-HighCompliance – The users in this group did not significantly change their trust and reliance on the monitoring system due to changes in the number of Misses. However, they significantly increased their trust in the monitoring system when the proportion of FAs was reduced, and also increased their compliance (trend after Holm-Bonferroni correction). This indicates that users of this type are reactive, and do not engage in the proactive reasoning required to take Misses into account. In terms of correlations, we only found a very weak correlation between trust and compliance for these users.
MediumReliance-HighCompliance – The users in this cohort are the most susceptible to changes in FAs and Misses. Their trust in the monitoring system and their reliance on it increased significantly as the number of Misses decreased. They also significantly increased their trust in the monitoring when the proportion of FAs was reduced, and also increased their compliance (trend after Holm-Bonferroni correction). However, we found no correlation between trust and compliance or reliance for these users.
LowReliance-MediumCompliance – The participants in this cluster did not change their behaviour significantly as a result of changes in FAs, but they significantly increased their trust in the monitoring system when the proportion of FAs was reduced. They also increased their trust and reliance on the monitoring system when the proportion of Misses decreased (trend after Holm-Bonferroni correction). In terms of correlations, we found a moderate correlation between trust and compliance and between trust and reliance for these users.

Table 10 Monitoring-system study $-$ Characteristics of clusters derived from participants’ monitoring behaviour in the first game (Silhouette score = 0.68).

Full size table

5.4.2 Advisor-agent study

Our experience with the monitoring-system study led us to identify user types in the advisor-agent study. We clustered all the participants according to their behaviour in Trial-Stage2, and clustered separately the participants in the Agent and the NoAgent cohorts according to their Game 1 behaviour (averaged over two stages). Here too, only the monitoring-task features affected the found clusters.

Table 11 shows the monitoring characteristics of the users in the discovered clusters, and the Silhouette scores of these clusters. The clusters found in Trial-Stage2 for both cohorts and in Game 1 for the NoAgent cohort resemble the HighReliance and LowReliance clusters found in the first game of the monitoring-system study (Table 10). In contrast, the Game 1 clusters found for the Agent cohort reflect the influence of the advisor agent, with most of the participants exhibiting a medium-low compliance and a low to very-low reliance on the monitoring system.

Similarly to the monitoring-system study, below we discuss whether the user types obtained from Trial-Stage2 shed light on users’ trust in the agent, conformity with the agent’s advice, and compliance and reliance on the monitoring system. In Section 6.2, we determine whether the user types obtained from Trial-Stage2 and Game 1 have predictive value with respect to trust and behaviour.

Table 11 Advisor-agent study $-$ Characteristics and Silhouette scores of the clusters derived from participants’ behaviour in Trial-Stage2 (# checks are pro-rated) and Game 1 for the Agent and NoAgent groups.

Full size table

The following results were obtained for the two user types identified in Trial-Stage2 for the participants in the Agent cohort (the means, standard deviations and p-values prior to Holm-Bonferroni correction appear in Table 17 in Appendix C).

Trust in the advisor agent and conformity with its advice – There is no statistically significant difference between the two clusters in terms of trust in the agent and conformity with its advice. In addition, for both clusters, we found no correlation between trust and conformity.
Change in compliance and reliance on the monitoring system due to the advisor agent – Both types of users statistically significantly decreased their compliance and reliance (increased the # checks on residents) on the monitoring system in Game 1 relative to Trial-Stage2 ($\textit{p}\hbox{-}\textit{value}\ll 0.01$), which is consistent with our results for Hypothesis 1(1). This indicates that both types of users are susceptible to the agent’s advice. Indeed, the decrease in compliance was similar for both groups (and there was no statistically significant difference between their compliance), but the HighReliance group increased their # checks on residents more than the LowReliance group. Still, the # checks on residents of the latter group remained significantly higher than that of the former ($\textit{p}\hbox{-}\textit{value}< 0.05$).

6 Predicting Trust and Behaviour

The prediction of trust and behaviour is essential to anticipate the adoption of devices, and to assist people in their usage. In Section 2, we highlighted two pieces of research that predict trust-related aspects: Lee and Moray [39] built time-series models of the influence of system performance on trust, but they did not consider behaviour; and Xu and Dudek [72] employed Dynamic Bayesian Networks to predict trust ratings and user interventions (to override the decisions of automated agents) from system performance and previous trust ratings.

We consider predictive models for the following aspects in our two studies:

Monitoring system – a user’s trust score in a monitoring system, compliance (% alerts attended) and reliance (# checks on residents).
Advisor agent – a user’s trust score in the advisor agent, conformity (% advice followed) and the user’s learning outcomes.

For both studies, we used three types of input features: System – characteristics of the system and the game; User – demographic information, experience and trust propensity; and User game experience – trust and behaviour in previous stages of the game (details about the input features are provided in Appendix D.1 and D.2 for the monitoring-system study and the advisor-agent study respectively).

We considered several predictive models, viz random forests, naïve Bayes, support vector machines and linear/logistic regression (we do not have enough data for Deep Learning models), and performed 10-fold cross validation. Random forests outperformed the other models in most cases, and had a stable performance overall (Tables 18 and 19 in Appendix D compare the performance of these models for the monitoring-system study and the advisor-agent study respectively). However, the main significance of our results pertains to the features that have predictive value.

In the following sections, we report the predictive performance of random forests, and describe the most influential features.

6.1 Monitoring-system study

We predict a user’s trust in a monitoring system, compliance and reliance for each of the last five stages played by a user (the first stage of the second game contains data used to make predictions for the following stage, and as mentioned in Section 5.4.1, the first game (two stages) was used to cluster the users). Predictive performance was measured with Root Mean Square Error (RMSE), as the target variables are numerical (trust scores are ordinal).

Table 12 Monitoring-system study $-$ RMSE of predictions of trust, compliance and reliance.

Full size table

Table 12 shows the RMSEs of the predictions for trust, compliance and reliance obtained with random forests. These errors are between 7-15% of the available range, which may be sufficiently informative for practical purposes. The most influential features for predicting trust, compliance and reliance are:

Trust score [1-5]: System feature # FAs in the current stage, User features Ethnicity (which is consistent with the findings reported in [4]) and Average trust propensity, and User game experience feature Trust in the monitoring system in the preceding stage.
Compliance (% alerts attended) [0-1]: User feature Ethnicity, and User game experience features % alerts attended, # checks and % checks attended in the preceding stage and User Type based on the first game.
Reliance (# checks on residents) [0-20]: User game experience features % alerts attended, # checks and % checks attended in the preceding stage and User Type based on the first game.

Noteworthy aspects of these features are: (1) a feature value in the preceding stage is predictive of its value in the next stage; (2) user type based on the first game is influential for predicting compliance and reliance (using the individual features that gave rise to the clusters reduces performance); and (3) the only demographic feature that has a high predictive power (for trust and compliance) is ethnicity.

6.2 Advisor-agent study

We predict a user’s trust in the agent and conformity with its advice (% advice followed) for Game 1, and learning outcomes (whether the user has learned a better checking and alert-attendance policy from the agent or from experience) for Game 2. As for the monitoring-system study, predictive performance for trust and conformity was measured using RMSE, but we used accuracy (percentage correct) for learning outcomes, as they are binary – the classes are {Yes,No}.

In order to predict learning outcomes, we must provide an operational definition of learning. To this effect, we harness Hypothesis 2, which posits that there is a training effect if a user’s behaviour in Game 2 is better than their initial behaviour, i.e., they increased the # checks and reduced the % alerts attended (Section 5.3). Specifically, we define learning as follows:

A user U has learned to check on residents if the # checks performed by U in each stage of Game 2 is higher than their (pro-rated) # checks in Trial-Stage2, and the average # checks performed by U in Game 2 is higher than their (pro-rated) # checks in Trial-Stage2 by at least half a standard deviation of the # checks performed by all the users in Game 2.
A user U has learned to attend to alerts if the % alerts attended by U in each stage of Game 2 is lower than their % alerts attended in Trial-Stage2, and the average % alerts attended by U in Game 2 is lower than their % alerts attended in Trial-Stage2 by at least half a standard deviation of the % alerts attended by all the users in Game 2.

Although the requirement of half a standard deviation is somewhat arbitrary, it reflects a substantial difference between users’ pre-agent and post-agent behaviour, which is indicative of learning.^{Footnote 11}

Table 13 Advisor-agent study $-$ RMSE of predictions of trust in the agent and conformity with its advice for the Agent cohort in Game 1, Stage 2; and accuracy of predictions of learning outcomes for the Agent and NoAgent cohorts in Game 2.

Full size table

Table 13 shows the RMSEs of the predictions of trust and conformity and the accuracy of the predictions of learning outcomes obtained with random forests. The RMSE for trust is similar to that obtained for the monitoring-system study, and the accuracy of the learning-outcomes predictions is quite creditable, but the RMSE for conformity is 23% of the range, which leaves something to be desired. Nonetheless, what is noteworthy are the features with predictive power. The following features alone yield the best performance for predicting trust, conformity with the agent’s advice and learning outcomes:

Trust score agent (Game 1, Stage 2) [1-5]: User game experience feature Trust in the agent in the preceding stage of Game 1.
Conformity with the agent’s advice (Game 1, Stage 2) [0-1]: User game experience features # checks and conformity with the agent’s advice in the preceding stage of Game 1.
Learning checks and alert attendances (Game 2) {Yes, No}:

For the Agent cohort, the majority class was Yes for learning checks (56.66%) and learning alert attendances (53.33%). The best predictors for learning to check are: Agent feature Accuracy in Game 1, and User game experience features # checks in Game 1 and User type according to Trial-Stage2. The best predictors for learning to attend are: Agent feature Accuracy in Game 1, and User game experience feature User Type according to Game 1.

For the NoAgent cohort, the majority class was No for learning checks (68.97%) and learning alert attendances (89.66%). The best predictors for learning to check are: User game experience features # checks in Game 1 and User Type according to Trial-Stage2. However, the majority class is the best predictor for learning to attend.

Similarly to the monitoring-system study, (1) a feature value in a preceding stage is predictive of its value in the next stage; and (2) user type according to Trial-Stage2 or Game 1 is an influential behaviour predictor (here too, using the individual features that gave rise to the clusters reduces performance). In addition, (3) the agent’s Game 1 accuracy for a user is an influential predictor of the user’s learning outcomes for checks on residents and alert attendance (recall that the agent’s accuracy varies according to users’ behaviour).

7 Conclusion

In this paper, we have (1) identified user types that shed light on users’ trust in automation and aspects of their behaviour; (2) ascertained the effect of monitoring-system accuracy and two common error types (false alerts and missed events) on users’ trust, compliance and reliance on the system; (3) determined the effect of the recommendations made by a good advisor agent on users’ compliance and reliance on the monitoring system, and learning outcomes; and (4) identified influential factors for predicting users’ trust and aspects of their behaviour.

User types obtained from basic behaviours. We characterized participants based on two aspects of the behaviour they exhibited early in an experiment: % alerts attended and # checks on residents. What is noteworthy is that the user types obtained from this basic information have predictive value, and shed light on other behaviours. Specifically, users of different types react differently to changes in FAs and Misses. This indicates that user characterizations obtained from a few basic information items may be of value — an insight that extends beyond reactions to FAs and Misses.

Relationship between system performance, trust and behaviour. Our results are similar to those in [11, 64], i.e., FAs and Misses influence trust, FAs affect compliance and Misses affect reliance. Like [11, 58] and unlike [3, 30, 64, 72, 73], we found no correlation between users’ self-reported trust in the monitoring system and reliance. In addition, similarly to [11, 64], we found only a very weak correlation between trust in the monitoring system and compliance, and found no correlation between trust in the advisor agent and conformity with its advice.

As mentioned above, the effects of FAs and Misses differ for different types of users identified in the monitoring-system study (Section 5.4): one group was highly reactive (to alerts), one group was highly deliberative (performing many checks on the residents), and a third group sat between these extremes. This finding extends Hoff and Bashir’s hypothesis regarding factors that influence the relationship between system performance and users’ trust and behaviour [27] (Section 2), and as above, is applicable beyond responses to FAs and Misses.

Effect of the recommendations made by an advisor agent on users’ behaviour. The results of our study indicate that the agent-based architecture leads to improved decision-making outcomes, both while users work with the agent and when they work independently later on, thus showing that the agent has a training effect. In addition, even though our agent’s advice was not tailored to particular types of users, it led to larger changes in behaviour for the reactive users than for the deliberative users, attenuating the difference between these user types. This finding is in line with the results of Buçinca et al. [9], whereby users that conform excessively with an AI’s recommendations benefit from cognitive forcing – interventions that disrupt heuristic reasoning, and encourage users to engage in analytical thinking.

Predicting trust and behaviour. Our predictive models generate usable predictions of trust scores, compliance and reliance on the monitoring system, conformity with the advisor agent’s advice and learning outcomes. They also complement our findings about features that influence users’ trust and aspects of behaviour. Specifically, in both user studies, (1) a feature value is predictive of its value in the short term, which agrees with the findings in [49]; and (2) user type based on a user’s early behaviour is influential for predicting compliance and reliance on the monitoring system, conformity with the agent’s advice and learning outcomes. A possible explanation for the better predictive performance when clusters (i.e., user types) were used as features (instead of using the features from which the clusters were derived) is that the individual features may lead to over-fitting, while the clusters reflect meaningful abstractions. Finally, the results of the advisor-agent study also indicate that (3) the agent’s accuracy for individual users is a predictor of their learning outcomes. Although this finding is not surprising, it suggests that it is worth taking into account users’ personal experience with a system, which may differ from general system accuracy.

Methodological contributions. The studies that yield these findings are enabled by two methodological contributions: the game itself, which supports experimentation with various factors, and a version of the game augmented with an advisor agent; and techniques for calibrating the parameters of the game and determining the recommendations of the advisor agent.

Limitations and future work. Our studies have the following limitations: (1) our routine administration task is not related to the care-taking scenario; and (2) our participants were crowd workers recruited from the SONA and CloudResearch platforms, and were not aged-care workers. Regarding the first limitation, we posit that the presence of a task that occupies participants’ attention enhances the ecological validity of our user studies, as it represents a common situation where people are responsible for patients’ welfare, while engaged in activities that demand their attention. Although not ideal, the recruitment of crowd workers has become an accepted standard in studies conducted in User Modeling and related areas. Mitigating factors to this limitation are: our narrative immersion, the observation that most of our participants had experience with smart homes or monitoring devices, and the fact that having ageing relatives is a common experience.

As mentioned in Section 1, Hoff and Bashir [27] noted that “Overall, the specific impact that false alarms and misses have on trust likely depends on the negative consequences associated with each error type in a specific context”. Our monitoring-system study is set in a care-taking scenario, where the consequences pertain to patients — a setting that has not been investigated to date. Since our focus was on the influence of error type on trust and behaviour, the consequences remained constant across different configurations of FAs and Misses. Going forward, it would be worthwhile to conduct studies which vary the consequences of users’ actions or lack thereof.

In terms of our design methodology, the process for setting the parameters of the game is manual (Section 3.3.2). An interesting avenue of investigation involves using an optimization algorithm to calibrate the game parameters together, e.g., to maximize the income gap between different types of users, while ensuring an appropriate level of income.

Our advisor agent was endowed with attributes which, according to the literature, are conducive to influencing users’ behaviour (Section 4). Since this agent was successful in modifying the behaviour of our participants, it is worth performing further studies that examine the impact of individual agent features on trust, conformity and learning outcomes. For example, it would be interesting to compare our agent with an agent that simply recommends actions without explaining them, or with agents that have different ages and genders. Another avenue of investigation would involve generating advice and explanations tailored to different types of users. For instance, the deliberative group could receive advice without explanations, while the reactive group could receive advice with explanations that promote analytical thinking [9]. Finally, transparency was achieved by offering manually-generated justifications of the agent’s advice, based on inspection of the agent’s policy. In the future, it would be interesting to automatically learn justifications from features of this policy.

Notes

Examples of these tasks in the real world are dispensing medications, making beds or kee** records. However, as noted in Section 3.2, our routine, attention-demanding task is an abstract task that is not associated with a care-taking scenario.
For clarity of exposition, we distinguish between compliance with monitoring-system alerts, i.e., attending to alerts, and conformity with the advisor agent’s advice, i.e., following its advice.
An interesting avenue of investigation involves using an optimization algorithm to calibrate the game parameters together, e.g., to maximize the income gap between different types of users, while ensuring an appropriate level of income.
As mentioned in Section 2.2, the findings in [1] indicate that we could have chosen a male or a female agent, but our study pre-dates that publication.
Other actions, such as the composite [Attend, Check+1], and further delays in checking on the residents are always worse than Attend, Check and Check$+1$ in light of the event probabilities and reward structure of the current game configuration. However, they can be readily incorporated into the formalism described in Section 4.2 for other configurations.
An interesting avenue of investigation is the automatic derivation of rationales, e.g., by using classifiers to recognize high-level concepts from low-level state features and actions taken [26, 67], or learning a map** from an agent’s actions to signals given by human experts [21].
In total, 85 crowd workers from CloudResearch participated in our study, of which we accepted 60. Crowd workers were rejected if they performed no checks on the residents throughout the experiment, despite several early admonitions to check, under both choice architectures, and multiple recommendations made to the Agent cohort throughout the game; typically these workers attended to very few alerts, and just ran down the clock. In contrast, there was only one rejected participant in the SONA study, which was conducted in a laboratory under supervision.
Reliance is not relevant to this study, because the agent generates advice for each alert, and even if users perform unprompted checks on the residents, it could be because they have learned that this is a good idea.
As seen in Table 15 in Appendix C, there is increase in the # checks performed by the NoAgent group between Trial-Stage2 and Game 1 and between Game 1 and Game 2, and also in the % alerts attended between Game 1 and Game 2, which is the opposite of the desired behaviour. However, the difference between Trial-Stage2 and Game 1 is not statistically significant, and the other differences are only trends, i.e., they do not hold after Holm-Bonferroni correction.
This was done because, due to a technical problem, we did not have data from the trials played by nine users. Also, as for the entire game, card-playing behaviour did not affect the resultant clusters.
We used Wilcoxon sign-ranked test to compare the (pro-rated) Trial-Stage2 # checks with $\textit{\# checks}+ \frac{1}{2} \textsc {stdev}$ (4), and the Trial-Stage2 % alerts attended with $\textit{\%~alerts attended} + \frac{1}{2} \textsc {stdev}$ (0.14). The differences are statistically significant, with $\textit{p}\hbox{-}\textit{value}= 6.22$ E-07 for # checks and $\textit{p}\hbox{-}\textit{value}= 2.31$ E-08 for % alerts attended.
Our formulation distinguishes between high- and low-risk events, but for the sake of clarity, this distinction is not shown here.

References

M. Armando, M. Ochs, and I. Régner. The impact of pedagogical agents’ gender on academic learning: A systematic review. Frontiers in Artificial Intelligence, 5, 2022.
H. Atoyan, J.-R. Duquet, and J.-M. Robert. Trust in new decision aid systems. In Proceedings of the 18th Conference on l’Interaction Homme-Machine, pages 115–122, Montreal, Canada, 2006.
N. Bagheri and G. Jamieson. Considering subjective trust and monitoring behaviour in assessing automation-induced “complacency”. In D. Vincenzi, M. Mouloua, and P. Hancock, editors, HPSAAII – Proceedings of the Human Performance, Situation Awareness and Automation Conference, pages 54–59, Daytona Beach, Florida, 2004.
C. Bartneck, T. Suzuki, T. Kanda, and T. Nomura. The influence of people’s culture and prior experiences with Aibo on their attitude towards robots. AI & Society, 21(1):217–230, 2006.
Article Google Scholar
H. Beck, M. Dzindolet, and L. Pierce. Automation usage decisions: Controlling intent and appraisal errors in a target detection task. The Journal of the Human Factors and Ergonomics Society, 49(3):429–437, 2007.
Article Google Scholar
J. Berg, J. Dickhaut, and K. McCabe. Trust, reciprocity, and social history. Games and Economic Behavior, 10(1):122–142, 1995.
Article MATH Google Scholar
J. Braun and B. Julesz. Withdrawing attention at little or no cost: Detection and discrimination tasks. Perception & Psychophysics, 60(1):1–23, 1998.
Article Google Scholar
C. Breazeal. Toward sociable robots. Robotics and Autonomous Systems, 42(3–4):167–175, 2003.
Article MATH Google Scholar
Z. Buçinca, M. B. Malaya, and K. Z. Gajos. To trust or to think: Cognitive forcing functions can reduce overreliance on AI in AI-assisted decision-making. Proceedings of the ACM on Human-Computer Interaction, 5(188):1–21, 2021.
Article Google Scholar
N. Dadashi, A. Stedmon, and T. Pridmore. Semi-automated CCTV surveillance: The effects of system confidence, system accuracy and task complexity on operator vigilance, reliance and workload. Applied Ergonomics, 44(5):730–738, 2013.
Article Google Scholar
R. Davenport and E. Bustamante. Effects of false-alarm vs. miss-prone automation and likelihood alarm technology on trust, reliance, and compliance in a miss-prone task. In Proceedings of the Human Factors and Ergonomics Society 54th Annual Meeting, pages 1513–1517, San Fransisco, California, 2010.
C. de Melo and J. Gratch. People show envy, not guilt, when making decisions with machines. In International Conference on Affective Computing and Intelligent Interaction, pages 315–321, **’an, China, 2015.
E. de Visser, F. Krueger, P. McKnight, S. Scheid, M. Smith, S. Chalk, and R. Parasuraman. The world is not enough: Trust in cognitive agents. In Proceedings of the Human Factors and Ergonomics Society 56th Annual Meeting, pages 263–267, Boston, Massachusetts, 2012.
E. de Visser, S. Monfort, R. McKendrick, M. Smith, P. McKnight, F. Krueger, and R. Parasuraman. Almost human: Anthropomorphism increases trust resilience in cognitive agents. Journal of Experimental Psychology: Applied, 22(3):331–349, 2016.
Google Scholar
E. de Visser, R.Pak, and T. Shaw. From ‘automation’ to ‘autonomy’: The importance of trust repair in human-machine interaction. Ergonomics, 61(10):1409–1427, 2018.
Article Google Scholar
M. Desai, P. Kaniarasu, M. Medvedev, A. Steinfeld, and H. Yanco. Impact of robot failures and feedback on real-time trust. In HRI’13–Proceedings of the 8th ACM/IEEE International Conference on Human-Robot Interaction, pages 251–258, Tokyo, Japan, 2013.
S. Dixon and C. Wickens. Automation reliability in unmanned aerial vehicle control: A reliance-compliance model of automation dependence in high workload. The Journal of the Human Factors and Ergonomics Society, 48(3):474–486, 2006.
Article Google Scholar
T. Dodson, N. Mattei, and J. Goldsmith. A Natural Language argumentation interface for explanation generation in Markov Decision Processes. In R. Brafman, F. Roberts, and A. Tsoukiàs, editors, ADT2011 – Proceedings of the 2nd International Conference on Algorithmic Decision Theory, pages 42–55, Piscataway, New Jersey, 2011.
M. Dzindolet, S. Peterson, R. Pomranky, L. Pierce, and H. Beck. The role of trust in automation reliance. International Journal of Human-Computer Studies, 58(6):697–718, 2003.
Article Google Scholar
F. Elizalde, L. E. Sucar, M. Luque, J. Diez, and A. Reyes. Policy explanation in factored Markov Decision Processes. In Proceedings of the 4th European Workshop on Probabilistic Graphical Models, pages 97–104, Hirtshals, Denmark, 2008.
Y. Fukuchi, M. Osawa, H. Yamakawa, and M. Imai. Autonomous self-explanation of behavior for interactive reinforcement learning agents. In HAI’17 – Proceedings of the 5th International Conference on Human Agent Interaction, pages 97–101, Bielefeld, Germany, 2017.
J. Gao and J. Lee. Effect of shared information on trust and reliance in a demand forecasting task. In Proceedings of the Human Factors and Ergonomics Society 50th Annual Meeting, pages 215–219, San Fransisco, California, 2006.
L. Gong. How social is social responses to computers? The function of the degree of anthropomorphism in computer representations. Computers in Human Behavior, 24(4):1494–1509, 2008.
Article Google Scholar
N. Gupta, A. M. Bisantz, and T. Singh. Investigation of factors affecting driver performance using adverse condition warning systems. In Proceedings of the Human Factors and Ergonomics Society 45th Annual Meeting, pages 1699–1703, Minneapolis/St. Paul, Minnesota, 2001.
W. Güth, R. Schmittberger, and B. Schwarze. An experimental analysis of ultimatum bargaining. Journal of Economic Behavior and Organization, 3(4):367–388, 1982.
Article Google Scholar
B. Hayes and J. A. Shah. Improving robot controller transparency through autonomous policy explanation. In HRI’17 – Proceedings of the 2017 ACM/IEEE International Conference on Human-robot Interaction, pages 303–312, Vienna, Austria, 2017.
K. Hoff and M. Bashir. Trust in automation: Integrating empirical evidence on factors that influence trust. The Journal of the Human Factors and Ergonomics Society, 57(3):407–434, 2015.
Article Google Scholar
S. Holm. A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2):65–70, 1979.
MathSciNet MATH Google Scholar
G. A. Jamieson, L. Wang, and H. Neyedli. Develo** human-machine interfaces to support appropriate trust and reliance on automated combat identification systems. Technical report, DTIC Document, 2008.
Google Scholar
J. D. Johnson. Type of automation failure: The effects on trust and reliance in automation. Master’s thesis, Georgia Institute of Technology, Atlanta, Georgia, 2004.
Z. Juozapaitis, A. Koul, A. Ferm, M. Erwig, and F. Doshi-Velez. Explainable reinforcement learning via reward decomposition. In T. Miller, R. Weber, and D. Magazzeni, editors, Proceedings of the IJCAI 2019 Workshop on Explainable Artificial Intelligence, pages 47–53, Macao, China, 2019.
P. Kaniarasu, A. Steinfeld, M. Desai, and H. Yanco. Robot confidence and trust alignment. In HRI’13 – Proceedings of the 8th ACM/IEEE International Conference on Human-Robot Interaction, pages 155–156, Tokyo, Japan, 2013.
P. Kim, K. Dirks, C. Cooper, and D. Ferrin. When more blame is better than less: The implications of internal vs. external attributions for the repair of trust after a competence- vs. integrity-based trust violation. Organizational Behavior and Human Decision Processes, 99(1):49–65, 2006.
Y. Kim, A. Baylor and PALS Group. Pedagogical agents as learning companions: The role of agent competency and type of interaction. Educational Technology Research and Development, 54(3):223–243, 2006.
Article Google Scholar
W. K. Kirchner. Age differences in short-term retention of rapidly changing information. Journal of Experimental Psychology, 55(4):352–358, 1958.
Article Google Scholar
I. Koch, E. Poljac, H. Muller, and A. Kiesel. Cognitive structure, flexibility, and plasticity in human multitasking – an integrative review of dual-task and task-switching research. Psychological Bulletin, 144(6):557–583, 2018.
Article Google Scholar
F. C. Lacson, D. A. Wiegmann, and P. Madhavan. Effects of attribute and goal framing on automation reliance and compliance. In Proceedings of the Human Factors and Ergonomics Society 49th Annual Meeting, pages 482–486, Orlando, Florida, 2005.
E.-J. Lee. Flattery may get computers somewhere, sometimes: The moderating role of output modality, computer gender, and user gender. International Journal of Human-Computer Studies, 66(11):789–800, 2008.
Article Google Scholar
J. Lee and N. Moray. Trust, control strategies and allocation of function in human-machine systems. Ergonomics, 35(10):1243–1270, 1992.
Article Google Scholar
Y.-M. Li and Y.-S. Yeh. Increasing trust in mobile commerce through design aesthetics. Computers in Human Behavior, 26(4):673–684, 2010.
Article Google Scholar
P. Madhavan, D. A. Wiegmann, and F. C. Lacson. Automation failures on tasks easily performed by operators undermine trust in automated aids. The Journal of the Human Factors and Ergonomics Society, 48(2):241–256, 2006.
Article Google Scholar
P. Madumal, T. Miller, L. Sonenberg, and F. Vetere. Explainable reinforcement learning through a causal lens. In AAAI-20 – Proceedings of the 34th AAAI Conference on Artificial Intelligence, pages 2493–2500, New York, New York, 2020.
S. Merritt. Affective processes in human-automation interactions. The Journal of the Human Factors and Ergonomics Society, 53(4):356–370, 2011.
Article Google Scholar
S. Merritt, K. Huber, J. LaChapell-Unnerstall, and D. Lee. Continuous calibration of trust in automated systems. Technical report, University of Missouri – St Louis, St Louis, Missouri, 2014.
J. Meyer. Conceptual issues in the study of dynamic hazard warnings. The Journal of the Human Factors and Ergonomics Society, 46(2):196–204, 2004.
Article Google Scholar
A. Mitrovic, M. Gordon, A. Piotrkowicz, and V. Dimitrova. Investigating the effect of adding nudges to increase engagement in active video watching. In S. Isotani, E. Millán, A. Ogan, P. Hastings, B. McLaren, and R. Luckin, editors, AIED 2019 – Proceedings of the 20th International Conference on Artificial Intelligence in Education, pages 320–332, Chicago, Illinois, 2019.
C. Ndulue, O. Oyebode, R. Iyer, A. Ganesh, S. Ahmed, and R. Orji. Personality-targeted persuasive gamified systems: Exploring the impact of application domain on the effectiveness of behaviour change strategies. User Modeling and User-Adapted Interaction, 32:165–214, 2022.
Article Google Scholar
K. Oduor and E. Wiebe. The effects of automated decision algorithm modality and transparency on reported trust and task performance. In Proceedings of the Human Factors and Ergonomics Society 52nd Annual Meeting, pages 302–306, New York, New York, 2008.
J. Ouellette and W. Wood. Habit and intention in everyday life: The multiple processes by which past behavior predicts future behavior. Psychological Bulletin, 124(1):54–74, 1998.
Article Google Scholar
R. Pak, N. Fink, M. Price, B. Bass, and L. Sturre. Decision support aids with anthropomorphic characteristics influence trust and performance in younger and older adults. Ergonomics, 55(9):1059–1072, 2012.
Article Google Scholar
R. Pak, A. McLaughlin, and B. Bass. A multi-level analysis of the effects of age and gender stereotypes on trust in anthropomorphic technology by younger and older adults. Ergonomics, 57(9):1277–1289, 2014.
Article Google Scholar
R. Parasuraman and C. Miller. Trust and etiquette in high-criticality automated systems. Communications of the ACM, 47(4):51–55, 2004.
Article Google Scholar
A. Partovi, I. Zukerman, K. Zhan, N. Hamacher, and J. Hohwy. Relationship between device performance, trust and user behaviour in a care-taking scenario. In UMAP 2019 – Proceedings of the 27th Conference on User Modeling, Adaptation and Personalization, pages 61–69, Larnaca, Cyprus, 2019.
H. Pashler. Dual-task interference in simple tasks: Data and theory. Psychological Bulletin, 116(2):220–244, 1994.
Article Google Scholar
E. Rovira, K. McGarry, and R. Parasuraman. Effects of imperfect automation on decision making in a simulated command and control task. The Journal of the Human Factors and Ergonomics Society, 49(1):76–87, 2007.
Article Google Scholar
E. Rovira and R. Parasuraman. Transitioning to future air traffic management: Effects of imperfect automation on controller attention and performance. The Journal of the Human Factors and Ergonomics Society, 52(3):411–425, 2010.
Article Google Scholar
S. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice-Hall, Englewood Cliffs, NJ, fourth edition, 2021.
MATH Google Scholar
J. Sanchez. Factors that affect trust and reliance on an automated aid. PhD thesis, Georgia Institute of Technology, Atlanta, Georgia, 2006.
A. C. Guimarães Santos, W. Oliveira, J. Hamari, L. Rodrigues, A. M. Toda, P. T. Palomino, and S. Isotani. The relationship between user types and gamification designs. User Modeling and User-Adapted Interaction, 31:907–940, 2021.
N. Schauffel, J. Gründling, B. Ewerz, B. Weyers, and T. Ellwart. Human-Robot Teams. Spotlight on Psychological Acceptance Factors exemplified within the BUGWRIGHT2 Project. PsychArchives, 2022.
Y. Seong and A. Bisantz. The impact of cognitive feedback on judgment performance and trust with decision aids. International Journal of Industrial Ergonomics, 38(7):608–625, 2008.
Article Google Scholar
T. Sheridan and W. Verplank. Human and computer control of undersea teleoperators. Technical Report ADA057655, Massachusetts Insitute of Technology, Cambridge, Massachusetts, 1978.
R. Spain and P. Madhavan. The role of automation etiquette and pedigree in trust and dependence. In Proceedings of the Human Factors and Ergonomics Society 53rd Annual Meeting, pages 339–343, San Antonio, Texas, 2009.
N. Stanton, S. Ragsdale, and E. Bustamante. The effects of system technology and probability type on trust, compliance, and reliance. In Proceedings of the Human Factors and Ergonomics Society 53th Annual Meeting, pages 1368–1372, San Antonio, Texas, 2009.
R. Sutton and A. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, Massachusetts, 1998.
MATH Google Scholar
R. Thaler and C. Sunstein. Nudge: Improving Decisions about Health, Wealth, and Happiness. Yale University Press, New Haven, Connecticut, 2008.
Google Scholar
J. van der Waa, J. van Diggelen, K. van der Bosch, and M. Neerincx. Contrastive explanations with reinforcement learning in terms of expected consequences. In Proceedings of the IJCAI-18 Workshop on Explainable Artificial Intelligence, Stockholm, Sweden, 2018.
F. M. Verberne, J. Ham, and C. J. Midden. Trust in smart systems: Sharing driving goals and giving information to increase trustworthiness and acceptability of smart systems in cars. The Journal of the Human Factors and Ergonomics Society, 54(5):799–810, 2012.
Article Google Scholar
J. C. Walliser, E. J. de Visser, and T. H. Shaw. Application of a system-wide trust strategy when supervising multiple autonomous agents. In Proceedings of the Human Factors and Ergonomics Society 60th Annual Meeting, pages 133–137, Washington DC, 2016.
L. Wang, G. A. Jamieson, and J. G. Hollands. Trust and reliance on an automated combat identification system. The Journal of the Human Factors and Ergonomics Society, 51(3):281–291, 2009.
Article Google Scholar
N. Wang, D. V. Pynadath, and S. G. Hill. Trust calibration within a human-robot team: Comparing automatically generated explanations. In HRI’16 – Proceedings of the 11th ACM/IEEE International Conference on Human-Robot Interaction, pages 109–116, Christchurch, New Zealand, 2016.
A. Xu and G. Dudek. OPTIMo: Online probabilistic trust inference model for asymmetric human-robot collaborations. In HRI’15 – Proceedings of the 10th Annual ACM/IEEE International Conference on Human-Robot Interaction, pages 221–228, Portland, Oregon, 2015.
K. Yu, S. Berkovsky, R. Taib, D. Conway, J. Zhou, and F. Chen. User trust dynamics: An investigation driven by differences in system performance. In IUI 2017 – Proceedings of the 22nd International Conference on Intelligent User Interfaces, pages 307–317, Limassol, Cyprus, 2017.
D. Zanatto, M. Patacchiola, J. Goslin, and A. Cangelosi. Priming anthropomorphism: Can our trust in humanlike robots be transferred to non-humanlike robots? In HRI’16 – Proceedings of the 11th ACM/IEEE International Conference on Human Robot Interaction, pages 543–544, Christchurch, New Zealand, 2016.
T. Zhou. The effect of initial trust on user adoption of mobile payment. Information Development, 27(4):290–300, 2011.
Article Google Scholar
I. Zukerman, A. Partovi, K. Zhan, N. Hamacher, J. Stout, and M. Moshtaghi. A game for eliciting trust between people and devices under diverse performance conditions. In T. Cazenave, M. H. Winands, and A. Saffidine, editors, Computer Games, pages 172–190. Springer, 2018.

Download references

Acknowledgements

We thank David Albrecht, Stephen Meagher, Julie Stout and Kai Zhan for their help in early stages of this research, and Nora Hamacher for her inspired implementation of the monitoring-system study. We also thank the two anonymous reviewers for their helpful comments, which have led to an improved manuscript.

Funding

Open Access funding enabled and organized by CAUL and its Member Institutions.

Author information

Authors and Affiliations

Faculty of Information Technology, Monash University, Clayton, Victoria, 3800, Australia
Ingrid Zukerman & Andisheh Partovi
Faculty of Arts, Monash University, Clayton, Victoria, 3800, Australia
Jakob Hohwy

Authors

Ingrid Zukerman
View author publications
You can also search for this author in PubMed Google Scholar
Andisheh Partovi
View author publications
You can also search for this author in PubMed Google Scholar
Jakob Hohwy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ingrid Zukerman.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This research was supported in part by award number FA2386-14-1-0010 from the Air Force Office of Scientific Research, Asian Office of Aerospace Research and Development (AOARD), and grant DP190100006 from the Australian Research Council (ARC). This paper significantly extends our previous work described in [53]. This paper or a similar version is not currently under review by a journal or conference. This paper is void of plagiarism or self-plagiarism as defined by the Committee on Publication Ethics and Springer Guidelines.

Appendices

Derivation of the Parameters of the Game

A user’s expected net income comprises (positive) expected earnings from the card game (, Section A.1) and (negative) expected losses from the monitoring task (, Section A.2):^{Footnote 12}

(5)

1.1 Expected earnings from the card game

The expected earnings from the card game are calculated from the number of cards in a stage () and the expected reward from a card ($E(\$\textit{Card})$):

(6)

where

$$\begin{aligned} E(\$\textit{Card})= & {} \hbox {Pr}(\textit{Skip}) \times \$\textit{Skip}\ +\\{} & {} (1-\hbox {Pr}(\textit{Skip})) \times [\,\hbox {Pr}(\textit{Correct}) \!\times \! \$\textit{Correct}+ \hbox {Pr}(\textit{Wrong}) \!\times \! \${ Wrong}\,]\qquad \nonumber \end{aligned}$$

(7)

$Skip and $Wrong are the penalties for skip** a card and giving an incorrect answer respectively, and $Correct is the reward for a correct answer. $\hbox {Pr}(\textit{Skip})$ is the probability of skip** a card, $\hbox {Pr}(\textit{Correct})$ is the probability of answering correctly, and $\hbox {Pr}(\textit{Wrong})=1-\hbox {Pr}(\textit{Correct})$.

1.2 Expected losses from the monitoring task

The expected losses from the monitoring task are due to checks on residents (), delayed responses to TAs and Misses () and attendance to FAs ().

(8)

where the terms are defined as follows:

Expected losses due to checks on residents –
(9)
where $Check is the cost of checking on residents, and $\mu (\#\,\textit{checks})$ is the mean # checks performed by a user, which depends on the type of the user (Section 3.3.2). $\#\,\textit{checks}$ is modeled by a Poisson distribution with parameter $\lambda \approx \mu (\#\,\textit{checks})$ (truncated at , but the loss due to truncation is negligible).
Expected losses due to attendance delay –
(10)
where $DelaySlope is the penalty for each card elapsed while an event remains unattended, and $\mu (\textit{DelayResponse})$ is the mean time (in number of cards) a user delays in attending to an event. This parameter depends on the system’s performance and the type of the user (Section 3.3.2). To model this parameter, we assume that when users check on the residents, they will attend to all previously unattended events. Thus, $\mu (\textit{DelayResponse})$ is the difference between the means of two Exponential distributions: one for the inter-arrival time of checks on residents, and one for the inter-arrival time of unattended events (due to system Misses or user behaviour) [76]:
$$\begin{aligned}{} & {} \mu (\textit{DelayResponse}) =\\{} & {} \quad \max \{0, \mu (\#\textit{CardsBetweenChecks}) - \mu (\#\textit{CardsBetweenUnattended})\}\qquad \nonumber \end{aligned}$$
(11)
(12)
(13)
To avoid discouraging participants, we cap the penalty of a delayed response to adverse events by truncating DelayResponse.
Expected losses due to attendance to FAs –
(14)
where $PenaltyFA is the penalty for attending to FAs; $\hbox {Pr}(\textit{Attend}\,)$ depends on the type of the user (Section 3.3.2); and $\#\textit{FAs}$ is defined by the probabilities of Misses and FAs, which characterize the performance of a monitoring system.
$$\begin{aligned} \#\textit{FAs}= \frac{\hbox {Pr}(\textit{FA}) \times (1-\hbox {Pr}(\textit{Miss})) \times \#\textit{Events}}{1-\hbox {Pr}(\textit{FA})} \end{aligned}$$
(15)

Derivation of the Agent’s Policy

The agent’s policy has two main parts: (1) a sub-policy for periodic checks (Section B.1), and (2) a sub-policy for handling alerts (Section B.2).

1.1 Sub-policy for periodic checks

As mentioned in Section 3.3.2, the expected cost of not checking on the residents increases by $DelaySlope for each card elapsed while an event remains unattended. Hence, the expected cost of not checking for N cards is

$$\begin{aligned} E(\textit{NoCheckCost}(N)) = \Pr (\textit{Event}) \times \Pr (\textit{Miss}| \textit{Event}) \times \$\textit{DelaySlope}\times \frac{(1+N)N}{2}\nonumber \\ \end{aligned}$$

(16)

The expected cost of checking on the residents after $N+1$ cards is the expected cost of not checking for N cards plus the cost of checking once.

$$\begin{aligned} E(\textit{CheckCost}(N+1)) = E(\textit{NoCheckCost}(N)) + \$\textit{Check}\end{aligned}$$

(17)

Thus, given a game with cards in one stage, the total expected cost of checking every $N+1$ cards is

(18)

The optimal checking interval $N_{{\varvec{opt}}}$ is that which yields the minimal total expected cost for a particular configuration of our game, as follows:

(19)

That is, in the absence of an alert, the agent will remind a player to check on the residents after cards have elapsed.

1.2 Sub-policy for handling alerts

Equation 4 (repeated here) defines the expected cumulative cost of a specific Action considered for an alert raised at card number $t_a$.

$$\begin{aligned}{} & {} E(\text{Cumulative } \text{ Cost } \text{ of } {} { Action} \text{ at } t_a) =\\{} & {} \quad E(\text{Cost } \text{ of } {} { Action} \text{ at } t_a) + E(\text{Cost } \text{ at } t_a\! +\! 1)+\\{} & {} \quad E(\text{Cumulative } \text{ cost } \text{ of } \text{ events } \text{ missed } \text{ between } t_{c}\! +\! 1 \text{ and } \text{ the } \text{ current } \text{ check})\! +\\{} & {} \quad E(\text{Cumulative } \text{ cost } \text{ of } \text{ the } \text{ periodic } \text{ checks } \text{ from } \text{ the } \text{ most } \text{ recent } \text{ check}) \end{aligned}$$

Using Equation 4, we calculate the expected cumulative cost of each candidate action for an alert at card $t_a$ () in a stage of the game and each previous card $t_c$ ($1, \ldots , t_a -1$) where a check could have been performed. The action with the minimal cost is selected.

Cost of a specific Action at card ${\varvec{t_a}}$. The cost of a Check is constant ($Check). Since there is no reward for attending to a TA, the expected cost of attending to an alert involves only the probability of the alert being false, and the penalty for attending to an FA ($PenaltyFA).

$$\begin{aligned} E(\textit{AttendCost}) = \hbox {Pr}(\textit{FA}| \textit{Alert}) \times \${\textit{PenaltyFA}}\end{aligned}$$

(20)

Cost of any action at card ${\varvec{t_a + 1.}}$ There are three cases – one for each action chosen at card $t_a$.

Check$+1$ (do nothing at card $t_a$, check at card $t_a + 1$) – In addition to the cost of checking at card $t_a + 1$, there is an expected cost for potentially having ignored a TA for one card.
$$\begin{aligned} E(\textit{CheckCost}) = \$\textit{Check}\ + \hbox {Pr}(\textit{TA}| \textit{Alert}) \times \$\textit{DelaySlope}\end{aligned}$$
(21)
where $\hbox {Pr}(\textit{TA}| \textit{Alert}) = 1 - \hbox {Pr}(\textit{FA}| \textit{Alert})$, and $DelaySlope – the penalty for ignoring a TA – is the same as that for missing an event for one card.
Check at card $t_a$ – There may be another alert at card $t_a + 1$, with an expected cost that depends on the probability of this alert, the probability that it is false, and the penalty for attending to an FA or ignoring a TA:
$$\begin{aligned} E(\textit{AlertCost})= & {} \hbox {Pr}(\textit{Alert}) \times \{ \hbox {Pr}(\textit{FA}| \textit{Alert}) \times \${\textit{PenaltyFA}}\ +\nonumber \\{} & {} \quad \quad \quad \quad \quad \quad \quad \! \hbox {Pr}(\textit{TA}| \textit{Alert}) \times \$\textit{DelaySlope}\} \end{aligned}$$
(22)
Attend at card $t_a$ – Like for Check, there may be another alert at card $t_a + 1$, incurring an expected cost of $E(\textit{AlertCost})$. However, if , this cost will be avoided, as the alert is covered by the periodic check scheduled for card $t_a + 1$.

Cumulative cost of events missed between card ${\varvec{t_c + 1}}$ and the current check. This term is not zero only if the action suggested at card $t_a$ is Check (to be performed at card $t_a$) or Check$+1$ (to be performed at card $t_a +1$). In both cases, the timing of the check is less than or equal to cards after the last check at $t_c$. This cost is calculated by Equation 16, with $N = t_a - t_c$ for Check, and $N = t_a + 1 - t_c$ for Check$+1$.

Cumulative cost of the periodic checks from the most recent check. Whenever a player performs a Check or Check$+1$, the checking protocol starts from that point onwards. That is, the agent plans to check after (=6) cards from this check. If there is another check for an alert before cards have elapsed, the protocol starts again, and so on. The calculation of the cumulative cost of the protocol implements this idea.

If the action at card $t_a$ was Attend, the expected remaining cost of the periodic check protocol is calculated from card $t_c + 1$ (after the last Check) until the end.
If the action at card $t_a$ was Check or Check$+1$, the expected remaining cost is respectively computed from card $t_a + 1$ or from card $t_a + 2$ until the end.

These costs are obtained by substituting the remaining number of cards for in Equation 18.

Statistical significance tests

Table 14 Monitoring-system study $-$ Effect of monitoring-system accuracy and error type on self-reported trust, compliance and reliance: means (stdevs) and p-values; one-tailed tests were conducted for hypothesized directional effects (indicated by > or <), and two-tailed tests for non-directional/unanticipated effects (FAs on reliance and Misses on compliance); statistically significant results after Holm-Bonferroni correction are boldfaced ($\alpha =0.01$).

Full size table

Table 15 Advisor-agent study $-$ Compliance and reliance on the monitoring system in Trial-Stage2 vs Game 1, Trial-Stage2 vs Game 2, and Game 1 vs Game 2 for the Agent and NoAgent cohorts: means (stdevs) and p-values; the agent is expected to reduce compliance and reliance, but since no particular change directions were posited for the NoAgent cohort, all tests are two tailed; statistically significant results after Holm-Bonferroni correction are boldfaced ($\alpha =0.01$), and results that were significant only prior to correction are italicized.

Full size table

In this section, we provide details of the statistical significance tests we conducted. Wilcoxon signed-rank test was used for paired data, and Wilcoxon rank-sum test for (unpaired) groups. One-tailed tests were performed for hypothesized directional effects (indicated by > or <), and two-tailed tests for non-directional/unanticipated effects (note that the p-value of one-tailed tests is half of the p-value of two-tailed tests). The results of the significance tests were adjusted with Holm-Bonferroni corrections for multiple comparisons.

Table 16 Monitoring-system study $-$ Effect of FAs on trust and compliance, effect of Misses on trust and reliance, and correlation between trust and compliance and between trust and reliance for HighReliance-HighCompliance, MediumReliance-HighCompliance and LowReliance-MediumCompliance users: means (stdevs) and p-values; one-tailed tests were conducted for hypothesized directional effects (indicated by > or <); statistically significant results after Holm-Bonferroni correction appear in boldface ($\alpha = 0.01$) or boldface italics ($\alpha = 0.05$), and results that were significant prior to correction are italicized; moderate correlations are boldfaced.

Full size table

Table 17 Advisor-agent study $-$ Comparison between HighReliance and LowReliance users from the Agent cohort in Game 1 $-$ Self-reported trust in the advisor agent, conformity with its advice, and effect of the agent’s advice on compliance and reliance on the monitoring system; and comparison between compliance/reliance in Game 1 and Trial-Stage2 for each user type: means (stdevs) and p-values; one-tailed tests were conducted for hypothesized directional effects (indicated by > or <); statistically significant results after Holm-Bonferroni correction appear in boldface ($\alpha =0.01$) or boldface italics ($\alpha =0.05$).

Full size table

Classification features and classifier performance

In this section, we describe the input features of the predictive models employed in both studies, and compare the performance of four classifiers: random forests, naïve Bayes, support vector machines and linear/logistic regression.

The best results for the monitoring-system study were obtained with all the features (the classifiers automatically selected the best features). However, for the advisor-agent study, we had to perform manual feature ablation to get the best accuracy, owing to the small number of data points. Random forests outperformed the other models for predicting trust, compliance and reliance in the monitoring-system study (Table 18). However, for the advisor-agent study, naïve Bayes had the best performance for predicting trust, and logistic regression had the best performance for predicting learning checks from the agent. We chose to present the results obtained with random forests, because it performed at least as well as the other models in seven out of nine cases, and it had the most stable performance overall. Nonetheless, it is worth noting that the main significance of our results pertains to the features that have predictive value.

Table 18 Monitoring-system study $-$ Comparison between the predictive performance of four classifiers $-$ best results, obtained with all the features: 10-fold cross-validation; trust, compliance and reliance on the monitoring system are measured using RMSE; the best performance is boldfaced.

Full size table

1.1 Monitoring-system study: Input features and classifier performance

The following features were considered for predicting trust, compliance and reliance on the monitoring system:

System – current stage in a game (1, 2), # of alerts (total, true and false) and # of misses for the system in the preceding stage of the game and the system in the current stage (recall that the previous stage may be the second stage of a game with a monitoring system that differs from the system in the current stage).
User – demographic information; disposition toward technology, technology expertise and smart-home experience; and trust propensity. Trust propensity was summarized into two scores: Average trust propensity (average of the responses to statements 1, 3, 4 and 6 in Table 3) and Average distrust propensity (average of the responses to statements 2 and 5 in Table 3).
User game experience –
- # trials (0, 1, 2);
- User type based on monitoring behaviour in the first game (Table 10);
- For the preceding stage of the game: trust in the monitoring system; administration-task performance (% correct cards, % skipped cards); and monitoring-task performance (% alerts attended, % TAs attended, % FAs attended, # checks performed and % checks attended — sometimes users did not find adverse events during a check).

Table 18 compares the performance of four classifiers: random forest, naïve Bayes, support vector machine and logistic/linear regression.

1.2 Advisor-agent study: Input features and classifier performance

The following features were considered for predicting trust and conformity with the agent’s advice in the second stage of Game 1, and learning outcomes in Game 2 (we used features from the first stage of Game 1 to make predictions for the second stage, and features averaged over both stages of Game 1 to make predictions for Game 2):

Agent – accuracy in the preceding stage/game (the agent’s accuracy varies depending on a user’s behaviour, while the accuracy of the underlying monitoring system is constant throughout the study).
User – same features as for the monitoring-system study.
User game experience –
- User type based on monitoring behaviour in Trial-Stage2 (Table 11);
- User type based on monitoring behaviour in Game 1 (Table 11) – used only for predictions pertaining to Game 2;
- For the preceding stage/game: trust in the agent and conformity with its advice (% advice followed), administration-task performance (% correct cards, % skipped cards), and monitoring-task performance (% alerts attended, # checks).

Table 19 Advisor-agent study $-$ Comparison between the predictive performance of four classifiers $-$ best results, obtained with feature ablation: 10-fold cross-validation; trust in the agent and conformity with its advice are measured using RMSE, and learning effect (binary) with and without the agent is measured using accuracy; the best performance is boldfaced; the results for alert attendances without the agent are for the majority class.

Full size table

Table 19 compares the performance of four classifiers: random forest, naïve Bayes, support vector machine and logistic/linear regression.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zukerman, I., Partovi, A. & Hohwy, J. Influence of Device Performance and Agent Advice on User Trust and Behaviour in a Care-taking Scenario. User Model User-Adap Inter 33, 1015–1063 (2023). https://doi.org/10.1007/s11257-023-09357-y

Download citation

Received: 22 February 2022
Accepted: 15 January 2023
Published: 30 March 2023
Issue Date: November 2023
DOI: https://doi.org/10.1007/s11257-023-09357-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Influence of Device Performance and Agent Advice on User Trust and Behaviour in a Care-taking Scenario

Abstract

Similar content being viewed by others

A Game for Eliciting Trust Between People and Devices Under Diverse Performance Conditions

Trust in telemedicine portals for rehabilitation care: an exploratory focus group study with patients and healthcare professionals

Argumentation Schemes for Events Suggestion in an e-Health Platform

1 Introduction

1.1 Main findings

1.2 Methodological contributions – game and agent

1.3 Roadmap to the paper

2 Related Research

3 The Experiment and the Game

3.1 Experiment overview

3.2 The game

3.3 Determining the Parameters of the Game

3.3.1 Game configuration: length, card frequency and alert frequency

3.3.2 Rewards and penalties

4 The Advisor Agent

4.1 Agent design features

4.1.1 Appearance/anthropomorphism, ease of use and communication style

4.1.2 Transparency

4.1.3 Feedback

4.1.4 Summary of design features

4.2 Calculating the agent’s policy

4.2.1 Sub-policy for performing periodic checks

4.2.2 Sub-policy for handling alerts

4.3 Agent performance

5 Experimental Results

5.1 Participant demographics, experience and trust propensity

5.2 Monitoring-system study

5.3 Advisor-agent study

Hypothesis 1

Hypothesis 2

5.4 Characterizing user types

5.4.1 Monitoring-system study

5.4.2 Advisor-agent study

6 Predicting Trust and Behaviour

6.1 Monitoring-system study

6.2 Advisor-agent study

7 Conclusion

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Derivation of the Parameters of the Game

1.1 Expected earnings from the card game

1.2 Expected losses from the monitoring task

Derivation of the Agent’s Policy

1.1 Sub-policy for periodic checks

1.2 Sub-policy for handling alerts

Statistical significance tests

Classification features and classifier performance

1.1 Monitoring-system study: Input features and classifier performance

1.2 Advisor-agent study: Input features and classifier performance

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation