1 Introduction

Following promises for economic and societal benefits across industries and public domains (European Commission, 2021), artificial intelligence (AI) tools and functions are rapidly adopted in high stakes social domains, resha** many public, professional, and personal practices (Whittaker et al., 2018). While AI tools often have the potential to increase efficiency and improve decision-making, these can also lead to harms and violations of fundamental rights related to non-discrimination or privacy (Balayn & Gürses, 2021). Other emerging harms include physical dangers related to new robotic systems such as autonomous vehicles, and digital welfare systems leading to grave financial and mental harm (Dobbe et al., 2021).

In response, many efforts have emerged about to anticipate and address the implications of AI through appropriate governance strategies. These included a first wave of ethical principles and guidelines (Jobin et al., 2019), as well as technical tools for addressing issues of bias, fairness, accountability and transparency (Whittaker et al., 2018). While these guidelines and tools helped develop broader awareness of the governance challenges, there is still little known about how to situate and operationalize these principles and tools in the practice of develo**, using and governing AI systems. At the contrary, critical scholars have argued that these instruments are often pushed as forms of self-regulation by industry to prevent more stringent forms of regulation (Wagner, 2018; Whittaker et al., 2018).

In technical fields, harms imposed by AI systems are primarily characterised as ‘bias’ or ’safety’ flaws that can be adressed in the design of the technical system, leading to a focus on technical solutions (Balayn & Gürses, 2021). This way, the broader social and normative complexity of harms and the relation to design choices are naively narrowed down to a problem in the technical design of AI systems, and thus in the hands of technology companies or internal developers thereby foregoing normative deliberation and accountability (Green, 2021; Nouws et al., 2022). However, problems such as discrimination cannot be tackled only by technology specialists, but require a more holistic specification and evaluation of AI systems in their sociotechnical context (Dobbe et al., 2021).

Based on a structured literature review of the scholarly literature on AI and public governance, Zuiderwijk et al. (2021) list various knowledge gaps motivating a more the need for a comprehensive sociotechnical system perspective for the governance of AI systems. Firstly, AI is mostly addressed generically, and there is great need for more domain-specific studies. In every domain there are different actors, legacy practices and infrastructures that an AI system operates in. Understanding the broader system that an AI technology operates in requires a mix of methods that can capture complex interactions across stakeholders and technological features (Ackerman, 2000). A sociotechnical system lens can comprehensively describe such complexity and allow for meta-analysis and cross-domain comparison (de Bruijn & Herder, 2009). Furthermore, there is little empirical testing of AI systems in practice: “[a]s AI implementations start to bear fruit (or cause harm) [...], there is an urgent need to pursue explanatory research designs that adopt expanded empirical methods to generate operational definitions, extract meanings, and explain outcomes specifically within public governance contexts” (Zuiderwijk et al., 2021)

In this paper we pursue empirical research to understand the extent to which existing design, use and governance practices for machine learning (ML) models are able to address the sociotechnical vulnerabilities of ML applications through which safety hazards may emerge. To map these vulnerabilities we perform an integrative literature review on sources of vulnerability of sociotechnical nature, based on recent literature on ML as well as lessons from system safety and other sociotechnical systems engineering disciplines that have dealt with sociotechnical vulnerabilities in software-based automation for a long time (de Bruijn & Herder, 2009; Dobbe, 1.2, we first cover related work. It is relevant to note that this study was performed in late 2021 and early 2022, and hence precedes the quick rise of generative AI tools since the end of 2022. Nonetheless, the findings in this paper largely apply, but may be nuanced or extended for more recent AI applications.

1.1 Related Work

The efforts to develop new practices for responsible AI system development are broad. Here, we put particular focus on efforts and critiques that explicitly mention and are informed by sociotechnical systems theory and engineering. The key findings in this related work echo two gaps identified by Zuiderwijk et al. (2021), namely a lack of empirical as well as conceptual accounts to better understand and describe the sociotechnical complexity of AI systems in practice and bridge technology, ethics and policy. In the following, we cover the most relevant papers, highlighting their affordances and limitations towards this aim.

Selbst et al. (2019) introduced the notion of sociotechnical systems in the discourse on fairness in machine learning. They mainly point out how fairness-aware research up till then abstracted away most context that surrounds the machine learning model, conceptualizing five traps that may contribute to undesirable narrowing of abstraction. They provide some high level takeaways but do not offer empirical grounding or engage with the ontological challenges inherent in defining ML as a sociotechnical system. Green (2021) critiques the tech ethics landscape pointing out the need for sociotechnical systems thinking to overcome the false assumptions on technology’s neutrality, solutionism and determinism, without further elaborating how to ground such thinking in design practice. Winby and Mohrman (2018) develop a high-level organizational design approach for digital systems incorporating sociotechnical analysis, but do not address the technical and sociotechnical dimensions of ML models themselves. Behymer and Flach (2016) point out the flaws in the dominant thinking of building autonomous systems separate from their human and social environment, proposing an alternative lens where the goal of design is a seamless integration of human and technological capabilities into a well-functioning sociotechnical system, based on Rasmussen’s Skills, Rules, Knowledge (SRK) framework. While SRK leans on decades of experience in safety engineering, it lacks conceptual and empirical depth to address the particular nature of ML models and their interactions with context. Similarly, Oosthuizen and Van’t Wout (2019) perform a study based on Cognitive Work Analysis (CWA) to understand the impact of AI technologies on users. This lens is relevant but mostly focused on the human agent with an eye for adoption, rather than understanding more integrally what kinds of vulnerabilities emerge in the broader human-AI system. Makarius et al. (2020) adopt an organizational approach, arguing that employees need to be socialized to develop so-called “sociotechnical capital” working with AI. Such a lens overlooks the risks and emergent hazards of adopting AI at a large scale. Martin et al. (2021). The MLOps process is further introduced in Sect. 2.

The research is both descriptive and design-oriented. On the one hand we want to understand and contribute to existing practices that are understudied. On the other hand, we do see valuable lessons in traditional sociotechnical systems disciplines, in particular in system safety in Dobbe (

2 Map** Existing Practices for MLOps

In this section we investigate currently existing practices for managing ML applications, under the embrella of MLOps. MLOps is an approach that aims to ensure reliable and efficient ML development, deployment and operations (Ruf et al., 2021). MLOps is a combination of ML, DevOps and Data Engineering. It is a practice to automate, manage and speed up the operationalisation of ML models (build, test, and release), by integrating DevOps practices into ML (Ruf et al., 2021). DevOps is a development methodology for software aimed at bridging the gap between Development and Operations practices, emphasizing communication and collaboration, continuous integration of sofware updates, quality assurance of software systems and delivery with automated deployment utilizing a set of development practices (Jabbari et al., 2016). At its core, MLOps is the standardisation and streamlining of ML lifecycle management, and the general desire in MLOps is to automate the ML lifecycle as far as possible to speed up the deployment and operations processes (Treveil, 2020).

2.1 Empirical Map** of the Machine Learning Lifecycle

As MLOps practices may vary from organization to organization, we sketch the processes we mapped in the empirical research performed for various use cases in the financial industry. All these case studies rely on one external company for develo** MLOps practices, which allows us to work towards one map**.

The map** was based on insights drawn from a series of interviews with stakeholders active in the design and management of ML models, both within the organizations as well as for a vendor company offering services in the design and deployment of the ML models. All actors were asked to draw the key steps in MLOps and their interdependencies. This resulted in the map presented in Fig. 1.

The process of bringing a ML model into practice is generally conceptualised as a ML lifecycle. The ML Lifecycle can be divided into three stages: experimental stage, deployment stage and operations stage. The experimental stage involves all steps that lead to the construction of a ML model as well as activities to improve, correct or enrich an existing ML model deployment. The deployment stage includes the steps to integrate the model in an organisation’s operational processes and infrastructure, so that it can be used to make predictions that then form an input to various business processes. The operations stage comprises the monitoring of the model and application and may trigger reasons to revisit the design and training of a model based on certain performance indicators.

Fig. 1
figure 1

MLOps practices mapped through interviews, sometimes referred to as the ML lifecycle. Under each activity, we list the roles of the professionals interviewed for the use cases in Sect. 4

2.2 Lack of Context in MLOps

The ML lifecycle presented above reflects the dominant view for how ML applications are developed. This view focuses on hardware, software, algorithms, mechanical linkages and inputs/outputs. This view is narrowly focused on primarily technical components and factors. However, the ML application is embeeded in an operational process and part of a broader sociotechnical system, which also include other technical systems, stakeholders, decision-making logics, institutions and the final outcomes, as presented in Fig. 3 and which we will further elaborate in Sect. 3.2.

A sociotechnical system consists of technological, social, and institutional elements and is mostly defined by the interactions between these elements (Van de Poel, 2020). The primary emphasis on the ML model and various technical metrics in MLOps practices leaves us with a gap between the technical conceptual framework, the ML lifecycle, and the needed sociotechnical conceptualisation of the ML-based applications in context (Alter, 2010), which we need in order to identify and address vulnerabilities that emerge in context (Leveson, 2012). For example, MLOps practices to not offer a lens to understand how in the construction of ML models and applications various dimensions of social context are abstracted away, such as critiqued in the context of fairness in ML (Selbst et al., 2019) or debiasing practices (Balayn & Gürses, 2021).

We can draw this point futher by looking at the key practices in MLOps. Continuous Integration (CI) enables automated model validation, after which the Continuous Delivery (CD) pipeline automatically delivers the model to be deployed. By definition, i.e. by virtue of their automatic nature, CI and CD practices do not consider the validation of the model’s interactions with its sociotechnical context, including users and other technical systems that it depends on or which depend on the model’s outputs. As such, a validated new model version in the CI pipeline could be valid from a technical perspective, but could be not meeting requirements of stakeholders in the sociotechnical system. This way, new model versions are released that could require changes in e.g. the design of decision-making process or interpretation of the model output by end-users.

If left unaddressed, these changes can cause new hazards to emerge at the level of sociotechnical interactions. In system safety, a field that has grappled with hazards in software-based automation for many decades, it is known that significant changes in the design of an automatic function require a management of change procedure to catch any such emergent hazards and prevent them from see** into the operational process, which is a central procedure of safety management systems (Leveson, 2012).

The history of safety in software-based automation also tells us that vulnerabilities emerge from the interactions between social, technical and institutional components of the broader sociotechnical systems, including aspects like maintenance, oversight and management. Therefore, the identified technocentric view in MLOps, while being able to address technical or mathematical vulnerabilities in the ML application itself, is not sufficient to understand a variety of related and additional vulnerabilities emerging when using ML in context. In the next section, we adopt a sociotechnical systems lens to identify vulnerabilities that may emerge in the development and use of ML models.

3 A Sociotechnical Systems Lens for Machine Learning in Context

3.1 Sco** the Study

In this paper, we make a first step towards a sociotechnical specification approach for ML applications that can do justice to the emergent nature of key values as safety or fairness. We particularly focus on understanding what kind of vulnerabilities emerge in the development and use of ML applications in their social and institutional context that may impact such values. The aim is to bring relevant vulnerabilities into view, to inform actors involved in specifying and designing ML applications to take these into account and work towards safer and better functioning system design. This way of designing is inspired by system safety (Dobbe, 2002). The theory evolves through continuous interplay between analysis and data collection. In this research, the data collected is data on vulnerabilities that emerge in the sociotechnical system context, as identified in scientific literature. The analysis comprehends the interpretation of these vulnerabilities by combining them with knowledge of the ML lifecycle and sociotechnical specification, as mapped and corroborated in the earlier sections. Figure 3 provides a conceptual overview and visualization of the dimensions. It is important to note that there is no one-to-one map** from vulnerabilities (as listed in Appendix 1) to the dimensions. Instead, the dimensions serve as a meta-categorization with which each vulnerability can be characterized to emerge through different dimensions. For example, the occurrence of a false positive that was not detected by a human user or interpreter of the machine learning model output, is an expression of both misinterpretation and error. Or, a model that does not consider important elements of the context and therefore makes mistakes, is a combination of the dimensions of misspecification and bias and error. In the following subsections, we discuss each of the eight resulting dimension.

Fig. 3
figure 3

Resulting dimensions of possible vulnerabilities emerging in the sociotechnical context of machine learning models and applications, based on the scope laid out in Fig. 2. The identified dimensions are represented by the bright pink color icons. The boxes and arrows denote the typical flows of digital information, in the form of data and model outputs, and their impact on decisions, other uses and outcomes

3.2.1 Misspecification

Vulnerabilities can be caused by misspecification, which entails mistakes or gaps in the specification of the broader sociotechnical system. As ML models are an integrative part of larger sociotechnical systems, it should be acknowledged that the specification should be done at the level of the operational process in which the ML model is deployed. Facilitating the interaction between ML model and other technical, social and institutional components should be part of such specification. The absence of such specification may easily lead to ML applications that do not serve the needs of users and other stakeholders impacted, do not comply with various regulations, or cause new emergent forms of error, hazard and harm. Here we distinguish between cases of misspecification that arises because there is a lack of consensus on the outcomes and behavior of the intended system, and cases for which there is consensus, but the resulting specification is incorrect. For the former part, there is the need to address the possible normative complexity that may arise from different stakeholders holding different values or interests leading to conflicts in the specification, which may or may not lead to vulnerabilities in the eventual design and operation of the system (Dobbe et al., 2021; Van de Poel, 2015).

3.2.2 Bias and Error

Machine errors occur in the ML model resulting in errors in the model output, with impact on people, processes and organisations. It is important to be aware of the types of machine error that could occur in the development of ML applications or in the ML model output and what the potential impact could be and to whom. Machine error can be divided into two categories: machine incorrectness and machine bias. Machine incorrectness refers to a false negative or false positive output. Machine bias refers to forms of disproportionate differences in the error rates of a model across different groups or attributes of people subject to the system’s outputs. Both errors as well as the resulting biases may be pre-existing in the data and social practices from which these are derived, or they may be encoded in the technical design of the ML application or operating process, or they may emerge in the operation and maintenance of the system (Dobbe et al., 5.1.1.1 Misspecification in the Use Cases

In the financial crime detection use case, the manager of data science does realise that it is vital to take the role of the end-user into account during the development of the ML application, as the value of the system would be zero if the end-user would not understand what comes out of the system. While the realisation is there, an example of misspecification can be identified in this use case. The description of the features to be used in the ML application were specified by the data science team, while the transaction monitoring analysts, being the end-users of the ML model output, have to use these descriptions for their analysis. This led to overly technical descriptions, that were difficult to understand by the transaction monitoring analysts. In the specification of the features, the end-user component of the sociotechnical system was thus not sufficiently considered, leading to disfunction. The transaction monitoring analysts had to make a translation document retrospectively to make the descriptions and thus the ML model output understandable for the analyst. This could have been prevented, by involving the end-user in the specification of the features and their descriptions.

Comparing this example of misspecification with the email marketing use case sheds light on another approach to deal with the specification of features, to prevent the misspecification described above. In this use case, the ML engineering/consulting firm intentionally chose not to use the most complex model and features. This choice was made to allow for the marketing intelligence analysts to understand the features and to propagate this understanding to the rest of the organisation. The email marketing use case does have another example of misspecification. The ML engineering/consulting firm initially automatically scheduled the model runs and transfers of the output files to the analysis system. However, the marketing intelligence analyst highlighted the importance of the checking role he has. Subsequently, it happened regularly that the transfer had failed, which led to a change in the process, where the marketing intelligence analyst now checks the model output before transferring to the analysis program. This example shows that a specification meant to lead to an efficient process (no human checks) is not necessarily desirable or effective.

Moreover, in both use cases, the systems are specified for internal organisational objectives, not necessarily for fair outcomes for people affected by the systems. In the financial crime detection use case, the objective of one of the models is to reduce the false-positive rate of alerts being investigated. This reduction is seen as valuable for optimizing the alert investigation workflow. However, a false-positive alert means that a customer of the bank is investigated for financial crime, which was not mentioned as a problem. In the email marketing use case, the system is specified to increase the conversion rate towards investing products. This way, the system is optimized to select the most promising customers of the bank to consider investments. This selection might lead to unfair outcomes, such as privileging those who are wealthy already to become even wealthier.

5.1.1.2 General Insights on Misspecification

Besides the identification of misspecification within the two use cases, the broader selection of interviews shed light on how misspecification can be present in ML use cases in general. First, several representatives of civil society organisations and regulators pointed out that technology is chosen as solution to problems, whereas it is not always the best solution. As the representative from Privacy First said: “What we see is a kind of love for technology, where goals are achieved with technological solutions, while solutions may need to be found in another area” (Personal Communication, January 22, 2022). Second, every model is a simplification of reality, and it is vital to understand that the world is more complex than the information ultimately present in a ML model, according to the representative of Bits of Freedom (Personal Communication, January 17, 2022). The fact that ML models are simplifications does not always penetrate the sociotechnical system, making ML be seen as a sort of holy grail, while it only captures something very specific, and is not a replacement for critical thinking in an organisation (Personal Communication representative Waag, January 19, 2022). Therefore, it should be debated whether develo** a ML application is an appropriate solution to a problem, to prevent misspecification. Furthermore, there are risks of misspecification seen by several several representatives of civil society organisations and regulators in the usage of data. Data is often seen as something factual and objective, while in reality it is a translation of what is seen in the world. Data is just used without really thinking and reflecting on the data and where it comes from, and what social problems have influenced that data (Personal Communication representative Bits of Freedom, January 17, 2022). If data is collected in one context, and later used for another context, this could lead to harmful consequences. As an example a situation was shared in which there were two data sets about hours worked and invoiced by workers, collected by two different organisations, which were subsequently combined into one data set by a third party to seek for fraudulent activity. While in one data set the invoiced hours only represented the on-site working hours, the other data set contained the on-site hours as well as the preparation time. This way a lack of proper specification in the use of the combined data contributed to a ML model that made grave errors unjustly accusing people of fraud (Personal Communication representative Platform Bescherming Burgerrechten, January 18, 2022). Another often mentioned set of consequences of misspecification is when a model is developed and put into production, but not used in practice. The reasons mentioned relate to misspecification of the sociotechnical context in which the model needs to operate. For example, the model was not needed, is not trusted by the end-user or adjustments in the way of working of the end-user are not properly foreseen (Personal Communication Data Science Advisor, December 22, 2021).

5.1.2 Machine Error

Machine error can be roughly divided into two categories; machine incorrectness and machine bias, as explained in Sect. 3.2.2. The impact of decisions made based on the models in the two use cases is fairly different. This seems influential to how is dealt with the potential machine error in the models. In the financial crime detection use case, the performance of the models’ internal workings are always evaluated before pushing the output to the analysts, whereas in the email marketing use case the model is trusted to be working as expected. Machine error is recognised by the civil society organisations and regulatory bodies as a relevant dimension.

5.1.2.1 Machine Incorrectness in the Use Cases

In the financial crime detection use case, the models are used in a quite sensitive context, which makes it important to prevent incorrectness in models. The impact of incorrectness could be that customers of the bank are unjustly investigated by the transaction monitoring analysts, or that customers that should be detected by the model are not detected, which could lead to money laundering or terrorist financing being undetected. Because the latter is considered to be important to prevent, there is an acceptance of a less accurate model, which leads to more false positives, in order to be able to find the true positives. Further, there are multiple mechanism in place to prevent incorrectness in this use case. First, data scientists work with a four-eyes principle, by which every piece of code is checked by another data scientist during development. Second, there is a dedicated independent model validation team that validates a model including all code before it is put to production. Lastly, there is monthly performance monitoring in place for the models, that run monthly, to check whether the model performance is comparable to its performance during training and whether the features’ distributions have changed to detect potential machine incorrectness. After completion of the performance monitoring, the model output is directed to the transaction monitoring analysts. Although the bank has these mechanisms in place, the interviewed transaction monitoring analyst pointed out that in a third model, of which the first version is currently live, the testing of the model was not performed in the case management system the analysts are using. Once the first real output of the model in the case management system appeared, there was a lot of incorrectness; for example, generated alerts that did not contain transactions and features that were not visible in a customers’ account. As a result, the analysts are dealing with model output that has a lot of incorrectness, and a new version is still not live at the time the interview was conducted (Personal Communication, January 14, 2022).

On the contrary, every interviewee in the email marketing use case pointed out that the impact of machine error is relatively low. The model is not part of a mission-critical activity, and the worst case impact to customers is that customers are accidentally repeatedly being emailed by the bank, or are being emailed while they had opted-out for certain emails. Although this relatively low impact, the marketing intelligence manager mentions the importance of being transparent. For example, if customers that had opted-out for emails would have been emailed (Personal Communication, January 11, 2022). In the first live version of this model, there appeared to be overfitting on a certain feature. This was only detected once the model was already live, because the test set did not contain the needed exceptional cases. There was only 1.5 years of data so most of the data had to be used for training, but if more data had been available for testing this overfitting could have been detected earlier, before going live (Personal Communication external ML engineer/project manager, January 13, 2022). To prevent emails being sent based on incorrect model output, there is a human in the loop, the marketing intelligence analyst, who does manual checks on the model output to check for example if customers had opted-out. That said, he only checks proposed customers on a narrow set of characteristics, otherwise trusting the model to be working as expected (Personal Communication marketing intelligence analyst, January, 21, 2022).

5.1.2.2 Machine Biases in the Use Cases

To prevent biases in the models, in both use cases it was chosen not to use certain data. In the email marketing case, the external ML engineer/project manager pointed out that the bank thinks it is very important to use data wisely, well within the lines of the GDPR (General Data Protection Regulation). Therefore, the choice was made not to use gender data and postal codes (because those may be a proxy for ethnicity). The choices of what could not be included were made based on intuition. In the financial crime detection use case, the data science department wanted to detect potential bias in the ML models caused by proxies for gender, age, ethnicity and origin. However, the privacy officer was prohibited to do this for ethnicity and origin, because those are sensitive personal data. This presents a notable paradox: To be able to detect a bias on a particular attribute, analysts require data on this particular attribute, for example ethnicity. However, the interviewees indicate that the GDPR prohibits this, as a ML model should adhere to privacy by design, which means data on ethnicity and ethnicity cannot be processed without a clear and proportional motivation. Interpreting this as a limitation precludes the ability to check for biases.

The handling of potential bias in the ML applications as surfaced by the interviewees in the use cases, i.e. not using certain features and detecting bias for certain factors by proxies, is known not to be a sufficient strategy for attaining fair outcomes. The found practices do not account for the diverse fairness requirements and needs stakeholders involved in or affected by the sociotechnical system (Balayn & Gürses, 2021).

5.1.2.3 General Insights on Machine Error

Machine error was recognized as a relevant dimension by the representatives of the civil society organisations and regulators. To illustrate, biases are seen as one of the largest problems in ML applications, in particular in the context of predicting social behaviour or crime. A large risk is the use of indicators that are actually proxies for protected grounds, as was discussed within the use cases as well (Personal Communication, January 25 representative Amnesty International, 2022). The representative of the AP also points out to guard for unwanted indirect inferences based on proxies, although it is increasingly difficult to completely prevent this from happening in contexts where ever-growing numbers of variables are used (Personal Communication, January 24, 2022). There is a lot of attention on the impact of machine error on people in the proposed EU AI act (Personal Communication representative DNB, January 18, 2022). Society does not have a high acceptance of machine error, which disincentivizes organisations to be open about flaws in their systems. This non-acceptance reflects in a statement made by the representative of Amnesty International, who stated that ML models with high error rates should not be used for any consequential decision-making (Personal Communication, January 25, 2022).

5.1.3 Interpretation

Interpretation of a ML application by human decision-makers is recognised by many civil society organisations as a potential source of risks for the quality of the ultimate decisions. At the same time, human intervention is mandatory if a decision affects a person to a significant degree in the GDPR. In the use cases, stakeholders seem to be less aware of the vulnerabilities of human intervention. In the marketing use case, human intervention is seen as a means to reduce risks by providing an extra checking and controlling mechanism, instead of a step that can impose additional vulnerabilities. There is an increased risk of human noise in the financial crime detection use case, as the more complex system output increases the potential for deviation in interpretations among analysts. Further, the potential for human bias is not adequately considered in both use cases and the human interpretation step is not monitored or evaluated.

5.1.3.1 Interpretation in the Use Cases

In the financial crime detection use case, the analysts that use the model output in their work encountered a challenge in the beginning to use the model output. The number of factors that are taken into account using the ML models instead of the previous rule-based system was largely expanded, which made it difficult to understand how to interpret the output, where to look at and what to think about at all (Personal Communication transaction monitoring analyst, January 14, 2022). Meanwhile, the transaction monitoring analyst thinks the model output is well interpretable, in combination with the translation document on the model features. The analysts are supported in the model interpretation by means of explainability methods, consisting of highlighting the three most important features that contributed to the model output, as well as the most important transactions (Personal Communication Lead Data Scientist, January 12, 2022). Further, analysts that start working with the ML models’ output follow a training and receive a working instruction document. Nevertheless, the transaction monitoring analyst pointed out that the introduction of the ML models increases the risk of deviations in interpretation among analysts compared to the previous rule-based system, because the output is a lot more complex (Personal Communication, January 14, 2022). The data science advisor within the bank does see that bias could emerge in the decision-making process besides the model, but that it is difficult to quantify this (Personal Communication, December 22, 2021). The manager data science points out that there are a lot of checks on the model itself, but potential human bias is not well considered, which could be improved. At the same time, the human intervention is an important mitigation measure for automated decision-making, and completely automating the decision-making is not desired for these important decisions (Personal Communication manager data science, January 12, 2022).

In the marketing case, a human-in-the-loop, the marketing intelligence analyst, has a controlling and checking responsibility for the model output, after which the proposed customers by the model are being sent an email. The model output only presents the customer IDs that are proposed, which makes it difficult to understand why the model chooses certain customers (Personal Communication marketing intelligence analyst, January 21, 2022). The marketing intelligence analyst feels that the checking function he has is very important, however, he does not have enough tools and guidance to perform this function adequately. He would like to have better reports on the customers that are selected, to get a picture of the customers (Personal Communication marketing intelligence analyst, January 21, 2022). The ML engineer/project manager does not see a great risk for human bias to emerge, as the marketing intelligence analysts mainly has a checking role (Personal Communication, January 13, 2022). Additionally, the marketeer thinks the role of the marketing intelligence analyst rather decreases the risk on potential mistakes, than imposing new potential mistakes (Personal Communication, January 14, 2022).

5.1.3.2 General Insights on Interpretation

Both representatives of Bits of Freedom and Amnesty International mention that a human-in-the-loop is seen by many organisations as a means to take away concerns about ML, while it is not the solution to the problem and much more is needed to prevent harmful outcomes (Personal Communication, January 17 and 25, 2022). Automation bias and limited time for the job can make humans overly rely on the model output, and a model can be used by humans as confirmation to their own bias (Personal Communication representatives Platform Bescherming Burgerrechten and Amnesty International, January 18 and 25, 2022). At the same time, human intervention in automated decision-making that can affect a person to a significant degree is mandatory by the GDPR (Personal Communication AP, January 24, 2022). The ML engineer within bank A points out that the chain of activities in a decision-making process around a ML model is more important for the quality of the ultimate decisions than the model itself (Personal Communication, January 24, 2022). It really depends on the mindset of the human decision-makers that are in the chain between the model output and the final decision. If the decision-making process is badly designed or not thought through, a very good model can lead to terrible outcomes (Personal Communication ML engineer bank A, January 24, 2022).

5.1.4 Performative Behaviour

Performative behaviour as a dimension did not show up frequently in the interview data about the use cases, which could be caused by unawareness of the dimension or performative behaviour being of less relevance in these use cases. The civil society organisations and regulatory bodies did recognise performative behaviour as a relevant dimension. The representative of DNB pointed out that it is very relevant, yet the point of concern does not get much attention in the field, which could be an explanation of the dimension not being extensively elaborated on in the use cases.

5.1.4.1 Performative Behaviour in the Use Cases

Considerations on performative behaviour have not been mentioned in the email marketing use case. In the financial crime detection use case, there was one comment on behaviour. The ML models required changes in the way of working of the analysts, to which they reacted that they could not or did not want to work with the model output (Personal Communication Manager Data Science, January 12, 2022).

5.1.4.2 General Insights on Performative Behaviour

The general insights on performative behaviour consist of insights on the change of behaviour among people about whom a ML model takes a decision, as well as change of behaviour among human decision-makers that use the model output to take a final decision. The representative of DNB argued that a trade-off exists between transparency and black-box models. For example, if criminals get insights in how a ML model that is used to detect money laundering works, it becomes easier for them to circumvent being detected as a money launderer. This point of concern gets little attention and does not play a big role in the discussion around explainability (Personal Communication, January 18, 2022). The representative of Amnesty International sheds a different light on this trade-off, arguing that it is not required to make the whole code public, but people should be informed when personal characteristics as nationality, age, postal code or salary are used. The potential for circumventing the system is not at stake, as people cannot change these characteristics, but should be able to know based on what a decision is made (Personal Communication, January 25, 2022). Behavioural change among human decision-makers that use model output is also mentioned by interviewees. The introduction of a ML model decreases the level of ownership a human decision-maker has, compared to a situation without a ML model in place, which can influence the outcome (Personal Communication representative of Waag, January 19, 2022). Lastly, if a human decision-maker has to adhere to a predefined target in terms of validating and rejecting model output, this has influence on the behaviour and the final decisions as such (Personal Communication AP, January 17, 2022).

5.1.5 Adaptation

Adaptation as a dimension can be recognised in the use cases, although it has not been widely discussed in the interviews. In the financial crime detection, the adjustments in the way of working were very challenging, which was not an issue in the email marketing use case. In that use case, the marketing intelligence analyst has actively been given the possibility to adjust configurations towards the environment’s needs. Only one representative of a civil society organisation had encountered examples of the adaptation dimension in practice. Whereas it seems that adaptation is not on top of mind among the external stakeholders, ML applications were developed and not being used in practice due to the adaptation of the designated end-users of ML applications in Bank A.

5.1.5.1 Adaptation in the Use Cases

In the financial crime detection use case, the transaction monitoring analyst pointed out that it had been a big challenge to start working with the ML models, as it required a whole new way of working (Personal Communication, January 14, 2022). For the transaction monitoring analysts to understand what the models’ output meant in the context of actual increased risk of money laundering or terrorism financing was a struggle in the beginning of using the models in practice. To address the difficulties of using the models’ output in practice, there have been intensive feedback sessions between the data science team and the transaction monitoring analysts (Personal Communication transaction monitoring analyst, January 14, 2022).

In the email marketing use case, the marketing intelligence analyst has been given the possibility to adjust some configurations of the model. For example, they can adjust the threshold of the model output, by which a change leads to more or less customers to be emailed based on the model output (Personal Communication external ML engineer/project manager, January 13, 2022). While these possibilities give the marketing intelligence analyst the possibility to adjust the system to the needs of the environment, he has been given instructions not to change configurations too often, because it makes the ML application hard to evaluate (Personal Communication external ML engineer/project manager, January 13, 2022).

5.1.5.2 General Insights on Adaptation

As the introduction of ML models often require a change in the way of working among the human decision-makers that are part of the decision-making process, this should be accommodated for in the development and integration of the ML application in its context. In bank A, there have been models that were developed and put into production, but were not used very much in practice, because this accommodation was lacking or there was no trust in the model (Personal Communication data science advisor, December 22, 2022). Another phenomenon within the adaptation dimension is function creep: It happens a lot that a ML application that is designed and developed for one purpose is over time used for other purposes as well (Personal Communication representative Amnesty International, January 25, 2022). Moreover, it can happen that the initial design of the system does not show risks on human right violation, but the way in which the system is used in practice does, for example leading to groups of people being treated differently than intended in design (Personal Communication, January 25, 2022).

5.1.6 Dynamic Change

Dynamic change is not widely recognised by the civil society organisations. Most representatives did not encounter an example of dynamic change in practice, but can image the relevance of it. On the other hand, stakeholders involved in the use cases do recognise this dimension, as external factors such as regulatory changes or internal changes in data could have large impact on the working of the models over time, while alignment of changes within the bank’s different departments involved is a challenge.

5.1.6.1 Dynamic Change in the Use Cases

In both use cases, the department in which the models are developed and ran in production are not the owners of the underlying data sources. At the same time, changes in underlying data can have direct impact on the models. In the financial crime detection use case, the Manager Data Science pointed out that if a new version of a ML model is developed, this requires various governance checks, while such checks are not in place for changes in the data sources, owned and triggered by the IT department (Personal Communication, January 12, 2022). Therefore, the data science team needs to continuously engage with the IT department about data changes. Additionally, they monitor potential changes in data or output that can be indicative of problems in the models, monitoring metrics such as feature distributions, alert volume, and false positive rates (Personal Communication Lead Data Scientist, January 12, 2022).

In the email marketing use case, changes in underlying data are seen as a relevant dimension by all stakeholders involved. Changes are not always adequately communicated to the marketing intelligence department, while significant changes could result in the model stop performing properly (Personal Communication Marketing Intelligence Analyst, January 21, 2022). Other developments within the bank, such as updated privacy guidelines or new products, could require adjustments in the model as well. Currently, the department is depending on the external ML engineering/consulting firm to make adjustments in the model (Personal Communication Marketing Intelligence Analyst, January 21, 2022). Lastly, external factors could really impact the model performance. For example, the stock exchange dip in March 2020 due to the start of the Covid-19 pandemic in The Netherlands, was a scenario the model had not been trained on and which led to unreliable outputs that could not be used (Personal Communication external ML engineer/project manager, January 13, 2022).

5.1.6.2 General Insights on Dynamic Change

Representatives of Waag, Bits of Freedom, Amnesty International and Platform Bescherming Burgerrechten have not encountered dynamic change as a dimension in practice, but can image that it is a source for vulnerabilities. On the other hand, the representative of DNB does recognise the problems for supervision that dynamic change could cause. The proposal of the new AI act does require a conformity assessment for new high risk applications of ML, but does not take the dynamic change dimension into account (Personal Communication representative DNB, January 18, 2022). Especially in the case of self-learning ML models, an investigation on one day could lead to the outcome that the model is compliant, but the next week it could not be compliant any more (Personal Communication representative DNB, January 18, 2022). This is a realistic problem that is challenging to assess for supervising bodies.

5.1.7 Downstream Impact

Among representatives of civil society organisations and regulatory bodies, downstream impact is seen as an important dimension, in which especially data quality and data selection can have severe impact on the outcomes of ML models. At the same time, the subject receives little attention in practice compared to the attention to the development of the ML models themselves, which has been noticed in the interviews with stakeholders involved in the use cases, in which the concerns of downstream impact seemed to not play a large role.

5.1.7.1 Downstream Impact in Use Cases

Within bank A, where the financial crime detection use case was developed, there are measures to ensure good data quality and compliance with the GDPR, as the bank mostly uses gold standard data sources for ML models, which are the most accurate and reliable of its kind (Personal Communication Privacy Officer, January 17, 2022). At the same time, the data sources are maintained and continuously improved or changed by the IT department, which can have downstream impact on the models (Personal Communication Manager Data Science, January 12, 2022). A large challenge is to keep grip on where the ML model outputs are used within the organisation (Personal Communication Privacy Officer, January 17, 2022). To keep a grip on where model outputs are used and thus limit downstream impact, employees or department need either a data sharing agreement or authorization from the data owner, who is the data scientist who developed the model, to use the model output for different purposes (Personal Communication ML engineer, January 11, 2022). It is clear that measures have been taken in bank A to prevent vulnerabilities due to downstream impact. However, the changes in data sources are important to monitor, as well as the requirements for departments to be able to use model output in secondary decision-making processes.

In the email marketing use case, the occurrence of vulnerabilities within the downstream impact dimensions seems limited, as the data quality was good in general, and the output of the ML model is not used in secondary decision-making processes. However, the marketing intelligence analyst did see the possibility for this to become the case in the future (Personal Communication, January 21, 2022). Downstream impact is thus a dimension to keep in mind in bank B.

5.1.7.2 General Insights on Downstream Impact

Downstream impact is seen as an important source of vulnerabilities by civil society organisations and regulatory bodies. Both downstream impact by data issues as downstream impact by interconnected ML models have been mentioned. The representative of DNB called data management, governance and quality at least as important as the development of models, whereas it is a relatively small part of the discussion (Personal Communication, January 18, 2022). Representatives of Amnesty International and Bits of Freedom also highlighted that the data used can have severe impact on model output, which receives too little attention (Personal Communication, January 17 and 25, 2022). A ML model can in turn interact with other ML models, which can lead to losing control on wrong model output or if one model fails, other models fail too, which can have large impact on the larger system (Personal Communication representatives DNB and Waag, January 18 and 19, 2022).

5.1.8 Accountability

Lack of accountability can lead to large issues as outcomes based on ML models can have big impact on people, is a shared point of view from the civil society organisations and regulatory bodies. Within the use cases, there can be vulnerabilities related to accountability identified. In the financial crime detection use case, remaining knowledge on the models within the bank is challenging, which is key to be able to provide accountability. In the email marketing use case, reproducibility of the customers selected and dismissed is not easy to achieve, and the responsibilities are not officially defined among the involved stakeholders.

5.1.8.1 Accountability in Use Cases

In the Financial Crime Detection use case, accountability of the model, model output and outcome are divided among different stakeholders. The data science team is model owner and thus responsible for the model. The ML team carries responsibility for correct implementation of the model, and the transaction monitoring analysts are responsible for the decision whether a customer should be reviewed or not. Ultimately, the leader of the financial crime detection department carries ultimate responsibility for everything that happens in the department, and thus for the model, model output and final outcomes (Personal Communication Lead Data Scientist, January 12, 2022). To be able to provide accountability over the ML models used for certain output, reproducibility is in place for the model version, model output, features and optionally parameters, metrics and other metadata (Personal Communication ML engineer, January 24, 2022). Being able to explain an outcome to a customer is a challenge within bank A, because there is a continuous flow of employees leaving and joining the bank (Personal Communication Privacy Officer, January 17, 2022). The risk exists that at a certain point, nobody knows how a ML model works any more. To prevent this, transparency on model development is key (Personal Communication Privacy Officer, January 17, 2022). As the model output is used by the transaction monitoring analysts in the use case, they have to work uniformly to make the final outcome reproducible as well (Personal Communication Transaction Monitoring Analyst, January 14, 2022). It is vital to be able to explain how a model works and how the outcome is achieved to supervisory agents such as the Data Protection Authority (Personal Communication Privacy Officer, January 17, 2022). As the model output is part of a larger decision-making process, making the model output explainable is not enough to being able to understand the final outcome (Personal Communication ML engineer, January 24, 2022).

Being able to explain to a customer why he or she has been selected is seen as a great challenge in the use of ML among stakeholders in the email marketing use case (Personal Communication Manager marketing intelligence and Marketing Intelligence Analyst, January 11 and 21, 2022). If a customer asked this, the bank cannot fully explain why he or she receives the email (Personal Communication Marketing Intelligence Analyst, January 21, 2022). At the same time, customers could always unsubscribe from receiving certain types of emails (Personal Communication Marketing Intelligence Analyst, January 21, 2022). Also, the model uses only seven features, so the stakeholders within the bank have insight into the data that have led to a model output, which makes the model explainable to a certain degree. The model has been developed by an external ML engineering/consulting firm, but the responsibility for using the model and the model output is carried by the bank (Personal Communication external ML engineer/project manager, January 13, 2022). Within the bank, the responsibilities of the model and model output are not officially defined (Personal Communication marketeer, January 14, 2022). Reproducibility of model output is covered by the external ML engineering/consulting firm. However, reproducibility of the final outcome, thus customers that are selected and customers that are dismissed are not easy to reproduce, because the final selection of the customers is overwritten every week. If needed, the marketing intelligence analyst could match the model output data with the customers that have been emailed to trace back the final selection (Personal Communication Marketing Intelligence Analyst, January 21, 2022).

5.1.8.2 General Insights on Accountability

The lack of accountability of ML applications is seen as a big issue among civil society organisation and regulatory bodies. Issues related to accountability can be divided into internal accountability issues and external accountability issues. Internal accountability issues are issues within organisations. In the financial sector, the board level is ultimately accountable for the use of ML models within the organisation, whereas they often do not fully understand what the use of ML entails due to a lack of knowledge and skills (Personal Communication representative DNB, January 18, 2022). There are few experts within and outside of banks that understand how ML models really work. This could lead to board members not being aware of the risks and the impact of ML models on the organisation and the larger financial system (Personal Communication representative Waag, January 19, 2022). External accountability includes situations where people affected by a faulty decision made using a ML application seek justice. Here the lack of explanation over model and broader process, as well as an absence of channels to defend themselves are core issues. These problems are partly caused by the secretive way in which many of such systems are used by organisation, outside of the awareness of subjects. While the GDPR requires organisations to be transparent about the ML models that are used, the AP notices a lack of transparency and proactive communication among such organisation (Personal Communication representative AP, January 24, 2022). When people are not informed about what is going on, it is impossible for them to detect errors in the outcome (Personal Communication representative Platform Bescherming Burgerrechten, January 18, 2022). As this outcome can have large impact on individuals or groups of people, the outcomes should be explainable, and the organizations deploying the systems should be held accountable, which is often not the case in practice (Personal Communication representative Bits of Freedom, January 17, 2022).

5.2 Results of an Inductive Analysis: Challenges in ML Practice

The inductive analysis results in seven challenges. These are discussed in the following paragraphs.

5.2.1 Challenge 1: Defining the System Boundaries for Design, Analysis and Governance

In specification of the ML lifecycle, the system boundaries need to be defined. This defines which components and interactions are taken into account in designing and analyzing the ML model in its context, and determines which stakeholders should be involved. When the system boundary is defined too narrowly at the start, this has consequences for the efficiency and effectiveness of the system development and may be a source for vulnerabilities.

A ML model does not operate in isolation, but becomes part of a larger sociotechnical system. Not including this within the system boundaries may lead to issues later in the ML lifecycle. This is noticed in the financial crime detection use case. The end-users of the ML application, the transaction monitoring analysts, were not involved in the financial crime detection use case from the start. They were not involved until the model had been developed and was ready to be tested and become part of the work of the end-user. For the end-user, the introduction of the ML model required a great shift in their way of working. Moreover, they were not able to interpret the model output, as the feature descriptions they had to use were defined in technical language. Therefore, a translation of the feature descriptions had to be specified ad-hoc and working instructions needed to be created by a delegation of the transaction monitoring analysts.

As illustrated, defining the boundaries too narrow is a source of issues. On the other hand, it is not feasible to involve everyone and specify every detail within the sociotechnical system from the beginning. As such, defining adequate system boundaries represents natural tradeoffs.

5.2.2 Challenge 2: Dealing with Emergence in the Sociotechnical Context

Not every detail within the sociotechnical system can be specified in the beginning. This is especially the case because vulnerabilities may emerge over time, emerge due to interactions between different system components, and emerge due to dynamics within and beyond the sociotechnical context.

In the use case, several efforts are made to deal with emergence over time. For example, the ML models’ performance is monitored over time using monitoring metrics as accuracy, prediction volume and feature distributions. Moreover, feedback from the end-users is collected to improve the ML models over time.

However, there are also blind spots identified in the use case on emergent dimensions. First, the behaviour dimension is hardly addressed in the use case. This could for example lead to human decision-makers that overly rely on ML model outputs due to time pressure. Second, the adaptation dimension is insufficiently considered, as the way of working was not part of the specification in the use case. Third, the feedback gathered from end-users is used to improve the ML applications. However, this does not take into account that such feedback may be subject to human bias or noise, which can be incorporated in technical updates of the model and application.

As illustrated, vulnerabilities may emerge after deployment, when a ML application is used in its sociotechnical context. Awareness of the emergent dimensions is needed, and how to address them remains challenging.

5.2.3 Challenge 3: Understanding the Risks of Human-ML Interactions in Decision-Making Processes

A human intervention in a decision-making process is mandatory if automated decisions affects a person to a significant degree, as described by the GDPR. However, a human-in-the-loop brings about new potential vulnerabilities that can lead to harmful outcomes of a ML application, such as automation bias, confirmation bias, disparate interactions, and limited time for the job.

In the financial crime detection use case, human decision-makers should be able to deliberately use the model output for their final decision. To do so, they have to understand the model output to that extent. Initially, human decision-makers experienced difficulties to understand the model output and use it in their work. Eventually, this is addressed using explainability techniques, to guide the human decision-makers in using the model output. This helps them, but the introduction of the ML application in their work has increased the risk of deviations in interpretation, as the complexity of information has grown.

Furthermore, the introduction of a ML application can impose vulnerabilities in the behaviour and adaptation dimensions. On both dimensions, awareness in lacking in the use cases, as these dimensions did not come forward in the interviews.

Although the introduction of a human decision-maker is often mandatory by the GDPR in a ML decision-making process, it must not be solely seen as a means that takes away vulnerabilities. Rather, it may introduce new ones. As the current focus in the specification of use cases is mainly on the ML model itself, the specification and design of impacts in the decision-making process and its outcomes may receive less priority. As the ML engineer within the bank summarized: “you can have a very good model, but if the decision-making process around it is badly designed, a very good model can lead to terrible outcomes.”

5.2.4 Challenge 4: Recognising the Hazards Related to Data in the ML Lifecycle

As data is at the core of ML applications, it has large influence on the final output of the model and the potential for vulnerabilities to emerge. Biases in data, bad data quality, and changes in data are thus important vulnerabilities to consider.

As the representative of the National Bank of The Netherlands (DNB) pointed out: “data management, governance and quality is at least as important as the developme t of models, whereas it is a relatively small part of the discussion”. This statement is corroborrated in the use cases. In both banks, data management lies in a different department than where the ML applications are developed. The data science department has to follow strict governance processes to be able to launch a new model or model version. However, the data sources are not covered by these strict governance processes. As such, changes in data can be made, which directly influences the ML models and output. Kee** a grip on this, requires continuous alignment between the departments. This organisational complexity can be a source of vulnerabilities when changes are not communicated within the organisations.

5.2.5 Challenge 5: Develo** Knowledge and Shared Language Across Different Actors

Different types of knowledge exist and are needed within the sociotechnical context of ML applications. These have to be developed, shared and maintained. Furthermore, stakeholders with different types of knowledge need a shared vocabulary to effectively communicate in develo**, operating and governing such sociotechnical systems.

Firstly, there are few experts on ML within banks and in external organisations. This has consequences that could result in vulnerabilities. For instance, one of the banks does have ML experts in-house, but ML experts often leave the bank and new ML experts are hired. This makes maintaining knowledge on models that run in production a challenge, and may even lead to models not being able to continue due to lack of knowledge of their operations and risks. To address this, extensive documentation on the models is made and updated.

Secondly, the expertise of ML experts and the expertise of end-users on the process and context in which the ML model is used may be hard to connect. In both use cases, the people that ultimately have to work with the ML application output are not ML experts. The other way around, ML experts are no operational experts. As such, to develop a ML application that can be integrated in the way of working of operational experts, communication between the different stakeholders is needed for alignment. However, this communication is often lacking in organisations. To illustrate with the use case, the operational experts were only involved when the ML application development was finished. Furthermore, the model validation team only validated the models, and did not communicate with operational experts to validate whether the models could actually be integrated in the way of working, neither did they assess possible hazards or undesirable impacts in the operating or decision-making process.

5.2.6 Challenge 6: Providing Transparency of ML Models and Process Outcomes

Transparency of ML applications is challenging on multiple levels. First, ML models are known for their opacity, also referred to as a ’black-boxes’. Therefore, it may be challenging or impossible to understand why or how a ML model arrives at a certain output. To address this opacity, explainability techniques are used in the financial crime detection use case. This provides the end-users with some guidance in understanding the model output.

Interviews with civil society organisations, regulators, and other external organisations pointed out that organizations that use ML are often not transparent about it. As a result, civil society organisations struggle to get insight in and address ML applications that are imposing vulnerabilities that can lead to harm for citizens. Therefore, they only get insight if harm is already imposed, while it is better to prevent harm. Furthermore, the non-transparency of organisations raises the question why they would not be transparent about it, do they have something to hide? As a result, this strengthens the distrust of civil society organisations towards organisations that use ML. Governance approaches and design efforts to use ML responsibly are thus not seen and acknowledged either.

Lastly, there are reasons why organisations could be non-transparent about the use and the inner working of ML applications. Being transparent may hamper their competitive position and the possession of intellectual property. Besides that, if people know how a ML model takes decision, they could game the results, by adapting their behaviour to get a certain outcome. This may impose risks, for example if the ML application is used to detect financial crime.

5.2.7 Challenge 7: Operationalizing Regulations Applicable to ML Applications

The financial sector is highly regulated, but what it means for the use of ML and what future regulations will require from organisations is not yet crystallised.

The enforcement of legislations lies in the hands of several regulatory bodies, among which the Dutch Data Protection Authority (Autoriteit Persoonsgegevens) and the DNB. Furthermore, financial institutions have extensive responsibilities to organise internal supervision and compliance with the GDPR and other legislation. According to the interviews with the Dutch Data Protection Authority, organisations have to be develop a particular level of maturity to be able to do so effectively. A privacy officer in one bank endorsed that the bank has organised its own internal supervision, but that there is little external supervision. This lack of external supervision could lead to undetected risks in the use of ML within the financial sector. Furthermore, the daily supervision of the DNB hardly looks at ML applications at the moment. This will change in the future by the EU AI act, which gives the DNB a mandate to do so.

As illustrated, current legislation and the proposed EU AI act are considered insufficient to enforce a responsible way of develo** and using ML applications. Currently, banks are left free to organise this themselves, but the privacy officer of one bank points out that guidance from external higher authorities on how to deal with ML would be very welcome.

It is still too early to understand how the final EU AI Act will look like, and how it will affect organisations. The GDPR caused a substantial increase of awareness on data protection and privacy among organisations, so the question is whether the AI act has the same effect on AI within organisations. How the AI act will be adopted in organisations will in turn influence the position regulatory bodies take.

6 Guidelines for Sociotechnical Specification

In the final stage of the Design Science Research (DSR) project, we developed an artefact to translate the validated dimensions and found challenges to an updated practice for MLOps, extending the initially mapped practice (see Fig. 1). This practice contains (1) an iterative lifecycle that operationalizes a sociotechnical lens on ML applications across the ML lifecycle, also adding a dedicated subpractice stage on sociotechnical specification, and (2) a set of ten guidelines for organizations to establish this practice.

We did not empirically test or validate the new practice and guidelines, and as such we do not cover these in great detail in this paper. Instead, we provide a list of the guidelines and a depiction of the lifecycle (in Fig. 4). The guidelines do form a pragmatic onramp for addressing vulnerabilities of ML models in their sociotechnical context with a comprehensive set of implications for organizations building, deploying and governing these systems:

  • Guideline 1: Establish a multidisciplinary team at the beginning of the ML lifecycle

  • Guideline 2: Define the system boundaries as multidisciplinary team

  • Guideline 3: Enable the identification, addressing, and mitigation of vulnerabilities in the sociotechnical specification

  • Guideline 4: Formulate an initial specification of the sociotechnical system before starting the experimental stage of the sociotechnical ML lifecycle

  • Guideline 5: Create feedback channels for different stakeholders during the development and operations of the sociotechnical system

  • Guideline 6: Specify monitoring and evaluation mechanisms for the sociotechnical system in operation

  • Guideline 7: Verify and validate the sociotechnical system before operationalizing

  • Guideline 8: Establish transparency of the sociotechnical system, about its development, design, use and governance

  • Guideline 9: Create knowledge and communication between stakeholders in the sociotechnical system

  • Guideline 10: Establish a safe culture and adequate management within the organisation

Interested readers are referred to Wolters (2022) for a more detailed explanation of and reflection on these guidelines.

Figure 4 presents the sociotechnical ML lifecycle, and the guidelines that relate to the different activities. The arrows present activities that follow up on each other and feedback channels that may lead to changes in the deliverables of earlier performed stages. As can be seen, the activities in the sociotechnical specification are not linear, which is why they are not connected with arrows. The only requirement is that an initial specification should be formulated before moving to the experimental stage. Besides that, the monitoring and evaluation of the sociotechnical system in the operations co-exist, thus also does not represent a linear process. Lastly, guideline 8, 9, and 10 are separately presented as they do not address a specific activity, but are guidelines that should be taken into account across the organization and throughout the sociotechnical ML lifecycle.

Fig. 4
figure 4

Visualisation of the sociotechnical ML lifecycle with the developed guidelines. The larger boxes are the different stages of the ML lifecycle. Within each stage, the smaller boxes denote activities. The blue-colored boxed numbered GL1-GL10 denote the activities and points in the lifecycle where the guidelines apply

7 Conclusions and Future Work

This paper provides an empirically-driven conceptualization and validation of vulnerabilities in the sociotechnical context of ML applications. Furthermore, our interview analysis identified a set of challenges that need to be addressed to properly account and design for the context-specific and emergent issues related to sociotechnical complexity. The design science research methodology allowed us to funnel these insights into a set of guidelines to offer a pragmatic framework for practitioners in development, MLOps, as well as those involved in the domain of application or as policy makers or administrators/managers, so to address sociotechnical vulnerabilities in the design, development, use and governance of machine learning in sociotechnical systems context.

Beyond practitioners, the results also contribute to civil society organisations and regulatory bodies as the theoretical framework supports the articulation of vulnerabilities, which are often encountered in the field, but which are still hard to express in a language that is understood by policymakers and ML application developers. In sharing our results with these interviewees we received promising feedback that the vulnerability dimensions and guidelines may help to form a shared lexicon to build more effective bridges between different communities addressing the risks and hazards of ML-based systems.

The research also provides a scientific knowledge base, which can be used in future research on vulnerabilities in sociotechnical systems. Moreover, the empirically validated vulnerability categories provide a useful theoretical lens for research addressing sociotechnical complexity of ML in practice.

Four directions to build upon this research are recommended. First, evaluation and demonstration of the guidelines in real-life ML use cases is recommended. This research did not involve an iterative step between design and evaluation, but concluded with a first design of guidelines. Evaluation and demonstration are recommended to research the value of using the guidelines in actual ML use cases and to iteratively improve the guidelines. Second, researching other ML use cases within different organisations within the financial sector and in other sectors would enrich the research output. As the financial sector is highly regulated and risk averse, it would be insightful to include use cases from less regulated and more risk seeking sectors, to complement this research. Third, this research widened the technical view that dominates the ML field towards a sociotechnical systems view. The next step is to widen this view further, by addressing the interactions with broader organisational and institutional mechanisms. And lastly,  it is relevant to note that this study was performed in late 2021 and early 2022, and hence precedes the quick rise of generative AI tools since the end of 2022. Nonetheless, the findings in this paper largely apply, but may be nuanced or extended for more recent generative AI applications.