Abstract
Artificial Intelligence and data science are rapidly gaining importance as parts of decision support systems. As these systems improve, it becomes necessary to clarify humans’ roles in the decision-making processes. Humans may not be able to improve on the choices a good algorithm makes, they may not be able to adjust the parameters of the algorithm correctly, and their role in processes that use good algorithms may be limited. However, this does not mean human involvement in data-supported decision processes is unnecessary. A closer look at the analytical process reveals that each step entails human decisions, beginning with the data preparation through the choice of algorithms, the iterative analyses, and the display and interpretation of results. These decisions may affect the following steps in the process and may alter the resulting conclusions. Furthermore, the data for the analyses often result from recordings of human actions that do not necessarily reflect the actual recorded events. Data for certain events may often not be recorded, requiring a “big-data analysis of non-existing data.” Thus, adequate use of data-based decisions requires modeling relevant human behavior to understand the decision domains and available data to prevent possible systematic biases in the resulting decisions.
You have full access to this open access chapter, Download chapter PDF
Similar content being viewed by others
Keywords
In our current “age of data”, Artificial Intelligence (AI), Machine Learning (ML), Data Science (DS), and analytics are becoming part of problem-solving and decision-making in many areas, ranging from recommendations for movies and music to medical diagnostics, the detection of cybercrime, investment decisions or the evaluation of military intelligence (e.g., McAfee & Brynjolfsson, 2012). These methods can be used because an abundance of information is collected and made available. Also, the tools for analyzing such information are becoming widely accessible, and their use has become easier with platforms such as BigML. While in the past, statisticians or data scientists were in charge of the analytics process, now anybody with some basic computing skills can conduct analyses with R or Python, using open-source tools and libraries.
These developments are the basis for new insights and understanding social and physical settings. They also alter the decision processes used by organizations and the information that is available to individuals. As such, they affect reality, its representation in digital records and the media, and the ways people interpret this reality and act in it. The dynamic interaction between the physical, digital, and social realms shapes current societies. Understanding and modeling it is a major challenge for both data science and the social sciences.
Data analytics, and the information one can gain from them, can be used in decision-making processes, in which they help to choose among possible alternatives. Algorithmic decisions can be advantageous in legal contexts, such as bail decisions (Kleinberg, Lakkaraju, Leskovec, Ludwig, & Mullainathan, 2018). In medical settings, the development of personalized evidence-based medicine for diagnostic or treatment decisions (Kent, Steyerberg, & van Klaveren, 2018) depends on analyzing electronic medical records with data science tools. AI-based analyses in medicine can indeed improve diagnostic or therapeutic decisions (Puaschunder, Mantl, & Plank, 2020). Similarly, algorithms in financial markets, implemented as algorithmic advisors or in algorithmic trading, can provide clear benefits (Tao, Su, ** the data shadows of Hurricane Sandy: Uncovering the sociospatial dimensions of “big data”. Geoforum, 52, 167–179. https://doi.org/10.1016/j.geoforum.2014.01.006 " href="#ref-CR31" id="ref-link-section-d221393918e1193">2014). This was the strongest hurricane that had hit the New York City area in recorded history. There is indeed a strong correlation between Twitter activities and the strength of a storm, but there were very few Twitter activities in the areas in which the storm was the strongest. Two causes can explain the nonmonotonic relation between Twitter activities and storm strength. Both are related to the physical realm. One is that, very often, people flee an area after they receive a hurricane warning and are told to evacuate a certain area, so they will not tweet anymore from this area. A second reason may be that storms tend to topple cellular towers. So even if people remained in the area, they may not have been able to communicate, causing a decrease in communication activity in these areas.
These are examples of nonexisting data of existing events that result from a biased or partial recording of data. They are due to the physical properties of the data collection process or of the events that generate the data in physical reality. However, the selectivity of the data does not only depend on the external statistics of the physical properties of the world. It may also result from specific human actions that may create a somewhat partial view of reality. For instance, a study of credit card data in a country in which there was social unrest showed that the effect of the localized unrest (which mainly involved large demonstrations in specific locations in a metropolitan area) diminished with distance from the demonstrations, as expressed in the number of purchases and the amounts of money spent on a purchase (Dong, Meyer, Shmueli, Bozkaya, & Pentland, 2018). This effect was not the same for all parts of this society. Some groups of the population showed a greater change than others. However, when interpreting these results, we need to keep in mind that we have only partial data on the economic activities in this country during this era of unrest because we only have credit card data. People in this country also use cash, and the decrease in credit card purchases may only reveal part of the picture.
Another factor that affects the digital records of behavior that can be analyzed is the fact that some behaviors will be more easily recorded while others are less so. For instance, on social media, socially desirable and high-prestige behavior will appear more often in posts than less desirable behavior. Viewers, consequently, may feel that others are more engaged in these positively valued behaviors than they themselves (Chou & Edge, 2012). Also, the digital image of the world that may emerge from scra** social media data will present a biased view, possibly overrepresenting the behaviors people like to post about on the web. Any decisions made based on these data, for instance, concerning the public investment in different facilities for leisure activities or the development of product lines for after hours, may be biased and may be misled by people’s tendency to post about some things and not post about others.
Another example of the partial representation of the physical or social reality in data is demonstrated in Omer Miran’s master’s thesis (Miran, 2018). The study dealt with the analysis of policing activity in the UK, as expressed in the data the UK police uploaded to their website.Footnote 1 Making police data openly available allows the public to monitor police activities. It also provides the basis for the assessment of the risk of crime in different areas. This can, for instance, help individuals in their decisions about where to live, rent or buy an apartment, and raise their kids.
The study aimed to determine the relative frequency of different types of crimes in different parts of the UK, where each part was defined by the specific police station that oversaw an area. The analysis combined information from the “crime cases database” for the years 2010–2015, which includes reports of crime incidents and their locations. The most important one is the UK police database, in which all crime events are recorded with relatively rough geographical information. A second database is the database on police stop and search activities for the year 2014, also downloaded from the UK police site. Here, the location at which a person was stopped is also recorded. Two other databases were from the UK Office for National Statistics and included population size and the average weekly for different locations.
The analysis focused on two different types of crime—burglary and drug-related crime. In a burglary, one or more people enter a location (a house, business, etc.) without permission, usually with the intention of committing theft. One can assume that a burglary will almost always be reported to the police and will appear in the records. Therefore, the number of burglary incidents in police records likely reflects the actual frequency of burglaries in an area.
The second type of crime was crimes related to drugs, such as drug deals. In this case, the people involved in the crimes (such as drug deals) will usually not report their occurrence. Consequently, a drug-related crime will usually only appear in the police files if the police make an active effort to detect it. Hence the data on drug-related activities does not really reflect the volume of such activities in an area but rather the police activity in the area.
The analyses of the data showed that there was no correlation between the amount of police activity in an area (as measured through the number of stop and search events in the area) and the number of burglary events (r = −0.047). However, there was a positive correlation between police activity and recorded drug-related crimes (r = 0.180). Thus, the two types of crime data indeed reflect somewhat different types of events, namely the activity of criminals (in the burglary data) and the activity of the police (in the drug-related crime data). These two types of activities can, of course, be correlated or can be related to other variables that characterize the location.
The analysis of the police databases revealed additional clear differences between the picture of reality they provide and the actual reality. In the UK Home Office drug survey for 2013, 2.8% or 280 out of 10,000 adults aged 16 to 59 reported using illicit drugs more than once a month in the last year. Assuming that these people purchased drugs once a month, they were involved in approximately 12 * 280 = 3360 drug deals in a year. In the UK police data set, the yearly average of drug-related crimes per year was about 28.7 per 10,000 people. Clearly, less than 1% of drug deals appear in police data. This demonstrates the large potential gap between the image of the reality that appears in the analysis of data and the actual reality this image is supposed to reflect.
Conclusions
The availability of data can have great value for decision-making. For instance, data-based decisions may lower the effects of biases due to faulty preconceptions or naïve beliefs. Also, many processes, such as controlling large-scale networks or high-frequency trading in financial markets, are only possible with algorithms and must rely on data.
The use of data science and AI in decision-making can often provide valuable information, but the process is not without potential problems. One needs to keep in mind that the data analysis process is a human activity that involves numerous decisions along the way. Each of them impacts the following steps in the process and the eventual outcome. It is important to monitor these decisions and to test the sensitivity of the conclusions to specific changes in the decisions made along the process. Furthermore, the analytics process often concerns human activities. The records they generate depend on the decisions of those who do the recording and, to some extent, the people whose behavior is recorded.
The development of data-based decision-making or support tools requires a combined modeling effort. On the one hand, the usual analytics modeling process needs to proceed, aiming to generate models that can identify the preferable choices in different settings. A model in this context would be the output of the algorithm used for the analytics process, together with information about the quality of the output, compared to some criterion. Often this would result from tests of the model, computed on a training set of data, on a separate, independent data set, the test set. An additional output of the algorithmic process can be information on feature importance, identifying the relative importance of different variables for predicting the outcome variable.
This should be accompanied by a modeling effort that develops more traditional social sciences models based on psychological, sociological, economic, or other disciplines. These models can be used to model the behavior that is related to the analytics process (choices made regarding the questions asked, the selection of the data, the preprocessing of data, the choice of algorithms and their parameters, the presentation of results, the interpretation, the implementation of insights gained). The models can also be related to behaviors that generate the data that is analyzed, as shown in the examples of drug-related crimes or social media posts during emergencies.
Thus, traditional modeling techniques and data science methods should be combined. Such a combination has the potential to better decisions and utilization of data. One can take several steps to achieve this goal. First, data scientists (who often have computer science, mathematics, or engineering backgrounds) should be trained in social sciences. This would give them some critical analytical skills that will allow them to question assumptions behind the analyses and the behaviors that are represented in the data. The data scientists would detach themselves from the mechanistic process of taking input, running analyses, and interpreting the results only in terms of the input variables and the model output, with the feature importance tables and other output data. Analyses of results in view of theories in the social sciences can provide a deeper understanding of phenomena beyond what is possible with a-theoretical analyses.
Also, interdisciplinary teams should analyze, evaluate or implement the results of data science processes that are used in decision-making. The output of these process needs to be critically assessed, and the value of the insights gained through the process needs to be calculated. It is important to determine how the information can actually be implemented in the operation of the organization. This requires the conduct of sensitivity analyses that evaluate the procedures and their robustness.
A critical view of the analytics process and of the implementation of its results is particularly important because data-science-based decision support always depends on the particular data that served as input for the algorithm. Dynamic changes in the data may cause predictions to become less (or sometimes more) precise. The relevance of the data for the decisions may also change with time because options become more available or less expensive or because new alternatives arise.
We need to combine traditional social science methods, such as methods in economics, political science, geography, sociology, and psychology, with the methods used in analytics and data science. There should be a dynamic interplay between the two approaches to phenomena. The combined use of the two has the potential to create a synergy that can lead to better decision-making processes and better decisions. It can also provide insights into the dynamic sha** of reality, following the use of data science, and the effects human behavior has on the data science process.