Keywords

In our current “age of data”, Artificial Intelligence (AI), Machine Learning (ML), Data Science (DS), and analytics are becoming part of problem-solving and decision-making in many areas, ranging from recommendations for movies and music to medical diagnostics, the detection of cybercrime, investment decisions or the evaluation of military intelligence (e.g., McAfee & Brynjolfsson, 2012). These methods can be used because an abundance of information is collected and made available. Also, the tools for analyzing such information are becoming widely accessible, and their use has become easier with platforms such as BigML. While in the past, statisticians or data scientists were in charge of the analytics process, now anybody with some basic computing skills can conduct analyses with R or Python, using open-source tools and libraries.

These developments are the basis for new insights and understanding social and physical settings. They also alter the decision processes used by organizations and the information that is available to individuals. As such, they affect reality, its representation in digital records and the media, and the ways people interpret this reality and act in it. The dynamic interaction between the physical, digital, and social realms shapes current societies. Understanding and modeling it is a major challenge for both data science and the social sciences.

Data analytics, and the information one can gain from them, can be used in decision-making processes, in which they help to choose among possible alternatives. Algorithmic decisions can be advantageous in legal contexts, such as bail decisions (Kleinberg, Lakkaraju, Leskovec, Ludwig, & Mullainathan, 2018). In medical settings, the development of personalized evidence-based medicine for diagnostic or treatment decisions (Kent, Steyerberg, & van Klaveren, 2018) depends on analyzing electronic medical records with data science tools. AI-based analyses in medicine can indeed improve diagnostic or therapeutic decisions (Puaschunder, Mantl, & Plank, 2020). Similarly, algorithms in financial markets, implemented as algorithmic advisors or in algorithmic trading, can provide clear benefits (Tao, Su, ** the data shadows of Hurricane Sandy: Uncovering the sociospatial dimensions of “big data”. Geoforum, 52, 167–179. https://doi.org/10.1016/j.geoforum.2014.01.006 " href="#ref-CR31" id="ref-link-section-d221393918e1193">2014). This was the strongest hurricane that had hit the New York City area in recorded history. There is indeed a strong correlation between Twitter activities and the strength of a storm, but there were very few Twitter activities in the areas in which the storm was the strongest. Two causes can explain the nonmonotonic relation between Twitter activities and storm strength. Both are related to the physical realm. One is that, very often, people flee an area after they receive a hurricane warning and are told to evacuate a certain area, so they will not tweet anymore from this area. A second reason may be that storms tend to topple cellular towers. So even if people remained in the area, they may not have been able to communicate, causing a decrease in communication activity in these areas.

These are examples of nonexisting data of existing events that result from a biased or partial recording of data. They are due to the physical properties of the data collection process or of the events that generate the data in physical reality. However, the selectivity of the data does not only depend on the external statistics of the physical properties of the world. It may also result from specific human actions that may create a somewhat partial view of reality. For instance, a study of credit card data in a country in which there was social unrest showed that the effect of the localized unrest (which mainly involved large demonstrations in specific locations in a metropolitan area) diminished with distance from the demonstrations, as expressed in the number of purchases and the amounts of money spent on a purchase (Dong, Meyer, Shmueli, Bozkaya, & Pentland, 2018). This effect was not the same for all parts of this society. Some groups of the population showed a greater change than others. However, when interpreting these results, we need to keep in mind that we have only partial data on the economic activities in this country during this era of unrest because we only have credit card data. People in this country also use cash, and the decrease in credit card purchases may only reveal part of the picture.

Another factor that affects the digital records of behavior that can be analyzed is the fact that some behaviors will be more easily recorded while others are less so. For instance, on social media, socially desirable and high-prestige behavior will appear more often in posts than less desirable behavior. Viewers, consequently, may feel that others are more engaged in these positively valued behaviors than they themselves (Chou & Edge, 2012). Also, the digital image of the world that may emerge from scra** social media data will present a biased view, possibly overrepresenting the behaviors people like to post about on the web. Any decisions made based on these data, for instance, concerning the public investment in different facilities for leisure activities or the development of product lines for after hours, may be biased and may be misled by people’s tendency to post about some things and not post about others.

Another example of the partial representation of the physical or social reality in data is demonstrated in Omer Miran’s master’s thesis (Miran, 2018). The study dealt with the analysis of policing activity in the UK, as expressed in the data the UK police uploaded to their website.Footnote 1 Making police data openly available allows the public to monitor police activities. It also provides the basis for the assessment of the risk of crime in different areas. This can, for instance, help individuals in their decisions about where to live, rent or buy an apartment, and raise their kids.

The study aimed to determine the relative frequency of different types of crimes in different parts of the UK, where each part was defined by the specific police station that oversaw an area. The analysis combined information from the “crime cases database” for the years 2010–2015, which includes reports of crime incidents and their locations. The most important one is the UK police database, in which all crime events are recorded with relatively rough geographical information. A second database is the database on police stop and search activities for the year 2014, also downloaded from the UK police site. Here, the location at which a person was stopped is also recorded. Two other databases were from the UK Office for National Statistics and included population size and the average weekly for different locations.

The analysis focused on two different types of crime—burglary and drug-related crime. In a burglary, one or more people enter a location (a house, business, etc.) without permission, usually with the intention of committing theft. One can assume that a burglary will almost always be reported to the police and will appear in the records. Therefore, the number of burglary incidents in police records likely reflects the actual frequency of burglaries in an area.

The second type of crime was crimes related to drugs, such as drug deals. In this case, the people involved in the crimes (such as drug deals) will usually not report their occurrence. Consequently, a drug-related crime will usually only appear in the police files if the police make an active effort to detect it. Hence the data on drug-related activities does not really reflect the volume of such activities in an area but rather the police activity in the area.

The analyses of the data showed that there was no correlation between the amount of police activity in an area (as measured through the number of stop and search events in the area) and the number of burglary events (r = −0.047). However, there was a positive correlation between police activity and recorded drug-related crimes (r = 0.180). Thus, the two types of crime data indeed reflect somewhat different types of events, namely the activity of criminals (in the burglary data) and the activity of the police (in the drug-related crime data). These two types of activities can, of course, be correlated or can be related to other variables that characterize the location.

The analysis of the police databases revealed additional clear differences between the picture of reality they provide and the actual reality. In the UK Home Office drug survey for 2013, 2.8% or 280 out of 10,000 adults aged 16 to 59 reported using illicit drugs more than once a month in the last year. Assuming that these people purchased drugs once a month, they were involved in approximately 12 * 280 = 3360 drug deals in a year. In the UK police data set, the yearly average of drug-related crimes per year was about 28.7 per 10,000 people. Clearly, less than 1% of drug deals appear in police data. This demonstrates the large potential gap between the image of the reality that appears in the analysis of data and the actual reality this image is supposed to reflect.