Introduction

Digital technology has transformed society. It underpins much of people’s routine activities, from accessing and sharing information on social media to using banking services on the move, through ordering goods and having them delivered at home. However, the impact of digital technology is not limited to lawful applications. It has changed the opportunity structure for traditional crimes (Felson & Boba, 2010; Holt & Bossler, 2015). Indeed, stalking, bullying, and fraud can be facilitated by misuse of emails, social media, geo-localisation, and online purchase services. Furthermore, new crimes have emerged in cyberspace, including new crimes that can only occur because of cyber (e.g. hacking, scamming, phishing, smishing) and traditional crimes that are enabled by cyber (e.g. online harassment, online fraud, and online child sexual abuse material).

Extant efforts to measure and understand cybercrime typically rely on three sources of data: traditional police-recorded crime databases (Furnell & Dowling, 2019; Popham et al., 2020; van der Laan & Tollenaar, 2021), databases from recently established cybercrime bureaux (e.g. Action Fraud, the IC3), and cyber-victimisation surveys (DeVoe & Murphy, 2011; Reep-van den Bergh and Junger 2018). However, these sources have significant limitations, as outlined below, that the current study seeks to redress:

  • Police-recorded crime databases do not always label cybercrime incidents, and those that do rarely contain more than a few cybercrime sub-categories (Furnell & Dowling, 2019). This is because most traditional crime classifications are based on legal definitions of crime that tend to be defined based on a type of harm, not tools (McGuire, 2019). Hence, in police databases that adhere to the legal definition of crime for processing criminal cases within the criminal justice system, most conventional crimes that are enabled by cyber technology are recorded as conventional crimes. As a result, data mining methods (i.e. the ‘incident filtering function’) standardly found on police database management software cannot be used to extract certain types of cybercrime incidents (e.g. cyber sexual harassment or detailed types of cyber frauds). Extraction therefore requires selecting a broader set of crime incidents (e.g. sexual harassment or fraud) and manually identifying cybercrime incidents among them, a process that can be very time consuming.

  • Databases generated by cybercrime bureaux only contain cybercrimes with pre-defined categories. First, it may not be possible to report all types of cybercrime incidents. Second, non-reporting rate is high in these specialist agencies, mainly because victims are not aware that they can report cybercrime to them. In the UK, Furnell and Dowling (2019) found that the most commonly cited reason for not reporting cybercrime was that people ‘haven’t heard of Action Fraud’—the national self-reporting website for fraud and cybercrime. The same is likely to apply in South Korea.

  • Victimisation surveys are known to yield better reporting rates, but they do not tap into all facets of cybercrime. Due to time constraints, survey questionnaires usually focus on the most common types of cybercrime, such as online scams (Reep-van den Bergh & Junger, 2018; Whitty & Buchanan, 2012), cyberstalking (Fissel & Reyns, 2020), and online defamation (Näsi et al., 2015). Notwithstanding the strengths of victimisation surveys, these instruments cannot be used to provide an overview of all cybercrime types currently.

Given these limitations, calls for police service (CPS) datasets may be considered a promising data source to understand the scale and nature of cybercrime. CPS databases include records about not only crimes but also anti-social behaviour and even simple consultations from citizens—all of which may have a cyber component (Boulton et al., 2017). The data also contain more crime events than official crime records datasets which only include crimes classified under legislative categories. Hence, a CPS dataset may encompass many different types of cybercrime incidents. Moreover, a CPS dataset offers practical advantages as analysts can readily access it. They should therefore be able to conduct timely investigations of cybercrime without requesting data from other bodies.

Unfortunately, extracting cybercrime records from CPS datasets is challenging because the records do not have built-in meta-data (e.g. labels, flags) for cybercrime and its sub-categories. Researchers have used a range of methods for extracting information and analysing cybercrime: manual investigation (Montoya et al., 2013), keyword query extraction (Domenie et al., 2009), and supervised machine learning classification (van der Laan & Tollenaar, 2021). Basit (2003) pointed out that the size of the dataset, available resources (time and fund), and the expertise of the researchers are all critical factors in the selection of an appropriate data extraction approach. Manual investigation of large text datasets is a time-consuming process that requires significant resources (van der Laan & Tollenaar, 2021). Keyword query extraction can be useful when the analyst has sufficient knowledge of the topic to suggest appropriate keywords (Sagayam et al., 2012), but it requires manual verification to filter out false positives (van der Laan & Tollenaar, 2021), and this can be time consuming.

In this context, supervised machine learning classification appears a promising approach for efficiently classifying texts from large volumes of documents (Kadhim, 2019). However, there are no publications describing in detail how a cybercrime classifier should be designed and what performance can be expected from it. This study aims to fill this gap and explain how a supervised machine learning method can be developed to identify cybercrime cases in CPS datasets. The classifier was developed and tested using data from the Korean CPS dataset (N = 15 million, 1 January–30 September 2020).

Background

Cybercrime in Police Crime Database

Crimes involving network technology are generally referred to as ‘cybercrime’. This category is often divided into two groups: cyber-dependent crime and cyber-enabled crime (McGuire & Dowling, 2013). Cyber-dependent crime (CDC) refers to offenses that cannot be committed without computers, networks, or information communication technology (ICT). Examples include hacking, computer viruses, and DDoS attacks. In contrast, cyber-enabled crime (CEC) refers to traditional incidents in which computers, networks, or ICT have an important role, such as romance scams, cyberstalking, and online fraud (McGuire, 2019).

Researchers have attempted to measure the scale of cybercrime using police crime databases. Furnell and Dowling (2019) estimated that cybercrime represented about 1% of incidents recorded in UK police databases. van der Laan and Tollenaar (2021) investigated crime records from the Dutch police and reported that CDC and CEC constituted about 0.34% and 5.59% of all crime records in 2016. Interestingly, Mazowita and Vézina (2014) found that 55% of cybercrime incidents (9084) recorded in the Canadian Uniform Crime Report (UCR) dataset of 2012 were fraud cases. The low number of cybercrimes in the UCR implies that traditional crime datasets are unlikely to capture the scale of the problem or its evolving nature. Additionally, cybercrime victims may not see the police as the natural agency to deal with many forms of cybercrime.

Statistics were also generated from cases reported on special cybercrime bureaux’s online platforms. The FBI’s Internet Crime Complaint Centre, for example, reported that an average of 552,000 cybercrime complaints a year were handled in the US between 2017 and 2021 (IC3, 2021: 7). Similarly, the Action Fraud (2021a, 2021b) announced that 31,322 cyber-dependent crimes and 875,622 fraud cases were reported in the UK in 2021. Additionally, the Korean National Police Agency (KNPA, 2020) denoted that an average of 169,802 cybercrime cases were reported in South Korea between 2016 and 2020.

Automatic Extraction of Cases from Police-Recorded Crime Datasets

To date, only a few studies have attempted to automatically classify crime incidents in police datasets. Matos et al. (2022) built a classier to automate tasks that were manually performed by the Assistant Secretariat for Intelligence and Criminal Analysis (SIAC) in Brazil. Created using a manually labelled dataset, their Convolutional Neural Network achieved an accuracy of 78% in classifying the 20,000 cases of the test dataset into 463 crime categories. This approach, however, cannot be replicated here because cybercrime cases are not labelled in police datasets.

Haleem et al. (2019) took a different approach in their attempt to classify mental illness cases from calls for police service dataset of Greater Manchester Police, UK. To build the training set, two researchers and one police officer independently labelled 391 cases and finalised each label through quorum voting. Their approach, which is based on word-embedding techniques and deep learning including continuous bag of words and distributed bag of words, yielded 89% accuracy. However, the size of the test data set (N = 78) and lack of details about the method used to build the training set (N = 391 call logs) raise concerns about the application of the method to large datasets. Additionally, all non-consensus samples were labelled based on the opinions of two out of three coders. However, descriptions in the individual cases of CPS data sometimes lacked information regarding situations, causes, or motives revealed at the scene, which could indicate whether a case is associated with a mental health issue or not. Therefore, some samples may fall into a grey zone and cannot be confidently assigned to either class. This limitation is discussed later in the paper.

Karystianis et al. (2018) used a rule-based approach to identify mental health disorder events in domestic violence (DV) cases in the Australian police database WebCOPS. The 200 cases randomly selected from the DV records (N = 492 k) were manually labelled by two experts. Labelling was based on lexical patterns identified through manual reading, as well as a list of mental health disorders in the International Classification of Diseases (ICD, 2017). A lexical rule was created to distinguish who, between the offender or the victim, had mental health disorder. The model performance obtained using a separate testing set (N = 100) showed that the F-scores were about 81% and 87% for victims and offenders, respectively. The final model was applied to the entire DV dataset, and it was estimated that mental disorder–related cases represented 16% (N = 78 k) of all cases. The researchers provided detailed information about the labelling process, but they did not explain how they resolved disagreements between coders. Furthermore, the testing set was too small to be confident that the performance level would hold with very large datasets.

van der Laan and Tollenaar (2021) developed a supervised classifier and applied it to cybercrime incidents recorded in the Dutch police registration system. Incidents were distributed between three classes of CDC (hacking, ransomware, and DDoS attack) and five classes of CEC (online threat, online stalking, cyber-libel, online identity theft, and online fraud). With 67% accuracy, the classifier exhibited moderate performance, leaving ample room for improvement before it could be reliably used by crime analysts. Furthermore, important details were not provided about the design of the classifier, which makes replication difficult. First, the authors did not explain how they selected the cybercrime-related keywords in the study. Second, during the process of creating a training set for a supervised classifier, the authors categorised all incidents in the dataset into one of eight cybercrime types based on their criteria. They used cybercrime-related keywords to sample 5000 cases. Presumably, some of them did not correspond to any of the eight classes used in the research. Therefore, the authors should have included a separate class for other types of cybercrime incidents that do not belong to any of the eight categories. For instance, cases related to online child sexual abuse material or various phishing frauds cannot be classified as one of the defined cybercrime categories. It remains unclear how this affected the classification results. Third, without an ‘others’ class, non-cybercrime cases, if any, would have been incorrectly classified. It is not clear how the authors proposed to address this issue.

Data and Proposed Classification Method

This research aims to build a machine learning classifier based on the text information in call logs. The KCPS data were anonymised and employed for this study. The dataset comprises all the calls for police service made in South Korea between the 1 January and 30 September 2020 (N = 15,743 k). The learning set (which consists of the combinations of training and testing sets) was used to develop the classifier. The latter was then applied to the whole KCPS dataset, and statistics were computed about the identified cybercrime incidents.

The flowchart in Fig. 1 represents the seven-step procedure used in this study. The central methodological part involves building a high-quality learning set. First, all call logs in the KCPS dataset were pre-processed to identify individual event cases by removing redundant information. After that, a learning dataset was built through three procedures, namely, generating a cybercrime-related terms list, performing minority oversampling, and labelling incidents. Next, word features were chosen using the term frequency-inverse document frequency (TF-IDF) scores. The model performance was evaluated using a tenfold cross-validation technique, with the training and testing sets rotated ten times to ensure good validity. The models were trained and tested with the learning set (N = 22,877 incidents). The model with the greatest combination of accuracy and performance was then selected and applied to the KCPS dataset.

Fig. 1
figure 1

Flowchart representing the development process of the cybercrime classifier

Data Pre-processing

A number of calls were removed from KCPS dataset. These include test callsFootnote 1 made by police (N = 18 k), calls without text (N = 4651 k), duplicates (N = 1324 k), and transferred callsFootnote 2 (N = 159 k). Duplicate cases were identified based on the content of the text column: ‘same as previous call’ (‘동일’ or ‘중첩’). Through this process, the size of the KCPS dataset was reduced by 39% to 9591 k incidents, which were finally used as population data to apply the final model.

As Korean sentences become shorter during morphological analysis (Han & Palmer, 2004), very short case descriptions may not possess enough features after lemmatisation for accurate classification. The primary researcher conducted a manual review of the text in the dataset and arbitrarily determined that a minimum of nine characters were necessary to effectively convey meaningful information for describing a crime case. Records containing less than nine characters (N = 1855 k) were therefore removed. The resulting dataset (N = 7736 k) was used to build a learning dataset.

Refining the Text

The dataset was refined by merging columns, removing stop words,Footnote 3 and improving spelling. First, each incident was assigned a unique document number for the study. Then, the three columns that contain information about the cases (i.e. report, end comment, and reference comment) were merged into a single column. Populated by the KCPS call centre operators, the ‘report’ column summarises the issues raised by callers (including citizens or other agencies). The text in the ‘end comment’ column is recorded by the officer dispatched at the scene once they finished dealing with the case. It contains a chronological record of events and actions taken during the incident. ‘Reference comments’ were recorded by the operator when they wanted to log further information deemed relevant to the case.

As there is no standard list of ‘stop words’ in Korean, we drew a list of elements without specific lexical meaning and removed them from the dataset. The elements, including numbers (10), letters (26), and punctuation marks (29), can be found in the Appendix. English letters in the dataset (e.g. CCTV) were used for the morphological analysis. As operators are expected to quickly elicit, understand, and synthesise information from a large number of calls, spelling accuracy is often secondary in their job. After manual reading, misspelt terms deemed to be related to cybercrime (N = 30) were then edited (Appendix Table 7).

Building the Learning Dataset

Cybercrime-Related Terms

As previously noted, Furnell and Dowling (2019) estimated that cybercrime accounts for about 1% of all recorded crimes in the UK police data. If the proportion of cybercrime is similarly low in the KCPS dataset, we could expect random sampling to yield a highly imbalanced learning dataset. To address this challenge, we carried out keyword search of incidents involving digital technology. In total, 88 cybercrime-related terms (CRT) were selected that relate to cyber, online, computer, Internet, Internet of Things (IoT), ICT, and networks (Appendix Table 8). The CRT was searched using three different approaches and finally selected when a term was found in the KCPS dataset:

  • A random number generator in the R software was used to select 20,000 incidents. The primary researcher identified 50 terms that meet the above inclusion criterion.

  • TF-IDF keywords selection method was applied to the dataset (Aizawa, 2003), and 4 additional CRT were selected among the top 200 keywords.

  • Another set of 29 terms were identified in searching from two Korean online portal sites (i.e. Naver and Daum).

Although the R package NLP4kec (Namyoun, 2020) is a useful tool for performing morphological analysis on Korean text data, it has limitations in identifying specific terms that are not included in its built-in dictionary. To address this, we iteratively created a ‘domain dictionary’ consisting of 712 relevant acronyms and bigrams frequently found in the learning set. Throughout the labelling process, we continuously updated this dictionary. For instance, the phrase ‘컴퓨터 (computer) + 조회 (search)’ is recognised as two separate words by a machine learning model because they are treated as a noun and a verb in the built-in dictionary. However, within the samples in the learning set, the bigram ‘컴퓨터조회 (computer search)’ was consistently observed when officers used it to refer to searching relevant information for a case in the intranet system. This bigram conveys a different meaning (and context) within the Korean police domain compared to when the words are separated. To illustrate, consider the sentences ‘a man uses a computer to search a victim online’ and ‘officers use a computer search for the case in their intranet’. Both sentences contain the words ‘computer’ and ‘search’. By treating ‘computer search’ as a distinct term in the latter sentence, the features in the learning set can be improved. Hence, a model can recognise this bigram as a noun by including it in the domain dictionary. By incorporating the domain dictionary into morphological analysis, we retained non-standard terminologies in the dataset instead of discarding them during lemmatisation. This was possible because the computer can automatically recognise the terms listed in the domain dictionary as nouns.

Minority Oversampling

As explained above, the dataset contains a relatively small number of cybercrime incidents. Our approach therefore involved ‘oversampling the minority class’ and ‘undersampling the majority class’ in the learning dataset (see Chawla et al., 2002). The CRT was used to partition the dataset into two groups: Group A (N = 423 k), whose elements contained the CRT, and Group B (N = 7313 k), whose elements did not. Oversampling from Group A enabled us to obtain a learning set with similar proportions of cybercrime and non-cybercrime cases. Datasets A’ (N = 13,000) and B’ (N = 10,000) were generated through random sampling from Datasets A and B, respectively. Note that the sample size of dataset A’ was purposely greater than that of dataset B’ because some of the incidents in dataset A’ may not actually be cybercrime. The dataset (N = 23,000) obtained by combining the two samples (A’ and B’) constitutes the learning dataset.

Labelling the Learning Dataset

To label the learning dataset, we recruited 16 Korean police officers (13 males and three females) with direct field experience. This ensured that the coders had sufficient understanding of the Korean language, KNP call recording system, and crime in Korea. The duty stations of officers were spread across the country; the officers’ average length of service was 12.7 years.

The labelling procedure consisted of two sessions. In the first session, the learning dataset (N = 23,000) was divided into 16 subsets. Each subset was independently labelled by two people: the primary researcher and an officer (note: a different one for each subset). This reduced the burden on individual officers and the risk of systematic bias. The officers received guidelines (see Appendix Table 9) that explained the concept of cyber-enabled and cyber-dependent crime. After the first session, the officers and the researcher disagreed on 820 incidents (3.56%).

In the second labelling session, these 820 incidents were split into eight subsets, and each subset was independently re-labelled by two more officers. Majority voting (Sheng et al., 2017) between the researcher and three officers was adopted to decide on the final label. After the second review, only 123 unresolved incidents (0.53%) remained for which there was no majority vote. These were discarded. The final learning dataset comprised 10,261 cybercrime incidents (44.85%) and 12,616 non-cybercrime incidents (55.15%).

Extracting Features (TF-IDF)

The morphological analysis that lemmatises and extracts nouns, verbs, and adjectives from the dataset was performed using the R package NLP4kec (Namyoun, 2020). The resulting corpus was transformed to a document term matrix (DTM) with TF-IDF weighting. Infrequent terms, occurring less than once in 1000 cases, were eliminated from the DTM. As a result, the number of terms in the DTM decreased from 30,436 to 2135.

Training and Testing the Models

The learning dataset (N = 22,877) was randomly divided into training (N = 18,000) and testing (N = 4877) sets. Finally, tenfold cross-validation was used to validate and optimise the hyper-parameters of the final model. The linear SVM, kernel SVM, and Naïve Bayes algorithms were employed to build the model. The final model that showed the best performance among three algorithms was applied to classify cybercrime from the KCPS dataset (N = 9591 k). The results are described in the next section.

Results

Distribution of Crime Categories in the Datasets

Table 1 illustrates how the cases in the KCPS, training, and testing sets are distributed across the seven crime categories. The sex crime category consists of calls regarding sexual violence, sexual harassment, dating violence, and stalking. Violent crime refers to violence, domestic violence, child abuse, extortion, blackmail, property damage, robbery, homicide, dispute, and aggressive behaviour in public. The fraud category consists of traditional fraud and phishing fraud. Juvenile delinquency comprises school violence and other misbehaviours typically carried out by young people. An example would be a group of students calling out one student on Facebook messenger and harassing him while filming a video. Theft relates to larceny, burglary, and snatch theft. Traffic consists of traffic accidents, traffic violations, hit-and-runs, intoxicated driving, and traffic inconveniences. The ‘others’ field covers a wide range of situations, including criminal offenses such as illegal drugs, illegal gambling, miscellaneous criminal offences, other minor offenses, and wanted persons, as well as non-criminal situations that require police intervention for security or safety reasons, such as consultation, disaster, emergency bell alarms, fire, intoxicated person, missing persons, noise complaints, preventing a dangerous situation, requests from private security company, rescue operations, police service request, suicide, and unconfirmed cases.

Table 1 Distribution of incidents by crime types in the KCPS, training, and testing datasets

As expected, the distribution of incidents varies between the three datasets due to minority oversampling. In particular, the proportion of incidents classified as ‘fraud’ is seven to eight times higher in the training (13.7%) and testing sets (14.9%), respectively, than in the KCPS dataset (1.9%). This implies that fraud incidents are much more likely to have a cyber component than other types of incidents. In contrast, the proportion of traffic incidents is 0.6 times lower in the learning datasets (6.4% and 6.7%) than in the KCPS dataset (10.7%). This reflects the fact that traffic incidents are comparatively less likely to be related to cybercrime. Sex crime, violent crime, juvenile delinquency, and ‘others’ have relatively similar proportions among datasets, suggesting that cybercrime prevalence is not significantly higher or lower in these categories compared to frauds and traffics in the dataset. Notably, incidents classified under ‘others’ account for a substantial proportion of the datasets (65–72%).

The Proportion of Cybercrime in the Datasets

Table 2 shows the proportion of cybercrime incidents in the learning set (45%) and across different crime categories. Two categories, fraud (~ 95%) and ‘others’ (46 to 47%), have higher proportions of cybercrime than the other five crime types (0.5–29.5%). Sex crime contains about 30% and 20% of cybercrime in the training and testing sets, respectively. Given that sex crime represents only a small fraction of recorded incidents (~ 0.7%), a 10-percentage point difference can be considered a coincidental consequence of random sampling. There are no statistically significant differences among the ratios of sex cybercrime in learning, training, and testing sets, as determined by the proportion test (p-value = 0.58). Cybercrime represents about 5–6% of violent crime and juvenile delinquency incidents, which is notably lower than its proportion in the theft cases (20%). The ratios of cybercrime in both datasets fell within the 95% confidence interval (43.57 to 46.74%), as determined by the proportion test.

Table 2 Proportions of cybercrime in the training and testing sets by crime types

As explained above, traffic incidents do not typically include cyber elements.Footnote 4 This is confirmed by the results in Table 2: only 0.5% of the traffic incidents in the training set were labelled as cybercrime. For this reason, the classification accuracy cannot be determined in this category.

Classifier Performance

Three classification methods (Naïve Bayes, linear SVM, and kernel SVM) were applied using the aforementioned learning datasets. The training sample size was selected by differentiating sample sizes using learning curves (see details in Appendix Table 6). As shown in Table 3, the kernel SVM classifier was identified as the best classifier, considering accuracy (93.45%), recall (88.35%), precision (96.94%), and F1-score (92.44%). The optimal parameters for the kernel SVM classifier were obtained using the e1071 package in R (Meyer et al., 2020). The average accuracy of tenfold cross-validation is over 90%, and the best optimised hyper-parameters for gamma and cost are 0.0005 and 1, respectively.

Table 3 Confusion matrix for the three classifiers

Table 4 presents an overview of the kernel SVM classifier’s performance on various sub-crime categories in the testing set. The total accuracy and F1-score of the model are 93.5% and 92.44%, respectively. The recall rate for all crime categories shows that 88.4% of cybercrime cases in the testing set were correctly identified as such by the model. The precision rate of the model indicates that 96.9% of the predicted cybercrime cases were real cybercrime cases in the testing dataset.

Table 4 Performance of the kernel SVM classifier on the crime categories of the testing dataset

While the average performance of the classifier (93.5%) is promising, it is important to examine its performance for individual crime sub-types. With an accuracy over 90% for all crime types except for theft (82%), the model should be suitable for many applications. We can see that it performs best with fraud incidents, with an accuracy, recall, precision, and F1-score of 98.7%, 98.8%, 99.8%, and 99.3%, respectively. The classifier also performs well in identifying cybercrime in the ‘others’ category, with similarly high scores: 93.5%, 90.7%, 96.2%, and 93.4%. As the ‘others’ field constitutes over two-thirds of the KCPS dataset, the overall performance of the classifier is strongly linked to the performance for this crime type. The excellent results are therefore encouraging, but they should be interpreted with care as we cannot confirm that the ‘others’ cases in the testing set are perfectly representative of the ‘others’ cases in the original dataset.

We found that the performance of the model is good, but not as high for sex crime, violent crime, juvenile delinquency, and theft incidents. For these, the F1-scores are about 77%, 69%, 67%, and 33%, respectively. As shown in Table 4, the proportion of cybercrime is smaller than 20% for these four crime categories in the testing set. Given that it takes into account the trade-offs between false positives and false negatives, the F1-score is a useful benchmark for evaluating the classifier’s performance on imbalanced datasets (Huilgol, 2019).

Proportion of Cybercrime in the KCPS Dataset

This section concludes our demonstration with statistics about the classification results. Table 5 details the proportion of cybercrime incidents in the KCPS dataset. After removing traffic-related incidentsFootnote 5 (N = 1022 k), the classifier labelled about 394 k (4.6%) of the remaining 8.5 million incidents as cybercrime. The greatest proportion of cybercrime in the dataset was found for fraud incidents (76.3%). The lowest proportion of cybercrime concerns juvenile delinquency and violent crime, with less than 1%. Although juveniles are usually considered more tech-savvy, the low percentage of cybercrime in juvenile delinquency is presumed to be due to their having other options to report cyber incidents (e.g. school teachers). Cybercrime cases were slightly more prevalent in the following three categories: theft (5.7%), sex crime (4%), and ‘others’ (3.2%).

Table 5 The proportion of cybercrime and non-cybercrime in the KCPS dataset (1 January–30 September 2020)

Discussion

The purpose of this study was to evaluate the potential value of using supervised classification techniques to classify cybercrime from calls for police service data. In this section, we go back to some of the issues encountered during the development of the dataset and discuss future research directions.

Building a Quality Training Set

When using a supervised model with natural language processing (NLP), feature engineering, in which researchers use their domain knowledge to manually define and extract noteworthy features from the texts, is an essential technique to improve a training dataset (Liu et al., 2023). In other words, text features in the training set should be representative of the population it aims to classify, and mislabelled samples in the dataset must be handled appropriately (Sánchez et al., 2003) as well as maintaining balances between classes (Chawla et al., 2002).

In this context, the current study demonstrated how minority oversampling can be applied to police-recorded textual big data. From the perspective of classifying cybercrime, the KCPS dataset is highly imbalanced. Hence, this study sought to balance classes by using the CRT and the minority oversampling technique. This phase increased the likelihood of selecting cybercrime incidents when sampling the learning dataset from the two partitioned datasets. Although the dataset with the CRT (N = 13,000) is randomly sampled more than the other dataset (N = 10,000), the proportion of cybercrime and non-cybercrime in the learning dataset is 45:55. This fact implies that relying solely on keyword-query request classification is not suitable for document classification because many incidents with keywords are not classified as cybercrime cases in the learning dataset. Moreover, as shown in Table 1, the proportions of the crime categories can be varied after applying the minority oversampling technique to the dataset. This result indicates that researchers should consider the potential changes in crime categories’ proportions when sampling the dataset to build a learning set.

In the second labelling phase for majority voting, 123 cases were discarded from the learning set. This happened when two coders labelled the case as cybercrime, while the other two identified it as non-cybercrime. Lack of consensus can be due to the coders’ experience, level of practice, risk-taking tendencies, and the clarity of guidelines (Demirhan & Yilmaz, 2023). Additionally, some cases are presumed to be inherently ambiguous. Furthermore, individuals have different levels of comprehension when it comes to interpreting information, inferring omitted contents (e.g. motives), and envisioning the actions that would have taken place based on the narratives (Borrion et al., 2017). Given the amount of text information in the KCPS dataset, not every crime event can be described perfectly. Although some cases contain sufficient information, others do not. Each text record in the KCPS dataset represents a crime event with narratives according to the time flow. Consequently, crime-type information may be occasionally omitted while recording the text. Therefore, despite efforts to prevent mislabelling of samples in the training set, it is necessary to exclude samples on which coders do not agree regarding the labels to create a high-quality training dataset.

Machine Learning Model’s Performance

The models demonstrated very good overall performance (Table 3). This is partly due to the size and quality of the learning set. First, the labelling procedure involved 16 police officers (with considerable professional experience) independently labelling each case in the learning set. Cases without consensus were labelled using the principle of majority voting. This labelling process contributes to reducing biases in the learning set. Second, the classes used in this study are mutually exclusive and form a complete set: cybercrime and non-cybercrime. Most models in the previous studies did not contain the ‘non-cybercrime’ or ‘others’ class in their learning dataset, which may have resulted in lower performance.

We found that the kernel SVM model has the highest performance among the three classifiers (F1-score, 92.45%). However, the model performs differently in the seven crime categories. This empirical finding demonstrates that although a classifier performs very well overall, it may perform poorly for specific sub-categories. In turn, when we try to improve the quality of the learning dataset, we should consider sampling more incidents from those sub-categories with a low F1-score. Here, the final model may not be useful to identify cyber incidents in the theft and traffic crimes categories (F1-scores, 33.3 and N/A). However, we noted that the classifier performs well for the ‘others’ category, which contains around 80% of incidents in the KCPS dataset.

Previous studies that examined machine learning classifiers for crime datasets (Haleem et al., 2019; Karystianis et al., 2018; Matos et al., 2022; van der Laan & Tollenaar, 2021) were also primarily focused on methodological issues. While these studies were useful in explaining how to classify incidents from textual data, the performance of the classifiers was limited to around 60 to 80%.

Assessing Accuracy Acceptability of the Model

In addition to evaluating the accuracy of the machine learning model, we need to consider the ‘acceptability of accuracy’ based on the intended application of the model (Germini et al. 2022). Kay et al. (2015) proposed that we can assess accuracy acceptability by examining the perceived usefulness and sensitivity for the intended application.

The function of the model is to classify cybercrime from the KCPS dataset, for the purpose of analysing cybercrime. The data extracted by the model can be used to measure the scope, trends, police workloads, and patterns of cybercrime. Errors in classifying individual incidents are not as problematic as in other sectors, such as cancer detection. However, the model’s error rate (~ 6%) needs to be assessed in light of the specific applications for which it is being used.

Limitations

This study has several limitations. The first limitation is that it only analyses a single dataset (the KCPS dataset) which is limited to police-recorded cybercrime incidents in South Korea. Therefore, we cannot be sure the extent to which the findings can be generalised to other regions and other datasets. Future studies can expand the analysis to other datasets to confirm the effectiveness of the proposed classification method. Second, since the CPS data presents the results of calls and first responses by the police, without involving a criminal investigation, we cannot ascertain whether the cases are related to cross-border cybercrime with the CPS data alone. Consequently, the results cannot be used to draw conclusions about the global characteristics of cybercrime. For this, the study should be replicated using databases from other countries. Third, in the text data pre-processing phase, misspelt terms (Appendix Table 7) related to cybercrime were corrected. Since the bag-of-words method was applied in this study, identifying more commonly misspelt terms that practically indicate the same word would improve the quality of the training set (Scott & Matwin, 1999). Fourth, the sampling of the training set can be enhanced to better represent the KCPS dataset by incorporating other strata in the dataset, such as call categories or call priority codes. Finally, when considering that the field of cybercrime is rapidly evolving and new types of cybercrime may emerge in the future (McGuire, 2019), the proposed methodology and classifier should be regularly updated and improved to keep up with the evolving nature of cybercrime and new technology.

Conclusion

The CPS is arguably one of the most valuable police datasets, including reports about citizens’ daily experiences with crimes and disorder problems. Hence, this study presents the procedure of extracting unclassified crime types, especially cybercrime, from the CPS free-text dataset using text mining and machine learning.

This study contributes to our understanding of building a good quality learning dataset to develop a high-performance machine learning model. The study interconnects three main methods to build a good learning dataset: building CRT as a basis of employing keyword query technique and domain dictionary for text mining, applying the minority oversampling technique to balance the learning dataset, and labelling with majority voting for a quality dataset. The data imbalance issue could be solved by combining the CRT and the sampling technique. Labelling the learning dataset is conducted in two steps with 16 officers and finally labelled by the result of majority voting. Non-consensus cases, which comprised 0.53% (N = 123) of the learning set, were discarded.

The framework presented in this study can aid police analysts in classifying and analysing specific crime phenomena regardless of the volumes of the data. Since the study has illustrated a methodological way of classifying unidentified crime events in the police-recorded textual databases, the framework can also be applied to analyse any crime types that do not have official labels in police datasets. Furthermore, the methodology can also be employed to classify a sub-crime type within a specific crime type in police datasets. Hence, intangible information that is easily overlooked by the pre-existed categorisation of police datasets can be analysed for crime prevention with this methodology.