
As the Internet continues to grow in both size and importance, the quantity and impact of online reviews continually increases. Reviews can influence people across a broad spectrum of industries, but are particularly important in the realm of e-commerce, where comments and reviews regarding products and services are often the most convenient, if not the only, way for a buyer to make a decision on whether or not to buy them. Online reviews may be generated for a variety of reasons. Often, in an effort to improve and enhance their businesses, online retailers and service providers may ask their customers to provide feedback about their experience with the products or services they have bought, and whether they were satisfied or not. Customers may also feel inclined to review a product or service if they had an exceptionally good or bad experience with it. While online reviews can be helpful, blind trust of these reviews is dangerous for both the seller and buyer. Many look at online reviews before placing any online order; however, the reviews may be poisoned or faked for profit or gain, thus any decision based on online reviews must be made cautiously. Furthermore, business owners might give incentives to whoever writes good reviews about their merchandise, or might pay someone to write bad reviews about their competitor’s products or services. These fake reviews are considered review spam and can have a great impact in the online marketplace due to the importance of reviews.

Review spam can also negatively impact businesses due to loss in consumer trust. The issue is severe enough to have attracted the attention of mainstream media and governments. For example, the BBC and New York Times have reported that “fake reviews are becoming a common problem on the Web, and a photography company was recently subjected to hundreds of defamatory consumer reviews” [2]: (1) Untruthful Reviews -- the main concern of this paper, (2) Reviews on Brands -- where the comments are only concerned with the brand or the seller of the product and fail to review the product, and (3) Non-Reviews -- those reviews that contain either unrelated text or advertisements. The first category, untruthful reviews, is of most concern as they undermine the integrity of the online review system. Detection of type 1 review spam is a challenging task as it is difficult, if not impossible, to distinguish between fake and real reviews by manually reading them. To illustrate the difficulty of this task, we consider a real and fake example from the dataset created by Ott et al. [3]. As a human judge it is difficult to confidently ascertain which review is fake and which is authentic.

Review 1: Great Hotel This building has been fantastically converted into studios/suites. We only had a studio which was brilliant can’t imagine how the suite could have bettered what we had. The kitchen had everything cooker microwave dishwasher and fridge freezer. Bathroom was a good size and again had everything you need including good quality toiletries. Hotel also has a good gym and swimming pool and excellent laundry facilities if you need them. The complimentary breakfast each morning was also very good and had an excellent choice. The parking in the hotel was secure and reasonably priced. The location was pretty central and had easy access to the underground city. Would definitely stay here again.

Review 2: During my latest business trip, both me and my wife recently stayed at the Omni Chicago Hotel in Chicago, Illinois, at one of their Deluxe suites. Unfortunately, and I think I speak for both of us, we were not fully satisfied with the hotel. The hotel advertises luxury-level accommodations, and while the rooms resemble what one can see in the pictures, the service is certainly sub-par. When one plans a stay at such an establishment, they expect a service that goes beyond having fresh towels in the bathroom when they check in. First of all, the air-conditioning in the room seemed to be in need of a new filter and when it was first turned on, the air coming out seemed musty. Second of all, the fitness center was only open until 10:30 pm. For people who like to exercise after dinner, this can certainly be a problem. Especiaally considering that it does not take much to have the fitness center available around the clock or until midnight. For these, as well as other similar reasons, I would not recommend this hotel, if one is looking for luxury accommodations.

There are no clear indications or signals from the text of the two reviews that indicate to the casual reader that the first review is real while the second is a fake. Nevertheless, guides provided by the ConsumeristFootnote 2 and MoneyTalksNewsFootnote 3 websites offer tips to help consumers spot fake reviews. A computer scientist might seek to utilize this logic when training data mining and machine learning algorithms to find these features in the review that will determine if it is real or fake.

Over 18 million reviews were created on Yelp 2014Footnote 4 and Trip Advisor currently has over 200 million reviewsFootnote 5. Online reviews are constantly being generated on various web sites across the Internet. Consequently, Big Data techniques are needed to address the problem of review spam. Big Data, while an overused buzzword with an elusive definition, is often quantified with the Four V’sFootnote 6: (1) Volume -- the sheer size and scale of the data, (2) Velocity -- the rate at which new data is created and consumed by processing engines, (3) Variety -- the different formats that data may be stored in, and (4) Veracity -- the quality level of the data. The Volume and Velocity of online reviews are noted by merely visiting e-commerce and customer rating sites, such as Yelp and Amazon. There is great Variety across the possible industry sectors for reviews (such as hotels, restaurants, e-commerce, home services, etc.), along with the multiplicity of languages that reviews are written in. Veracity is a problem with online reviews, since the vast majority of reviews are unlabeled, which means it is not easily known whether the review is fake or not. Additionally, standard machine learning algorithms tend to break down and become ineffective when dealing with data of this size, which poses a problem when trying to apply these algorithms for review spam detection [4]. Thus, review spam detection is a Big Data problem, as there are numerous challenges when analyzing and classifying varying reviews from disconnected sources.

Data mining and machine learning techniques, primarily those for web and text mining, offer an exciting contribution to detecting fraudulent reviews. According to Liu [5], web mining is “the process for finding useful information and relations from the contents available on the web by largely relying on the available machine learning techniques and methods”. Web mining can be divided into three types of tasks: structure, content and usage mining. Content mining is concerned with knowledge and information extraction, and categorizing entities using data mining and machine learning approaches. A straightforward example of content mining is opinion mining. Opinion mining consists of attempting to ascertain the sentiment (i.e., positive or negative polarity) of a text passage by analyzing the features of that passage. A classifier can be trained to classify new instances by analyzing the text features associated with different opinions along with their sentiment. Review spam detection, like opinion mining, lies in the category of content mining, but also utilizes features not directly linked to the content [6]. Constructing features to describe the text of the review involves text mining and Natural Language Processing (NLP). Additionally, there may be features associated with the review’s writer, its post date/time and how the review deviates from other reviews for the same product or service.

It is important to mention that while most existing machine learning techniques are not sufficiently effective for review spam detection, they have been found to be more reliable than manual detection. The primary issue, as identified by Abbasi et al. [7], is the lack of any distinguishing words (features) that can give a definitive clue for classification of reviews as real or fake. A common approach in text mining is to use a bag of words approach where the presence of individual words, or small groups of words are used as features; however, several studies have found that this approach is not sufficient to train a classifier with adequate performance in review spam detection. Therefore, additional methods of feature engineering (extraction) must be explored in an effort to extract a more informative feature set that will improve review spam detection. In the literature, there are many studies that consider different sets of features for the study of review spam detection utilizing a variety of machine learning techniques. **dal et al. [8], Li et al. [9] and Mukherjee et al. [10], used individual words from the review text as the features, while Shojaee et al. [11] used syntactic and lexical features. An additional study by Ott et al. [12] used review characteristic features in addition to unigram and bigram term-frequencies.

Features associated with the behavior of the reviewer also merit further investigation. The study of writers of review spam differs from that of the review spam itself since features representing the characteristics and behaviors for reviewers cannot be extracted from the text of a single review. Examples of studying spammer behavior include spotting multiple User IDs for the same author [13] and identifying groups of spammers by studying their behavioral footprints [1416]. Alternatively, graph-theory based methods can also be used to find relationships between the reviews and their corresponding authors and have shown promising results [17, 18]. Combining review spam detection through a review’s features, and spammer detection through analysis of their behavior may be a more effective approach for detecting review spam than either approach alone.

Before addressing the challenges associated with improving review spam detection, we must first address collection of data. Data is a major part of any machine learning based model, and while a massive volume of reviews are available on the Internet, collecting and labeling a sufficient number of them to train a review spam classifier is a difficult task. An alternative to collecting and labeling data is to artificially create review spam datasets by using synthetic review spamming, which takes existing truthful reviews and builds fake reviews from them. Sun et al. [19] used this approach to create a review spam dataset.

In this paper we discuss machine learning techniques that have been proposed for the detection of online review spam, with an emphasis on feature engineering and the impact of those features on the performance of the spam detectors. Additionally, the merits of supervised, unsupervised and semi-supervised learning methods are analyzed and results of current research using each approach presented along with a comparative analysis. Finally, we provide suggestions for aspects of review spam detection requiring further investigation, and best practices for conducting future research. To the best of our knowledge, this paper includes information about all of the datasets that have been used, or generated for use, in the reviewed literature.

The structure of this paper is as follows. The Feature Engineering for Review Spam Detection section provides an overview of feature engineering in this domain, both for review centric spam detection and reviewer centric spam detection. The Review Centric Review Spam Detection section discusses and analyzes current research using supervised, unsupervised and semi-supervised machine learning for review centric spam detection. The Reviewer Centric Review Spam Detection section provides an overview of studies using reviewer centric features. The Comparative Analysis and Suggestions section contains a discussion and comparison between the different methods proposed. The Conclusion summarizes our findings and reviews the important of both past and future work.

Feature engineering for review spam detection

Feature engineering is the construction or extraction of features from data. In this section, we analyze and discuss some of the commonly used features in the domain of review spam detection. As briefly outlined in the introduction, previous studies have used several different types of features that can be extracted from reviews, the most common being words found in the review’s text. This is commonly implemented using the bag of words approach, where features for each review consist of either individual words or small groups of words found in the review’s text. Less frequently, researchers have used other characteristics of the reviews, reviewers and products, such as syntactical and lexical features [11] or features describing reviewer behavior. The features can be broken down into the two categories of review and reviewer centric features. Review centric features are features that are constructed using the information contained in a single review. Conversely, reviewer centric features take a holistic look at all of the reviews written by any particular author, along with information about the particular author.

It is possible to use multiple types of features from within a given category, such as bag-of-words with POS tags, or even create feature sets that take features from both the review centric and reviewer centric categories. Using an amalgam of features to train a classifier has generally yielded better performance then any single type of feature, as demonstrated in **dal et al. [20], **dal et al. [21], Li et al. [9], Fei. et al. [22], Mukherjee et al. [23] and Hammad [24]. Li et al. [25] concluded that using more general features (e.g., LIWC and POS) in combination with bag-of-words, is a more robust approach than bag-of-words alone. A study by Mukherjee et al. [23] found that using the abnormal behavioral features of the reviewers performed better than the linguistic features of the reviews themselves. The following subsections discuss and provide examples of some review centric and reviewer centric features.

Review centric features

We split review centric features into several categories. First, we have bag-of-words, and bag-of-words combined with term frequency features. Next, we have Linguistic Inquiry and Word Count (LIWC) output, parts of speech (POS) tag frequencies, Stylometric and Syntactic features. Finally, we have review characteristic features that refer to information about the review not extracted from the text.

Bag of words

In a bag of words approach, individual or small groups of words from the text are used as features. These features are called n-grams and are made by selecting n contiguous words from a given sequence, i.e., selecting one, two or three contiguous words from a text. These are denoted as a unigram, bigram, and trigram (n = 1, 2 and 3) respectively. These features are used by **dal et al. [21], Li et al. [9] and Fei et al. [22]. However, Fei et al. observed that using n-gram features alone proved inadequate for supervised learning when learners were trained using synthetic fake reviews, since the features being created were not present in real-world fake reviews. An example of the unigram text features extracted from three sample reviews is shown in Table 1. Each occurrence of a word within a review will be represented by a “1” if it exists in that review and “0” otherwise.

  1. 1.

    Review1 : The hotel rooms were so great

  2. 2.

    Review2 : We had a great time at this hotel great stay

  3. 3.

    Review3 : The rooms service is bad

Table 1 Example of text features dataset structure, for reviews 1, 2 and 3

Term frequency

These features are similar to bag of words but also include term-frequencies. They have been used by Ott et al. [12] and **dal et al. [8]. The structure of a dataset that uses the term frequencies is shown in Table 2, and is similar to that of the bag of words dataset; however, instead of simply being concerned with the presence or absence of a term, we are concerned with the frequency with which a term occurs in each review, so we include the count of occurrences of a term in the review.

  1. 4.

    Review4 : The hotel rooms were so great, were so comfort

  2. 5.

    Review5 : We had a great time at this hotel great stay

  3. 6.

    Review6 : The rooms service is bad so bad

Table 2 Example of text features frequencies dataset structure, for reviews 4, 5 and 6

LIWC output and POS tag frequencies

Linguistic Inquiry and Word CountFootnote 7 (LIWC) is a text analysis software tool in which users can “build [their] own dictionaries to analyze dimensions of language specifically relevant to [their] interests.” Part of Speech (POS) tagging involves tagging word features with a part of speech based on the definition and its context within the sentence in which it is found [26]. Ott et al. [3] and Li et al. [25] achieved better results by also including these features than with bag of words alone. Table 3 shows the results from the LIWC program when applied to Review 7. Personal text refers to text associated with personal concerns such as work, home or leisure activities. Formal text refers to text disassociated from personal concerns, consisting of psychological processes, linguistic processes and spoken categories. Below Review 7 is the review along with POS tags for each word. Table 4 shows the meaning of each POS tagFootnote 8, while Table 5 presents the frequencies of these tags within the review.

  1. 7.

    Review7 : I like the hotel so much, the hotel rooms were so great, the room service was prompt, I will go back for this hotel next year. I love it so much. I recommend this hotel for all of my friends.

    Review7: I_PRP like_VBP the_DT hotel_NN so_RB much_RB,_, The_DT hotel_NN rooms_NNS were_VBD so_RB great_JJ,_, the_DT room_NN service_NN was_VBD prompt_JJ,_, I_PRP will_MD go_VB back_RB for_IN this_DT hotel_NN next_JJ year_NN ._. I_PRP love_VBP it_PRP so_RB much_RB ._. I_PRP recommend_VBP this_DT hotel_NN for_IN all_DT of_IN my_PRP$ friends_NNS ._.

Table 3 LIWC results when applying Review7 text
Table 4 POS tags abbreviation descriptions
Table 5 POS tagging frequencies for Review 7


These features were used by Shojaee et al. [11] and are either character and word-based lexical features or syntactic features. Lexical features give an indication of the types of words and characters that the writer likes to use and includes features such as number of upper case characters or average word length. Syntactic features try to “represent the writing style of the reviewer” and include features like the amount of punctuation or number of function words such as “a”, “the”, and “of”.


These features deal with the underlying meaning or concepts of the words and are used by Raymond et al. [1] to create semantic language models for detecting untruthful reviews. The rationale is that changing a word like “love” to “like” in a review should not affect the similarity of the reviews since they have similar meanings.

Review characteristic

These features contain metadata (information about the reviews) rather than information on the text content of the review and are seen in works by Li et al. [9] and Hammad [24]. These characteristics could be the review’s length, date, time, rating, reviewer id, review id, store id or feedback. An example of review characteristic features is presented in Table 6. Review characteristic features have shown to be beneficial in review spam detection. Strange or anomalous reviews can be identified using this metadata, and once a reviewer has been identified as writing spam it is easy to label all reviews associated with their reviewer ID as spam. Some of these features may not be available for all sources of review spam and thus limits their utility for detection of spam in many data sources.

Table 6 Reviews characteristics dataset structure

Reviewer centric features

As highlighted earlier, identifying spammers can improve detection of fake reviews, since many spammers share profile characteristics and activity patterns. Various combinations of features engineered from reviewer profile characteristics and behavioral patterns have been studied, including work by **dal et al. [20], **dal et al. [21], Li et al. [9], Fei et al. [22], Mayzlin et al. [27] and Mukherjee et al. [23]. Examples of reviewer centric features are presented in Table 7 and further elaboration on select features used in Mukherjee et al. [23] along with some of their observations follows:

Table 7 Reviewers characteristics dataset structure

Maximum number of reviews

It was observed that about 75 % of spammers write more than 5 reviews on any given day. Therefore, taking into account the number of reviews a user writes per day can help detect spammers since 90 % of legitimate reviewers never create more than one review on any given day.

Percentage of positive reviews

Approximately 85 % of spammers wrote more than 80 % of their reviews as positive reviews, thus a high percentage of positive reviews might be an indication of an untrustworthy reviewer.

Review length

The average review length may be an important indication of reviewers with questionable intentions since about 80 % of spammers have no reviews longer than 135 words while more than 92 % of reliable reviewers have an average review length of greater than 200 words.

Reviewer deviation

It was observed that spammers’ ratings tend to deviate from the average review rating at a far higher rate than legitimate reviewers, thus identifying user rating deviations may help in detection of dishonest reviewers.

Maximum content similarity

The presence of similar reviews for different products by the same reviewer has been shown to be a strong indication of a spammer. Mukherjee et al. [23] used cosine similarity; however, other more advanced similarity functions based upon word meanings versus the words themselves have shown promise [1].

Review centric review spam detection

Review centric review spam detection is the most common form of review spam detection, which uses machine learning techniques to build models using the content and metadata of the reviews. Supervised learning refers to the task of learning from labeled data and is the most prevalent method used for review spam detection in the literature. Unfortunately, this method requires labeled data in order to train a classifier, presenting the challenge of needing methods to procure and accurately label a sufficient amount of data, which can be problematic in the field of review spam detection. Conversely, unsupervised learning uses unlabeled data to find unseen relationships between instances independent of a class attribute. An example of unsupervised learning is clustering, which is able to group instances of unlabeled data based upon some type of similarity function. Semi-supervised learning is a combination of the two and uses a few labeled instances in combination with a large number of unlabeled instances to train a classifier and has shown promise in the area of review spam detection. These methods are summarized in Table 8 and the following subsections outline research conducted using these different types of learning in the domain of review spam detection.

Table 8 Types of machine learning techniques

Supervised learning

Supervised learning can be used to detect review spam by looking at it as the classification problem of separating reviews into two classes: spam and non-spam reviews. To the best of our knowledge, the first researchers to have studied deceptive opinion spam using supervised learning were **dal et al. [21]. They discuss the evolution of opinion mining, which had primarily focused on extracting or summarizing the opinions from text by using Natural Language Processing (NLP). Prior to their contribution, the content characteristics of the text that might indicate abnormal activities, such as creating review spam, had not been addressed. In an effort to investigate opinion spam in reviews and devise techniques for review spam detection, **dal et al. collected 5.8 million reviews of products on Amazon generated by 2.14 million users. The authors categorized the reviews of class spam into three types: untruthful opinion, reviews on brand only, and non-review (labeled types 1, 2 and 3 respectively). They started by finding the near duplicate reviews, which they defined as reviews with a Jaccard similarity score of over 90 % of their 2-g. This was done using a method known as w-shinglingFootnote 9. An alternate method for detecting near duplicates using Symantec Language Models (SLM) was developed by Raymond et al. [35], to learn from a few positive examples and a set of unlabeled data. Montes-y-Gómez and Rosso adapt this approach for review spam detection in their work “Using PU-Learning to Detect Deceptive Opinion Spam” [36]. PU-learning is an iterative method which tries to identify a set of reliably negative instances in the unlabeled data. The model is trained and evaluated using all of the unlabeled data as the negative class and any instances that are classified as positive are removed. The process is repeated until some stop criterion is reached. For evaluation purposes, the dataset generated by Ott et al. [3] was used and the performance was evaluated using F-Measure. Classifiers were trained using both Naïve Bayes and SVM as learners. PU-learning achieved an F-measure of 83.7 % with NB, using only 100 positive examples. While this is better than the results achieved using 6000 labeled instances and co-training by Li F. et al. [9], it is difficult to make a conclusive statement as the methods use different datasets and, as previously discussed, the dataset created by Ott et al. may not provide an accurate indication of real world performance.

Although there is little research in the area of using semi-supervised learning for review spam detection, results obtained using this approach are promising and with additional research, may yield better performance than supervised learning while reducing the need to generate large labeled datasets.

Reviewer centric review spam detection

We mentioned earlier that recognizing reviewers who are writing fake reviews is important in the effort to detect review spam. Using reviewer centric features in combination with review centric features may be preferred over a review centric only approach for spam detection. Additionally, gathering behavioral evidence of spammers is easier than identifying review spam [37].

A thorough study of supervised learning approaches for deceptive review detection was conducted by Mukherjee et al. [23]. They studied how well existing research methods work for detecting real-world fake reviews on a commercial website. The authors tested their models using the Amazon Mechanical Turk (AMT) synthetic fake reviews dataset on a real-world fake reviews dataset procured from Yelp. In this study, they found similar results to previous studies, confirming that using n-gram features performs well on the AMT dataset, however, when used with the real world Yelp dataset it performed significantly worse. They observed that using behavioral features yields higher performance than linguistic features alone on the real world Yelp dataset. Three different features sets were used in the experiment: LIWC, POS and bigrams. In addition, feature selection using Information Gain (IG) was applied to select the top 1 and 2 % features. One of the main conclusions of the study was that the synthetic reviews are not necessarily representative of what is found in real world review spam. Additionally, they observed that using the abnormal behavioral features (i.e., higher percentage of positive reviews, high number of reviews, average review length, etc.) yields better results than the n-gram features in these more realistic datasets. The results of a 5-fold cross validation experiment with an SVM classifier using bigram and POS features resulted in an accuracy of 68.1 % for the real-world fake reviews. This is far lower than the 90 % reported by Ott et al. when evaluating their model on synthetic data. From this, it appears that that using AMT, one cannot effectively generate fake reviews consistent with real-world fake reviews, or at least consistent with the types of reviews that Yelp filters. The addition of behavioral features increases their accuracy to 86.1 % on Yelp's filtered reviews dataset. Feature selection was found to offer no improvement to classification performance, and actually decreased performance slightly; however, only a single combination feature selection technique, learner and performance metric was considered.

In a later study, Mukherjee et al. [14] confirmed that the writers of review spam have different behaviors than truthful reviewers in a set of Amazon reviews as well. **dal et al. [8] also studied the impact of reviewer centric features on review spam detection. They identified unusual review patterns and reviewer behaviors that were highly correlated with spam review activity. They found unexpected rules and rule groups using Class Association Rules (CAR), which proposes unexpectedness measures after a set of expectations has been defined. These unexpected rules and rule groups represent the unusual behaviors of spam reviewers, which in turn allow for identification of review spam activity. This technique itself is generic and can be applied to solve a variety of problems due to its domain independence.

A novel technique for detecting review spammers was proposed by Fei. et al. [22], where they exploit the “bursty” nature of reviews generated by spammers to identify review spam. Bursty reviews are reviews that suddenly become popular and receive great attention from reviewers within a certain time period or certain area. The reviews and reviewers in those situations become suspicious as review spam and review spammer respectively. For burst detection, the authors used Kernel Density Estimation (KDE) techniques to detect review bursts. KDE is a technique closely related to histograms, which has attributes that allow it to asymptotically converge to any density function. Behavioral features for spammers were created that combined the spammers’ behaviors with the features of review bursts. In addition, these features can be used in conjunction with review spam features in a hybrid approach to improve the classification results. The features listed below are examples of the features used in this study.

Ratio of Amazon Verified Purchase (RAVP)

This feature is the number of the Amazon verified purchases divided by the number of total reviews written by this user. Because verified purchase reviews most likely reflect a genuine review, a reviewer with a higher RAVP is considered more trustworthy.

Rating Deviation (RD)

This feature measures the average deviation of a reviewer’s reviews. Since the expected behavior of a reviewer is to give similar ratings as other users gave for the same product, spammers may exhibit a higher divergence in their rating behavior.

Burst Review Ratio (BRR)

This value is computed as the ratio of a reviewer’s reviews that occur in bursts to the total number of reviews that he/she wrote.

Review Content Similarity (RCS)

The average pairwise cosine similarity of all of a reviewer’s reviews. Higher scores may be an indication of a possible spammer.

Reviewer Burstiness (RB)

This measures the amount of reviews that occur in both the reviewer’s and product’s bursts. The more that this occurs, the more likely the reviewer is a spammer.

A Markov Random Field (MRF) model engaged with a Loopy Belief Propagation algorithm was used to identify the spam reviewers in their proposed model. The dataset produced by **dal et al. [21] was used for training and evaluation. Unigrams features were used with SVM to classify the reviews for evaluation purposes, but not used in the main model. Using only reviewer centric features Fei et al. achieved an F-score of 75.4 % for burst reviews, and 68.7 % for all reviews. Earlier results by **dal et al. [21] indicate similar performance can be achieved using text based features; however work by Mukherjee et al. [14] shows that classifiers benefit from using both review centric and reviewer centric features.

Comparative analysis and suggestions

When develo** a new review spam detection framework, it is important to understand what approaches and techniques have been used in prior studies. In previous sections, we presented an overview of machine learning techniques that have been used in the review spam domain and some of the important results of these studies. As this domain is young, relatively few studies on machine learning techniques and review spam detection have been conducted.

Based on our survey, most of the previous studies have focused on supervised learning techniques. However, in order to use supervised learning, one must have a labeled dataset, which can be difficult (if not impossible) to acquire in the area of review spam. From the literature we discussed, it can be observed that most of the available datasets used in the previous studies are synthetically created, most likely due to the lack of review spam examples and the difficulty of labeling them [19]. Building and evaluating classifiers based on these synthetic datasets can be problematic, as it has been observed that they are not necessarily representative of real world review spam. For example, when using the same framework to evaluate the artificial AMT dataset used in [3, 12, 25] and Yelp’s filtered reviews dataset, the extracted features and results differed greatly, especially when using n-gram text features [23]. Comparing classification performance across these datasets shows that when evaluated on the synthetic review dataset, the classifier achieved an accuracy of 87 %, but while using Yelp’s reviews only achieved 65 % accuracy. This 22 % drop in accuracy implies that synthetically created reviews have different distinguishing features than real-life fake reviews, and that the reviews produced by AMT do not accurately reflect real world spam reviews.

Feature engineering can have a significant impact on classifier performance. Different studies have used the same datasets, learners, and performance metrics but achieved different results due to different feature engineering methods; [3] and [25] or [23] and [11] are examples. Table 8 reports the performance for some of the studies discussed in this paper and what types of features were used to achieve that value. In studying the various sets of features used in the literature, one of the most notable conclusions is that performance increases through combining multiple types of features, and that using the most relevant and expressive features can make a predictive model more robust [25]. **dal et al. [21] found that adding additional features (both review centric and reviewer centric) to text features improved performance. It can also be observed in Table 8 that augmenting bigrams with LIWC yields a small performance improvement [3]. Several experiments used the same datasets (built by Ott et al. [3] using AMT) and show that for this dataset, the highest performance is achieved using bigrams and LIWC [3, 11, 12]. As other studies are using unique datasets, or datasets that have been in some way altered, it is difficult to directly compare their results.

Although there are a large number of machine learning algorithms (learners) available, current research using supervised learning methods has been, for the most part, limited to three learners: Logistic Regression (LR), Naïve Bayes (NB) and Support Vector Machine (SVM). While SVM generally offered the best performance; it is occasionally beaten by NB or LR, and not compared to many other available learners, thus it cannot be considered the best learner. The best learner found by each study is shown in Table 9, but should not be considered conclusive due to the experiments not thoroughly studying multiple learners. Future research should test multiple learners across multiple datasets using many different feature engineering methods.

Table 9 Comparison of previous works and results for review spam detection along with the relative complexity of the approach (including feature extraction and learning methodology)

To the best of our knowledge, methods and tools for learning from Big Data have not been used in the literature even though real world datasets of only a single site (such as Trip Advisor) can contain upwards of 200 million reviews5. New reviews are constantly being added to large repositories of reviews across various websites at a high rate, over 1.5 million per month in the case of Yelp4. Consequently, distributed and streaming applications of machine learning algorithms across these datasets are of interest as traditional machine learning tools, such as R or Weka, cannot scale to datasets of this size. Tools such as MahoutFootnote 11, Spark (MLlib)Footnote 12, H2OFootnote 13, and SAMOAFootnote 14 should be explored to effectively model the large corpus of online reviews which exist in the real world [38]. Mahout has been used for large-scale recommendation systems [39], which would be useful to apply to review spam detection, as reviewers may be related to each other on different review websites. MLlib and SAMOA can perform large-scale online learning, where machine learning models are trained and tuned as new data flows in. This is especially desirable in the field of review spam detection, as reviews are constantly being added to the corpus. SAMOA has been used to analyze live Twitter streams [40], which involves similar text processing that can be applied to online reviews.

Current research has largely ignored feature selection techniques in their experiments, even when using text features, which can potentially lead to highly dimensional feature sets. The experiment by Mukherjee et al. [23] is a notable exception, as they used Information Gain (IG) to perform feature selection of top 1 and 2 % of features. Though they found this had no impact on classifier performance, we believe that using feature selection techniques can potentially improve performance based on results from other domains. Feature selection also has the benefit of reducing the computational costs associated with training a classifier. This is highly desirable as review spam detection is a big data domain and datasets may have a very large number of instances and features. In order to ascertain the impact of feature selection, additional techniques should be tested while considering different features, feature subset sizes and datasets.

In addition, current research has ignored the use of ensemble learning techniques, such Bagging or Boosting, to obtain better predictive performance than using the traditional learning algorithms. These techniques are especially useful for improving performance on noisy or imbalanced data [41, 42]. Noisy data is data with inaccuracies or, “noise”, in either the features or class attributes. For example, training data may contain review spam instances that have been mislabeled as true reviews or vice versa [43]. As classification performance on synthetic review datasets has shown to be a poor indicator of performance on real world data, it is beneficial to use real world data. Unfortunately it is difficult to accurately label training data. As seen in the study by Ott et al. [3], human judges have difficulty in accurately discriminating between, and thus labeling, spam and non-spam reviews. It is likely any labeled training data from real world sources would contain mislabeled instances. Due to this, ensemble techniques could be highly beneficial in this domain to mitigate the negative impact of noisy data.

Finally, there are a massive number of online reviews, and fake reviews are usually less frequent than truthful ones, resulting in highly imbalanced datasets [44]. Class imbalance can adversely affect classifier performance as the majority class may be favored, and must be taken into consideration when training a model. Two works have considered the class imbalance problem in this domain, [24] and [44]. Both used random undersampling and random oversampling to overcome imbalanced distributions and have promising but inconclusive results. Ensemble techniques can be used alongside, or in place of, data sampling as they have been shown to be more robust to the effects of class imbalance than single classifiers [41], but have yet to be used to address imbalanced data in this domain. Future work should include further investigation of the role class imbalance in review spam data as well as mitigating its effects using ensemble learners and sampling techniques.


In recent years, review spam detection has received significant attention in both business and academia due to the potential impact fake reviews can have on consumer behavior and purchasing decisions. This survey covers machine learning techniques and approaches that have been proposed for the detection of online spam reviews. Supervised learning is the most frequent machine learning approach for performing review spam detection; however, obtaining labeled reviews for training is difficult and manual identification of fake reviews has poor accuracy. This has led to many experiments using synthetic or small datasets. Features extracted from review text (e.g., bag of words, POS tags) are often used to train spam detection classifiers. An alternative approach is to extract features related to the metadata of the review, or features associated with the behavior of users who write the reviews. Disparities in performance of classifiers on different datasets may indicate that review spam detection may benefit from additional cross domain experiments to help develop more robust classifiers. Multiple experiments have shown that incorporating multiple types of features can result in higher classifier performance than using any single type of feature.

One of the most notable observations of current research is that experiments should use real world data if possible. Despite being used in many studies, synthetic or artificially generated datasets have been shown to give a poor indication of performance on real world data [23]. As it is difficult to procure accurately labeled real-world datasets, unsupervised and semi-supervised methods are of interest. While unsupervised and semi-supervised methods are currently unable to match the performance of supervised learning methods, research is limited and results are inconclusive, warranting further investigation. A possibility for a less labor-intensive means of generating labeled training data is to find and label duplicate reviews as spam. Multiple studies have shown duplication, or near duplication, of review content is a strong indicator of review spam. Another data related concern is that real world data may be highly class imbalanced, as there are currently many more truthful than fake reviews online. This could be addressed through data sampling and ensemble learning techniques. A final concern related to quality of data is the presence of noise, particularly class noise due to mislabeled instances. Ensemble methods, and experiments with different levels of class noise, could be used to evaluate the impact of noise on performance and how its effects may be reduced.

The studies discussed in this paper have primarily focused in the area of feature engineering, but which combination of features is best remains unclear. Research by **dal et al. [20, 21] shows that the addition of reviewer centric features yields higher classifier performance than the use of n-gram features alone, and other experiments support this conclusion [3, 9, 23]. The best observed performance was achieved by combining text and non-text features. Reviewer centric features have also been demonstrated to be important for accurate detection of review spam as seen in [9, 20, 21, 23, 24]. Despite many studies focusing on feature engineering, it is not possible to identify the best types of features since the experiments make use of different datasets; however, it has been shown that there is no silver bullet for review spam detection and multiple types of features are needed. Future work should evaluate different feature engineering methods across multiple datasets to determine which types of features are most useful for online review spam detection.

As review text is an important source of information and tens of thousands of text features can easily be generated based on this text, high dimensionality can be an issue. Additionally, millions of reviews are available to be used to train classifiers, and training classifiers from a large, highly dimensional dataset is computationally expensive and potentially impractical. Despite this, feature selection techniques have received little attention. Many experiments have avoided this issue by extracting only a small number of features, avoiding the use of n-grams, or by limiting number of features through alternative means such as using term frequencies to determine what n-grams are included as features. Further work needs to be conducted to establish how many features are required and what types of features are the most beneficial. Feature selection should not be considered optional when training a classifier in a big data domain with potential for high feature dimensionality. Additionally, we could find no studies that incorporated distributed or streaming implementations for learning from Big Data into their spam detection frameworks.