Introduction

There is a spectrum of techniques for using categorical variables in neural networks. Finding the right technique can have a significant impact on a model’s performance. At one end of the spectrum, we have determined encoding techniques. Determined techniques are simplest. We characterize determined encodings as having the quality that given the same dataset, a determined encoding will produce the same encoding every time a practitioner employs it. Also, determined techniques have low running time complexity. In the middle of the spectrum, we find algorithmic techniques, such as Latent Dirichlet Allocation. Algorithmic techniques may or may not have deterministic outcomes, but we wish to identify them in a class separate from determined techniques because they are more complex in terms of running time. At the other end of the spectrum, we have automatic techniques, where neural networks dynamically generate their own data representations as a part of their training phase. This is the key difference between automatic and algorithmic techniques. Automatic techniques encode categorical data as a side effect of a machine learning task, whereas algorithmic techniques encode categorical data prior to the beginning of a machine learning task. In practice, we find encoding techniques are blended as we depict in Fig. 1.

Fig. 1
figure 1

Encoding techniques are determined, algorithmic or automatic. Most works we review employ some blend of encoding techniques. It is possible to use compositions of up to all three varieties. After selecting a method for representing data, practitioners can transfer it to another algorithm, or use it directly in the next stage

In this survey, we aim to give the reader a solid understanding of current methods for applying neural networks to qualitative data. Therefore, our aim is to describe in detail the three primary techniques. One thing all three techniques have in common is that one must have a way to represent categorical data with numbers in order to use it in a neural network. Determined techniques convert categorical data to vectors with a low amount of computational complexity. For example, the determined Label encoding technique simply replaces a categorical value with a distinct, scalar numerical value. Algorithmic techniques are more sophisticated than determined techniques and require more intensive computation. Algorithmic techniques are not directly incorporated into the training of a neural network, but rather take the form of more involved pre-processing of data. Automatic techniques employ machine learning algorithms to discover a data representation. Hence, they have an intrinsic dependence on the dataset the practitioner uses. In the literature, we find authors sometimes employ techniques separately, sometimes they compose techniques. Here we use the term “composition” in the mathematical sense of the word; i.e., to compose functions f and g is to apply f to the output of g, often denoted \(f \circ g\). Determined techniques are suitable for big-data because they have low running time complexity. Algorithmic and automatic techniques require more resources to encode qualitative data for use in neural networks. However, researchers may be able to leverage transfer learning to side-step having to generate encodings for their qualitative data.

By definition, neural networks operate on vectors of real numbers [1]. Therefore, in order to apply neural networks to qualitative data, we must find a way to transform the categorical data to vectors of real numbers. One must begin with what we call a determined technique for representing categorical values with vectors of real numbers. The most common determined technique we find in this review is One-hot encoding. A determined technique is one that transforms categorical values to vectors with minimal processing. This is a defining characteristic of determined techniques. We may choose to use the result of a determined technique for transforming qualitative data as the input for a neural network. On the other hand, we may choose to use it only as the first step for an algorithmic or automatic technique. Both algorithmic and automatic techniques may exploit dataset labels in the computations they perform to convert qualitative data to numerical form. They are also more computationally complex than determined techniques. What separates algorithmic and automatic techniques is that practitioners must apply algorithmic techniques before initiating the learning process in neural networks. With automatic techniques, one need only convert data with a simple determined technique, like One-hot encoding, and the neural network itself will learn a dense encoding of the categorical data. Latent Dirichlet Analysis (\(\text {LDA}\)) [2] is an example of an algorithmic technique. Entity embedding [1 to show that researchers often employ combinations of all three techniques. If one aims to use qualitative data in neural networks, one should be open to using a combination of techniques. If the reader plans on extending another researcher’s work, understanding which sorts of encoding techniques are used is key to understanding over-all research results.

To help the reader understand the spectrum of techniques, in "Definitions" section, we give definitions for some terms used in this work. We then present our search methodology and related works. In addition, we provide a table of works studied that summarizes them and includes the encoding technique category (determined, automatic, or aglorithmic) for each work. After that, we proceed with an analysis of techniques. Our research for this study reveals that there is a spectrum of techniques. We separate works into sections by the interesting encoding technique. However, many works exhibit composition of techniques. Finally, we come to our conclusions on which encoding techniques show the most promise for future research.

Definitions

One term central to this paper is “deep learning algorithm”. Here, deep learning algorithms and feed forward networks are synonyms. We rely on Goodfellow et al.’s definition of the term “feedforward network” in [1]: “A feedforward network defines a map** \({\mathbf {y}}=f\left( x ; \theta \right)\) and learns the values of the parameters \(\theta\) that result in the best function approximation”. Hence, the term “learning” in what we call a deep learning algorithm. A deep learning algorithm is an algorithm that employs a composition of functions where each function equates conceptually to a layer of a \(k\)-partite directed graph. For further clarification on what a deep learning algorithm is, we refer the reader to chapter 6 of [1]. We use the terms “deep learning algorithm”, “neural network”, “feedforward network”, and “deep neural network”, interchangeably.

The next important term in this work is, “categorical data”. We use Lacey’s definition of categorical data, “Categorical variables represent types of data which may be divided into groups. Examples of categorical variables are race, sex, age group, and educational level [4]”. Furthermore, we use the terms “categorical data” and “qualitative data” interchangeably.

We find a more specific definition of categorical variables in Lane [5]. Lane writes about scales of measurement. He defines nominal scales, ordinal scales, interval scales, and ratio scales of measurement. Our notion of categorical variable includes variables that we measure using nominal or ordinal scales according to Lane’s definition of nominal and ordinal scales. A nominal variable is a variable whose values we cannot order. For example, the political party a person belongs to is a nominally scaled random variable. It does not make sense to say that Democrat comes before Republican. Variables we measure with ordinal scales have order, but we cannot associate a number with them. Lane gives the example of the variable “customer satisfaction” that may take levels such as “very dissatisfied” or “somewhat dissatisfied”.

In the works we study here, we find techniques for using categorical data in deep learning algorithms. We find works using several terms to refer to these techniques. Examples of terms referring to these techniques are “entity embeddings” [\(v \in {\mathbb {R}}^n\) for the purpose of using those vectors as input to a deep learning algorithm.

We find that researchers have not reached a consensus on what to call what we refer to as entity embedding. We attempt to follow the succession of important works in deep learning research to find the terms the authors of these important works are using. Our conclusion is that machine learning researchers’ interest in deep learning starts with Krizhevsky, Sutskever, and Hinton’s 2012 success applying deep learning algorithms to the problem of classifying Imagenet data [8]. It is our belief that this success then inspired researchers to find ways to apply deep learning to data other than image data. The next milestone we see is Mikolov et al.’s publication, “Distributed representations of words and phrases and their compositionality” [9]. Interestingly, both the 2012 Imagenet paper, and the 2013 paper on distributed representations of words have a common author: Ilya Sutskever. This fact may lead one to believe the community should settle on, “distributed representations”, as the term for map** categorical data to vectors for deep learning. There is, however, another work we find important. This is Guo and Berkhahn’s, “Entity embeddings of categorical variables” [\({\mathbb {R}}^n\). We denote the map** with the letter e. An entity embedding algorithm may change the behavior of e over time. One technique for updating e that we find employed in Mikolov et al. [9] is to express e as a feed forward neural network, and alter the behavior of e by updating the parameters of the neural network. We have many options for network structure and weight updating schemes to choose from, thanks to the work of previous researchers. We find many works where the authors use entity embeddings to produce input values for other deep learning algorithms. Guo and Berkhahn give a rigorous definition in [

$$\begin{aligned} e:S \rightarrow {\mathbb {R}}^{d} \end{aligned}$$
(1)

of the elements of S to vectors of real numbers. We define the range of an entity embedding as \({\mathbb {R}}^{d}\) to allow us the luxury of employing any theorems that hold for real numbers. In practice, the range of our embedding is a finite subset of vectors of rational numbers \({\mathbb {Q}}^{d}\) because computers are currently only capable of storing rational approximations of real numbers. We refer to this as “entity embedding”, “embedding categorical data”, “embedding categorical values”, or simply as “embedding” when the context is clear.

The terms we define in this section are key to understanding the works we cover here. In the next section we report the search method we use to find the works we include. This search process is what leads to our formulation of categories of embedding techniques.

Search method

We employ Google Scholar [10] and the Florida Atlantic University (FAU) online library database OneSearch [11] to search for terms synonymous with embedding. These are the synonymous terms: “entity embedding”, “tabular data”, “dense encoding”, “distributed representation”, and, “encoding categorical variables”.

We combine each of the five synonyms for entity embedding above with each of the following phrases: “deep learning”, “neural network”, “deep learning survey”, and, “neural network survey”.

Our research group is interested in entity embeddings as they relate to medical research with respect to the Healthcare Common Procedure Coding System (HCPCS), so we also searched the databases we listed above using phrases, “deep learning HCPCS”, and, “neural network HCPCS”.

Hence, we conduct a total of 40 searches in each of the two databases. We include the term “survey” in our searches to find related works. In addition, we search for the phrase “encoding categorical variables” by itself in order to find literature that covers techniques that have potential to be used in neural networks. Through the course of conducting these searches, we discovered the spectrum of techniques for using qualitative data in neural networks.

Related work

We find several existing studies that involve using categorical data in neural networks. However, we do not find work that identifies the spectrum of techniques we discover here. Related works we find here cover some aspects of what we find, but none of them present techniques for using qualitative data in neural networks with our unified approach.

Interestingly, when we query the Google Scholar search engine “encoding categorical variables neural networks” we find “A comparative study of categorical variable encoding techniques for neural network classifiers” [12] is the second result. This 3-page work is from October 2017. One positive aspect of this work is that it covers seven techniques for encoding categorical variables. On the other hand, the authors evaluate each technique on the UCI Cars dataset [13], with one neural network architecture. In our search for works to cover in this review we do not find papers that employ all of these techniques. Furthermore, we find that each of the seven techniques the authors listed are all available in the Scikit-learn Category Encoders library. All the techniques covered in this paper fall into the determined category of techniques for using categorical data in neural networks. This work appears to be a report on an exercise where the authors applied various encoding techniques available in a popular machine learning library to a dataset commonly used for teaching purposes. Although the scope of the work is limited, we find it useful for discussing determined techniques for encoding categorical variables.

One work with a promising title for researchers interested in using categorical data as input to neural networks is, “An overview on data representation learning: From traditional feature learning to recent deep learning” [14], by Zhong et al. However, this work traces the history of data representation starting with principal component analysis (\(\text {PCA}\)) and linear discriminant analysis (\(\text {LDA}\)) (On a side note, the reader should be aware that the abbreviation LDA is often used for two different techniques in machine learning literature, Latent Dirichlet Allocation, or linear discriminant analysis). Zhong et al. make a great contribution in describing PCA, LDA, and its descendants, but this work is not strictly dedicated to techniques for using categorical data in neural networks in the manner that this survey is. Another major theme in [14] is learned embeddings. Zhong et al. introduce learned embeddings by covering Mikolov et al.’s distributed representations in [9]. There is some overlap in this work and [14] in that regard. However, Zhong et al. present a collection of techniques from a historical point of view. We do not take a historical perspective in this work. This could imply some techniques are passe and others are en vogue. We find that there is a spectrum of techniques, and researchers often compose techniques.

Natural language is a common form of categorical data. On the subject of natural language processing, we also have the survey, “Semantic text classification: A survey of past and recent advances” [15] by Altinel and Ganiz. Part of this work covers embeddings for natural language data. However, the focus of [15] is on text classification, so it only covers techniques for using categorical data in neural networks for text classification. This work is not limited to neural networks only. For example, it also covers knowledge bases such as WordNet, non-parametric techniques for text processing such as K-Nearest Neighbors, Support Vector Machine, and so on. There is some overlap with Altinel and Ganiz’s coverage of deep learning based natural language processing techniques, and the automatic class of embedding techniques we study here. Altinel and Ganiz do not write about embedding techniques from a general perspective, so they do not provide a unified description of embedding techniques.

“Graph Embedding Techniques, Applications, and Performance: A Survey” [16] is another survey of embedding techniques albeit exclusively for graph embeddings. We feel this is an interesting, emerging subject in deep learning. Moreover, one may characterize a qualitative attribute of some data as connections between data that share the attributes’ value. However, in this work we focus on existing techniques for working with qualitative data in neural networks. As previously, we feel applying graph embedding techniques to categorical data is a subject for future research.

In “Deep EHR: A Survey of Recent Advances in Deep Learning Techniques for Electronic Health Record (EHR) Analysis” [17], Shickel et al. cover entity embeddings in the section entitled, “Concept representation”. The introduction mentions DeepCare, and Med2Vec. However, the scope of their work is smaller than the scope of this work. They cover deep learning applications to electronic health records (\(\text {EHR}\)) data. This survey is interesting to us because it covers techniques for dealing with categorical health care data such as International Statistical Classification of Diseases (\(\text {ICD}\)) codes as input for deep learning algorithms. We aim to cover the same types of techniques in the context of qualitative data in general.