Keywords

1 Introduction

Conversational systems are some of the most important advancements in the area of Artificial Intelligence (AI). In conversational AI, dialogue systems can be either an open-domain chit-chat model or a task-specific goal-oriented model. Task-specific systems focus on particular tasks such as flight or hotel booking, providing technical support to users, and answering non-creative queries. These systems try to generate a response by maximizing an expected reward. In contrast, an open-domain dialog system operates in a non-goal driven casual environment and responds to the all kinds of questions. The realization of rewards is not straightforward in these cases, as there are many factors to model in. Aspects such as understanding the dialog context, acknowledging user’s personal preferences, and other external factors such as time, weather, and current events need consideration at each dialog step.

In recent times, there has been a trend towards building end-to-end dialog systems such as chat-bots which can easily mimic human conversations. [12, 13, 17, 28]. Rule-based methods often require human experts to form rules for training the system, whereas learning-based methods learn from a specific algorithm, which makes it less flexible to adapt to the other domains. Data from various social media platforms like Twitter, Reddit, and other community question-answering (CQA) platforms have provided us with a large number of human-to-human conversations. Data-driven approaches developed by [6, 16] can be used to handle such problems. Retrieval based methods [6] generate a suitable response from a predefined set of candidate responses by ranking them in the order of similarity (e.g., by matching the number of common words) against the input sentence. The selection of a random response from a set of predefined responses makes them static and repetitive. [16] builds a system based on phrase-based statistical machine translation to exploit single turn conversations. [30] presented a deep learning-based method for retrieval-based systems. A brief review of these methods is presented by [2].

Lately, generation based models have become quite popular. [19, 22, 23, 25] presented several generative models based on neural network for building efficient conversational dialog systems. Moreover, several other techniques, for instance generative adversarial network (GAN) [10, 29] and conditional variational autoencoder (CVAE) [3, 7, 18, 20, 32, 33] are also implemented for dialog generation.

Conversations generated from retrieval-based methods are highly fluent, grammatically correct, and are of good quality as compared to dialogues generated from the generative methods. Their high-quality performance is subjected to the availability of an extensive repository of human-human interactions. However, responses generated by neural generative models are random in nature but often lack grammatical correctness. Techniques that can combine the power of both retrieval-based methods and generative methods can be adapted in such situations. On the whole hybrid methods [21, 27, 31, 34] first find some relevant responses using retrieval techniques and then leverages them to generate contextually relevant responses in the next stage.

In this paper, we propose a novel method for building an efficient virtual assistant using single-turn open-domain conversational data. We use a self-attention based transformer model, instead of RNN based models to get the representation of our input sequences. We observe that our method can generate more diverse and relevant responses.

3 Methodology

3.1 Problem Statement

Our goal is to generate contextually relevant responses for single-turn conversations. Given an input sequence of utterance \(\mathrm{U} = u_{1}, u_{2}, ..., u_{n}\) composed of n words we try to generate a target response \(\mathrm{Y} = y_{1}, y_{2}, ..., y_{m}\).

3.2 Word Embeddings

We use pre-trained GLoVE [15]Footnote 1 embeddings to initialize the word vectors. GLoVE utilizes two main methods from literature to build its vectors: global matrix factorization and local context window methods. The GloVe model is trained on the non-zero entries of a global word to word co-occurrence matrix, which computes how frequently two words can occur together in a given corpus. The embeddings used in our model are trained on Common Crawl dataset with 840B tokens and 2.2M vocab. We use 300-dimensional sized vectors.

3.3 Baseline Models

We formulate our task of response generation as a machine translation problem. We define two baseline models based on deep learning techniques to conduct our experiments. First, we build a neural sequence to sequence model [23] based on Bi-Directional Long Short Term Memory (Bi-LSTM) [5] cells. The second model utilizes the attention mechanism [1] to align input and output sequences. We train these models using the Glove word embeddings as input features.

To build our first baseline, we use a neural encoder-decoder [23] model. The encoder, which contains RNN cells, converts the input sequence into a context vector. The context vector is an abstract representation of the entire input sequence. The context vector forms the input for a second RNN based decoder, which learns to output the target sequence one word at a time. Our second baseline uses an attention layer [1] between the encoder and decoder, which helps in deciding which words to focus on the input sequence in order to predict the next word correctly.

3.4 Proposed Model

The third model, which is our proposed method, is based on the transformer network architecture [24]. We use Glove word embeddings as input features for our proposed model. We develop the transformer encoder as described in [24] to obtain the representation of the input sequence and the transformer decoder to generate the target response. Figure 1 shows the proposed architecture. The input to the transformer encoder is both the embedding, e, of the current word, \(e(u_n)\), as well as positional encoding PE(n) of the nth word:

$$\begin{aligned} {I_u = [{u}_{1},...,{u}_{n}]} \end{aligned}$$
(1)
$$\begin{aligned} {u_{n} = e(u_{n}) + PE(n)} \end{aligned}$$
(2)
Fig. 1.
figure 1

Proposed model architecture

There are a total of \(N_{x}\) identical layers in a transformer encoder. Each layer contains two sub-layers - a Multi-head attention layer and a position-wise feedforward layer. We encode the input utterances and target responses of our dataset using multi-head self-attention. The second layer performs linear transformation over the outputs from the first sub-layer. A residual connection is applied to each of the two sub-layers, followed by layer normalization. The following equations represent the layers:

$$\begin{aligned} {M^{1} = MultiHead({I}_{u},{I}_{u},{I}_{u})} \end{aligned}$$
(3)
$$\begin{aligned} {F^{1} = FFN(M^{1}) } \end{aligned}$$
(4)
$$\begin{aligned} FFN(t) = max(0,tW_1+b_1)W_2+b \end{aligned}$$
(5)

where \(M^{1}\) is the hidden state returned by the first layer of multi-head attention and \(F^{1}\) is the representation of the input utterance obtained after the first feed forward layer. The above steps are repeated for the remaining layers:

$$\begin{aligned} {M^{n} = MultiHead({I}_{u},{I}_{u},{I}_{u})} \end{aligned}$$
(6)
$$\begin{aligned} {F^{n} = FFN(M^{n})} \end{aligned}$$
(7)

where n = 1, ..., \(N_x\). We use c to denote the final representation of the input utterance obtained at \(N_x\)-th layer:

$$\begin{aligned} {c = F^{(N_x)}} \end{aligned}$$
(8)

Similarly, for decoding the responses, we use the transformer decoder. There are \(N_y\) identical layers in the decoder as well. The encoder and decoder layers are quite similar to each other except that now the decoder layer has two multi-head attention layers to perform self-attention and encoder-decoder attention, respectively.

$$\begin{aligned} {R_y = [{y}_{1},...,{y}_{m}]} \end{aligned}$$
(9)
$$\begin{aligned} {y_{m} = e(y_{m}) + PE(m)} \end{aligned}$$
(10)
$$\begin{aligned} {P^{n} = MultiHead({R}_{y},{R}_{y},{R}_{y})} \end{aligned}$$
(11)
$$\begin{aligned} {G^{n} = FFN(P^{n})} \end{aligned}$$
(12)
$$\begin{aligned} {D^{n} = MultiHead({G}^{n},{c},{c})} \end{aligned}$$
(13)
$$\begin{aligned} {H^{n} = FFN(D^{n})} \end{aligned}$$
(14)

To make prediction of the next word, we use Softmax to obtain the words probabilities decoded by the decoder.

$$\begin{aligned} {\hat{y}_t = softmax(H^{(N_y)})} \end{aligned}$$
(15)

4 Datasets and Experiments

In this section, we present the details of the datasets used in our experiments, along with a detailed overview of the experimental settings.

4.1 Datasets

Our dataset comprises of single-turn conversations from ten different domains - Data About User, Competitors, Emotion, Emergency, Greetings, About Bixby, Entertainment, Sensitive, Device, and Event. Professional annotators with a linguistics background and relevant expertise created this dataset. The total dataset comprises of 184,849 utterance and response pairs with an average of 7.31 and 14.44 words for utterance and response, respectively. We first split the data into a train and test set in a 95:5 ratio. We then use 5% of the training data for preparing the validation set. The dataset details are given in Table 2. Some examples from the dataset are shown in Table 1.

Table 1. Examples of the original utterance and the associated response from the dataset
Table 2. Dataset statistics

4.2 Experimental Setup

We use two different types of models for our experiments - recurrent and transformer-based sequence-to-sequence generative models. All data loading, model implementations, and evaluation were done using the OpenNMTFootnote 2 [9] as the code framework.

Recurrent Models. We train a seq2seq model where the encoder and decoder are parameterized as LSTMs [5]. We also experiment with the seq2seq model with an attention mechanism [1] between the decoder and the encoder outputs. The encoder and decoder LSTMs have 2 layers with 512-dimensional hidden states with a dropout rate of 0.1.

Transformer Model. The layers of both encoder and decoder are set to 6 with 512-dimensional hidden states with a dropout of 0.1. There are 8 multi-head attention heads and 2048 nodes in the feed-forward hidden layers. The dimension of word embedding is empirically set to 512. We use Adam [8] for optimization. When decoding the responses, the beam size is set to 5.

4.3 Evaluation Metrics

Automatic Evaluation: We use the standard metrics like BLEU [14], ROUGE [11] and perplexity for the automatic evaluation of our models. Perplexity is reported on the generated responses from the validation set. Lower perplexity indicates better performance of the models. BLEU and ROUGE measure the n-gram overlap between a generated response and a gold response. Higher BLEU and ROUGE scores indicate better performance.

Human Evaluation: To qualitatively evaluate our models, we perform human evaluation on the generated responses. We sample 200 random responses from our test set for the human evaluation. Given an input utterance, target response, and predicted response triplet, two experts with post-graduate exposure were asked to evaluate the predicted responses based on the given two criteria:

  1. 1.

    Fluency: The predicted response is fluent in terms of the grammar.

  2. 2.

    Adequacy: The predicted response is contextually relevant to the given utterance.

We measure fluency and adequacy on a 0–2 scale with ‘0’ indicating an incomplete or incorrect response, ‘1’ indicating acceptable responses and ‘2’ indicating a perfect response. To measure the inter-annotator agreement, we compute the Fleiss kappa [4] score. We obtained a kappa score of 0.99 for fluency and a score of 0.98 for adequacy denoting “good agreement.

5 Results and Analysis

In this section we report the results for all our experiments. The first two experiments (seq2seq & seq2seq_attn) are conducted with our baseline models. Our third experiment (c.f Fig. 1) is carried out on our proposed model using word embeddings as the input sequences. Table 3 and Table 4 show the automatic and manual evaluation results for both the baseline and the proposed model.

Automatic Evaluation Results: Our proposed model has lower perplexity and higher BLEU and ROUGE scores than the baselines. The improvement in each model is statistically significant compared to the other modelsFootnote 3. For all the evaluation metrics, seq2seq_attn has the highest score among the baselines, and our model outperforms those scores by a decent margin.

Human Evaluation Results: For Adequacy, we find that our seq2seq model achieves the highest score of 73.70 among the baseline models. Our proposed model outperforms the baselines with a score of 81.75. For Fluency, we observe that the responses generated by all the models are quite fluent in general.

Table 3. Results (BLEU and ROUGE scores) for the baseline and proposed models using Glove embeddings
Table 4. Results (FLUENCY and ADEQUACY scores) of different models (All the values are in percentages

5.1 Error Analysis

To observe our results in more details, we perform an error analysis on the predicted response. In Table 5, we show the predicted responses of various models used in our experiments along with the input utterance and target response. Some of the other observations are listed below:

  1. 1.

    Our proposed model gives adequate response for the unseen utterances: For example, Utterance: What success did you achieve?; Predicted Response: I wonder if achieving world peace is too much to ask for Utterance: What is your desired job?; Predicted Response: Those concepts don’t really apply to me. I am a digital entity after all. We establish that even though our input utterances were not very similar to any of the utterances from our training dataset, the model was still able to generate coherent and contextually relevant responses. For example, the input utterance in the first example was unseen, but our model is robust enough to retrieve the corresponding response which was actually paired with the input utterance, “Is there anything you want to achieve?” in the training conversation.

  2. 2.

    Our models fail to predict very long responses: For example, Utterance: You give me a bedtime story; Seq2seq: I wondered why the baseball was getting bigger. Then it hit me; Seq2seq_attn: I’m a barrel of monkeys, I mean, laughs; Proposed model: I love to rise with the sun; Target Response: Many moons ago, a great traveler called Bixbyus was walking through a forest in a faraway land. Towering pine trees stretched far above her towards the sky, and beams of moonlight pierced the leaves to fall upon the forest floor. ... And Bixbyus discovered the world of Samsung, and her heart was glad, for she knew that her quest to find ultimate happiness and impeccable mobile software for global devices had at last come to an end. The End

  3. 3.

    Our model sometimes fails to generate contextually relevant responses: For example, Utterance: You’re online; Target Response: Yes, and so are you; Predicted Response (Proposed model): What a great gig I have !. As seen in the example, the predicted response would not be the best fit reply to the utterance “You are online” as the response falls out of context for the given utterance.

6 Conclusion and Future Work

In this paper, we propose an effective model for response generation using single-turn conversations. Firstly, we created a large single-turn conversational dataset, and then built a transformer-based framework to model the short-turn conversations effectively. Empirical evaluation, in terms of both automatic and human-based metrics, shows encouraging performance. In qualitative and quantitative analyses of the generated responses, we observed the predicted responses to be highly relevant in terms of context, but also observed some in-corrections as discussed in our results and analysis section. Overall we observed that our proposed model attains improved performance when compared with the baseline results.

In the future, apart from improving the architectural designs and training methodologies, we look forward to evaluating our models on a much larger dataset of single-turn conversation.

Table 5. Some sample responses generated by the baseline and proposed models on our test set