Keywords

1 Introduction and Motivations

Email is one of the most popular tools for personal, business, and organizational communications. With billions of emails circulating every day [14], email inboxes are expanding at a staggering rate, making the task of searching for specific emails quite challenging for users. Sometimes, users vaguely remember the existence of an email but fail to recall the exact words in the content to use as search queries. Other times, their queries may be too narrow to retrieve the desired target, or too broad, resulting in an overwhelming number of mostly irrelevant emails. Query auto-completion (QAC) [5,6,7] is a key feature designed to address these issues, assisting users in formulating queries by suggesting a list of plausible candidates with each keystroke. A well-designed QAC system should be able to significantly cut down users’ ty**, memory and browsing efforts, enhancing the overall email search experience [18].

QAC is a ubiquitous feature in search systems. In a typical web-based QAC system, candidates are firstly retrieved from an index (e.g. a trie or inverted index), then the top-k suggestions are sorted by ranking models. During this process, web-based QAC presumes that data and logs can be collected from a vast number of users, and are constantly accessible for aggregation, feature generation, and model training. However, in personal search, this assumption is often not valid due to privacy constraints. In response to these restrictions, we propose an on-device QAC setting for email search. In on-device QAC, users can only access their own personal data from their own devices (e.g. mobile phones or laptops). Search and QAC are purely powered by on-device indices and algorithms, without relying on the availability of web-based search services. Users’ interaction logs are generated on their devices, and will not be collected, shared or aggregated through web services.

On-device QAC for email search has distinct characteristics when compared to web-based QAC. First, due to corpora differences in email search, on-device QAC is inherently personalized. Given the sensitivity of the information in emails, collecting a global email dataset from numerous users is also prohibitive. Second, email search logs are generated on device. They will not be collected by centralized services. Third, an on-device email index will change much more frequently than a web index. For example, email users may receive new emails or delete old ones at any time. An on-device email search system should be able to reflect these changes and users’ engagements in real-time. Lastly, there is a lot of structural information in emails that are different from web pages. These structural differences can be leveraged to improve the quality of QAC algorithms. All of these characteristic differences make traditional log-based web search QAC algorithms not directly applicable, and motivate the design of our on-device QAC methods.

Our proposed on-device QAC method comprises two stages. In the retrieval stage, we use pseudo relevance feedback (PRF) to generate potential completion and suggestion candidates. During this stage, emails that are most relevant to the users’ queries are retrieved, from which QAC candidates are extracted. In the ranking stage, relevance signals based on the structural and textual information derived from the users’ personal corpora are employed for ranking candidates and post-processing. To evaluate our method, we propose a novel grader-based offline evaluation pipeline. Extensive experiments show that graders are more satisfied with the quality of our QAC results, and our method outperforms strong baselines.

2 Related Work

Web-based QAC systems collect user interaction data centrally across many users. For example, Most Popular Completion (MPC) [6] utilizes search logs and generates candidates from the most frequently issued queries. Contextual, time sensitive and engagement signals collected from user activities are proposed for QAC candidates ranking [10, 28,29,30]. In order to improve personal search quality, [8, 17] aggregate non-private query-document associations from user interactions. In case query logs are absent, generative QAC methods are used [15, 26]. Mitra et al. [20] generates candidates for rare prefixes using popular n-gram suffixes. Park et al. [21] further propose a character-level neural language model trained on query logs to generate QAC candidates. Dehghani et al. [12] then utilize the attention mechanism to capture the structure of the session context.

Emails, different from web pages, have a lot of unique characteristics. Wang et al. [27] classify enterprise email intents into 4 categories, showing the topics of emails are less diversified than web pages. Alrashed et al. [4] show there is a positive correlation between user interaction and significance of emails. Ai et al. [3] study large-scale behavior logs in the email search. They found that compared to web search, email queries tend to be more specific, shorter, and less repetitive. This difference makes it hard to directly apply web search techniques to the email domain. Some existing works exploit these specific characteristics to develop more effective search systems for emails. For example, Meng et al. [19] combine the token-level sparse features and email-level dense features to better capture users’ intents. Carmel et al. [11] explore the importance of using freshness signals in email search. Horovitz et al. [13] propose to use freshness signals along with the structural information to enhance the query completion results. In our experiments, it turns out that these additional features can significantly boost the performance.

In personalized search, there is typically a shared corpus, and users have their own individual profiles. Different users may expect different results with the same query. Teevan et al. [24] form users’ profiles according to their historical search behaviors, and use such profiles to personalize their search results. Zhou et al. [30] encode users’ search history using transformers [25] to build users’ profile. In order to solve words ambiguity across different users, Yao et al. [29] build a personalized model based on personal embeddings. In the industry, Shokouhi et al. [23] propose a labelling strategy for generating offline training labels and train a personalized QAC model based on users’ search history. Aberdeen et al. [2] first train a global model, then apply transfer learning to adapt it to individual users.

In our on-device QAC setting, we keep users’ personal data on-device. Under this constraint, existing massive log-based training techniques (e.g. [8, 17]) cannot be applied. [9, 13] have the most similar setup to our work, hence serve as the baselines for our experiments. They propose to first construct the candidates set by selecting all the n-grams in the mailbox containing the users’ input as prefix, then rank based on the term-frequency scores. Horovitz et al. [13] use multiple ranking signals, while Bhatia et al. [9] extend TF-IDF by taking into account the term-to-phrase probability. However, these works assume the best QAC result is a consecutive n-gram in the corpus, which is not always the case.

3 On-Device QAC

3.1 System and Settings

Our email search system mainly consists of a trie-based inverted index, a key-value store for metadata storage, and components for tiered retrieval, ranking, and QAC. All of these components are on-device. Users’ emails are indexed instantly upon receipt. Figure 1 illustrates our user interface for email search and QAC. When a user taps on the search bar and begins ty**, the system receives an original query upon each keystroke. For each original query, a list of emails are retrieved and ranked. QAC then fetches metadata from the top ranked emails and generates QAC results from them. After the user views the QAC list and clicks a QAC result, a list of emails are retrieved and presented back to the user. There are interesting challenges in each of these components. The focus of this paper is on the QAC method. In cases where the original query is misspelled or includes synonyms that do not appear in the email contents, this will be left for future work.

Fig. 1.
figure 1

Left: Search and QAC user interface. Right: Completion and suggestion examples. Completion part completes the last prefix of the user’s original query to a word. Suggestion part adds relevant word(s). Final QAC result is a composition of original query, completion, and suggestion parts.

3.2 Candidates Generation

We propose to generate QAC candidates from PRF [1, 16, 22]. The classic PRF paradigm assumes that the set of top-k ranked documents (emails) are relevant to the query. This assumption holds naturally in our system which includes an on-device email ranker. This email ranker leverages a few important signals such as textual match between query and various zones of emails, freshness of activities, and users’ historical engagements.

After the top ranked emails are returned from the ranker, their metadata such as subjects, body texts, senders and recipients are obtained from a key-value store. QAC then identifies matches against the original query and extracts candidates from these texts on-the-fly.

Compared with existing work which explores terms and phrases from all emails (e.g. [9, 13]), retrieving candidates via PRF brings many benefits. We can avoid aggregating candidates from irrelevant emails, hence QAC results are potentially more relevant to the query. When users engage with a QAC result, their chance of getting relevant emails is also higher. This idea is motivated by the fact that users’ goal is to find target emails that are relevant to them, while QAC serves as an intermediate interface that helps users formulate their queries. As described in Sect. 3.3, QAC can take advantage of the ranking signals that come along with the email results. Moreover, there is no need to maintain a phrase dictionary that could be costly and non-trivial to be kept up-to-date.

We propose to formulate a QAC candidate as the combination of the following parts: 1) original query, excluding the last term; 2) n-gram completion of the last query term, defined as the completion; 3) optional suggested n-gram that can be anywhere in the text, defined as the suggestion. These concepts are illustrated in Fig. 1. For example, imagine the intent of the original query “apple st” is to retrieve an order confirmation email for a recent order placed on the Apple online store. The QAC candidate “apple store order” is constructed from the non-tailing part of the original query “apple”, the completion “store”, and the suggestion “order”. An important difference from previous work [9, 13] is that, the QAC candidates do not have to be contiguous n-grams extracted from emails. For example, in candidate “apple store order”, “store” and “order” may not consecutively occur in the email.

From the top ranked emails, QAC first extracts all the n-grams as suggestion candidates. These n-grams do not contain punctuation marks and do not cross sentences, paragraphs or zones. Following [9], when counting n-grams, common stop-words will be “jumped over” and retained, so that the resulting phrases do not start or end with stop-words, preventing unintelligible candidates from being generated.

For each suggestion candidate within an email, QAC finds the nearest token that matches the trailing term of the original query. This matched token could be part of an n-gram. If we only use this token as the completion, a QAC result could be incomplete. For example, imagine the intent of the original query “americ” is to retrieve a recent annual statement email from American Express, and the suggestion is “statement”. A uni-gram completion could generate an unintelligible result “american statement”. If the completion is extended to n-grams, we can complete “americ” to “american express” and formulate a better result “american express statement”. Because of this, completions are generated as n-grams around the matched token. For each suggestion, completions could also come from multiple emails. For example, “american airlines statement” is another QAC result for “americ”. In this case, suggestion “statement” corresponds to two completions: “american express” and “american airlines”.

Our QAC system responds to user’s keystrokes instantly, hence the size of the candidates set has to be controlled for ranking and computational feasibility. We consider the following parameters: 1) k, the maximum number of top ranked emails we use to generate candidates; 2) \(N_c\) and \(N_s\), the maximum size of the completion and the suggestion n-grams. For example, when the original query is “apple st” and \(N_c=N_s=1\), “apple store” is a valid candidate because the completion size \(n_c=1=N_c\) and the suggestion size \(n_s=0 < N_s\); “apple store order” is also valid since \(n_c=1=N_c\) and \(n_s=1=N_s\); however, “apple store order shipment” is not a valid candidate since the suggestion size \(n_s=2 > N_s\).

3.3 Candidates Ranking and Post-processing

In this section, we describe how the suggestion and completion candidates are ranked and combined. First, all the suggestion candidates are ranked. Then for each top ranked suggestion candidate, its corresponding completion candidates are ranked and the final QAC candidates are formulated.

To rank all the suggestion candidates generated from top-k emails, we utilize the following relevance features listed below. These features are efficient, explainable, and have been widely adopted in ranking systems of modern search engines.

  • Term frequency (TF) and inverse document frequency (IDF). TF measures the popularity of suggestion candidates in the top-k documents, and IDF measures their popularity among all emails in the user’s mailbox.

  • Proximity: \(e^{1 / d}\), where d is the distance between a suggestion \(s_w\) and a completion \(c_w\). If a suggestion and a completion overlap, then \(d=0\). In this scenario, candidates are more conservative and likely to be relevant, and we set a large value to the proximity. If there are multiple completion choices, the most adjacent one to the suggestion \(s_w\) will be used.

  • Zone weights. Intuitively, tokens from sender or subject zones are likely to be more relevant than those from body zone. Therefore, candidates from the “sender” or “subject” zones have higher zone weights than those from “body”.

  • Document score. It captures the importance of the document where candidates are generated from. Candidates generated from more relevant emails are likely more valuable. Inappropriate candidates from unwanted emails are demoted due to low document scores. We set

    $$\begin{aligned} {DocScore} = e^{1/r} \cdot p(q|D_i)\end{aligned}$$
    (1)

    where r is the rank of the email provided by the email ranker. Note the freshness of the email, which is an important signal [18], is incorporated in r. Then,

    $$\begin{aligned} p(q|D_i) = \Pi _{tok\in q} \frac{tf(tok, D_i)}{|D_i|}\end{aligned}$$
    (2)

    measures the similarity between the original query q and email \(D_i\) following relevance models [16].

  • Completion cost. It can be too aggressive if the suggestion is very long while the original query is short. In order to relieve this issue, candidates with shorter lengths receive more scores: \({CompCost} = 1 + (N_s - n_s)\).

The ranking score for a suggestion \(s_w\) and corresponding completion \(c_w\) is a weighted TF-IDF that aggregates all these features:

$$\begin{aligned} & Score(s_w, c_w, q) \\ & \qquad \qquad \qquad \,\,\,\,\, = \sum _{D_i \in D} \sum _{s_w \in D_i \wedge c_w \in D_i} \left( \frac{1}{|D_i|} \cdot IDF(s_w)\right. \nonumber \\ & \qquad \qquad \qquad \,\,\,\,\, \cdot Proximity(s_w, c_w, D_i) \cdot ZoneWeight(s_w, D_i) \nonumber \\ &\qquad \qquad \qquad \,\,\,\,\, \left. \cdot DocScore(q, D_i) \cdot CompCost(s_w) \right) ,\nonumber \end{aligned}$$
(3)

where D is the list of top-k relevant emails. The final ranking score for suggestion \(s_w\) is the sum of the scores over all completions \(c_w\):

$$\begin{aligned} & Score(s_w, q) = \sum _{c_w} Score(s_w, c_w, q). \end{aligned}$$
(4)

For each suggestion \(s_w\), the best \(\hat{c_w}\) is used to construct the final QAC result:

$$\begin{aligned} &\hat{c_w} = \mathop {\mathrm {arg\,max}}\limits _{c_w} Score(s_w, c_w, q). \end{aligned}$$
(5)

To construct the final QAC candidate for \(s_w\), the last step is to stitch the three parts together: the original query excluding the last term \(\tilde{q}\), \(\hat{c_w}\) and \(s_w\). In most cases, the candidate can be formulated as the concatenation of \(\langle \tilde{q}, \hat{c_w}, s_w\rangle \). However, if in the original email, completion \(\hat{c_w}\) appears consecutively after suggestion \(s_w\), the order of \(\hat{c_w}\) and \(s_w\) should be switched. For example, imagine a user is looking for their reservation confirmation email from the Hyatt Hotel. Given the original query “hyatt conf”, the suggestion \(s_w\) “reservation” has the best corresponding completion \(\hat{c_w}\) “confirmation”. The expected candidate “hyatt reservation confirmation” is actually the concatenation of \(\langle \tilde{q}, s_w, \hat{c_w}\rangle \), since \(s_w=\)“reservation” and \(\hat{c_w}=\)“confirmation” are consecutive in the email.

Lastly, it’s possible that some of the final QAC candidates are similar to each other. To avoid duplication, we go through the ranked list of candidates. If a candidate does not contain any new tokens compared with all the other candidates that are ranked higher, it will be eliminated from the list.

4 Evaluation Method and Experiments

Evaluating personalized on-device search is non-trivial. Offline experiments must remain sensitive and insightful without breaking privacy promises whilst online A/B experiments are challenged by on-device model deployment and minimal instrumentation under privacy preserving conditions. To evaluate the efficacy of our QAC method, we propose a novel, grader-based offline evaluation that enables direct measurement of QAC quality without compromising grader privacy. This method also affords greater control over unintentional factors that can impact QAC quality including: on-device indexing status, query sampling, display position bias, and results scra** consistency.

Table 1. Fixed test set distribution by scenario.

4.1 Experiment Setup

To evaluate our method, we set up four, double-blind, offline experiments using 47 US-based graders. Each experiment evaluated the quality of a different QAC method using the same graders, the same email accounts, and the same query test sets. QAC results generated by each method were scraped on-device in quick succession in order to fix the state of each grader’s email account.

To thoroughly protect grader privacy and ensure that no sensitive information ever left a grader’s device, all QAC result scra** and grading happened solely on-device. Only quality grades, grader comments, and unpersonalized meta information collected during the experiment were sent to our server, along with optionally donated queries and QAC results.

The 47 US-based graders selected for these experiments were specialized in English annotation tasks and represented a diverse demographic pool with 47% identifying as women and 53% as men. Experiment eligibility required graders to be active email users with at least basic general technical skills. Advanced technical knowledge and skills were neither required nor sought. To prevent grader fatigue, the experiments were completed over a three-week period and each experiment restricted grading to no more than forty distinct query prefixes, which each generated no more than eight QAC results.

Table 2. QAC String Quality guidelines and examples. The intent of the original query “americ” is to retrieve a recent annual statement email from American Express.

Before beginning these experiments, graders synced their primary personal email accounts to their evaluation devices. These accounts each contained between approximately 5,000 and 50,000 emails. Sufficient time was left prior to beginning the experiments to allow these email accounts to completely index.

Evaluating personalized search requires personalized queries that are relevant to each grader’s email account. To collect personalized queries we presented six different email-search scenarios that prompted graders to think of an email in their inbox. If graders could think of a relevant email, they were then asked to provide search queries they would use to retrieve that email. The scenarios were selected from a query-traffic analysis that identified the most common use cases for email search.

QAC systems suggest QAC results upon each typed keystroke therefore to replicate these keystrokes, each grader’s query was deconstructed into prefixes which were then weighted by length and randomly sampled to create one test set per grader. These test sets were fixed such that each grader used the same prefixes to evaluate QAC results in all four experiments. The final test set across all graders included 1,854 prefixes and the query distribution by scenario is included in Table 1.

4.2 Offline Evaluation and Metrics

QAC results are helpful shortcuts for users to arrive at their desired email with less effort. There are two dimensions that must be considered when evaluating the end-to-end quality of a QAC result: 1) QAC String Quality, the QAC result is intelligible and aligns with the search intent of the query; 2) QAC-to-Email Quality, clicking the QAC result returns the desired email. Both our evaluation and metrics are designed to capture these two dimensions.

Quality Evaluation. Each QAC result was graded independent of display position but dependent on the prefix and search intent of the original query.

To evaluate QAC String Quality, graders recorded a Helpfulness score on a 3-point scale based on the extent to which the QAC result aligned with the grader’s search intent. To summarize the grading criteria, score “0” denotes an “Unhelpful” result unaligned with user intent, score “1” denotes a “Slightly Helpful” result somewhat aligned with user intent, and score “2” denotes a “Helpful” result fully aligned with user intent. For QAC results with a “0” score, graders were also able to select one or more defect flags to help categorize the “Unhelpful” result. The flags, “Unintelligible” and “Inappropriate”, were further tagged as serious defect flags that represent the most critical product issues. See Table 2 for further guidelines and examples to evaluating QAC String Quality.

To evaluate QAC-to-Email Quality, graders were also asked to confirm whether the top six emails returned after selecting the QAC result contained their desired email(s). This step was completed independently of the string quality evaluation to prevent data peeking that might have influenced Helpfulness scores.

Metrics. To measure QAC String Quality we first adopt the widely used Normalized Discounted Cumulative Gain to compute NDCG Helpfulness@k using the 3-point Helpfulness score. Additionally, to control for highly defective QAC results, we also compute a binary, secondary metric, unweighted Defect Rate@k which measures the percentage of test set prefixes which generated one or more top-k QAC results with at least one serious defect. To measure QAC-to-Email Quality we use unweighted Email Recall@k which computes the percentage of top-k QAC results which successfully returned the desired email within the top 6 emails.

Metrics are measured @3 and @5 and for each metric we calculate the mean value across all test set prefixes. These prefix-averaged metrics most closely capture the end-user’s experience who will see a new list of QAC results for each newly typed prefix.

4.3 Baseline Methods

Most existing QAC methods utilize query logs, and learn from public or shared corpora. To our best knowledge, [9, 13] are the only methods that do not solely rely on query logs or shared corpora, hence serving as our baselines. These methods generate n-gram candidates which match the original query as a prefix. They are described as follows.

  • TF-IDF [13] used term frequency (TF) and inverse document frequency (IDF) features to calculate ranking scores for all n-gram candidates extracted from the email corpus.

  • TF-IDF\(\boldsymbol{+}\) [9] used TF-IDF to capture the word completion probability (e.g., the probability of completing “appl” to “apple”). TF-IDF is then multiplied with a word-to-phrase probability (e.g., the probability of completing “apple” to “apple store”).

  • Multiple ranking features [13] used document scores and zone weights as additional features on top of TF-IDF. Feature weights are from a model where labels are derived from centrally collected logs. We have no access to massive logs due to the on-device setting, hence weights are manually tuned with a labeled set.

4.4 Results

Table 3. Evaluation results.

In total 19,584 QAC results were graded and experimental results are summarized in Table 3. It can be observed that our proposed method consistently outperformed all of the baselines. It achieved the highest NDCG Helpfulness, while maintaining the lowest Defect Rate. The gains show that the graders were generally more satisfied with the quality of our QAC results. Our proposed method also achieved the highest Email Recall, indicating that with our method it was much easier for the graders to locate their desired emails. The metrics gap between TF-IDF\(\boldsymbol{+}\) and TF-IDF suggests the effectiveness of separating completion and suggestion parts during candidates ranking. Mul takes advantage of the additional document features and exhibited slightly higher NDCG and Email Recall, compared with TF-IDF.

The significant gains of our proposed method can be attributed to the following factors. First, our method is not limited to contiguous n-grams during candidates retrieval. By separating the completion and suggestion parts and progressively calculating the rank score, our method incorporates a larger candidates set, and produces higher NDCG Helpfulness and Email Recall at the same time. Second, our utilization of the structural ranking signals such as zone weight and proximity also helps boost NDCG Helpfulness and lower the Defect Rate. Third, the document score feature connects the email relevance and the QAC candidates ranking, which ensures that the generated candidates favor those top ranked documents that are supposed to be more relevant to the user, hence producing higher Email Recall. Fourth, the completion cost feature prevents a result from being too aggressive, hence lowers the Defect Rate.

Table 4. Case study examples. For “bm”, the 2nd results are listed. For “cha”, the 2nd and 3rd results are listed.

In addition to the metrics gains, we also carried out voluntary case studies with our graders to make sure these gains are indeed intended. Two examples are presented in Table 4. In the first example, the grader’s query is “bm”, and the intent is to find a recent event email with subject “Registration is open for the BMW Ultimate Driving Experience”. “bmw” is ranked 1st for all methods. Our method is the only one that can retrieve and rank result “bmw registration” to the top of the list. The token, “registration”, is not consecutive to “bmw” in any zones of the matched emails, hence the baseline methods do not even have a chance to retrieve it as a candidate. Mul takes the zone weights into consideration, hence it promotes the result “bmw santa monica” which comes from the sender zone. TF-IDF and TF-IDF\(\boldsymbol{+}\) both surface “bmw santa”, which was graded as “Unintelligible” because it is incomplete and therefore the grader found it very difficult to understand.

In the second example, the grader’s query is “cha”, and the intent is to find emails from Chase Bank. All four methods rank “chase” to the 1st, hence it is not listed. TF-IDF’s 2nd and 3rd results are “chargers” and “chargers holders”, which come from a large number of promotional emails. These two results were graded as “Unintelligible” because the grader had never read these promotional emails and therefore found the results very difficult to understand. Our method and Mul were able to find a “Helpful” result, “chase credit journey”, which comes from a sender, and surfaced due to higher zone weights. Our method’s 3rd result, “charles schwab”, has a low TF in the grader’s mailbox, but comes from a recent email with a high document score. Although it is not aligned with the grader’s main intent, it is better than “change” which comes from an unintended sender “Change.org”. It is also better than “chargers holders” since it is potentially more useful and not defective.

Apart from quality evaluations, we also carried out latency measurements on the performance of our system. Our implementation has a low latency overhead of 50ms on average, which is fast enough for instant search, and is not perceivable by the end users.

5 Conclusions and Future Work

In this paper, we propose an on-device QAC method for email search. Our QAC system is seamlessly integrated with the on-device email retrieval and ranking systems, hence it is personalized and adaptive to a user’s ever-changing email corpus. QAC candidates are generated from top ranked emails. Their retrieval is not limited to consecutive n-grams. Candidates ranking features are efficient and easy to implement. We also propose a novel, private corpora based offline evaluation method to measure on-device QAC quality. Experiments show that our method outperforms strong baseline algorithms.

There are promising directions for future work. Current ranking signals used in our method are mainly from lexical information, while rich semantic information can be extracted from email texts, entities, connections and user engagements. On-device personalized modeling or model fine-tuning based on a user’s personal data is a very interesting and challenging direction, where federated learning and transfer learning can be employed to protect user privacy. Another potential direction is to study how the QAC results can be diversified to cover multiple possible query intents, and to design evaluation methods to measure the diversity of the generated QAC results.