1 Introduction

1.1 NMR in metabolomics and compound identification

Metabolomics has become a key component of modern biological and biomedical studies, providing rich information on an organism’s biological status in health and disease. However, to exploit its full potential, the field must address the fundamental problem of metabolite identification (Edison et al., 2021; Garcia-Perez et al., 2020; Monge et al., 2019). Metabolomic assays capturing the widest range of metabolites (untargeted approaches) yield many unidentified features (spectral elements defined across samples), each of which reports on one or more small molecules. Without identification, it is extremely difficult to investigate and understand the biological mechanisms at work, either by expert interpretation or by bioinformatic approaches such as pathway analysis.

NMR is widely accepted as one of the most powerful structural assignment tools available to the analytical chemist, yielding structural information on a wide range of levels for a huge range of molecules. Here, we focus on small molecules, but lipid and protein signals are also commonly detected. Despite these strengths, NMR is faced with several challenges, illustrated in Fig. 1, including overlap (two peaks occupying the same spectral region), peak shifting (due to pH or metal ions; Tredwell et al., 2016), relatively low sensitivity compared to mass spectrometry, spectral crowding, and complex peak shapes (discussed in more detail below). Further complications can arise from differences in resolution, field strength, and line shape across samples or studies. Full utilization of 1H 1D data requires expert consideration of these variables.

Fig. 1
figure 1

Overlap, peak shifting, and spectral crowding are major issues for NMR annotation. A full 1H-NMR spectrum of human urine (Salek et al., 2007; lower panel) shows the great diversity in signal intensity and shape, as well as crowded regions of the spectrum (e.g. ~ 3–4 ppm, 7–8 ppm) typical in biofluid data. Expansion (top panel) of the red box shows shifting of two doublet peaks (yellow and purple), causing them to overlap by different amounts in different samples. As a result, the observed peak shapes differ across samples, greatly complicating annotation and quantification. A collection of signals at 1.50 ppm exhibits even more complex peak shapes and overlap. Data are from study MTBLS1 at the MetaboLights repository (www.ebi.ac.uk/metabolights/MTBLS1; Haug et al., 2020)

As such, identification of compounds in metabolomic NMR spectra is a nontrivial process which is hard to automate. Unambiguous structural identification typically requires highly time-consuming examination by a field expert who can leverage the rich and nuanced theoretical concepts involved. In theory, any spectrum should be computable from first principles, and there is an excellent literature covering the identification of small molecules by NMR techniques for metabolomics and natural products research (Beniddir et al., 2021; Bingol et al., Full size image

1.2 Annotation as a subprocess of identification

On the other hand, annotation, the assignment of putative candidates to an observed feature using physicochemical properties or spectral similarity and metabolite databases, has become an important step towards final identification (Eghbalnia et al., 2017; Everett, 2015; Monge et al., 2019; Sumner et al., 2007; Ulrich et al., 2019). While annotation does not provide unique and certain identifications, annotations are a first step and their confidence should be expressed on an appropriate scale (Joesten, Kennedy 2019; Sumner et al., 2007). Moreover, the information obtained by annotation methods is often suitable for large-scale biological hypothesis generation. Since this process is less stringent and scale is important, it is sensible to automate annotation when an acceptable balance between confidence and scalability exists.

1.3 Challenges in the automation of annotation

Annotation is also not easy to automate, however. Many difficulties can be traced to the complex nature of the mixtures analysed in metabolomics, as well as the relative lack of sensitivity and resolution of NMR compared to other analytical techniques. Once again, we point to excellent discussions of these issues in previous reviews (Beniddir et al., 2021). Instead, we aim to differentiate between the various computational approaches to automating annotation. Numerous additional difficulties and ambiguities in automation emerge from the implicit application of the knowledge and information the seasoned spectroscopist brings to a spectrum. We therefore find that it is helpful to delineate these approaches by the type(s) of spectral features used, the underlying structural information they utilize, and their computational characteristics. Note that in discussing these types of information we do not intend to replace long-standing terms used by the NMR community; rather, our intent is to suggest nomenclature which refers to how these elements are computationally derived and used in practical annotation. Furthermore, we will not attempt a comprehensive and rigorous assessment of available annotation tools, and point the reader to existing reviews documenting and carefully discussing existing tools (Beniddir et al., 2021; Misra, 2021).