Background

Human impacts on the environment are responsible for a dramatic increase in habitat destruction and an ever increasing list of species that are in decline. For example, most species of mammals have lost more than half of their original range since the 19th century [1] and for well-studied mammals such as primates over half of the species are listed as endangered [2]. Moreover, rapid habitat loss is responsible for extinction rates that have been estimated to be over 100 times higher than the background rates [3]. The decline of some species can be slowed through conservation measures such as habitat preservation, enhancement or ex-situ management, but such measures require natural history data on the fundamental aspects of the species’ biology, distribution, and genetic diversity. The need for such information is urgent [4], but coincides with the decline of natural history research [5, 6]. Furthermore, the limited resources available are disproportionally spent on a few charismatic species thus leaving little funding for other species [7, 8]. Yet these are likely to represent the majority of the endangered species and populations [9]. There is thus a need for develo** new techniques capable of rapidly expanding the data that are obtained in limited field studies often applied to such species.

Valuable natural history information can be obtained by the in-depth study of non-invasive samples such as feces, even if available in small numbers only. Fecal samples are often collected opportunistically during routine field work or they can be obtained efficiently using detection dogs [10]. Such samples have the potential to simultaneously provide information on host genetics, diet, and intestinal parasites. Some of this information can be obtained by direct morphological examination of fecal samples, e.g. by studying diet remnants [11] and gut parasites [12]. Molecular methods have expanded the utility of fecal samples by allowing the analysis of host genetics [13], diet from various sources [14] and the detection of parasites [15]. However molecular methods have been labour intensive as the characterization of multiple species from complex samples involved cloning and Sanger sequencing [14]. The advent of high-throughput sequencing (HTS) has simplified the characterization of complex fecal DNA and now allows for simultaneous characterization of the different aspects of ecology of a species [16]. For fecal samples, HTS can be employed in two ways, either by direct shotgun sequencing of DNA extracted from the fecal samples (metagenomics) or by PCR-based metabarcoding of target genes.

Currently, metabarcoding is more widely used [17, 18], and it has an advantage of lower cost where large numbers of samples have to be screened. Metagenomic shotgun sequencing, in contrast, remains largely unexplored for use in conservation biology [16, 19]. This is presumably due to the higher cost of sequencing and the greater bioinformatics effort required for analysing metagenomic data. But this approach has the potential advantage of rapidly yielding data on genetics, diet, parasites and microbiota from fecal samples, while also avoiding the need for a priori selection of amplification targets which limits the study to the sequencing of a specific subset of the genetic material [20]. This makes metagenomics attractive, but also raises practical challenges. Firstly, bioinformatic challenges arise from the need for a comprehensive reference database against which shotgun data can be queried [16]. Secondly, because of its costs shotgun sequencing is mostly suitable for studies requiring few samples, although with the expectation of cheaper DNA sequencing, one could argue that now is a good time to evaluate and develop the bioinformatic tools for metagenomic data.

The critically endangered population of banded leaf monkeys (Presbytis femoralis femoralis) in Singapore is one case where field observational data has been particularly difficult to obtain. Initially described from Singapore and common in the 19th century, the only remaining population now comprises ~40 individuals that are restricted to the Central Catchment Nature Reserve [21]. The forest is surrounded by urban areas and affected by further urban development, which creates conservation challenges including habitat loss, fragmentation and direct anthropogenic disturbance. The situation is exacerbated by the low genetic diversity within the population [22]. Studies of the species’ autecology, prior to develo** conservation strategies, have been hampered by the difficulty of making direct observations; a 6-month study in the 1990s led to only 13 sightings [23]. Overall, our current understanding of the species biology is preliminary and here fecal samples can be useful in complementing the current research.

In this study we aim to characterize fecal samples of P. femoralis using metagenomics and metabarcoding, for comparisons with field observational data on feeding ecology. Our recent pilot study comparing these approaches for diet analyses in the red-shanked doucs (Pygathrix nemaeus) [16] in a controlled zoo environment suggested that shotgun sequencing yields better taxonomic resolution if utilizing multiple reference loci, as compared to single marker metabarcoding, but this was at the expense of lower detection probability of rare food plants in the sample. Here we increased the depth of shotgun sequencing to obtain high taxonomic resolution whilst also detecting rare diet items. We also test whether DNA based analyses are congruent with field observational data, given that this is the first study applying metagenomics to samples collected in the wild. The challenges are considerable because colobine primates have long digestion times that may cause high DNA degradation (Mean Retention Time >40 h [24]), plant barcodes are short and often not species-specific [25], the potential diet of banded leaf monkeys consists of >700 species of trees and lianas in the studied habitat [26], and the amount of target DNA is minute in comparison to the DNA of microbial origin in fecal material. Despite these difficulties, we show that fecal samples can yield a credible set of well-identified plant sequences that correlates with field observational data. In addition, shotgun sequencing provides data on population genetic structure and gut parasites of individual monkeys.

Results

Field observations

Two and a half years of field observations yielded 31 feeding observations and banded leaf monkeys were seen to feed on 27 plant species from 24 genera and 20 families during the surveys (Additional file 1: Table S2, Table 1). Diet was primarily comprised of fruits and leaves, and to a lesser degree of flowers. Of the 27 species, Fibraurea tinctoria, Xanthophyllum ellipticum, Prunus polystachya and Hevea brasiliensis had two feeding observations each, while feeding on all other species was observed only on a single occasion (Additional file 1: Table S2).

Table 1 Summary of plant identifications

Illumina sequencing

Illumina sequencing using HiSeq produced ~67 to ~108 million reads while MiSeq produced ~23 to ~29 million reads per end per sample. For metabarcoding 272,103 to 419,407 sequences per sample were generated for the widely-used marker, P6 loop of trnL. These sequences were subsequently filtered and subjected to variant calling and diet identification.

Diet analysis

BLAST searches of HiSeq and MiSeq metagenomic data were conducted against the plant barcode databases comprising of rbcL, matK and trnL-F sequences from GenBank and newly sequenced data from the Nee Soon Swamp forest. These yielded between 2616 and 6416 sequence reads (0.004–0.008 %) per sample that could be used for taxonomic classification (Additional file 1: Table S4). A large proportion of the shotgun reads could be classified at least to family (87.0–96.2 %) and a substantial proportion had a genus name associated with them (45.0–56.5 %). A smaller fraction of the reads could be identified to species (27.0–39.5 %), i.e., the reads had similarly high matches to multiple species in a genus or family and thus could be identified only to higher taxonomic ranks (Additional file 1: Table S5). For metabarcoding, after applying the different filtering criteria (FC1 [16], variant calling) we retained the following number of unique sequences per sample: 31 (BLM1), 40 (BLM2), 31 (BLM3), 19 (BLM4), 61 (BLM5) and 46 (BLM6). Here, 4.9–15.8 % of the unique sequences produced species level identifications and 13.1–27.5 % were informative to genus-level, while most contained only family-level information (60.7–73.7 %) (Additional file 1: Table S5).

Comparison of metagenomics and metabarcoding

Family-level identifications were largely congruent between the metagenomic (MG) and metabarcoding (MB) analyses (Fig. 1). Metagenomics yielded identifications for 11–25 families per fecal sample (total number of family-level identifications = 99), while metabarcoding revealed 11–22 families per sample (total number of family-level identifications = 93). The use of 95 % or 90 % identity thresholds led to negligible differences for the metabarcoding results (91 vs. 93 identifications). The performance of the two approaches differed at the genus- and species levels (Fig. 1), as metagenomics generated ~2–3 times more identifications at both taxonomic hierarchies (genus: MG total = 115, range = 11–36 vs MB total = 46, range = 4–11; species: MG total = 59, range 3–21 vs MB total = 24, range = 2–7).

Fig. 1
figure 1

Identifications at different taxonomic hierarchies using metagenomics (MG) and metabarcoding (MB). Colours represent average of proportion of identifications per sample that were made by both MG and MB (black), MG only (blue), and MB only (red)

In order to check for the reliability of these identifications, we compared the identified genera/species to the checklists of plants for Nee Soon Swamp forest and Singapore (see Methods). Of the 115 genus-level identifications made by metagenomics, 110 were consistent with the Nee Soon Swamp forest list, while two additional ones matched the Singapore checklist and only three were not known for Singapore (Additional file 1: Figure S1a). The corresponding numbers for metabarcoding were as follows: out of 46 identifications, 40 were for plant genera present in the Nee Soon checklist, one was present only in the Singapore checklist and five identifications were present in neither checklist. Overall both methods were reliable at genus level. At species level, both methods had higher mismatches with the Singapore database, as 13.6 % of metagenomics and 25 % of metabarcoding similarities had best matches to extraneous reference sequences. Note, however, that the comparison between metagenomics and metabarcoding at species level is affected by the small numbers of barcodes corresponding to the P6 loop of trnL in the plant database (Additional file 1: Figure S1b).

Congruence of DNA based techniques with field observations

We next tested to what degree the pools of diet species inferred by DNA from the six fecal samples overlapped with the field observations. We first excluded potential misidentifications and synonyms (Additional file 1: Figure S1, yellow/red) and then limited the analyses to genus/family level identifications due to greater uncertainty at species level. Using metagenomics we obtained a set of 53 distinct plant identifications from 33 families. Forty-nine of the 53 identifications were at genus level, while four identifications could be made only to family (Araceae sp., Primulaceae sp., Sapidaceae sp., and Sapotaceae sp. Table 1) and could not be resolved further. Using metabarcoding we obtained 35 distinct plant identifications from 32 families. Twenty-one of the 35 identifications were to genus, while the remaining 14 distinct family level identifications could not be resolved further.

Comparison of these results to diet profile from field studies comprising 27 diet species from 20 families and 24 genera revealed that overall identifications by metagenomics, metabarcoding and field observations corroborated each other, but the DNA based analyses gave larger number of plant identifications. When all three methods were compared, there was high level of congruence for the family level profile with 16 of 20 families of plants from field observations also identified using metagenomics and metabarcoding (Fig. 2a). Due to greater taxonomic resolution achieved by metagenomics, the overlap at genus level was better for metagenomics as compared to metabarcoding (MG: 15/24 genra, MB: 6/24 genera). Lastly, out of the 15 genera observed in HTS based diet analyses and field observations, 11 were found in three or more samples (Table 1). Plants with multiple feeding observations were also present in multiple fecal samples: Fibraurea, Prunus, Hevea, Xanthophyllum, and Litsea were present in six, six, four, four and four samples respectively.

Fig. 2
figure 2

Number of family (a) and genus (b) level identifications using metagenomics, metabarcoding and field observations

Effect of sequencing depth on diet analysis using metagenomics

The completeness of the HTS dietary profile may depend on the sequencing depth. Rarefaction of sequence reads indicated that four of six samples approached an asymptote at sequencing depth of 70–100 million reads while the two most diverse samples (BLM2 and BLM6) showed increasing species diversity at this sequencing depth (Fig. 3). Hence, sequencing ~70 million paired reads (~10 Gbp) would lead to identification of most of the diet items in most samples, although due to the variability in diet across individuals or feeding events, the current sequencing depth may not be sufficient to capture the full dietary breadth of an individual.

Fig. 3
figure 3

Rarefaction curves representing number of plants identified at varying sequencing depths per sample. Rarefaction of plant reads was extrapolated to estimate effect of rarefaction of all reads in the metagenome

Diet of banded leaf monkeys using metagenomics, metabarcoding and field observational data

We built a dietary profile of P. femoralis by combining the above species identifications made from HTS with those from field observations and thus obtained a profile consisting of 38 families. Thirty five of 38 family records could be further resolved to include 60 genera while three family records remained unresolved giving a total of at least 63 plant identifications. We could putatively assign 43 species names to 38 of these genus names (Table 1). They comprise 30 trees, 12 lianas and one shrub. Fibraurea tinctoria (Menispermaceae) and Prunus polystachya (Rosaceae) were found in all six samples, while Xanthophyllum ellipticum, Securidaca philippinensis (Polygalaceae), Hevea (Euphorbiaceae), Bauhinia, Dalbergia (Fabaceae), Litsea (Lauraceae), Strychnos (Loganiaceae), Artocarpus, Ficus (Moraceae), Knema (Myristicaceae) were present in four samples. The dominant families were Fabaceae, Moraceae, Menispermaceae, Rosaceae and Rubiaceae, which were present across all six samples, and Polygalaceae and Lauraceae present in five samples.

Low genetic variability in mitochondrial genomes

A complete mitochondrial genome sequence of 16,548 bp was reconstructed for one sample (BLM5) and used for read map** of the remaining samples. The coverage for the six samples was 10.7–104.7X (HiSeq) and 7.4–41.3X (MiSeq). SNP calling using FreeBayes with ploidy = 1 led to identification of only three variables sites in the mitochondrial genomes (Table 2). Four of the six samples showed polymorphism at position 7791 or 15,572 with good confidence of at least 5× coverage for both alleles. Overall, four different genotypes were recognized separating the individuals for BLM2, BLM4, BLM6 and the three identical samples BLM1, BLM3 and BLM5 (Table 2).

Table 2 SNP calling for mitochondrial genomes at ploidy =1

DNA from parasites and other Metazoa in the fecal material

BLAST searches against a parasite SSU rDNA database revealed presence of several protists and nematodes (Table 3). Sequences corresponding to Blastocystis and Entamoeba were present in varying amounts in four and five samples, respectively. Additionally, nematode identifications were made for Strongyloides sp. (3 samples) and Oesophagostomum sp. (one sample). Using the COI database, we recovered sequences mainly corresponding to plants, the primate host and arthropods. Arthropod sequences were found in three samples, and mostly in two (BLM3, BLM6; Additional file 1: Table S7), including Muscidae in both samples, and Sarcophagidae and Drosophilidae (BLM3) and Sepsidae and Lepidoptera (BLM6). At genus level the closest hits were to Dicranosepsis (Sepsidae, BLM6) and Phortica (Drosophilidae, BLM3).

Table 3 Parasite identifications made using SSU rDNA

Discussion

Comparison of metagenomics, metabarcoding, and field observational data

We demonstrate the power of metagenomic shotgun sequencing for the characterization of fecal samples and find that it can quickly yield important natural history data for endangered species based on few samples. Using metagenomics we document a diverse diet for the banded leaf monkey comprising at least 53 diet plants from 33 families. There was a good overlap between metagenomics and field observational data, with 15 of 24 genera of observed diet plants found in metagenomics data from merely six samples. Moreover, metagenomics recovered similar number of plants as metabarcoding, as suggested by the comparison of family level profiles whilst providing greater taxonomic resolution by using multiple, longer genetic markers. In addition to a very diverse diet, the shotgun approach also detected previously uncharacterized parasites, and revealed low genetic diversity in mitochondrial genomes of P. femoralis.

There is good agreement between the diet reconstructed based on HTS data and field observations. Nearly half of the plant genera obtained from observational studies (11/24 genera) were also identified in at least three fecal samples. Researchers are more likely to observe feeding events involving frequently utilized diet species and these are also more likely to be present in multiple fecal samples (e.g., Fibraurea, Hevea, Prunus). We thus interpret the good overlap as indirect evidence for the reliability of the diet inferred by metagenomics but note that a dietary profile obtained from six samples is not comprehensive as, e.g., nine of 24 field-observed genera were not detected. Nonetheless, the diet profile obtained with metagenomics was much broader than the profile obtained by field observations. This is not unexpected given that the fieldwork only yielded 31 feeding observations while each fecal sample has the potential to cover ~ 48 h of feeding thus allowing for the identification of rare diet elements. Overall using HTS based methods, our analyses of only six samples added 39 plants to the observational data that had required ~30 months of field work. Field work was still necessary for sample collection, but in the future it can be aided greatly by use of dogs trained to detect feces from target species [10]. However, observational data still has some advantages. Firstly, it can provide information as to which specific individual and which parts of a plant are consumed although the latter can be difficult for food plants in a forest with >700 species of trees and lianas. Secondly, DNA based analyses may not necessarily represent preferred diet plants but also accidental ingestions, such as ingestion of pollen or any other material that may have been associated with the preferred diet items. The latter concern can be overcome by only considering diet items that are found in multiple samples.

When compared to metabarcoding based on one short amplicon, our results are similar to Srivathsan et al. [16] in that the main advantage of metagenomics is higher taxonomic resolution, which can be attributed to utilization of a combination of three barcodes and not limiting the analyses to the P6 loop of trnL. Our results for metabarcoding suggest the PCR-based approaches can amplify nearly all plant families revealed by metagenomics if the primers are universal enough, and thus the trnL approach is useful when a large number of samples have to be multiplexed and a family-level dietary profile is sufficient. To improve on identifications additional genetic markers can be included using methods such as the two-step approach involving group-specific primers proposed by De Barba et al. [18]. Here, the initial family-level identifications were further resolved using amplicons generated by family/taxon-specific nr ITS primers. This is feasible, but would require considerable effort for our samples because trnL sequences from 18 different families could not be assigned to genus/species suggesting that 18 new primer pairs may need to be designed and then used for each sample. Alternatively, metabarcoding could be based on multiplex PCRs using universal primers for multiple short barcodes. However, there is general consensus in the plant barcoding literature that no specific combination of currently used barcodes can be universally applied for species identification [27] and the multiplex PCRs would have to cover multiple markers. A third option may be anchored hybrid enrichment methods which allow for improving representation of multiple regions of interest in a sample [49] under default parameters and the mitochondrial genome of the related P. melalophos (GenBank: NC_008217) as reference. The genome was annotated using MITOS [50] and further manually curated prior to GenBank submission. A single contig was obtained against which reads were mapped back from all six HiSeq and MiSeq datasets using BWA mem [51]. The bam files generated in this process were further filtered to retain only sequences with map** quality of at least 30 and available paired-end reads. We used FreeBayes [52] with ploidy = 1, maximum read mismatch to reference setting at 5 %, minimum coverage for alternate allele at five and variant quality score of at least 30 to identify the variant sites across the six samples. Results obtained using HiSeq and MiSeq datasets from the same sample were first cross checked to ensure there were no differences in the calls for the two runs and then summarised together (see Additional file 1).

Parasites and other eukaryotes

In order to characterize other eukaryotes represented in the fecal samples, reads were matched against COI databases using settings identical to those in the diet analyses. All reads matching the COI database were retrieved and matched to the NT database of GenBank. Identifications were filtered using readsidentifier v1.0 at 95 % and 98 % identities, and only complete overlap between a read and COI sequences was considered. An initial survey of these results revealed matches to mostly plant, primate and insect sequences. Given that we were also interested in identifying potential parasites, we built a target rDNA database of common non-human parasites (Additional file 1: Table S1). SSU rDNA was selected as it has often been used to barcode single-cellular organisms and parasites such as nematodes. Similar to COI analyses, we matched the sequences using MEGABLAST and the retrieved hits were then matched to the NT database to validate the results. The reads were then classified at 98 % similarity and 50 bp overlap using readsidentifier v1.0.

Availability of supporting data

Barcode sequences that matched metagenomics data have been submitted to GenBank with accession numbers KU853075-KU853258. Reference mitochondrial genome has been submitted to GenBank under accession number KU899140. Sequences corresponding to plants from the metagenomic data and the metabarcoding dataset, and plant databases have been archived in LabArchives doi:10.6070/H4000047. COI and parasite databases are available on request. Scripts written specifically for the study are included in the readsidentifier package https://github.com/asrivathsan/readsidentifier.