INTRODUCTION

The first genome of a fungus, Saccharomyces cerevisiae, was sequenced in 1996 (Goffeau et al. 1996). Subsequent developments in technology have made sequencing much more affordable, and the number of fungal genome and transcriptome sequencing projects has increased exponentially resulting in 1886 genomes being available in 2020 (Grigoriev et al. 2014; Sharma 2015; NCBI 2021). Most of the early sequencing efforts were focused on terrestrial ecologically or economically significant fungi, crop-pathogens, or fungi related to human health (Sharma 2015). Despite the effort so far, one issue in comparative genomics is the lack of available genomic data and proper taxonomic representation of the known taxa (Naranjo-Ortiz and Gabaldón 2019; Lücking et al. 2020). This is especially noticeable among marine fungi, where few genomes are available compared to terrestrial fungi. The 1000 fungal genomes (1KFG) project wants to address these issues and answer questions regarding ecologically and taxonomically overlooked fungi like marine fungi in poorly resolved taxa, such as Helotiales (Leotiomycetes). By making their genomes publicly available, 1KFG contributes to better elucidate the general features of marine fungi (Grigoriev et al. 2011; Grigoriev et al. 2014).

The marine environment is vastly different from the terrestrial environment, leading to distinct adaptations of the organisms living there. Such adaptations may be unique enzymes that withstand low or high temperatures, pressure or salt concentrations, and potent signaling molecules and sensitive receptors, specific pigments, and other unique metabolites (Van Noort et al. 2013; Kis-Papo et al. 2014; Rédou et al. 2015; Oey 2016; Fouillaud et al. 2017; Huang et al. 2017; Trincone 2018). There are many substrates available in the marine environment that are different compared to terrestrial substrates. Such substrates include polysaccharides such as laminarin, carrageenan, fucoidan, alginate, ulvan, galactans, porphyrin, agarose and chitin that do not occur in terrestrial sources or have different modifications such as sulfation (Barbosa et al. 2019). Fungal enzymes utilizing specific marine polysaccharides, such as glycoside hydrolase family 29 (GH29) linked to the degradation of algal fucoidan, GH107 linked to sulfated fucans, GH78 and GH105 linked to ulvan and GH18 and GH82 linked to carrageenan, are of interest for industrial processing. These enzymes make sugars bioavailable in feed for aquaculture and agriculture, usable in the production of specific polysaccharides for pharmaceutical purposes or as a carbon source for bioenergy production. Marine microorganisms also communicate with each other and protect themselves using secondary metabolites. Because the water dilutes any secreted molecules, the secondary metabolites have to be potent and they are therefore of special pharmaceutical interest as potential drugs (Berteau et al. 2002; Haefner 2003; Michel et al. 2006; Collén et al. 2014; Vickers et al. 2018; Reisky et al. 2019; Carroll et al. 2020; Dobrinčić et al. 2020).

Some of the fungi frequently observed in the marine environment include Acremonium-like fungi that are a polyphyletic assembly of mostly indistinct, hyaline, simple, asexual fungi. These fungi are isolated from macroalgae, invertebrates and sediments (Zuccaro et al. 2008; Duc et al. 2009; Loque et al. 2010; Paz et al. 2010; Mouton et al. 2012; Zhang et al. 2013; Rédou et al. 2015; Zhang et al. 2015; Lee et al. 2019). Binomially named Acremonium fungi are found within Glomerellales, Hypocreales, Sordariales, Cephalothecales (Cephalothecaceae) and Leotiomycetes showing how Acremonium is used collectively on phylogenetically distinct, but often morphologically indistinct fungi (Summerbell et al. 2011). Many of these fungi have close sequence similarity to sexual reproductive morph of described species and likely represent the asexual morphs of these species (Summerbell et al. 2011). Some of the Acremonium-like taxa within the Emericellopsis clade are marine, specifically those closely related to E. maritima and A. fuci, whereas terrestrial isolates form a distinct clade (Zuccaro et al. 2004). Alkali-tolerant soda soil fungi seem to have derived from the marine lineage and are nested in their own subclade within the marine clade (Grum-Grzhimaylo et al. 2013). This concept of three ecological clades is challenged by recent research based on nuclear ribosomal DNA (nrDNA) ITS1–5.8S-ITS2 region (ITS) and β-tubulin (tub2) phylogeny and should be retested with multilocus gene phylogenies when new species are described (Gonçalves et al. 2020). Despite frequent phylogenetic studies and descriptions of new species, relatively few Acremonium-like fungi have available genome sequences. For Emericellopsis, there are no reference genomes available (Grigoriev et al. 2014; NCBI Resource Coordinators 2018). From chemical studies, it is known that species within the genus of Acremonium and Emericellopsis can produce a range of known bioactive metabolites (Argoudelis et al. 1974; Rogozhin et al. 2018; Hsiao et al. 2020). Despite evidence of secondary metabolite production, our understanding of the full biosynthetic potential of Emericellopsis species remains limited.

Calycina marina is a non-lichenized discomycetous fungus that is exclusively found on decaying seaweeds and has been collected all over the northern Europe (Baral and Rämä 2015; GBIF Secretariat 2021). Calycina marina is unique in both habitat, substrate and morphology compared to its closest relatives in Calycina that are terrestrial species (Baral and Rämä 2015). It is also peculiar in the sense that it is one of the few marine discomycetes compared to the terrestrial environment with hundreds of discomycetous species. Amylocarpus encephaloides is another strictly marine fungus that occurs on wood in the tidal zone (Prasannarai and Sridhar 2004). The fungus has a unique way of degrading wood that is similar to brown rot, but distinct from it, which may involve industrially interesting CAZymes (Prasannarai and Sridhar 2004). The fungus has been reported from in the Atlantic, Pacific and Indian Ocean (Prasannarai and Sridhar 2004; GBIF Secretariat 2021).

Here, we provide a thorough taxonomic and genomic description of the first fully sequenced Emericellopsis species. To further contribute to the knowledge of marine fungi, we include a brief description of the genomes of two marine fungi, Calycina marina and Amylocarpus encephaloides (Helotiales, Ascomycota), and resolve their phylogeny based on multilocus data extracted from genome sequences.

MATERIALS AND METHODS

In this manuscript we adhere to italicizing Latin names of organisms and higher order taxonomic ranks as discussed in Thines et al. (2020). Several of the methods used have previously been published and are only briefly described here.

Sampling and isolate information

The isolation method of the isolate TS7 was previously described in Batista-García et al. (2017). Emericellopsis sp. TS7 (Class Sordariomycetes, Order Hypocreales, Family Hypocreales incertae sedis) was obtained from the sponge Stelletta normani (Class Demospongiae, Order Astrophorida, Family Ancorinidae) collected on 16th June 2010 from 1350 m depth in the Atlantic Ocean (54.0613° N, 12.5518° W), off the west coast of Ireland using a remote operated vehicle Holland I on board the R.V. Explorer (Kennedy et al. 2014). Briefly, 1 mL of the macerated sponge material was serially diluted and 100 μL of each dilution was inoculated on agar plates with either malt extract agar-artificial seawater (ASW) or potato dextrose agar-ASW (DIFCO). Axenic cultures were obtained after two passages from the primary isolation. The fungus is accessible in the fungal collection of the School of Microbiology at University College Cork, under accession code TS7, and the Westerdijk Fungal Biodiversity Institute (CBS-KNAW) under the accession CBS 147198. Emericellopsis sp. TS7 was selected for full genome sequencing in the 1KFG project due to the lack of sequenced Emericellopsis species, its marine origin, promising antibacterial activity against gram-negative bacteria in initial bioactivity testing and as a putative novel species (Jackson et al. 2016).

Isolation of C. marina TRa3180A (Class Leotiomycetes, Order Helotiales, Family Pezizellaceae) was described in Baral and Rämä (2015). Spores from apothecia growing on decaying Ascophyllum nodosum (Class Phaeophyceae, Order Fucales, Family Fucaceae) at the entrance to Portsmouth Harbor, Portsmouth, Hampshire, England, were inoculated and isolated on 0.2SeaMEA (4 g/L malt extract agar with sterile filtered seawater) with antibiotics. The fungus was deposited at the Norwegian marine biobank (Marbank) with the accession number M16FUN0001.

Isolation of A. encephaloides TRa018bII (Class Leotiomycetes, Order Helotiales, Family Helotiaceae) was described in Rämä et al. (2014). Spores from a cleistothecium on decaying Betula sp. (Class Magnoliopsida, Order Fagales, Family Betulaceae) at 70.22874993° N, 19.68153674° E, Troms, Norway, were isolated on 0.2SeaMEA. The fungus was deposited at the Norwegian marine biobank (Marbank) with the accession number M15FUN0043.

Morphological study

Emericellopsis sp. TS7 was incubated on oatmeal agar (OA), potato dextrose agar (PDA) and malt extract agar (MEA) (recipes in Crous et al. (2019)) for 21 days at 25 °C. The cultures where then examined using a dissecting and compound light microscope equipped with differential interference contrast. Morphological characteristics were described and compared to closely related species.

Growth characterization

Growth requirements of Emericellopsis sp. TS7 was characterized by incubation on four different substrates (0.4% malt extract, 0.3% chitin flakes (Sigma), 0.3% fucoidan-rich extracts from Ascophyllum and Fucus (Non-commercial, Algaia, France) and 0.3% aqueous extract (freeze dried sponge material was macerated and extracted using distilled water for 3 h, the mixture was centrifuged and the aqueous phase was freeze dried. The resulting sample was then fractioned in six fractions and the most polar fraction were used for the agar) from Stelletta cf. normani (M15034-0-W01, Marbank, Norway), all on 1.5% agar, Sigma) and three different salinities (Distilled water, 50% seawater and seawater) was performed in triplicate. In addition, each medium was incubated at four different temperatures, 2 °C, 10 °C, 15 °C and 25 °C, to determine optimum growth temperature on the different media. The plates were incubated for a total of 43 days. Growths were recorded at day 3, 5, 10, 15, 21, 27, 31, 38 and 43. Distilled water agar (1.5% agar) was used as a control medium.

Cultivation for nucleic acid extraction

For DNA and RNA extractions, mycelium from liquid seed cultures of Emericellopsis sp. TS7, A. encephaloides and C. marina in 0.2ASME medium (4 g/L malt extract, 40 g/L artificial sea salts (Sigma), MilliQ-water – hereafter MilliQ) were inoculated in 250 mL of the same medium in 1000-mL baffled culture flasks. The media constituents were dissolved in MilliQ. All media were autoclaved at 121 °C for 30 min before inoculation. Incubations were performed at 10–16 °C at 140 rpm (shaking for liquid cultures only). After 13 days the culture was harvested by vacuum filtration through Miracloth (Merck) and the mycelium was subsequently placed in aluminum foil and stored at − 80 °C until processing.

Isolation of nucleic acids

Genomic DNA from Emericellopsis sp. TS7, A. encephaloides and C. marina mycelium was isolated using Quick-DNA Fungal/bacterial Miniprep Kit (Zymo Research) according to supplier’s instructions. The DNA quality was checked by three methods: First, DNA degradation was checked using gel electrophoresis on 1% TBE (Life technologies) UltraPure agarose (Life technologies) gel stained by GelRed (BioTium) that was run at 180 V for 20 min after loading the samples using Agarose gel loading dye (Amresco). Samples were compared to GeneRuler High Range DNA ladder (ThermoFisher). Secondly, NanoVue Plus (GE healthcare) measurement of wavelength ratio was used to control for contamination and estimate concentration. Finally, Qubit (Invitrogen) measurement using Qubit dsDNA BR Assay Kit (Invitrogen) was used for accurate concentration determination. The DNA sample was stored at − 80 °C.

Total RNA from Emericellopsis sp. TS7, A. encephaloides and C. marina mycelium was isolated using Quick-RNA Fungal/Bacterial Miniprep Kit (Zymo Research) according to the supplier’s protocol. All MilliQ used for RNA extraction were treated with diethyl pyrocarbonate (DEPC - Sigma). Quality control was performed using the same methods as for DNA with the exception of using RiboRuler High Range RNA ladder (ThermoFisher) for gel electrophoresis and Qubit RNA BR Assay Kit (Invitrogen) for concentration determination.

DNA sequencing and assembly

The draft genomes of Emericellopsis sp. TS7, C. marina and A. encephaloides were sequenced at the DOE Joint Genome Institute (JGI) using Illumina technology. For genome sequencing, 100 ng of DNA was sheared to 300 bp using the Covaris LE220 and size selected using SPRI beads (Beckman Coulter). The fragments were treated with end-repair, A-tailing, and ligation of Illumina compatible adapters (IDT, Inc) using the KAPA-Illumina library creation kit (KAPA biosystems). Illumina Regular Fragment, 300 bp, standard shotgun library (STD) and long insert, 3000 bp, mate pair library (sLMP) were constructed and sequenced using Illumina NovaSeq. All raw Illumina sequence data were filtered for artifact/process contamination using the JGI QC pipeline (Supplementary data 1). An automated attempt was made to reassemble any potential organelle (mitochondrion) from the filtered reads and remove any organelle-matching reads with kmer matching against the resulting contigs with an in-house tool. An assembly of the target genome was generated using the resulting non-Organelle reads with SPAdes v3.12.0 (Bankevich et al. 2012) using the following parameters [−-phred-offset 33 --cov-cutoff auto -t 16 -m 115 –k 25,55,95 --careful]. Similar methodology, employing the UNITE rDNA database (Kõljalg et al. 2013), was used to reassemble the ribosomal DNA from the filtered reads.

Completeness of the euchromatic portion of the genome assemblies were assessed by aligning assembled consensus RNA sequence data with bbtools v38.31 bbmap.sh [k = 13 maxindel = 100,000 customtag ordered nodisk] and bbest.sh [fraction = 85] (Bushnell 2014). This was a routine test by JGI to determine whether significant portions of the genomes were missing.

RNA library creation, read processing and De novo assembly

For transcriptomics, plate-based RNA sample prep was performed on the PerkinElmer Sciclone NGS robotic liquid handling system using Illumina’s TruSeq Stranded mRNA HT sample prep kit utilizing poly-A selection of mRNA following the protocol outlined by Illumina in their user guide:

https://support.illumina.com/sequencing/sequencing_kits/truseq-stranded-mrna.html, and with the following conditions: total RNA starting material was 1 μg per sample and 8 cycles of PCR was used for library amplification. The prepared libraries were then quantified using KAPA Biosystem’s next-generation sequencing library qPCR kit and run on a Roche LightCycler 480 real-time PCR instrument. The quantified libraries were then multiplexed with other libraries, and the pool of libraries was then prepared for sequencing on the Illumina NovaSeq 6000 sequencing platform using NovaSeq XP v1 reagent kits, S4 flow cell, following a 2 × 150 indexed run recipe.

Raw reads were filtered and trimmed using the JGI QC pipeline resulting in the filtered fastq file (*.filter-RNA.fastq.gz files). Using BBDuk (Bushnell 2014), raw reads were evaluated for artifact sequence by kmer matching (kmer = 25), allowing 1 mismatch and detected artifact was trimmed from the 3′ end of the reads. RNA spike-in reads, PhiX reads and reads containing any Ns were removed. Quality trimming was performed using the phred trimming method set at Q6. Finally, following trimming, reads under the length threshold were removed (minimum length 25 bases or 1/3 of the original read length - whichever is longer).

Filtered fastq files were used as input for de novo assembly of RNA contigs. Reads were assembled into consensus sequences using Trinity (v2.3.2) (Grabherr et al. 2011). Trinity was run with the --normalize_reads (In-silico normalization routine) and --jaccard_clip (Minimizing fusion transcripts derived from gene dense genomes) options.

Genome annotation and functional annotation

The genome was processed through the JGI Fungal Annotation Pipeline according to the Fungal Genome Annotation Standard Operating Procedure available at https://mycocosm.jgi.doe.gov/programs/fungi/FungalGenomeAnnotationSOP.pdf (Grigoriev et al. 2014). Briefly, gene models were iteratively improved using several gene-predicting tools and comparing it to the RNA transcriptome. Functional annotation was performed using SignalP (Petersen et al. 2011), TMHMM (Krogh et al. 2001), InterProScan (Hunter et al. 2009), SwissProt (Uniprot Consortium 2013) and KOG (Koonin et al. 2004). Finally, KEGG (Kanehisa et al. 2012) hits were used for EC numbers and map to metabolic pathways, while Intepro and SwissProt were used to map gene ontology (GO) terms. Core Eukaryotic Genes Map** Approach (CEGMA) was used to make a set of reliable genes and determine the completeness of the gene annotation (Parra et al. 2007; Parra et al. 2009).

In addition to the annotations done by JGI, a functional annotation of the Carbohydrate Active Enzymes was performed using the dbCAN2 meta server (Zhang et al. 2018). Annotations were assigned using HMMER (Eddy 2020), Hotpep (Busk et al. 2017) and DIAMOND (Buchfink et al.

Table 1 Overview of genome assembly and gene statistics for Emericellopsis sp. TS7, Calycina marina and Amylocarpus encephaloides

Gene features and functional annotation of Emericellopsis sp. TS7

The 9964 predicted gene models gave a gene density of 365 genes/Mbp. CEGMA estimated that 99.34% of the core genes were present, which indicates a nearly complete genome. There were 162 tRNAs and a single complete nrDNA region in the assembly. A total of 4331 (43%) genes were generically annotated with hypothetical (3252) or expressed (1079) proteins. The MAT-1-1 mating locus associated with sexual reproduction was also identified via BLAST in the assembly.

A total of 5201 (52%) genes were recognized as orthologous genes based on hits in the KOG database (Table 1), of these 1317 (25%) received general functional predictions or were conserved genes with unknown functions (Supplementary data 4). This indicates that 4763 of the 9964 (47.8%) predicted genes do not have characterized orthologs or are lineage specific genes. A small portion of these genes may be pseudogenes that are not functional or genes that have been incorrectly predicted from the annotation pipeline. The largest group of identified orthologs belonged to the posttranslational modification, protein turnover and chaperones category (483). Signal transduction (377), energy production and conversions (323), carbohydrate transport and metabolism (318) and translation, ribosomal structures and biogenesis (317) were the next four highly represented categories. Secondary metabolite biosynthesis, transport and catabolism (268) made up 2.5% of the functionally annotated orthologs.

Of the 9964 genes, only 1969 were classified based on the KEGG database, Table 1. The largest group of these were enzymes with known functions but undetermined pathways (688) (Supplementary data 4). This was followed by enzymes involved in amino acid metabolism (618), carbohydrate metabolism (433), metabolism of complex carbohydrates (314), and biodegradation of xenobiotics (298). Pathways associated with biosynthesis of secondary metabolites had 99 enzymes assigned to it.

Phylogenetic placement of Emericellopsis sp. TS7

Preliminary ITS analysis and morphological characterization indicated that Emericellopsis sp. TS7 was likely a novel species and for this reason, a thorough multigene phylogenetic analysis was performed. A concatenation of nuclear nrDNA 18S, ITS and 28S, and the protein coding genes rpb2, tef1 and tub2 were made and run through MrBayes using 12 partitions with different models as suggested by PartitionFinder and PhyML using the smart model selection (Supplementary data 2). The Acremonium/Emericellopsis species split into three clades; terrestrial soil, marine, and alkaline or “soda soil” (Fig. 1) as previously reported by Grum-Grzhimaylo et al. (2013). Emericellopsis sp. TS7 was grouped in the marine clade as an early branch, closest to E. pallida and E. phycophila with maximum support values. All three major ecological clades have support in both Bayesian and maximum-likelihood models, while individual taxa and branches in some cases have different branching in Bayesian and maximum-likelihood trees. The terrestrial clade have long branches and polytomy, but it is also the clade with the largest portion of missing data (70.1% - missing 18S, rpb2 and tef1) compared to the marine and alkaline clade (20.1% missing data). The alkaline clade contains E. cladophorae that was isolated from marine algae. Emericellopsis donezkii and E. enteromorphae were isolated from fresh water and marine algae, respectively. The three species, E. cladophorae, E. donezkii and E. enteromorphae, were all isolated from marine sources, but they do not group in the marine clade. However, all three lack sequence information for 18S, 28S, rpb2 and tef1.

Fig. 1
figure 1

Phylogenetic tree from MrBayes of the genus Emericellopsis based on a six gene multilocus alignment of available ex-type and representative sequences. Branch support values are from Bayesian posterior probability (top) and Maximum-likelihood aBayes support test (bottom). Branch length represents substitutions per sequence site. The taxon in bold is the studied fungus. The bold letter T denotes sequences of ex-type cultures. Accession numbers for each isolate are in Supplementary data 2, PhyML tree can be seen in Supplementary data 3

Growth characterization of Emericellopsis sp. TS7

In order to examine the growth characteristics of Emericellopsis sp. TS7, the isolate was grown on different substrates, salinities and temperatures (Fig. 2). The fastest growth rate occurred at 25 °C for all substrates and salinities. The preferred substrate was MEA and sponge extract, prepared with seawater. The slowest growth occurred on MEA prepared with distilled water. Generally, growth on media prepared with distilled water was slower compared to salt containing media. Growth at 2 °C occurred for all salinities with sponge extract. Emericellopsis sp. TS7 on the control medium reached full growth within 21 days at 25 °C, 38 days at 10 °C and 15 °C, and no growth at 2 °C. Growth on 0.4MEA medium without salt and chitin medium with salt was slower than the control medium.

Fig. 2
figure 2

Growth characterization of Emericellopsis sp. TS7 using four different substrates and three different salinities incubated at four different temperatures. Maximum growth was 86 mm. Max growth of growth control on distilled water agar is shown in the first panel with encircled symbols. The control for 10 °C and 15 °C is identical

CAZymes and other industrially relevant genes

The number of CAZymes in Emericellopsis sp. TS7 was 396 (3.97% of total genes), of which 149 possessed secretory peptide signal indicating that they are likely to be secreted into the external environment or across other membranes (38% of CAZymes). A comparison of Emericellopsis sp. TS7, A. encephaloides and C. marina with three other fungal genomes, namely A. niger, S. strictum and A. chrysogenum (two terrestrial/pathogens and one from sewage water outlet to the sea) indicated that Emericellopsis sp. TS7 had the second highest number of CAZyme genes (Fig. 3). A relatively high number of CAZymes in Emericellopsis sp. TS7 and A. chrysogenum had a secretory signal compared to the other species (38% vs 23–33%). Emericellopsis sp. TS7 had a higher number of polysaccharide lyase (PL), glycosyl transferase (GT) and GH domains compared to the other marine isolates. Amylocarpus encephaloides on the other hand contained the highest number of carboxyl esterases (CEs), carbohydrate binding modules (CBMs) and auxiliary activity (AA) domains of the marine fungi. Calycina marina contained two PL8 (absent in the other studied fungi), which act on uronic acid, a common constituent of seaweeds (Ponce et al. 2003; Sánchez-Machado et al. 2004). CAZyme genes are often modular with many genes containing one or more enzymatic domains along with CBMs that bind to substrates and have no catalytic function. Examples of this are the putatively secreted CAZyme gene 217,297 in Emericellopsis sp. TS7 with a GH18 and CBM18 domain (putative chitinase) or 546,426 (putative cellulase) with a CBM1, AA3_1 and AA8 domain (Fig. 4).

Fig. 3
figure 3

Overview of the distribution of CAZymes in Emericellopsis sp. TS7 Amylocarpus encephaloides and Calycina marina and three other fungi. The lines indicate the number of genes and number of genes with putative secretion signal and uses the secondary Y-axis

Fig. 4
figure 4

Examples of putatively secreted modular CAZymes from Emericellopsis sp. TS7, Amylocarpus encephaloides and Calycina marina. The illustration is not to scale. SP – Secretion signal peptide, GH – Glycoside hydrolase, CBM – Carbohydrate binding module, AA – Auxiliary activity, CE – Carboxyl esterase. Number indicates enzyme class. Number in brackets is protein identifier

The different classes of CAZymes followed a similar putative secretion signal pattern in the fungi compared here (Supplementary data 5). Generally, few genes (4–6%) with predicted GT activity contained putative signal peptide for secretion, but these are often involved in intracellular synthesis. Genes with PL activity contained secretion signal in 80–88% of cases, with the exception in C. marina and S. strictum that only had signal in 50 and 60% of genes, respectively. Amylocarpus encephaloides had the highest ratio of CBM containing genes with secretion signal (66.7%) and C. marina had the lowest ratio of genes with secretion signal for all classes except GHs. For example, Emericellopsis sp. TS7 genes with AA had secretion signal in 42.3% of cases, CBM in 55.0%, CE in 66.7%, GH in 42.5%, GT in 6.7% and PL in 88.2%.

The domains that occurred in the highest numbers across the six genomes analyzed were associated with cellulose, hemicellulose, xylan, mannose, fucose, pectate, and chitin. In the secreted enzymes mainly cellulose-, chitin- and xylan-interacting domains were abundant. The unclassified domain GH0 was found in Emericellopsis sp. TS7 (1), C. marina (1) and A. encephaloides (2). In total, Emericellopsis sp. TS7 had 176 different classes of CAZymes (Supplementary data 5).

Emericellopsis sp. TS7 does not appear to possess genes encoding polyphenol oxidases or fucoidanase, but does have genes encoding fucosidase (GH29 and GH95), a fucose transporter and a few GTs with potential fucose activity (GT1 and GT31). In addition, Emericellopsis sp. TS7 also contains seven potential sulfatase genes based on the sulfatase catalytic site pattern (Barbeyron et al. 2016), but none of the domains are on CAZymes.

The gene for the industrially relevant enzyme phytase was also found (Lei et al. 2013) in Emericellopsis sp. TS7, C. marina and A. encephaloides, along with histidine acid phosphatases that share the same enzyme classification (EC 3.1.3.8) with phytase.

Biosynthetic gene clusters of Emericellopsis sp. TS7

A total of 35 biosynthetic gene clusters (BGCs) were predicted using antiSMASH, with 27 of these gene clusters being shown in Fig. 5. Eight are not included in the figure because they were solitary core genes not surrounded by other tailoring, transport or transcription genes or they were likely precursor genes in sterol synthesis such as the squalene and lanosterol synthase. The clusters contained a range of oxidoreductases, transcription factors, tailoring genes and transporters together with core biosynthetic gene(s). These BGCs included eight NRPS clusters, six NRPS-like clusters, nine terpene clusters, six polyketide synthase (PKS) clusters, three mixed NRPS-PKS clusters, one hybrid NRPS-PKS cluster, one phosphonate cluster and one indole cluster.

Fig. 5
figure 5

Overview of BGC structure of the predicted clusters in Emericellopsis sp. TS7 colored after function. Clusters marked in red were on the end of scaffolds and may be incomplete. The leucinostatin-like cluster was split in two, but is presented as one cluster with a gap. Helvolic acid, produced by the cluster in bold, were detected in MS analyses of fermentation broths

Several of the clusters had homology to known clusters according to KnownClusterBlast, these were further investigated by a synteny analysis using clinker (Gilchrist and Chooi 2021). Only the BGC for ascochlorin (Araki et al. 2019), leucinostatin A/B (Wang et al. 2016), botrydial (Pinedo et al. 2008), cephalosporin C (Terfehr et al. 2014) and helvolic acid (Mitsuguchi et al. 2009) showed a high degree of conserved genes in the Emericellopsis clusters (Supplementary data 6). Several of the NRPS-genes without homologous hits had a configuration of 4–13 modules according to antiSMASH.

Emericellopsis sp. TS7 was cultivated in several different media and the fermentation broths were extracted. The resulting fractions from the extracts showed antibacterial activity against Enterococcus faecalis, Streptococcus agalactiae and Staphylococcus epidermidis. No toxicity was detected against A2058 human melanoma cancer cells. Methods and details of bioactivity experiments can be found in Supplementary data 7.

Genome description of Calycina marina

The genome assembly of C. marina was more fragmented when compared to Emericellopsis sp. TS7. The assembly statistics reveal that the L50 was 173 with an N50 of 50 kbp and the final assembly consisted of 1318 scaffolds with a total length of 34.2 Mbp. The number of predicted genes in C. marina was 9558, which was slightly fewer than in Emericellopsis sp. TS7 despite the fact that C. marina has a larger genome. Calycina marina distinguished itself from the other genomes analyzed in having comparatively few CAZyme genes, totaling 217; and the lowest proportion of potentially secreted CAZyme genes at 51 (24% of CAZymes). The genome contained 21 potential BGCs distributed as nine NRPS/NRPS-like, five PKS (including two type 3), three terpene, one indole, one hybrid, one aromatic prenyltransferases and one ribosomally synthesized and post-translationally modified peptide (RiPP).

Genome description of Amylocarpus encephaloides

The genome assembly of A. encephaloides was also more fragmented than Emericellopsis sp. TS7 with an L50 value of 168 and N50 of 74 kbp. The genome assembly consisted of 2381 scaffolds with a total length of 46.3 Mbp, which was larger than that for Emericellopsis sp. TS7 and C. marina. The total number of predicted genes was 11,869, which is the highest number among the three sequenced strains. Despite being fragmented, the genome was complete in terms of core gene presence with a CEGMA value of 99.56%. Amylocarpus encephaloides had 356 CAZyme genes, of which 115 are potentially secreted. The genome showed a higher portion of CAZyme genes with CBM1 (Cellulose binding) modules and secretion of these (15 genes, 80% secreted). Amylocarpus encephaloides also had the largest portion of CBM containing CAZymes with secretion signal in total (66.7%). A total of 34 BGCs were detected in the genome, distributed as 14 PKS (one type 3), 10 NRPS/NRPS-like, five terpene, four hybrid clusters and one RiPP.

Phylogenetic placement of Amylocarpus encephaloides and Calycina marina within Helotiales

A 15-gene multilocus phylogenetic analysis was performed using a slightly modified dataset of Johnston et al. (2019). Calycina marina was placed together with the rest of Calycina within Pezizellaceae, where it formed a monophyletic clade (Fig. 6). Amylocarpus encephaloides was placed within Helotiaceae on a branch with “Hymenoscyphus” repandus. “Hymenoscyphus” repandus was not placed together with the rest of the Hymenoscyphus that formed a distinct monophyletic clade. Both of these clades were within Helotiales, sensu Johnston et al. (2019).

Fig. 6
figure 6

Phylogeny of Helotiales based on a 15-gene dataset for the analysis. The support values are from the ultrafast bootstrap in IQ-TREE. The bold letter T denotes ex-type sequences. Xylaria hypoxylon was used as an outgroup