Background

In the process of plant growth and development, information must be transmitted between cells to ensure the basic life activities, among which protein secretion is one of the important ways of cell–cell communication. There are two common pathways of protein secretion, and the classical ER-Golgi pathway aims at the N-terminal of proteins with signal peptide sequences, meanwhile depends on endoplasmic (ER) and Golgi apparatus. The other non-classical pathway aims at the same N-terminal of proteins while without signal peptide sequences, nevertheless is independent of ER and Golgi apparatus [1]. The proteolytic enzymes, also namely as protease in plants, are generally consisted of these secreted proteins. They could break substrates into small peptide segments by hydrolyzing peptide bonds, which are divided into cystein protease, serine protease, aspartate protease, and metalloprptease based on their different catalytic sites [2]. Among the total 500–800 kinds of plant proteases encoded by Arabidopsis thaliana genome, cystein proteases (CPs) could contribute approximately 140 kinds, and are consisted of five protein families, such as papain proteinases (family C1), vacuolar processing enzyme (family VPE), caspases (family C14), calcium-dependent proteinases (family C2), and other CP family [3]. Given that CPs presented the different expression patterns in diverse tissues and organs, the previous reports have proved that CPs could play important roles in affecting the biological processes, such as seed germination, root growth, leaf senescence, and programmed cell death (PCD) of tracheary elements in stems and of tapetum cells in anthers [4,5,6,7,8]. These CPs makes significant differences in the specific positions through protein degradation for providing nutrients and components for the development of new cells, ensuring that the biological processes involved in plant growth proceed smoothly.

Epidermal patterning factor / -like (EPF/EPFL) proteins are one of the cysteine-rich plant-specific secreted peptides, which normally have 6 or 8 conserved cysteine residues in the C-terminal of peptide chain. These cysteine residues could form inter-molecular disulfide bond so as to affect the folding of peptide chain and protein activity [9]. Besides the C-terminal signal peptide, there is 1 alpha helix, 2 reverse beta folds, and 1 irregular ring region connecting with 2 beta folds, forming the core skeleton of EPF/EPFL peptides. Among the peptides, the 2 beta folds constitute the active region of the molecular, while the irregular ring region takes the responsibility for molecular specificity [10]. EPF/EPFL genes are ubiquitous in terrestrial plants, such as Physcomotrella patens, Selaginella moellendorffii, Picea glauca, Sorghum bicolor, Populus trichocarpa, Medicago truncatula, Carica papaya, while their genome-wide family analyses were only reported on A. thaliana, Oryza sativa, and Malus domestica [11,12,13]. During the specific stages of plant growth and development, EPF1 and EPF2 were found to combine with 3 LRR-RLKs (Leucine-rich repeat receptor kinases), namely as ERECTA (ER), ERECTA-LIKE1 (ERL1), and ERL2, and to interact with 1 LRR-receptor-like protein (LRR-RLP), namely as TOO MANY MOUTHS (TMM), collectively transmitting extracellular specific stomatal development signals [14,15,16,17]. In addition, the homologous EPFL1 was reported to induce the awn elongation in rice [18]. Meanwhile, the ectopic expression of wheat EPFL1 in A. thaliana resulted in the shorter filaments and peduncles, implying its correlation with flower development [19]. Gene editing experiments conducted in Kasalath (one rice cultivar) to knockout 11 OsEPF/EPFL genes by CRISPR/Cas9 (Clustered regularly interspaced short palindromic repeats) indicated that OsEPFL2 participated in the regulation of awn development [

Map** EPF/EPFL genes in cotton chromosomes

The software TBtools was chosen in this study to extract the position information of EPF/EPFL genes in the chromosomes according to the genome sequences and annotation files of the four representative cotton species. Thesoftware MapChart (https://help.salesforce.com/s/articleView?id=sf.bi_chart_intro_ map.htm&type = 5) was utilized to draw and visualize the physical locations in the cotton chromosomes.

Analyses of gene structure and conserved protein motif of cotton EPF/EPFL genes

Firstly, the annotation files of cotton EPF/EPFL genes were entered into Gene Structure Display Server (GSDS, http://gsds.cbi.pku.edu.cn) so as to analyze their exon–intron structures [46]. Subsequently, the on-line website MEME SUITE (http://meme-suite.org/memei) was utilized for identifying the conserved motifs in cotton EPF/EPFL proteins [47]. Finally, the software TBtools was used to perform the visual-merge map** on the phylogenetic tree, gene structure, and conserved protein motif of all the cotton EPF/EPFL family members.

Collinearity analysis of cotton EPF/EPFL genes

The gene sequences of cotton EPF/EPFL family was subjected to collinearity analysis by the MCScanX software [48], whose visualization was displayed by the software TBtools. The collinearity analyses were composed of intraspecific and interspecific BLAST, which were separately conducted on the diploid genomes of A2 (G. arboretum) and D5 (G. raimondii) and the allotetraploid genomes of AD1 (G. hirsutum) and AD2 (G. barbadense). Subsequently, the duplicated gene pairs were identified from the intraspecific collinearity as Ga-Ga, Gr-Gr, Gh-Gh, and Gb-Gb, which were also identified from the interspecific collinearity as Ga-Gr, Ga-Gh, Ga-Gb, Gr-Gh, Gr-Gb, and Gh-Gb. The obtained duplication events were finally presented as collinearity relationships with the intraspecific and interspecific covariance circles.

Analyses of cis-regulatory elementsof cotton EPF/EPFL genes

The DNA sequences of 2000 bp upstream of initiation codon (ATG) of all the 132 cotton EPF/EPFL genes were downloaded as their promoter regions from CottonFGD (http://cottonfgd.net/) database, and on-line tool PlantCARE (http://bioinformatics.psb. ugent.be/webtools/plantcare/html/) [49] was chosen to perform prediction analysis of cis-regulatory elements. The visualization of the predicted cis-regulatory elements were shown by TBtools, and the colorful rectangles presented the different cis-regulatory elements with the same clades of evolution relationships.

Analyses of expression patterns and quantitative Real-time PCR verification

The transcriptome data of G. hirsutum TM-1 and G. barbadense Hai7124 on the different tissues/organs (root, stem, leaf, petal, torus, sepal, epicalyx, anther, and pistill) were downloaded from the SRA (Sequence Read Archive) database of NCBI website (http://www.ncbi.nlm.nih.gov/, and the accession number was PRJNA490626). Meanwhile, the transcriptome data of G. hirsutum TM-1 in response to multiple abiotic stresses (low temperature at 4℃, high temperature at 37℃, salt treatment of 0.4 M NaCl, and drought treatment of 200 g/liter PEG6000) were also obtained from the SRA database under the accession number PRJNA248163 [32]. The filtering treatment was firstly carried out on the published RNA-seq data by Trimmomatic software [50], and the obtained clean data were subsequently subjected to map** on the reference genome databases built by the HISAT software [51]. The software Cufflinks was chosen to calculate the expression levels of cotton EPF/EPFL genes with the presentation of FPKM (fragments per kilobase of transcript per million fragments) values [52], which were utilized to show whether to express or not, and to show high or low expression levels of all the GhEPF/EPFL and GbEPF/EPFL genes in the different tissues/organs. The FPKM values of all the GhEPF/EPFL genes under the different adversity stresses were subjected to uniformization treatment by the Z-score algorithm, which was performed in order to investigate their up-regulated and down-regulated expression patterns along with the processes of stress occurrence and proceeding [53]. The heat-map of expression levels and patterns of cotton EPF/EPFL genes was finally drawn by the softwareTBtools.

The transcriptome data of G. hirsutum TM-1 and G. barbadense Hai7124 on the develo** ovules (0, 1, 3, 5, 10, and 20 days after anthesis, DPA) and fibers (10, 20, and 25 DPA) were downloaded from the SRA database under the accession number of PRJNA490626. These data were subjected to the same treatments in turn as the aforesaid descriptions, including filtering the low-quality data, map** to the reference genome, calculating the expression level, and Z-score uniformization. The positive and negative values represented the up-regulated and down-regulated expression patterns during the development of ovules and fibers, respectively, whose heat-map was also drawn by the software TBtools. In addition, two cultivated species, namely G. hirsutum CCRI36 and G. barbadense Hai1, were chosen in this study to perform quantitative Real-time PCR (qRT-PCR) experiment. The CCRI36 harbored the merits of high yield and wide adaptability, while subjected to normal fiber quality and low resistance to Verticillium wilt, and the Hai1 had the characteristics of superior fiber quality, high VW-resistance, low yield, and poor adaptability [54, 56].

Analyses of protein–protein interaction and functional enrichment of GhEPF/EPFL genes

In consideration of the fact that there was no protein data of cotton species recorded into the STRING database (http://string-db.org), homologous alignment was firstly performed between the 44 GhEPF/EPFLs and 11 AtEPF/EPFLs, of which the protein sequences of the latter ones were inputted as the protein models into STRING database for constructing the protein–protein interacting network [57]. Meanwhile, the 44 GhEPF/EPFL genes were also subjected to functional enrichment by GO (Gene Ontology) and KEGG (Kyoto Encyclopedia of Genes and Genomes) databases, of which the former was generally divided into three categories, namely biological process, cellular component, and molecular function [58], and the latter was composed of cellular processes, environmental information processing, genetic information processing, human disease, metabolism, and organismal systems [59]. The enrichment analyses of GO and KEGG were accomplished by the on-line platform OmicShare (https://www.omicshare.com/), and the detailed procedures were as followed: the ID numbers of GO and KEGG of all the cotton genes were firstly extracted from the released annotation information of G. hirsutum TM-1 [32], which were separately inputted into the OmicShare tools of GO Enrichment Analysis Advanced and Pathway Enrichment Analysis Advanced as the background files together with the 44 GhEPF/EPFL ids as the queries, finally generating the classification results.

Results

Identification and physiochemical characteristic analyses of cotton EPF/EPFL genes

Twenty-four GaEPF/EPFL, 20 GrEPF/EPFL, 44 GhEPF/EPFL, and 44 GbEPF/EPFL genes were separately identified from the genomes of G. arboreum (A2), G. raimondii (D5), G. hirsutum (AD1) and G. barbadense (AD2) [33, 36, 37], reaching the total number of 132 members (Additional file 1: Table S2). Meanwhile, the analyses of physico-chemical properties of cotton EPF/EPFL family showed that the lengths of amino acid sequences of EPF/EPFL proteins ranged from 87 aa (GrEPF20 and GhEPF35) to 161 aa (GhEPF36 and GbEPF35), and those molecular weight ranges were from 9760.13 kDa (GrEPF20) to 18,049.99 kDa (GhEPF36 and GbEPF35). The isoelectric points of cotton EPF/EPFLs ranged from 5.53 (GhEPF15 and GbEPF17) to 9.91 (GhEPF44), whose average value was 8.84. The ranges of instability index of cotton EPF/EPFLs were from 30.62 (GhEPF21) to 84.76 (GhEPF5), including 12 possibly stable members (≦ 40) and 120 possibly unstable members (> 40). The results of subcellular localization prediction showed that most of cotton EPF/EPFL genes were located in chloroplast (59 members) and extracellular (51 members), and the minority ones were located in mitochondria (9 members), endoplasmic reticulum (4 members), nucleus (4 members), vacuole (4 members), and plasmalemma (1 member).

Phylogenetic analysis of EPF/EPFL proteins in Gossypium and Arabidopsis

The amino acid alignment of EPF/EPFLs were firstly conducted among the four cotton species, A. thaliana, O. sativa, and S. moellendorffii (Additional file 1: Figure S1), and the results showed the 6-cysteine residues conservatively located in the C-terminal mature peptide region for the both two clades. There were another 2-cysteine residues conserved in the loop region of EPF1-EPF2-EPFL7 clade, implying their diversity with EPFL9/Stomagen clade. Subsequently, 11 AtEPF/EPFL proteins, 13 OsEPF/EPFL proteins, and 17 SmEPF/EPFL proteins were utilized in this study to construct evolutionary tree together with 132 cotton EPF/EPFL proteins, resulting in four clades by virtue of neighbor-joining method (Fig. 1). The largest number of cotton EPF/EPFL proteins (71 ones, 53.8% of the total 132 ones) were classified into one clade with AtEPFL1-3 of the plant model species, therefore this clade was named as EPFL1-3. The most EPF/EPFL proteins of A. thaliana, namely as AtEPFL4, AtEPFL5, AtEPFL6, and AtEPFL8, and the second largest number of cotton EPF/EPFL proteins (24 ones, 18.2%) were divided into the same clade named as EPFL4-6-EPFL8. Eighteen EPF/EPFL proteins in cotton (13.6%) and 3 AtEPF/EPFL proteins, namely as AtEPF1, AtEPF2, and AtEPFL7, were classified into EPF1-EPFL2-EPFL7 clade, and 19 cotton EPF/EPFL proteins (14.4%) and AtEPFL9/Stomagen were divided into EPFL9/Stomagen clade.

Fig. 1
figure 1

Phylogenetic analysis of EPF/EPFL proteins from Gossypium, Arabidopsis thaliana, Oryza sativa, and Selaginella moellendorffii. The blue checkmark presents the AtEPF/EPFL proteins, and the purple square presents the OsEPF/EPFL proteins. The red check mark presents the SmEPF/EPFL proteins, and the green triangle presents the GaEPF/EPFL proteins. The red square presents the GrEPF/EPFL proteins, and the yellow star and blue circle present the GhEPF/EPFL and GbEPF/EPFL proteins, respectively

Chromosomal location and gene duplication of cotton EPF/EPFLs

As shown in Fig. 2, 24 GaEPF/EPFL genes were unevenly distributed in 10 chromosomes and 1 scaffold except for 4th, 8th, and 13th chromosome in A2 genome, of which the numerous members were found in 5th (5 GaEPF/EPFL genes) and 11th (4 GaEPF/EPFL genes) chromosome. The similar results of 20 GrEPF/EPFL genes also occurred in D5 gnome, and there was no GrEPF/EPFL genes in 12th chromosome besides 4th, 8th, and 13th chromosome, while the different fact was that 9th chromosom chromosome harbored the most number (4 GrEPF/EPFL genes). As for the two AADD genomes, we noticed the same EPF/EPFL members and chromosome distributions between G. hirsutum and G. barbadense. The main difference was found that AD1 genome had two more GhEPF/EPFL genes separately located in D05 and D07 chromosomes, while AD2 genome had two more GbEPF/EPFL genes separately located in A10 and D06 chromosomes. With the comparison between the allotetraploid and diploid genomes, there were 21 GhEPF/EPFL genes and 22 GbEPF/EPFL genes in A subgenomes, and the numbers were less than 24 GaEPF/EPFL genes. On the contrary, 24 GhEPF/EPFL genes and 22 GbEPF/EPFL genes were found in D subgenomes that were more than 20 GrEPF/EPFL genes, despite their total number of EPF/EPFL genes were the same as 44. Besides, the distributed chromosomes between A2 and AD1 or AD2 were one-to-one corresponding, and we found one GhEPF/EPFL gene located in D12 chromosome in AD1 genome while not in D5 and AD2 genomes. These similarities and differences derived from the above results indicated EPF/EPFL gene family showed the conservatism and variability during the long-term evolutionary process.

Fig. 2
figure 2

Chromosomal location of cotton EPF/EPFL genes

Gene structure and conserved motif prediction of cotton EPF/EPFLs

The evolutionary relationships were also investigated among the cotton EPF/EPFLs based on their protein sequences, and the clustering result of 4 clades (Fig. 3a) maintained the consistency with the increased Arabidopsis EPF/EPFL proteins (Fig. 1). The prediction analysis of conserved motifs on the cotton EPF/EPFL proteins (Fig. 3b) indicated that a total of 8 conserved motifs were identified and named as Motif 1 to Motif 8 in turn (Additional file 2: Figure S1). The number of conserved motifs of cotton EPF/EPFLs ranged from 3 to 5, of which the most ones had 4 conserved motifs. Only Motif 1 was commonly observed in all the EPF/EPFL proteins, therefore it was deemed as the most conserved motif with the second lowest E-value. Motif 2 with the minimum E-value and the longest width was also commonly identified among the EPF/EPFL proteins of EPFL1-3, EPF1-EPF2-EPFL7, and EPFL4-6-EPFL8 clades, and Motif 3 with the third lowest E-value and the shortest width was also found among the PF/EPFL proteins of EPFL1-3, EPFL4-6-EPFL8, and EPFL9/Stomagen clades. Besides, we noticed that Motif 5 uniquely while commonly appeared in EPFL9/Stomagen clade, implying its potential significance.

Fig. 3
figure 3

Gene structure and conserved motif identification of cotton EPF/EPFLs. a represents the evolutionary relationships of cotton EPF/EPFL genes, and b and c separately represent the conserved motifs and gene structures of cotton EPF/EPFL genes

The results of gene structure of cotton EPF/EPFL genes showed that the exon number of cotton EPFL/EPFL genes ranged from 1 to 4 (Fig. 3c). The main gene structure of cotton EPF/EPFL genes contained 3 exons and 2 introns (60/132), followed by the gene structures with 2 exons and 1 intron, with 4 exons and 3 introns, and with 1 exon and 0 intron.

Collinearity analysis of EPF/EPFL genes in cotton

Gene duplication has been deemed as the main force to extend the number of gene family, which was generally consisted with tandem duplication, fragment duplication, and whole genome duplication [

Fig. 7
figure 7

The expression patterns and qRT-PCR verification of cotton EPF/EPFL genes during fiber development. A presented the analyses of expression patterns of GhEPF/EPFL and GbEPF/EPFL genes on the develo** ovule (0, 1, 3, 5, 10, and 20 DPA) and fiber (10, 20, and 25 DPA), and B presented the qRT-PCR verification of 15 highly expressed GhEPF/EPFL genes on the high-yield and wide-adaptability CCRI36 and superior fiber-quality and high VW-resistance Hai1 at 10, 20, and 25 DPA, respectively

During the initial period (0 to 3 DPA) of ovule development, 8 of 44 GhEPF/EPFL genes showed the highest up-regulated fold changes at 0DAP, namely as GhEPF2, GhEPF4, GhEPF14, GhEPF15, GhEPF23, GhEPF24, GhEPF26, and GhEPF36, which were observed at 1 DPA as GhEPF25, GhEPF35, GhEPF37, and GhEPF39, while at 3 DPA as GhEPF9, GhEPF18, and GhEPF28. As for the elongation period (5 to 20 DPA) of ovule development, we noticed that GhEPF6, GhEPF36, GhEPF40, and GhEPF43 at 5 DPA, GhEPF3, GhEPF12, GhEPF13, GhEPF18, GhEPF19, GhEPF24, GhEPF27, and GhEPF42 at 10 DPA, and GhEPF7, GhEPF8, GhEPF22, GhEPF30, and GhEPF31 at 20 DPA, showed the highest up-regulated fold changes. During the elongation and secondary wall thickening periods (10 to 25 DPA) of fiber development, GhEPF20 and GhEPF43 at 10 DPA, GhEPF10, GhEPF16, GhEPF17, GhEPF29, GhEPF39, and GhEPF41 at 20 DPA, and GhEPF1, GhEPF33, and GhEPF41 at 25 DPA harbored the highest up-regulated fold changes. Similarly in the ovule and fiber development of Hai7124, we identified the highest up-regulated fold changes as GbEPF1, GbEPF17, GbEPF30, and GbEPF42 in the ovule at 0 DPA, GbEPF10 and GbEPF21 in the ovule at 1 DPA, GbEPF6-8, GbEPF14, and GbEPF43 in the ovule at 5 DPA, GbEPF12, GbEPF13, GbEPF19, GbEPF21, GbEPF29, GbEPF33, GbEPF34, and GbEPF41 in the ovule at 10 DPA, GbEPF16 and GbEPF24 in the ovule at 20 DPA, GbEPF18 and GbEPF40 in the fiber at 10 DPA, GbEPF2, GbEPF15, and GbEPF38 in the fiber at 20 DPA, and GbEPF3, GbEPF4, GbEPF9, GbEPF25, GbEPF26, GbEPF28, GbEPF32, and GbEPF39 in the fiber at 25 DPA. These data indicated that cotton EPF/EPFL genes might play important roles affecting the development and growth of ovules and fibers.

After conducting the relative quantization on the 13 high-expressed GhEPF genes during the key development periods of cotton fiber (10, 20, and 25 DPA), the largest number of GhEPF genes showed the dramatic changes on the expression levels either between the two different varieties or among the three develo** periods, while little changes were mainly observed in GhEPF20, GhEPF26, and GhEPF43 between CCRI36 and Hai1 at 20 and 25 DPA. As for the fiber-elongation period (10 DPA), we noticed GhEPF10, GhEPF26, and GhEPF43 presented higher expression levels in Hai1 than those in CCRI36. During the secondary-wall thickening periods, GhEPF13, GhEPF28, GhEPF29, GhEPF39, GhEPF40, and GhEPF43 were more highly expressed in Hai1 than those in CCRI36 at 20 DPA, and GhEPF13, GhEPF14, GhEPF23, GhEPF25, GhEPF26, GhEPF28, and GhEPF29 showed higher expression levels in Hai1 than those in CCRI36 at 25 DPA. To sum up, despite there were no GhEPF genes harboring the steadily higher expression levels in Hai1 under the key developmental periods of fiber elongation and secondary-wall thickening periods, GhEPF26 and GhEPF43 were separately observed to highly expressed in sea island cotton than those in upland cotton between 10 DAP and 25 DPA, and between 10 and 20 DPA, while GhEPF13, GhEPF28, and GhEPF29 were found with higher expression levels in sea island cotton than those in upland cotton between 20 and 25 DPA. These highly expressed GhEPF genes might be the candidate genes determining why the sea island cotton had the superior fiber quality than upland cotton, which definitely require the further experiments of genetic transformation to verify their potential functions in fiber development.