Background

Regulation of gene expression is central to all organisms [1] and is imperative for determining the morphology, functional competence, and development of a multicellular organism [2]. This regulation is tightly coordinated by a number of mechanisms, such as DNA methylation [3]; chromatin organization [4]; dimerization; and sequence-specific DNA binding, which is executed primarily by transcription factors (TFs). Depending upon the combinatorial control of protein-protein interactions, a TF may simultaneously function as an activator of one set of genes and a repressor of others [5]. For example, TFs have been known to determine the identity of floral organs in plants [6]. These TFs, referred to as organ identity genes, control the transcriptional regulation of target genes, thereby triggering organ formation in sexual plant reproduction. Via their various actions, these modular proteins play a pivotal role in controlling the spatial and temporal expression patterns of genes in all living organisms.

Usually, TFs are comprised of a DNA-binding domain (DBD) that interacts with the cis-regulatory elements of its target genes [7] and a protein-protein interaction domain that facilitates oligomerization between TFs and other regulators [8]. The majority of TFs may be grouped into a number of different families according to their structural features, i.e., the type of DBD that is present within their sequence [5]. Usually, each TF has only one type of DBD, occurring in either single or multiple copies.

Eukaryotes have a more sophisticated transcription regulation mechanism than prokaryotes. Multicellular eukaryotes must address cell differentiation and consequently administer a more enigmatic regulatory mechanism, which uses a large number of TFs [912]. Reports have also shown that TF families are strongly conserved across eukaryotic organisms, especially plants [13]. Approximately, 45% of Arabidopsis TFs belong to families that are specific to plants [1]. As in animals, TF families have been considerably expanded in plant lineages, suggesting that they are involved in the regulation of clade-specific functions [1, 8, 14, 15]. Thus, plants have more TF genes than animals [13, 16]. A significant number of protein-encoding genes are dedicated to regulating the transcription machinery and gene expression [1]. In plants, ~7% of all genes encode for TFs. For example, the genome of Arabidopsis thaliana includes 27,416 protein-coding genes (TAIR http://www.arabidopsis.org/), of which 6% (more than 1,700) encode TFs.

The completion of various genome sequencing projects has provided a unique opportunity for comparative studies of transcriptional regulatory networks. Distribution and sequence analyses suggested that TF genes in plants evolved via genome duplication [17], exon capture, translocation, and mutation. The retention of duplicated TF genes led to gene family expansions, which further complicated the genomes of higher plants [18]. TF families that have significantly expanded in the past 600–100 million years are mainly the MADS box proteins, basic-region leucine-zipper proteins (bZIP), and the MYB and bHLH families [8, 19, 20].

Plants and animals are known to have originated from a common ancestor. Structural conservation of TF DBDs among plants and animals suggests that these domains may have originated before these two eukaryotic kingdoms diverged. Little structural conservation has also been reported among different eukaryotic TFs. This suggests that eukaryotes use only a limited number of DBDs to achieve various regulatory purposes, in combination with other functional activation domains. Thus, TFs may be viewed as molecular switches that link signal transduction pathways to gene expression [7]. The function of a few TF families has remained conserved between plants and animals separated by over a billion years of evolution [1]; one example is the E2F family, which controls basic cell cycle functions [1, 21]. On the other hand, many TF families may exhibit altered or diverse functions due to minor sequence changes in different plant and animal lineages [11]. Thus, these evolutionary changes in sequences and TF functions may complicate the detection of paralogous/orthologous relationships between organisms.

Liverworts are among the earliest diverging plant lineages, thus constituting a sister group to all other land plants [2227]. The bryophyte fossil record shows that liverworts are at least 475 million years old [28]. M. polymorpha is a common liverwort with a wide distribution around the world and is one of the most intensively studied bryophytes. Because they belong to the clade of the most basal plant lineages, liverworts occupy a very important position with respect to understanding early land plant evolution [29]. No evolutionary study can be complete without data from Marchantia. Unfortunately, only minimal genomic information has been available for this bryophyte, until now. Although some expressed sequence tags (ESTs) have been produced and some male and female gene-based markers have been developed, full-fledged functional genomics studies in liverworts have not been initiated. Since M. polymorpha is a dioecious plant, ESTs have been generated [30, 31] in an attempt to identify key genes involved in sex differentiation mechanisms and the development of male and female plants but are limited in coverage. Here, we present the entire repertoire of regulatory factors in this liverwort for the first time and predict a set of TF-encoding transcripts in M. polymorpha on the basis of stringent sequence similarity with known TF genes. Sequence comparisons alone would not have provided the appropriate information regarding the alterations of TF function during evolution; hence, we also examined the expression profiles of the TF-encoding transcripts in M. polymorpha. In this study, we also focused on the evolution of TF gene families based on a comprehensive comparison of TF gene distribution in liverworts, mosses, higher plants, and their algal ancestors.

Results and discussion

Identification of TF-encoding transcripts through transcriptome sequencing and De novoassembly

The transcriptome of M. polymorpha was sequenced from RNA isolated from six different male and female tissues, as described in materials and methods section and shown in Additional file 1, using short reads on an Illumina HiSeq™ 2000 platform (Sharma et al., unpublished observations). The chosen tissue samples for RNA isolation and sequencing represented the most comprehensive repository of vegetative and reproductive stages of both male and female gametophytic tissues. The Marchantia transcriptome dataset generated from this study is a new source for the identification of novel regulatory transcripts and has provided a glance of their expression profiles in vegetative and reproductive tissues.

Approximately 80 million paired-end sequence reads, each 90 bp in length, were generated from RNA sequencing (Sharma et al., unpublished observations). Low-quality reads were filtered out before assembly. De novo transcriptome assembly was performed with Velvet [32] and Oases [33] using the same parameters used by Garg and colleagues for their transcriptome assembly [34]. De novo assembly of the Marchantia transcriptome resulted in a total of 46,533 non-redundant (NR) transcripts from 46,070 predicted loci. The sequence dataset generated is deposited at NCBI in the Short Read Archive (SRA) database under accession number SRP029610.

The total genome size of M. polymorpha was estimated to be 280 Mb based on flow cytometry, and the total number of genes was estimated to be ~20,000 [35]. In this study, 46,533 transcripts from 46,070 loci, potentially representing an estimated number of genes, were predicted from the transcriptome data of M. polymorpha. This number likely includes the alternatively spliced variants and non-coding transcripts. In fact, only 20,000 out of 46,533 transcripts generated BLASTX hits, with an E-value cut-off of 1e-05, against the protein sequences of embryophytes that were extracted from the NR NCBI database (Sharma et al., unpublished observations). Hence, we assume that most of the Marchantia genes, including TF genes, were detected by our RNA-Seq data. Our results indicate that the obtained transcript dataset may be fragmentary. Thus, the number of transcripts/genes encoding for TFs is likely to be fewer than what is presented in the data below. Further, the genome sequence information for Marchantia may provide more information about the fragmentation of transcripts in this liverwort.

The assembled NR transcripts of Marchantia were compared with known TF gene sequences of other sequenced plants listed in PlnTFDB [36] using BLASTX. In total, 3,471 putative Marchantia TF-encoding transcripts, distributed in at least 80 families, were identified, representing 7.4% of the total Marchantia transcripts detected in our study. Major TF gene families are depicted in Figure 1. The organization of TF families in Marchantia resembled that of Physcomitrella patens[3639].

Figure 1
figure 1

Distribution of Marchantia transcripts in different transcription factor families. A bar graph indicating the number of TF-encoding genes in Marchantia polymorpha and Physcomitrella patens distributed in various TF families. If the number of genes/transcripts encoding for a particular TF family is less than 12, those families are listed in others category.

Hence, the description of TF-encoding transcripts from Marchantia provided insight into the organization and biological functions of TFs in lower plants as well as their evolution. From a biotechnological standpoint, TF identification is useful for studying the transcriptional regulatory switches involved in plant development and reproduction and in generating responses and sequential adaptations to the changing environment.

Comparison of TF-encoding genes in plants and their algal ancestors

In the present study, we first summarized the knowledge of TF-encoding genes in plants and algae, while updating the classification of Marchantia TF-encoding transcripts and their categorization in all 80 different TF families. PlnTFDB [40] includes 85 families of TFs and TRs from 20 sequenced plant species other than liverworts, ranging from unicellular red and green algae to highly complex angiosperms, thereby including >1.6 billion years of gene regulatory network evolution and encompassing 26,184 distinct proteins. Sequence data showing the number of TF encoding genes in red algae, green algae, Selaginella, Physcomitrella, Chlamydomonas, and other higher plants is listed in Table 1.

Table 1 Number of genes/transcripts encoding TFs for various organisms

Data presented in Table 1 show that the number of genes encoding TFs is the smallest for algae; the number increases from liverworts to mosses, and increases further in monocots and dicots. More complex organisms execute complex mechanisms to control gene expression by employing a greater number of TFs [2, 912, 15, 41]. In eukaryotes, an appreciable number of protein-coding genes encode TFs. The number of TF-encoding genes ranges from 2–9% of the total protein-coding genes of the 20 organisms considered. As expected, based on published reports, the smallest number of TF genes was found in the most primeval organisms e.g., Chlamydomonas and Physcomitrella, where TF genes were found to be 2%–4% of the total genes annotated. In higher plants, the greater complexity of form and function presumably mandates an increased number of TF genes (e.g., monocot and dicot plants have 5–9% TF genes) [13]. This was clearly demonstrated in some earlier reports, which are summarized in Table 2. The number of total predicted protein-coding genes and the number of predicted TF genes identified are also indicated.

Table 2 TF gene percentages for various algae, liverworts, mosses and higher plants

Existing knowledge of plant TF genes was acquired from various studies conducted on an exemplar genetic model in plant biology—Arabidopsis thaliana. Despite Arabidopsis being an important and very useful plant model for studying various developmental processes and regulatory mechanisms common to all higher plants [13], it lacks certain traits that are concomitant with the evolutionary movement of plants from aquatic conditions to land, such as the loss of genes associated with an aquatic environment and acquisition of genes for tolerating terrestrial stresses. These traits are of immense value to lower plants, and this may support the concept of evolution of plants from their algal ancestors. Hence, it was of great interest to perform a more comprehensive comparative analysis of TF genes between alga, moss, spike moss, liverwort, and higher plants. We considered the identity of organisms when evaluating gene family sizes, as various organisms are reported to have different rates of gene duplication and retention, and differences in gene content may reflect species-specific adaptations [39].

Figure 2 shows 85 TF families, color-coded according to the lineage of land plants in which they were commonly found. A strikingly important observation made from the analyzed comparative dataset is that, out of the 85 gene families taken into consideration, 24 appear to originate as liverworts evolved (marked as orange blocks). These families are present in all land plants, including liverworts, but are absent in red algae (rhodophytes) and green algae (chlorophytes): Alfin-like, ARF, AUX/IAA, BBR/BPC, BES1, CAMTA, DBP, EIL, FAR1, GeBP, GRAS, GRF, HRT, LFY, LOB, LUG, NAC, NOZZLE, OFP, SRS, TCP, Tify, Trihelix and zf-HD. An initial report stated that these 21 TF families arose within the earliest land plants or in their aquatic ancestor [2]. However, taking Marchantia into consideration, given that it is the earliest diverging lineage, transcriptome sequencing provided us with new findings. The numbers of TF-encoding genes in all studied organisms are given in Additional file 2. Some TFs, which originated together with the evolution of liverworts, contribute to the stress tolerance capacity of plants: for example, CAMTA [54] and Alfin-like [55] regulate salt tolerance; ARF [56] and AUX/IAA [57] play roles in auxin regulation; EIL [58] is known for ethylene signaling in higher plants; and GRF [59], LFY [60], LOB [61], LUG [62], NAC [63], NOZZLE [64], OFP [65], and Tify [66] regulate meristem elongation, flowering initiation, and flowering organ development [6]. Trihelix TFs are known to be involved in diverse functions in seed plants, such as abiotic stress tolerance [103]. In Marchantia, WRKY proteins are transcriptional regulators that are proposed to play a role in proper cellular responses to internal and external stimuli. Other transcripts showing preferential expression pattern for reproductive stages code for AP2-EREBP – a regulator of floral organ identity [104], HB which is involved in cell differentiation and controls cell-growth [105], LOB which functions in plant development in lateral organs like the leaf or flower [61], MYB which controls cellular proliferation and the commitment to development [99], PHD which controls chromatin or transcription [106], SET which is involved in histone methylation [107], and TIG which is involved in DNA binding [75]. Thus, these TFs are proposed to play similar roles in Marchantia.

In plants, the manifestation of fundamental biological processes and proper development requires some genes to be expressed constitutively, while others are expressed in a specific spatio-temporal pattern (organ-limited, stimulus-responsive, development-dependent, and cell-cycle specific manners). Both patterns of expression rely on the interaction of TFs with cis-acting elements or with other TFs for the regulation of cell activities. Hence, any change in the expression profile of TF genes in tissues normally leads to dramatic changes in plant development, and structural changes to these genes may signify an important evolutionary force [95]. As a practical approach, studying the expression pattern of these TF-encoding transcripts in liverworts provides us with strong evolutionary support for models and emphasizes the importance of this model plant system.

Putative functions of TF-encoding transcripts

3,471 TF-encoding transcripts were subjected to a BLASTX search against the non-redundant (NR) database of the NCBI (National Center for Biotechnology Information). The BLASTX search used an E-value cut-off of 1e-05. Out of 3,471 transcripts, 3,395 (97.8%) resulted in hits, supporting that these are the protein-coding genes. 94.8% of 3,395 transcripts resulted in hits with plants. A list of BLASTX hits is provided in Additional file 8.

qPCR validation

qPCR analysis was used to compare the expression of selected variably expressing transcripts across a spectrum of tissues, including vegetative, immature, and mature reproductive stages. Transcripts displaying consistent expression across the spectrum of cells were taken as reference genes. Homologues of actin (MpACT1) and CDPK (MpCDPK) exhibited variable expression in six considered stages when checked by qPCR, as shown in Additional file 9. Hence, CDPK and actin were not taken as reference genes. Instead, based on the RPKM values, a transcript having consistent expression was selected as the reference gene and was cross-checked by qPCR as well (Additional file 9). qPCR results confirmed the in-silico calculations for the RPKM values of the dataset for most of the transcripts, as shown in Figure 4. The de novo assembled Marchantia TF expression data presented here will also be beneficial for performing other functional genomics and comparative genomic studies.

Figure 4
figure 4

Real-time RT-PCR expression profiles of selected transcripts coding for transcription factors. VM – Vegetative thallus Male, VF – Vegetative thallus Female, IMM – Immature reproductive Male, IMF – Immature reproductive Female, MM – Mature reproductive Male and MF – Mature reproductive Female tissues. All reproductive stage tissues referred to antheridiophores and archegoniophores as described in materials and methods. Y-axis on the left side of graphs shows scale for qPCR values and on the right side shows scale for RPKM values.

Our in silico inspection of the expression patterns of these TF-encoding genes in different vegetative and reproductive tissues suggested tissue-specific and/or stress-responsive attributes in accordance with their expression patterns. The tissue-specific expression profile of a gene could also be used to discuss the combinatorial usage of TFs for dictating the transcriptional program of different tissues. Members of different TF gene families appear to differ in their time and level of expression as they responded to multiple environmental signals and different developmental signs. Consequently, specific lower-plant traits may derive from some unique TF gene expression patterns. Additionally, it is possible that the same TF gene family members variably express in different plants [95]. Hence, the differential expression of similar TF genes upon exposure to contrasting environmental stimuli could be due to cis-acting elements. Clearly, the regulation of TF gene expression and function involves a vital network of interrelated processes.

Statistical analysis

Analysis of variance showed highly significant differences among ranks (p <0.0001), in terms of the number of genes coding for TFs, as depicted in Additional file 10. The number of TF-encoding genes appear to increase significantly with organism rank, and thus complexity of the organisms involved. The comparisons of ranks using Gabriel’s comparison limits revealed three major groups. The two most primitive organisms (ranks 1 and 2) had a similarly few number of TF-encoding genes. Organisms classified as rank 5 and 6 (most developed) exhibited a similarly high number of TF-encoding genes. Organisms in rank 3 and 4 showed medium numbers of genes and were placed in between these two extremes, as shown in Figure 5. The results of variance showed that nearly 59% of the total variation in the number of genes coding for TFs was between organisms. Differences between ranks contributed to 39% of the variation, and only 2% variation existed between organisms grouped within a given rank.

Figure 5
figure 5

Statistical analysis. Ranks 1 & 2 represents red and green alga respectively, 3 & 4 represents Marchantia and Moss, Spike moss, Physcomitrella respectively and 5 & 6 represents monocots and dicots respectively.

Conclusions

Liverworts as the sister of all land plants represent the basal lineage of land plants, providing a unique perspective on the regulatory origin of TFs and the genetic complexity of terrestrial plants. Marchantia, among the liverworts, is particularly easy to grow, transformable, and may prove to be a crucial model for future study of the origin of regulatory genetic systems. The availability of the complete genomic sequences of an increasing diversity of important plant species has provided us with a unique opportunity for comparative studies on the expansion and contraction of TF families. The expansion of regulatory protein numbers and interactions, as well as changes to their spatial and temporal expression, constitute part of the evolutionary process that has led to increasingly complex organisms.

The comparison of Marchantia TF genes to other sequenced plant genomes reveals the emergence of new TF families within Marchantia that have been preferentially retained and have particularly diversified in higher plants. Among these, such TF families as GRAS, LFY, LUG, NOZZLE, Tify and Trihelix play important roles in sexual plant reproduction. Liverworts therefore appear as a critical lineage with respect to terrestrial trait development through the origin and diversification of TF genes regulating specialized functions in reproduction. The evolution of these TF families in Marchantia may allow the activation of gene expression during male/female reproductive organ formation and differentiation. However, two TF families present in lower plants and green and red algae did stop with Marchantia and were not inherited in higher plants.

This study identifies TF genes and provides a detailed analysis of TF gene expression as a means of understanding the impact of TF diversification on the evolution of liverworts and their importance in the origin of modern land plants from bryophytes to flowering plants. Thus, we have demonstrated the utility of short read sequence data to characterize TF-encoding transcripts using Marchantia as a basal lineage in the context of genetic change in a broad comparison of terrestrial plants with their charaphytic and algal ancestors. Further analysis is expected to increase our knowledge of organism diversification through further chromosomal sequence analysis and reorganization. In addition, the identification of cis- and trans-acting elements associated with plant TFs are expected to reveal additional mechanisms that regulate gene expression in a more tightly regulated genetic context. Future studies are expected to build on the current liverwort TF gene transcriptome through construction of a broader interactome (protein-protein interaction) and elucide the regulons controlling each TF. The establishment of such a TF interactome within a fairly short time span is a feasible and important goal. Such an interactome will encompass TF-TF interactions directly as well as TF-DNA interactions and will highlight the underlying complexity of gene regulation in liverworts.

Methods

Plant material and growth conditions

Male and female M. polymorpha plants were collected from local wild colonies growing in nurseries in Melbourne, Australia. Male and female lines for RT-PCR and Real-time PCR experimental purposes were established from a single gemma of the thallus. Plants were maintained and propagated in growth cabinet with temperature of 20°C and continuous white light 60 μmol photon m-2 s-1 and far-red (FR) light 730 nm. Tissues were collected for the RNA sequencing from male and female vegetative thallus (VM and VF), immature male and female reproductive structures (antheridial and archegonial discs) - 2 mm in height (IMM and IMF) and mature male and female reproductive structures (antheridial and archegonial discs) > 2 mm in height (MM and MF) as shown in Additional file 1.

RNA sequencing and assembly

Total RNA was extracted from the male and female vegetative thalli and immature and mature reproductive gametophytic tissues of M. polymorpha (obtained from nurseries across Melbourne) using an RNeasy extraction kit (Qiagen, Australia), according to the manufacturer’s recommendations. RNA samples were quantified using a Nanodrop ND-1000 spectrophotometer (Biolabgroup, Australia). RNA sequencing was performed by the Bei**g Genome Institute (BGI), China. In total, six cDNA paired-end libraries were generated using the mRNA-Seq assay for transcriptome sequencing on Illumina HiSeq™ 2000 platform.

Briefly, beads with Oligo(dT) were used to isolate poly(A) mRNA from the total RNA preparations. mRNA was fragmented into short fragments and taking these fragments as templates, random hexamer-primer was used to synthesize the first strand cDNA. The second-strand cDNA was synthesized using dNTPs, RNaseH and DNA polymerase I. Short fragments were purified and resolved for end reparation and adding poly(A). Short fragments were then connected with sequencing adapters and suitable fragment were selected using agarose gel electrophoresis for the PCR amplification as templates. At last, the library could be sequenced using Illumina HiSeq™ 2000.

Raw sequence reads were filtered for low quality reads trimmed off 3’ adaptor sequences. All short read assemblies were performed using publicly available programs: Velvet (version 1.1.05; http://www.ebi.ac.uk/~zerbino/velvet/), developed for de novo short read assembly using de Bruijn graphs [32], and Oases (version 0.1.22; http://www.ebi.ac.uk/~zerbino/oases/), a de novo transcriptome assembler for very short reads [33]. After velvet assembly, the resulting contigs were clustered into small groups, loci using Oases to produce transcript isoforms. Various parameters of these programs i.e. K-mer length = 49, N50 length were optimized to obtain the best assembly results with our dataset.

Similarity search and identification of TF-encoding transcripts

For the identification of TF-encoding transcripts in M. polymorpha, all of the assembled transcripts were subjected to a homology search (BLASTX) with known transcription factors (TFs) and other transcriptional regulators (TRs), as classified in Plant Transcription Factor Database (PlnTFDB; version 3.0; http://plntfdb.bio.uni-potsdam.de/v3.0/[40, 108]), with an e-value cut-off of 1e-05 using default parameters. PlnTFDB is an integrative database that provides complete sets of TFs and TRs in plant species, which have completely sequenced and annotated genomes and that are listed in the database.

Protein sequences for all of the genes from 20 species listed in the PlnTFDB were downloaded from (http://plntfdb.bio.uni-potsdam.de/v3.0/downloads.php); the file contained 29,473 sequences. This file acted as the database for the local BLASTX search, and the query file contained all the assembled Marchantia transcript sequences. The BLASTX results were inspected for their top first hits using in-house python script, and thus, putative transcripts of M. polymorpha that coded for TFs were identified.

Comparison of TF-encoding genes in plants and their algal ancestors

In order to better understand the evolution of TFs, comparative studies of TF gene families was carried out between 21 algal and plant species - 20 species were listed in the Plant Transcription Factor Database (PlnTFDB) and Marchantia transcripts. We investigated TF gene evolution based on the phylogenetic positions of plants listed in PlnTFDB and by comparing the number of genes coding for a particular TF family in different plant and algal species taken into consideration. Comparative analysis was performed on the number of TF genes by highlighting similarities and differences in TF gene populations among the organisms taken into consideration. The percentages of identified TF genes compared with the total number of protein-encoding genes in the genome were also analyzed for all species. We took into account the events of emergence, halt, expansion and contraction of particular TF gene families by considering the number of genes/transcripts that encoded for a specific TF in various species.

Expression patterns of TF-encoding transcripts of M. Polymorpha

We mapped all of the reads from six libraries onto the non-redundant set of assembled transcripts to quantify the abundance of the transcripts using Bowtie [109] allowing upto 3 mismatches per read. The calculation of transcript expression in each tissue used the RPKM (number of reads per kilobase per million reads) method [110]. The expression value in terms of the RPKM, which corresponded to each transcript in all six tissues, was determined. TF-encoding transcripts were quantified by the formula:

RPKM = 10 6 * C / NL / 10 3

where RPKM(A) is the expression of transcript A, C is the number of reads that uniquely aligned to transcript A, N is the total number of reads that are uniquely aligned to all transcripts and L is the number of bases on transcript A. The RPKM method eliminated the influence of different gene lengths and sequencing levels on the calculation of gene expression. Therefore, the calculated gene expression could be directly used to compare the difference in gene expression between samples.

RT-PCR analysis

For the detection of transcripts that were expressed at specific stages as revealed by the assembly and RPKM methods, RT-PCR was carried out. Reverse transcriptase (Superscript™ One step RT-PCR with Platinum® Taq, Invitrogen, Australia) reactions were performed using 20 ng of total RNA, according to the manufacturer’s instructions. The cDNA equivalent of 20 ng total RNA was amplified in 10 μl reactions for 45 min at 50°C. The reaction conditions were as follows: pre-denaturation for 2 min at 94°C, followed by 35 cycles of 94°C for 15 s and annealing/extension at 58°C for 30 s, then 72°C for 1 min, followed by a final extension of 1 cycle at 72°C for 5 min. PCR products were run on a 1% (w/v) agarose gel to confirm the size of the amplification products and to verify the presence of a unique PCR product. Total RNA used in RT-PCR and Real-time PCR analysis experiments were extracted from the clean cultures of Marchantia. These RNA preparations were entirely independent from the ones used in RNA sequencing. Two technical replicates were done for each of the nine transcripts. Primers suitable for amplification for each transcript were designed using an online tool from Invitrogen, OligoPerfect™ Designer (http://tools.invitrogen.com/content.cfm?pageid=9716). A list of primers used is given in Additional file 11.

Real-time RT-PCR analysis

Real time PCR for selected TF encoding transcripts was performed in duplicates using Brilliant III Ultra-fast SYBR QPCR Master mix (Agilent Technologies, Mulgrave, Victoria, Australia) according to manufacturer’s instructions involving 3-step PCR cycle. Quantitative expression differences between samples were estimated using cDNA from male and female vegetative, immature and mature reproductive stages, obtained using the Invitrogen Superscript™III First strand cDNA synthesis kit according to manufacturer’s instructions. After purification and measurement, ~50 ng of cDNA from each stage of the 6 developmental stages was used as template for real-time PCR analysis using Brilliant III Ultra-fast SYBR QPCR Master mix. PCR amplifications were performed on the MX3000P real-time PCR instrument (Agilent Technologies, Mulgrave, Victoria, Australia). Data generated was analysed using MxPro software. All experiments were performed with two technical replicates and the RNA preparations were pooled mixtures of several rounds isolations for each sample, and are entirely independent from the ones used in RNA sequencing, hence the preparations itself contained multiple biological replicates. The quantity of cDNA was calculated by software in nanograms for each sample and is plotted onto a graph for reference transcripts - actin and CDPK genes and the transcript that has uniform constant RPKM values in all six stages (Additional file 9). The starting concentration of each transcript in a sample was expressed relative to the starting concentration of reference transcript. For each examined transcript, the ^Ct value between each tested sample and reference gene was calculated and plotted onto a graph. A list of primers used is given in Additional file 11.

Statistical analysis

Data given in Additional file 2 is divided into 6 ranks according to the group of organisms analysed and fed into Statistical Analysis Software (SAS) version 9.2. To test whether the number of genes encoding for TFs differs significantly among organisms (as grouped in ranks); all data were subjected to analysis of variance using PROC GLM of SAS. The sub-ranks nested within rank (i.e. rank (sub-rank)) was used as error term for significant test of ranks and this referred to individual organisms within a rank. Data was log-transformed prior to analysis to meet the assumptions of homogenous and normally distributed residuals. Pair-wise comparisons between ranks were undertaken with the use of Gabriel’s comparison interval (95% confidence intervals). Further analysis was done using the PROC NESTED (SAS) to determine the variance partitioning pattern among different sources of variation (i.e. rank, sub-rank, genes). The Tukey’s Studentized Range (HSD) Test also grouped 6 ranks in A, B, C and D groups according to the similarity between the number of TF-encoding genes for various organisms.

Availability of supporting data

The sequence datasets sets supporting the results of this article are available at NCBI in the Short Read Archive (SRA) database under accession number SRP029610.