A multi-omics dataset of human transcriptome and proteome stable reference

Lu, Shaohua; Lu, Hong; Zheng, Tingkai; Yuan, Huiming; Du, Hongli; Gao, Youhe; Liu, Yongtao; Pan, Xuanzhen; Zhang, Wenlu; Fu, Shuying; Sun, Zhenghua; **, **gjie; He, Qing-Yu; Chen, Yang; Zhang, Gong

doi:10.1038/s41597-023-02359-w

A multi-omics dataset of human transcriptome and proteome stable reference

Data Descriptor
Open access
Published: 13 July 2023

Volume 10, article number 455, (2023)
Cite this article

Download PDF

You have full access to this open access article

Scientific Data

A multi-omics dataset of human transcriptome and proteome stable reference

Download PDF

Shaohua Lu^1,2^na1,
Hong Lu¹^na1,
Tingkai Zheng¹^na1,
Huiming Yuan³,
Hongli Du ORCID: orcid.org/0000-0002-6419-0160⁴,
Youhe Gao⁵,
Yongtao Liu⁵,
Xuanzhen Pan⁵,
Wenlu Zhang⁴,
Shuying Fu⁴,
Zhenghua Sun¹,
**gjie **¹,
Qing-Yu He¹,
Yang Chen ORCID: orcid.org/0000-0002-1863-5544¹ &
…
Gong Zhang ORCID: orcid.org/0000-0003-0418-3433¹

2784 Accesses
1 Altmetric
Explore all metrics

Abstract

The development of high-throughput omics technology has greatly promoted the development of biomedicine. However, the poor reproducibility of omics techniques limits their application. It is necessary to use standard reference materials of complex RNAs or proteins to test and calibrate the accuracy and reproducibility of omics workflows. The transcriptome and proteome of most cell lines shift during culturing, which limits their applicability as standard samples. In this study, we demonstrated that the human hepatocellular cell line MHCC97H has a very stable transcriptome (r = 0.983~0.997) and proteome (r = 0.966~0.988 for data-dependent acquisition, r = 0.970~0.994 for data-independent acquisition) after 9 subculturing generations, which allows this steady standard sample to be consistently produced on an industrial scale in long term. Moreover, this stability was maintained across labs and platforms. In sum, our study provides omics standard reference material and reference datasets for transcriptomic and proteomics research. This helps to further standardize the workflow and data quality of omics techniques and thus promotes the application of omics technology in precision medicine.

A practical guide to amplicon and metagenomic analysis of microbiome data

Article Open access 11 May 2020

Single-Cell RNA Sequencing Analysis: A Step-by-Step Overview

RNA-Seq Data Analysis in Galaxy

Background & Summary

The booming applications of omics technologies provide unprecedented insights into biology and medicine. However, the reproducibility of omics technology has been questioned for a long time. A study in genomic sequencing showed zero sensitivity in finding pathogenic mutations using whole exome sequencing of 57 patients¹. The mutation of 40 circulating tumor DNA (ctDNA) samples, sequenced by two individual companies, showed only 12% congruence². In the field of RNA-seq, a wide variety of methodologies limits reproducibility³. In the field of proteomics, a study of Human Proteome Organization (HUPO) tested 20 highly purified recombinant human protein samples. Each protein contains one or more 1250 Da unique pancreatic peptides. The samples were then distributed to 27 laboratories for identification. The results found that only 7 of 27 laboratories reported all 20 proteins, whereas only 1 laboratory reported all 1250 Da trypsin peptides⁴. Another study found that even using optimal conditions and a uniform standard operating procedure, the median value of protein repeatability for the same mixture sample in different laboratories was only 75%⁵. In addition, the repeatability of multiple quantitative tests on the same sample in the same laboratory and between different laboratories is less than 80%⁶. Other studies show that omics research’s lack of reliability and repeatability is one of the biggest obstacles to narrowing the gap between personalized medicine research and practice^7,8.

Reference materials are needed to examine the omics workflow’s accuracy and reproducibility. For genome (DNA) sequencing, reference genome samples, e.g., ΦX174 viral DNA and NA12878 human cell line genomic DNA, were widely used as standards. Mixed genomic standards are used for variant calling benchmarks⁹. DNA standards are easy to produce because of the high fidelity of DNA replication of normal genomes. However, the transcriptome (RNA) and proteome are qualitatively and quantitatively highly variable due to various kinds of factors, including intrinsic factors (e.g., senescence, cell cycles, contact inhibition, etc.) and extrinsic factors (e.g., temperature, osmolarity, buffer content, oxidation, etc.), which set challenges on making reference standard transcriptome and proteome sample. Since 2004, the Universal Human Reference RNA¹⁰ has been commercialized as a “standard” RNA sample for RNA-seq benchmarking^{11,12,13,14,15}. However, the RNA content of the Universal Human Reference RNA sample may shift during long-time production. Universal Human Reference RNA is a pooled RNA of 10 cancerous cell lines, including Hela, which is well known for its instability^{16,17,18,19,20,21}. Indeed, in the lot-to-lot comparison of the Universal Human Reference RNA, Pearson’s correlation coefficient reached only 0.9736¹⁰, which demonstrated such instability even in a short production period. Such variation, which is expected to multiply during a long production period, is not sufficient to evaluate the reproducibility of the advancing next-generation sequencing techniques with increasing depth and resolution. The proteome is more variable than the transcriptome due to the massive translational regulation²². Therefore, a stable proteome reference is more difficult to produce. Since 2003, the Proteomics Standards Initiative standardized the data formats of mass spectrometry (MS)-based proteomics but did not plan to provide a human proteome standard sample^23,24. Till 2021, proteome standard material has not been considered in the quality standards in research facilities²⁵. Currently, a mixture of 18~48 recombinant proteins is used as a “test standard” or spike-in for proteomics^{5,26,27,28,29}. However, such a small number of proteins could hardly form a reference standard for complex proteome samples.

In this study, we collected the 8~12 generation cells of 5 cell lines, A549, HCCLM3, HCCLM6, Hela, and MHCC97H. Their RNA was extracted to generate a transcriptome sequencing dataset (Fig. 1). Quantitative analysis results showed that MHCC97H has high stability in the transcriptome. Subsequently, we generated the MHCC97H translatome sequencing dataset and protein mass spectrometry dataset. In the comprehensive tests of multiple laboratories and multiple platforms, MHCC97H showed high stability in both transcriptome and proteome. In conclusion, we demonstrate that the MHCC97H cell line has a stable transcriptome and proteome that can be used as an omics standard to evaluate and calibrate omics workflows. We also provided transcriptome datasets of multiple cell lines, as well as MHCC97H translatome and mass spectrometry datasets, which provide a reliable reference for the quality control of omics data.

Methods

Cell culture and materials

Hela and A549 cell lines were purchased from American Type Culture Collections (ATCC, Rockville, MD, USA) and authenticated by short tandem repeat profiling. The human hepatoma MHCC97H, HCCLM3, and HCCLM6 cells were provided by Professor Yinkun Liu, Fudan University¹². In fact, MHCC97H, HCCLM3, and HCCLM6 cell lines are derived from the parent MHCC97 cell, and their potential metastatic ability increased successively³⁰. The MHCC97H cell line was isolated from the parent MHCC97 cell with high metastatic potential³¹. HCCLM3 was derived from MHCC97H and was characterized by high lung metastasis³². HCCLM6 was also derived from the parent MHCC97H cell, and was characterized by high lung metastasis and lymphatic metastasis³³ MHCC97H, HCCLM3, and HCCLM6 cell lines were typical hepatocellular carcinoma cell lines, and their characteristics, phenotypes and representative disease characteristics have been reported in detail³⁰. All cells were detected free of mycoplasma during maintenance and upon experiments. These cells were cultured in a DMEM medium with 10% FBS and 1% penicillin/streptomycin. Culture conditions for all cell lines were 37 °C, 5% CO₂.

Detection of mycoplasma contamination

We used the Mycoplasma Detection Kit (ExCellBio, MB000-1591, China) to detect cell contamination by mycoplasma. Detailed operation steps were as follows: 1~1.5 mL cell culture supernatant was put into a centrifuge tube and centrifuged at 13000 rpm for 5 min. Then the supernatant was discarded and the precipitate was washed once with PBS. 100 µL of lysate was added to lyse the cells, inverted upside down to mix well, and left at room temperature for 5 min. The cell lysate was then incubated at 95 °C for 5 min, centrifuged at 13000 rpm for 5 min, and the supernatant was transferred to a new centrifuge tube. The PCR reaction was carried out with toke 1~2 µL supernatant as a template, and the amplified products were electrophoresis by 2% agarose gel.

RNA extraction

Each generation of cells was cultured to 80~90% confluency, then washed twice with RNase-free PBS (LEAGENE, IH0142, China), and then isolated by using TRIzol RNA extraction reagent (Invitrogen, 15596026, USA), detailed steps were as follows: Cells were collected in 1.5 mL EP tubes, centrifuged at 230 × g for 3 min, the supernatant was discarded and PBS was added to wash the cells. Then added 1 mL of TRIzol into a fume hood, mixed well, and placed at room temperature for 5 min, and added 200 µL chloroform and the mixture was vortexed vigorously for 15 s and placed at room temperature for 3 min. After centrifugation at 12000 × g at 4 °C for 15 min, the sample was divided into three layers (RNA in the upper aqueous phase), and the upper aqueous phase was sucked into a new EP tube. Then added 800 µL isopropyl alcohol, inverted and mixed, and placed at - 20 °C overnight. The next day, centrifuged at 12000 × g at 4 °C for 30 min, and removed the supernatant. Added 1 mL of 75% ethanol to wash the RNA precipitation (precooled at - 20 °C), centrifuged at 7500 × g at 4 °C for 5 min, and removed the supernatant, repeat this step once again. Used 20 µL of RNAase-free water to redissolve the pellets, and ran electrophoresis with 1% agarose gel after measuring RNA concentration.

Ribosome-nascent chain complex-RNA (RNC-RNA) extraction

The RNC extraction was performed as we previously reported³⁴. In brief, cells from each generation were cultured to 80~90% confluence, pretreated with 100 mg/mL cycloheximide for 15 min, then precooled PBS was washed and 2 mL of cell lysis buffer (1% Triton X-100, 20 mM HEPES-KOH (pH 7.4), 15 mM MgCl₂, 200 mM KCl, 100 mg/mL cycloheximide and 2 mM dithiothreitol (DTT) (Solarbio, D8220, China)) was added. After 30 min ice bath, cell lysates were scraped and transferred to 1.5 mL RNase-free tubes. Cell debris were removed by centrifuging at 16200 × g for 10 min at 4 °C. Supernatants were transferred on the surface of 20 mL of sucrose buffer (30% sucrose, 20 mM HEPES-KOH (pH 7.4), 15 mM MgCl₂, 200 mM KCl, 100 mg/mL cycloheximide and 2 mM DTT). RNC was pelleted per ultra-centrifugation at 42500 rpm for 5 h at 4 °C. Subsequently, RNC-RNA was extracted from RNC particles using TRIzol RNA extraction reagent following the manufacturer’s instructions.

mRNA-seq and RNC-seq

For all mRNA-seq and RNC-seq, DNase I (Thermo Fisher Scientific, EN0525, USA) treatment was performed prior to the RNA library construction to remove DNA contamination according to the manufacturer’s instructions. For mRNA-seq, our study used two methods to construct the sequencing library, including Oligo-dT method (PolyA + mRNA strategy) and the rRNA depletion method (Ribominus strategies).

We used Library Preparation VATHS mRNA Capture Beads (Vazyme, N401-02, China) to purify polyA + mRNA from total RNA. Then, according to different experimental designs, used the MGIEasy RNA Library Prep Kit (MGI, A0210, China) for MGI platform or VAHTS Universal V6 RNA-seq Library Prep Kit (Vazyme, NR604-01, China) for Illumina platform to constructed the sequencing libraries for Oligo-dT method, according to each manufacturer’s instruction. Before the rRNA depletion sequencing libraries were constructed, rRNA was removed from total RNA by probe hybridization followed by RNase H degradation as we previously reported^35,36. And the rRNA depletion sequencing libraries were also constructed by using the MGIEasy RNA Library Prep Kit according to the manufacturer’s instructions too.

For RNC-seq, only Oligo-dT method was used for library construction, which was the same as mRNA-seq. Among all the mRNA-seq and RNC-seq data involved in this project, only the mRNA library construct by Sagene Co. Ltd. was sequenced by NovaSeq-6000 platform (Illumina, China) for 300 cycles, and the rest were sequenced on a MGISEQ-2000 platform (MGI, China) for 210 cycles.

The high-quality reads were subjected to the subsequent bioinformatics analysis. The adapter sequences were trimmed from the reads. Then reads were mapped to transcripts using the hyper-accurate map** algorithm FANSe3³⁸ in the next-generation sequencing analysis platform “Chi-Cloud” (http://www.chi-biotech.com). Gene expression levels were quantified using the RPKM (reads per kilobase per million reads) method³⁹. Genes with at least 10 reads were considered quantifiable genes⁴⁰.

RNA degradation

We tested the RNA samples under degradation conditions to investigate how the degradation affects their applicability to serve as a reference, and to test whether our procedure could tolerate the degradation and provide comparable results as the non-degraded counterparts.

For RNA samples that were slightly degraded during RNA extraction, we used Oligo-dT method and the rRNA depletion method for library construction. The probe sequences used in the rRNA depletion method were listed in supplementary Table 1. The library construction method for RNA-degraded samples treated with RNase A was as follows: We randomly selected a complete total RNA sample without degradation from the 1~9 generations of MHCC97H (Fig. 3a). For example, in the third generation, the concentration of extracted total RNA was 1299.16 ng/µL, the total volume was 20 µL, and 29 µg of total RNA was finally obtained. We performed a series of RNase A degradation experiments, each of which contained 2 µg total RNA of the third-generation cell line as starting material. 1 ng RNase A was added to each of the five experimental groups (except the non-degraded group), and then reacted for 30 s, 1 min, 2 min, 5 min, and 1 h, respectively. Finally, 0.5 U RNaseOUT (Thermo Fisher Scientific, 10777019, USA) was added for the termination reaction. All experiments were operated at room temperature. The library was constructed by the rRNA depletion method, and evaluated in the same way as described above (paragraph “mRNA-seq and RNC-seq”).

Protein trypsin digestion

Each generation of cells was cultured to 80~90% confluency, then treated with 0.25% trypsin-EDTA (Gibco, 25200056, USA), centrifuged at 300 × g for 5 min at room temperature, washed twice with PBS, and the supernatant was removed by centrifuge. Cells were dissolved in 1% SDS lysis buffer (Beyotime, P0013G, China) and the protein concentration was measured by a BCA quantification kit (Thermo Fisher Scientific, 23227, USA). The protein digestion was performed by filter-aided sample preparation (FASP)⁴¹ method. In brief, protein samples were treated with 8 M urea (8 M urea in 0.1 M Tris-HCl, pH 8.5), resulting in a final concentration of urea ≥4 M. Next, an appropriate amount of DTT was added to a concentration of 50 mM and incubated at 37 °C for 1 h. Iodoacetamide solution (IAA) (Merck, I6125, USA) was added to a concentration of 120~150 mM, and incubated at room temperature for 30 min in the dark. Each solution was transferred into a 10 KDa ultrafiltration tube (Merck, UFC501096, USA) and centrifuged at 12000 × g for 15 min. The filter tube was washed twice with 8 M urea (200 µL each time) and then washed three times (200 µL each time) with 50 mM triethylammonium bicarbonate (TEAB) (Thermo Fisher Scientific, 90114, USA). Finally, trypsin (Promega, V5280, USA) was added at the ratio of 1:40 (trypsin: protein), and incubated at 37 °C overnight. After 16 hours, all peptides were collected by centrifugation at 12000 × g for 20 min. Then washed the filter tubes twice with 50 mM TEAB (200 µL each time), and all eluted peptides were collected and mixed. Their concentrations were determined using the Pierce Quantitative Fluorometric Peptide Assay kit (Thermo Fisher Scientific, 23290, USA). Finally, all peptides were lyophilized and stored at -80 °C.

The peptides were then redissolved using 200 µL 0.5% trifluoroacetic acid (TFA) (Macklin, T818782, China) solution and desalted using Waters C18 columns (Waters, WAT054955, USA). The procedure for desalting was as follows: the C18 columns were activated with 1 mL acetonitrile (ACN) (Thermo Fisher Scientific, A955-4, USA) and then equilibrated twice with 1 mL condition buffer (20% ACN with 0.1% TFA). All peptides were then loaded into the C18 columns and repeated 3 times. Then the C18 columns were washed 5 times with 1 mL of washing buffer (0.1% TFA). Finally, all peptides were eluted with elution buffer (70% ACN with 0.1% TFA), lyophilized, and stored at -80 °C.

Data-dependent acquisition (DDA) mass spectrometry

For data-dependent acquisition analysis, data were collected by Q Exactive Plus (QE+) mass spectrometer equipped with EASY-nLC 1000 system (Thermo Fisher Scientific, USA) and Orbitrap Fusion Lumos mass spectrometer equipped with EASY-nLC 1200 system (Thermo Fisher Scientific, USA) respectively.

QE+ parameter setting

Each injection consisted of 2 µg of peptides and 1 µL of standard peptides (iRT peptides) (Biognosys, Ki-3002-2, Switzerland). The samples were separated by a 100 µm × 20 mm, 5 µm C18 nano trap column (Thermo Fisher Scientific, AAA-164564, USA) and a 75 µm × 250 mm, 2 µm C18 analytical column (Thermo Fisher Scientific, 164941, USA), respectively. In the analytical column, the samples were eluted at a flow rate of 300 nL/min for 120 min, and the elution gradient was: 3~7% solvent B, 4 min; 7~18% solvent B, 70 min; 18~25% solvent B, 20 min; 25~35% solvent B, 16 min; 35~40% solvent B, 1 min; 40~90% solvent B, 9 min (solvent A: 98% H₂O, 2% ACN, 0.1% FA; solvent B: 98% ACN, 2% H₂O, 0.1% FA). The parameters of the mass spectrum were set as follows: MS1 scan range: 400 to 1200 m/z, resolution: 70000, AGC (auto gain control) target: 3e6, max injection time: 60 ms. Top-20 parent ions of MS1 were selected for MS2 collection. MS2 scan resolution: 17500, isolation window: 1.6 m/z, HCD (higher collision energy dissociation): 32%, AGC target: 5e5, max injection time: 50 ms, NCE (normalized collision energy): 27%, dynamic exclusion: 30 s.

Orbitrap fusion lumos parameter setting

Each injection consisted of 2 µg of peptides and 1 µL of iRT peptides. The samples were separated by a 150 µm × 20 mm, 1.9 µm C18 nano trap column (homemade) and a 150 µm × 300 mm, 1.9 µm C18 analytical column (homemade), respectively. In the analytical column, the samples were eluted at a flow rate of 600 nL/min for 120 min, and the elution gradient was: 5~12% solvent B, 28 min; 12~24% solvent B, 58 min; 24~38% solvent B, 25 min; 38~95% solvent B, 1 min; 95% solvent B, 8 min. The parameters of the mass spectrum were set as follows: MS1 scan range: 350 to 1500 m/z, resolution: 120 k, AGC: 4e5, max injection time: 50 ms; MS2 scans resolution: 15 k, isolation window: 1.6 m/z, HCD: 31%, AGC target: 5e4, max injection time: 50 ms, cycle time: 3 s, dynamic exclusion: 30 s.

Data-independent acquisition (DIA) mass spectrometry

The mass spectrometry data were collected using QE+ and Orbitrap Fusion Lumos for data-independent acquisition analysis, respectively.

QE+ parameter setting

Each injection consisted of 2 µg of peptides and 1 µL of iRT peptides. Samples were analyzed in the data-independent acquisition method. The liquid conditions were the same as in the data-dependent acquisition method mentioned above. The parameters of the mass spectrum were set as follows: MS1 resolution: 70000; MS2 resolution: 17500, m/z range: 400 to 1200 m/z, variable acquisition windows: 30, AGC target: 3e6, injection time: 60 ms, NCE: 27%, AGC target: 1e6, max injection time: auto.

Orbitrap fusion lumos parameter setting

Each injection consisted of 2 µg of peptides and 1 µL of iRT peptides. Samples were analyzed in the data-independent acquisition method. The liquid conditions were the same as in the data-dependent acquisition method mentioned above. The parameters of the mass spectrum were set as follows: MS1 scan resolution: 120000, AGC target:4e5, max injection time: 50 ms, mass range: 350 to 1250 m/z, followed by 40 data-independent acquisition scans with segment widths adjusted to the precursor density; MS2 scan resolution: 30 k, max injection time: 50 ms, AGC target: 5e5, HCD: 31%.

Database search

MaxQuant (version 1.5.7.4) was used for data-dependent acquisition data search. The common search parameters: Type: standard, multiplicity 1; Digestion: digestion mode(specific), enzyme, trypsin/P; Variable modification: oxidation(M), acetyl (protein N-term); Max number of modifications per peptide: 5; Missed cleavage sites were allowed: 2; Label-free quantification: LFQ; LFQ minimum ratio count: 2; Fast LFQ was selected; LFQ minimum number of neighbors: 3; LFQ average number of neighbors: 6; Instrument: orbitrap; Fixed modification: carbamidomethyl (C); Two missed cleavage sites were allowed. We adopted the criteria for confident identification with a false discovery rate (FDR) < 0.01 at peptide and protein levels.

Data of data-independent acquisition searched by the direct DIA module of Spectronaut Pulsar (version 14.2.200619.47784). The common search parameters: Enzymes/Cleavage Rules: trypsin/P; XIC Extraction: default parameter; Modifications: Fixed modification: Carbamidomethyl (C); Variable modifications: Oxidation (M), Acetyl (Protein N-term); Calibration: default parameter; Identification FDR (false discovery rate) threshold: peptide levels: 0.01, protein levels: 0.01, and PSM levels: 0.01; Identification: Machine Learning: Per Run, Precursor, and protein Qvalue Cutoff: 0.01, Probability Cutoff: 0.75, and the others were default parameter; Quantification: Quantity MS-Level: MS2 and the others were default parameter; The Workflow, Post Analysis and Pipeline Mode parameter setting was default parameter.

The Database of Uniprot-Human-Filtered-Reviewed-Yes -UP000005640_9606.fasta was used for all database searches.

RNA and protein quantification

For mRNA-seq and full-length translating mRNA-seq (RNC-seq) data, our study used the FANSe3 algorithms. The sequence map** of FANSe3 can be referenced to the human transcriptome database. The mRNA abundance was normalized using RPKM.

For protein quantification analysis, label-free mass spectrometry data were quantified with the iBAQ (intensity-based absolute quantification) algorithm as provided in MaxQuant. Remove missing values from protein quantitative data before performing median normalization.

Data Records

All the sequencing datasets are available at the NCBI Gene Expression Omnibus (GEO) with dataset identifier GSE234201⁴². All the mass spectrometry raw data are publicly available on iProX with the accession number PXD041292⁴³. Details of all omics data are shown in supplementary Tables 2, 6.

Technical Validation

Quality control of cells

To find a cell line with stable transcriptome and proteome during long-term subculture, we tested 5 commonly used cell lines: MHCC97H, HCCLM6, HCCLM3, Hela, and A549. In order to ensure the quality of cell lines, we detected mycoplasma at intervals to ensure that all cell lines were mycoplasma negative (Fig. 2a~d).

RNA quality control

For each cell line, we cultured 8~12 generations and took samples from each generation. Total RNA was extracted from each sample and the RNA quality was examined by electrophoresis to verify that they were not degraded (Fig. 3a~e).

Omics data quality control

We conducted quality control of sequencing data (including RNA-Seq and RNC-Seq) and generated a series of QC metrics. The overall quality of the sequencing dataset was satisfied at the level of raw and mapped data in the following aspects: (1) the average reads count of raw sequencing data was more than 20 M; (2) the average map** ratio of all samples was around 72%; (3) the average GC content of the data generated from all samples was around 52%; (4) the average rRNA contamination ratio for all samples was around 4.57%; (5) the average Q30 of all samples was around 89%. The detailed results of data quality control are showed in supplementary Table 7.

Reproducibility of transcriptome datasets

We used the polyA + mRNA method to construct a transcriptome library for sequencing and used the RPKM method for quantification. The mutual correlation of gene expression showed that the MHCC97H has the most stable transcriptome, with the Pearson r = 0.983~0.997 (Fig. 4a). The other two hepatocellular carcinoma cell lines showed lower consistency (r can be as low as 0.973 and 0.960, respectively, Fig. 4b,c). The Hela and A549 cell showed even lower consistency over the generations (average r = 0.979 and 0.964, respectively, with the lowest value being 0.948 and 0.920, respectively, Fig. 4d~f).

Subsequently, we performed a series of rigorous evaluations on the stability of MHCC97H at the transcriptome level. Firstly, we subcultured two batches of MHCC97H cell lines in May and December 2021, respectively. The mutual correlation of gene expression over generations was similar (r = 0.977~0.997, Fig. 5a). The correlation of the same generation between the two batches was steadily high (r = 0.998 ± 0.001, Fig. 5b). Secondly, we tested the robustness over experimenters and labs. The results (Fig. 5c~d) were almost identical to the former experimenter (Fig. 4a). We also sent 4 samples to two commercial sequencing service providers more than 1000 km away. Chi-Biotech Co. Ltd. was equipped with a MGISEQ-2000 platform, and Sagene Co. Ltd. was equipped with a NovaSeq-6000 platform. The Pearson r reached 0.979~0.991 and 0.970~0.991, respectively (Fig. 5e), and the mutual correlation of gene expression over generations was similar in both labs (Fig. 5f).

We then tested different mRNA enrichment strategies. Our standard protocol used oligo-dT to enrich polyA + mRNA (mature mRNA), which was applied in most studies. Another strategy was the rRNA depletion method, which removes rRNA by probe hybridization followed by beads extraction or RNase H degradation. Using the rRNA depletion method, the Pearson r = 0.962~0.986 (Fig. 5g), which was considerably lower than the oligo-dT method. As expected, the correlation of gene expression between the two strategies was slightly lower (r was only 0.864~0.896) (Fig. 5h), suggesting that the data generated by two different mRNA enrichment strategies should not be mixed for analysis.

The RNA is vulnerable to ubiquitous RNases and environmental changes (e.g. freeze-thaw cycles). Therefore, minor or major degradation might be inevitable during the production, storage, and transport of the standard samples. We generated transcriptome sequencing datasets of RNA-degraded samples to investigate how the degradation affects their applicability to serve as a reference. Firstly, we created a scenario that mimics the degradation due to environmental exposure: the RNA samples were exposed to the air at room temperature for a prolonged time (more than 2 hours), so that the RNases in the environment may enter the tube and degrade the RNA. Then, the samples were frozen and thawed for 10 cycles. The result of agarose gel electrophoresis showed that the total RNA of MHCC97H was degraded to various extents (Fig. 6a). Surprisingly, when we used oligo-dT method, the non-degraded RNA samples and their corresponding RNA degraded samples still had high transcriptome correlation (Pearson r = 0.980~0.993, Fig. 6b). The rRNA depletion method also showed remarkable consistency with the counterparts of non-degraded and degraded RNA samples (r = 0.965~0.983, Fig. 6c), but still lower than the oligo-dT method.

Most environmental RNases are exonucleases, which may remain the 3’-end of the mRNAs. However, endonucleases may degrade mRNA into smaller fragments. We added RNase A into the non-degraded MHCC97H total RNA and incubated for 30 seconds to 1 hour to create a series of endonuclease-degraded samples (Fig. 6d). However, the endonuclease-degraded samples showed considerably lower consistency compared to the non-degraded counterparts (r = 0.872~0.941, Fig. 6e), but still much higher than the correlation reported by other literature (r² = 0.41~0.69)¹³.

Reproducibility of translatomic and proteomic datasets

It is generally known that translational regulation is the most significant regulatory level²². Therefore, we first tested the stability of the MHCC97H translatome over subculture generations. The RNC-seq of the MHCC97H showed very high mutual consistency (Pearson r = 0.974~0.996, Fig. 7a). Mass spectrometry requires more steps than RNA-seq, making it difficult to achieve high reproducibility. However, the protein abundance detected using mass spectrometry was comparable both in data-dependent acquisition mode (r = 0.966~0.988) (Fig. 7b) and in data-independent acquisition mode (r = 0.970~0.994) (Fig. 7c), respectively. To test the variability contributed by the experimental procedures, we started from the same trypsin-digested sample and made 3 independent mass spectrometry measurements (including LC-MS and data analysis). Such technical replicates yielded r² = 0.945~0.949 and r² = 0.975~0.990 in data-dependent acquisition (Fig. 7d) and data-independent acquisition (Fig. 7e) modes, respectively. These results indicated that the variance contributed by biological nature and trypsin digestion could be neglected in the DDA mode, and merely distinguishable in the DIA mode.

Next, we tested the robustness of the standard proteome sample across labs and instruments. We distributed the same batch of standard proteome samples to 4 labs (JNU, SCUT, DICP, and BNU), which were over 2000 km away (Fig. 8a). The samples were shipped using ice boxes at 0 °C for 3 days. All labs followed the same protocol to process the samples. The only hardware differences were listed in Fig. 8b. A similar number of protein were identified in the 4 labs (Fig. 8c~d). The JNU lab yielded slightly more proteins due to the long column, which provides higher chromatography resolution. The SCUT lab yielded fewer proteins due to the lower resolution and slower scanning speed of the mass spectrometer. However, the distribution of the isoelectric points (pI) of the identified protein showed no significant differences among these labs (Fig. 8e). The protein abundance measured by these labs was highly comparable (r = 0.962~0.974, Fig. 8f left). In data-independent acquisition mode, the labs with the same instruments showed highly similar results (r = 0.962), while the SCUT lab, which was equipped with another model of mass spectrometer showed remarkably lower consistency to the other two labs (r = 0.912) (Fig. 8f right), demonstrating that the instrument-specific bias and spectrum analysis software cannot be neglected.

Code availability

The data analysis methods, software, and associated parameters used in the present study were described in the section of Methods. If no detailed parameters were described for the software used in this study, default parameters were employed. No custom scripts were generated in this work.

References

Park, J. Y. et al. Clinical exome performance for reporting secondary genetic findings. Clin Chem 61, 213–220 (2015).
Article CAS PubMed Google Scholar
Torga, G. & Pienta, K. J. Patient-Paired Sample Congruence Between 2 Commercial Liquid Biopsy Tests. JAMA Oncol 4, 868–870 (2018).
Article PubMed Google Scholar
Simoneau, J., Dumontier, S., Gosselin, R. & Scott, M. S. Current RNA-seq methodology reporting limits reproducibility. Brief Bioinform 22, 140–145 (2021).
Article CAS PubMed Google Scholar
Bell, A. W. et al. A HUPO test sample study reveals common problems in mass spectrometry-based proteomics. Nat Methods 6, 423–430 (2009).
Article CAS PubMed PubMed Central Google Scholar
Tabb, D. L. et al. Repeatability and reproducibility in proteomic identifications by liquid chromatography-tandem mass spectrometry. J Proteome Res 9, 761–776 (2010).
Article CAS PubMed PubMed Central Google Scholar
Xuan, Y. et al. Standardization and harmonization of distributed multi-center proteotype analysis supporting precision medicine studies. Nat Commun 11, 5248 (2020).
Article CAS PubMed PubMed Central ADS Google Scholar
Alyass, A., Turcotte, M. & Meyre, D. From big data analysis to personalized medicine for all: challenges and opportunities. BMC Med Genomics 8, 33 (2015).
Article PubMed PubMed Central Google Scholar
Karczewski, K. J. & Snyder, M. P. Integrative omics for health and disease. Nat Rev Genet. 19, 299–310 (2018).
Article CAS PubMed PubMed Central Google Scholar
Chen, Z. et al. Systematic comparison of somatic variant calling performance among different sequencing depth and mutation frequency. Sci Rep 10, 3501 (2020).
Article CAS PubMed PubMed Central ADS Google Scholar
Novoradovskaya, N. et al. Universal Reference RNA as a standard for microarray experiments. Bmc Genomics 5 (2004).
SEQC/MAQC-III Consortium. A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium. Nat Biotechnol 32, 903–914 (2014).
Article Google Scholar
Chang, C. et al. Systematic analyses of the transcriptome, translatome, and proteome provide a global view and potential strategy for the C-HPP. J Proteome Res 13, 38–49 (2014).
Article CAS PubMed Google Scholar
Schuierer, S. et al. A comprehensive assessment of RNA-seq protocols for degraded and low-quantity samples. Bmc Genomics 18, 442 (2017).
Article PubMed PubMed Central Google Scholar
Kitchen, R. R. et al. Correcting for intra-experiment variation in Illumina BeadChip data is necessary to generate robust gene-expression profiles. Bmc Genomics 11 (2010).
Selitsky, S. R. et al. Virus expression detection reveals RNA-sequencing contamination in TCGA. Bmc Genomics 21 (2020).
Nelson-Rees, W. A., Hunter, L., Darlington, G. J. & O’Brien, S. J. Characteristics of HeLa strains: permanent vs. variable features. Cytogenet Cell Genet 27, 216–231 (1980).
Article CAS PubMed Google Scholar
Gille, J. J. & Joenje, H. Chromosomal instability and progressive loss of chromosomes in HeLa cells during adaptation to hyperoxic growth conditions. Mutat Res 219, 225–230 (1989).
Article CAS PubMed Google Scholar
Chen, T. R. Re-evaluation of HeLa, HeLa S3, and HEp-2 karyotypes. Cytogenet Cell Genet 48, 19–24 (1988).
Article CAS PubMed Google Scholar
Macville, M. et al. Comprehensive and definitive molecular cytogenetic characterization of HeLa cells by spectral karyoty**. Cancer Res 59, 141–150 (1999).
CAS PubMed Google Scholar
Frattini, A. et al. High variability of genomic instability and gene expression profiling in different HeLa clones. Sci Rep 5, 15377 (2015).
Article CAS PubMed PubMed Central ADS Google Scholar
Liu, Y. et al. Multi-omic measurements of heterogeneity in HeLa cells across laboratories. Nat Biotechnol 37, 314–322 (2019).
Article CAS PubMed Google Scholar
Schwanhäusser, B. et al. Global quantification of mammalian gene expression control. Nature 473, 337–342 (2011).
Article PubMed ADS Google Scholar
Orchard, S., Hermjakob, H. & Apweiler, R. The proteomics standards initiative. Proteomics 3, 1374–1376 (2003).
Article CAS PubMed Google Scholar
Bittremieux, W. et al. The Human Proteome Organization-Proteomics Standards Initiative Quality Control Working Group: Making Quality Control More Accessible for Biological Mass Spectrometry. Anal Chem 89, 4474–4479 (2017).
Article CAS PubMed Google Scholar
Chiva, C. et al. Quality standards in proteomics research facilities Common standards and quality procedures are essential for proteomics facilities and their users. Embo Rep 22 (2021).
Ramus, C. et al. Spiked proteomic standard dataset for testing label-free quantitative software and statistical methods. Data in Brief 6, 286–294 (2016).
Article PubMed Google Scholar
Gotti, C. et al. DIA proteomics data from a UPS1-spiked E.coli protein mixture processed with six software tools. Data Brief 41, 107829 (2022).
Article CAS PubMed PubMed Central Google Scholar
Ramus, C. et al. Benchmarking quantitative label-free LC-MS data processing workflows using a complex spiked proteomic standard dataset. J Proteomics 132, 51–62 (2016).
Article CAS PubMed Google Scholar
Gotti, C. et al. Extensive and Accurate Benchmarking of DIA Acquisition Methods and Software Tools Using a Complex Proteomic Standard. J Proteome Res 20, 4801–4814 (2021).
Article CAS PubMed Google Scholar
Tang, Z. Y. et al. A decade’s studies on metastasis of hepatocellular carcinoma. J Cancer Res Clin Oncol 130, 187–96 (2004).
Article PubMed Google Scholar
Li, Y. et al. Establishment of cell clones with different metastatic potential from the metastatic hepatocellular carcinoma cell line MHCC97. World J Gastroenterol 7, 630–6 (2001).
Article CAS PubMed PubMed Central Google Scholar
Li, Y. et al. Establishment of a hepatocellular carcinoma cell line with unique metastatic characteristics through in vivo selection and screening for metastasis-related genes through cDNA microarray. J Cancer Res Clin Oncol 129, 43–51 (2003).
Article CAS PubMed Google Scholar
Li, Y. et al. Stepwise metastatic human hepatocellular carcinoma cell model system with multiple metastatic potentials established through consecutive in vivo selection and studies on metastatic characteristics. J Cancer Res Clin Oncol 130, 460–8 (2004).
Article CAS PubMed Google Scholar
Wang, T. et al. Translating mRNAs strongly correlate to proteins in a multivariate manner and their translation ratios are phenotype specific. Nucleic Acids Res 41, 4743–4754 (2013).
Article CAS PubMed PubMed Central ADS Google Scholar
Lu, S. et al. A hidden human proteome encoded by ‘non-coding’ genes. Nucleic Acids Res 47, 8111–8125 (2019).
Article CAS PubMed PubMed Central Google Scholar
Lian, X. et al. Genome-Wide and Experimental Resolution of Relative Translation Elongation Speed at Individual Gene Level in Human Cells. PLoS Genet 12, e1005901 (2016).
Article PubMed PubMed Central Google Scholar
Liu, W., **ang, L., Zheng, T., **, J. & Zhang, G. TranslatomeDB: a comprehensive database and cloud-based analysis platform for translatome sequencing data. Nucleic Acids Res 46, D206–D212 (2018).
Article CAS PubMed Google Scholar
Zhang, G., Zhang, Y. & **, J. The Ultrafast and Accurate Map** Algorithm FANSe3: Map** a Human Whole-Genome Sequencing Dataset Within 30 Minutes. Phenomics 1, 22–30 (2021).
Article CAS PubMed PubMed Central Google Scholar
Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. & Wold, B. Map** and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5, 621–628 (2008).
Article CAS PubMed Google Scholar
Bloom, J. S., Khan, Z., Kruglyak, L., Singh, M. & Caudy, A. A. Measuring differential gene expression by short read sequencing: quantitative comparison to 2-channel gene expression microarrays. Bmc Genomics 10, 221 (2009).
Article PubMed PubMed Central Google Scholar
Wiśniewski, J. R., Zougman, A., Nagaraj, N. & Mann, M. Universal sample preparation method for proteome analysis. Nat Methods 6, 359–362 (2009).
Article PubMed Google Scholar
Zhang, G. et al. GEO. https://identifiers.org/geo/GSE234201 (2023).
Zhang, G. et al. A multi-omics dataset of human transcriptome and proteome stable reference. iProX. http://proteomecentral.proteomexchange.org/cgi/GetDataset?ID=PXD041292 (2023).

Download references

Acknowledgements

This work was supported by the Ministry of Science and Technology of China, the National Key Research and Development Program (Project No. 2017YFA0505001/2017YFA0505101/2018YFC0910201/2018YFC0910202), the National Natural Science Funds of China (Project No. 81802916/82002949), the National Natural Science Funds of Guangdong Province (Project No. 2023A1515010605), Guangdong Key R&D Program (Project No. 2019B020226001), State Key Laboratory of Respiratory Disease, Guangdong-Hong Kong-Macao Joint Laboratory of Respiratory Infectious Disease (Project No. GHMJLRID-Z-202103), Guangzhou Medical University Discipline Construction Funds (Basic Medicine) (Project No. JCXKJS2022A11) and the Fundamental Research Funds for the Central Universities. Furthermore, we would like to thank Chi-biotech Co. Ltd. and SAGENE Co. Ltd. for their contributions to the generation of multi-platform and multi-center data of transcriptome.

Author information

These authors contributed equally: Shaohua Lu, Hong Lu, Tingkai Zheng.

Authors and Affiliations

Key Laboratory of Functional Protein Research of Guangdong Higher Education Institutes and MOE Key Laboratory of Tumor Molecular Biology, Institute of Life and Health Engineering, **an University, Guangzhou, China
Shaohua Lu, Hong Lu, Tingkai Zheng, Zhenghua Sun, **gjie **, Qing-Yu He, Yang Chen & Gong Zhang
Sino-French Hoffmann Institute, School of Basic Medical Sciences, State Key Laboratory of Respiratory Disease, Guangzhou Medical University, Guangzhou, China
Shaohua Lu
CAS Key Laboratory of Separation Science for Analytical Chemistry, National Chromatographic Research and Analysis Center, Dalian Institute of Chemical Physics, Chinese Academy of Science, Dalian, China
Huiming Yuan
School of Biology and Biological Engineering, South China University of Technology, Guangzhou, China
Hongli Du, Wenlu Zhang & Shuying Fu
Department of Biochemistry and Molecular Biology, Bei**g Key Laboratory of Gene Engineering Drug and Biotechnology, Bei**g Normal University, Bei**g, China
Youhe Gao, Yongtao Liu & Xuanzhen Pan

Authors

Shaohua Lu
View author publications
You can also search for this author in PubMed Google Scholar
Hong Lu
View author publications
You can also search for this author in PubMed Google Scholar
Tingkai Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Huiming Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Hongli Du
View author publications
You can also search for this author in PubMed Google Scholar
Youhe Gao
View author publications
You can also search for this author in PubMed Google Scholar
Yongtao Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xuanzhen Pan
View author publications
You can also search for this author in PubMed Google Scholar
Wenlu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Shuying Fu
View author publications
You can also search for this author in PubMed Google Scholar
Zhenghua Sun
View author publications
You can also search for this author in PubMed Google Scholar
**gjie **
View author publications
You can also search for this author in PubMed Google Scholar
Qing-Yu He
View author publications
You can also search for this author in PubMed Google Scholar
Yang Chen
View author publications
You can also search for this author in PubMed Google Scholar
Gong Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

G.Z., S.H.L. and Y.C. conceived the project, designed the experiments, and co-supervised the study, S.H.L. and Y.C. performed cell culture, S.H.L. performed mass spectrometry experiments with assistance by Z.H.S., H.L., and T.K.Z. performed the mRNA-seq experiments with assistance by Y.C. and J.J.J, H.L. performed the analysis of all omics data with assistance by S.H.L., Y.C. performed the RNC-seq experiments, H.M.Y., H.L.D., Y.H.G., Y.T.L., X.Z.P., W.L.Z. and S.Y.F. jointly performed the multi-center and multi-platform mass spectrometry experiments, Q.Y.H. providing mass spectrometry resource, G.Z. and S.H.L. wrote the manuscript. Y.C., S.H.L. and H.L. revised the manuscript. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Shaohua Lu, Yang Chen or Gong Zhang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Table 2

Supplementary Table 3

Supplementary Table 4

Supplementary Table 5

Supplementary Table 6

Supplementary Table 7

Supplementary Table 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Lu, S., Lu, H., Zheng, T. et al. A multi-omics dataset of human transcriptome and proteome stable reference. Sci Data 10, 455 (2023). https://doi.org/10.1038/s41597-023-02359-w

Download citation

Received: 31 January 2023
Accepted: 03 July 2023
Published: 13 July 2023
DOI: https://doi.org/10.1038/s41597-023-02359-w
Springer Nature Limited

A multi-omics dataset of human transcriptome and proteome stable reference

Abstract

Similar content being viewed by others

Background & Summary

Methods

Cell culture and materials

Detection of mycoplasma contamination

RNA extraction

Ribosome-nascent chain complex-RNA (RNC-RNA) extraction

mRNA-seq and RNC-seq

RNA degradation

Protein trypsin digestion

Data-dependent acquisition (DDA) mass spectrometry

QE+ parameter setting

Orbitrap fusion lumos parameter setting

Data-independent acquisition (DIA) mass spectrometry

QE+ parameter setting

Orbitrap fusion lumos parameter setting

Database search

RNA and protein quantification

Data Records

Technical Validation

Quality control of cells

RNA quality control

Omics data quality control

Reproducibility of transcriptome datasets

Reproducibility of translatomic and proteomic datasets

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation