Background

China is rich in human biodiversity, and over six language families exist here, including Altaic (Mongolic, Tungusic, and Turkic), Sino-Tibetan (Sinitic and Tibeto-Burman (TB)), Hmong-Mien (HM), Tai-Kadai (TK), Austronesian (AN), and Austroasiatic (AA). The genetic patterns of modern Chinese populations revealed the population stratification among ethnolinguistically different people, which was strongly correlated with geography, culture, and language families [1,2,3]. Recent genetic cohorts from the China Metabolic Analytics Project (ChinaMAP) [4] and NyuWa genome resources [5] have provided crucial genetic variation data from geographically different Chinese populations and offered new insights focused on population structure and the medical relevance of Chinese people. We also noticed that all these genetic studies in China mainly included Han Chinese as their major studied subjects, which would introduce the Han bias in Chinese population genetic studies and influence the health inequality of genomic benefit in the Genome-drived precision medicine era [4,5,6]. China had two independent agriculture innovation centers in the Yellow River Basin (YRB, millet agriculture) and the Yangtze River Basin (YZRB, rice agriculture). The abundant civilization history of social organization development and technological innovation in the middle Holocene epoch facilitated the formation of the ancient Yangshao tribe and the Dawenkou tribe in North China, the Sanmiao tribe, and the Liangzhu society in South China. Recent ancient DNA has identified the genetic differentiation between Ancient Northern East Asian (ANEA) and Ancient Southern East Asian (ASEA) since the early Neolithic period, and then they experienced extensive population admixture events along different geographical corridors [7, 8]. The patterns of evolutionary history observed in East Asia differed from those in Europe and Oceania, which had undergone large-scale population admixture and replacement processes [9]. Ancient human gene flow events outside East Asia have limited influence on the genetic backgrounds of East Asians [10]. However, ancient DNA from spatiotemporally diverse East Asians has identified regional-restricted ancient founding lineages and contributed to the reconstruction of subsequent extensive population migration and admixture events [7, 15,16]. However, these efforts have only provided primary foundational knowledge to dissect the mystery of genetically different Chinese populations’ evolutionary and adaptive history. The fine-scale genetic structure of ethnolinguistically different Chinese populations and the patterns of genetic relationship and admixture between some Chinese populations should be further explored, especially for some ethnolinguistically underrepresented groups in South China.

HM-speaking populations include those who speak Hmongic (Miao, She, and Hmong) and Mienic (Yao and Dao) languages in mountainous areas of South China, Vietnam, and Thailand [17]. The original homeland of HM people was suggested to be in Central China, associated with Neolithic Shijiahe, Qujialing, and Daxi cultures in the middle YZRB. Historical documents showed that the expansion of pro-Han Chinese or other ANEA through Central China promoted the southward of ancient HM people [18]. The complex migration and admixture history of HM people and their interaction with other southern Chinese populations (ST, TK, AN, and AA) must be further explored. Recent findings based on uniparental markers and genome-wide evidence have identified different evolutionary processes between inland TK and HM people and between coastal AN and TK people [19,20,15, 30], whole-genome sequences of mitochondrial DNA and Y-chromosome could provide additional evolutionary traces based on the shared or novel haplotype groups, also referred to as the haplogroups. Poznik et al. analyzed 1244 worldwide Y-chromosome sequences from the 1000 Genomes Project (1KGP) to characterize the landscape of Y-chromosome diversity of 26 worldwide populations. Karmin et al. investigated 456 geographically diverse high-coverage Y-chromosome sequences to construct the revised phylogenetic topology with the divergence time estimation of key mutation events [32, 33]. They have reported punctuated bursts and population bottlenecks associated with the cultural exchanges among worldwide populations [32, 33]. Maternal lineages among different populations could also trace the process of population evolutionary past. Recent mitochondrial studies from modern and ancient Tibetan genomes have illuminated the Neolithic expansion processes of the YRB farmers and the Paleolithic peopling of the Tibetan Plateau [34,35,36]. Li et al. also reported that the maternal structure of Han Chinese was stratified along three main Chinese river boundaries [37]. It is obvious that fine-scale and large-scale uniparental genetic studies should be conducted to explore the evolutionary history of the understudied Chinese populations [38, 51]. Other risk alleles possessed a low frequency in East Asians but a high frequency in other populations or vice versa (Fig. 5a). The similar patterns of AFS in some types of cancers also showed comparable genetic basis or pleiotropy at cancer-risk loci. In addition, interpopulation differences in drug responses were generally recognized, and drugs such as clopidogrel, warfarin, carbamazepine, and peginterferon have been confirmed to show the greatest population differences in predicted adverse drug reactions [5, 27]. Thus, we assessed the AFS of 25 known pharmacogenomic variants from the ADME (absorption, distribution, metabolism, and excretion) core genes and found that some variants showed significant allele frequency differences between HM speakers, East Asians and other intercontinental populations, such as SLC15A2 that is associated with the absorption of β-lactam antibiotics and peptide-like drugs, suggesting the necessity for genomic testing for drug response phenotypes (Fig. 5b).

Fig. 5
figure 5

Medical relevance and natural selection signatures among HM-speaking populations. a Allele frequency spectrum (AFS) of 106 previously reported cancer susceptibility variants among HM people and worldwide reference populations from the NyuWa, GnomAD, and 1KGP. b The AFS of 25 previously reported pharmacogenomic loci in our dataset and reference groups. GDX_Guizhou included Dongjia, Gejia, and **, and quality control

We collected 349 HM individuals from 25 ethnically or geographically diverse populations (Miao, Yao, and She) from Sichuan, Chongqing, Guizhou, and Fujian provinces in South China (Fig. 1a; Additional file 1: Table S1), where 38 She people (SYS: 14; PSS: 7; GSS: 17) from Fujian in coastal South China were first reported here. We also sampled four AN-speaking Gaoshan people in Fujian to explore the genetic interaction between coastal HM and AN populations. We genotyped 661,134 autosomal, 28,320 X-chromosomal, 24,047 Y-chromosomal, and 3746 mitochondrial SNPs in all HM people and Gaoshan people using the Infinium Global Screening Array (Illumina, CA, USA). We used PLINK v.1.90 [67] and King [68] to explore the close relatives within three generations. We estimated the PI_HAT values using PLINK with the “--genome” parameter. The kinship coefficients of individual pairs with PI_HAT values larger than 0.15 were further estimated using King with the “--related --ibs” parameter. We used PLINK v.1.90 [67] to filter out the variants with missing call rates exceeding 0.05 (--geno: 0.05) and remove samples with missing call rates exceeding 0.1 (--mind: 0.1). Additionally, variants with minor allele frequencies less than 0.05 (--maf 0.05) and not in Hardy–Weinberg equilibrium (--hwe 1e-6) were filtered out. The final HM-related Illumina dataset included 533,935 SNPs.

Ethics approvement

All included individuals signed the written informed consent forms and were unrelated indigenous people in the sampling places. We also provided the necessary genetic counseling and healthy genetic reports for the sample donors if they were interested. The study protocol was approved via the medical Ethics committees at North Sichuan Medical College and West China Hospital of Sichuan University.

Dataset arrangement and reference populations

To present a fully resolved picture of the genetic diversity of HM people, we also collected 20 HM people (10 Miaos from Hunan and 10 Shes from Fujian) from the HGDP [69] and 71 HM individuals (12 Daos, 8 IuMiens, 12 PaThens, and 39 Hmongs) from previously investigated populations from Vietnam [41] and Thailand [42] that were genotyped using the Affymetrix Human Origins array (personal communication). These HM people were merged with the above HM-related Illumina dataset to generate an HM-specific dataset, which consisted of 56,814 SNPs and included 440 HM people from 33 populations belonging to seven ethnic groups. To explore the genetic structure of HM-speaking populations in the context of modern eastern Eurasian reference populations, we first merged our HM-related Illumina dataset with published genome-wide SNP data that was genotyped using the same Illumina array to generate the high-density Illumina dataset [2, 15, 16, 23, 77,78,79]. The Illumina dataset contained 533,935 SNPs and also included two AA-speaking Blang and Wa; nine Mongolic-speaking Baoan, Dongxiang, Mongolian, and Yugur; Sinitic-speaking Han and Hui populations from Guizhou, Sichuan, Fujian, Gansu, and Hainan provinces; six TB-speaking Pumi, Bai, Hani, Lahu, Tibetan, and Tujia; one Tungusic-speaking Manchu; and two Turkic-speaking Kazakh and Salar (Fig. 1d). The high-density dataset was mainly used to perform the haplotype-based analyses and phylogenetic reconstruction of uniparental lineages. We then merged the high-density Illumina dataset with modern and ancient populations genotyped via the Affymetrix Human Origins array from the AADR [80] to form the merged low-density HO dataset, including 56,814 SNPs, which was used to explore the general patterns of population structure as this dataset included more modern and ancient reference populations. We then imputed the low-density genome-wide SNP data of modern populations in the merged HO dataset using the WBBC (Westlake BioBank for Chinese) and 1KGP haplotype reference panels [31, 81], which generated the imputed merged HO dataset covering 458,786 SNPs. The HO modern reference populations included 343 TK people from 26 populations in China and Southeast Asia, 27 Han Chinese people from 6 populations, 276 TB people from 30 Chinese and Southeast Asian populations, 224 AA people from 18 populations, 115 AN people from 13 populations, 30 Japanese and 6 Korean, 140 Mongolic people from 18 populations, and 62 Tungusic people from 62 populations (Additional file 1: Table S1) [13, 40,41,42, 71, 82,83,84]. To analyze the comprehensive admixture and interaction landscape between HM people and other ancestral source groups, we merged the high-density Illumina dataset with ancient eastern Eurasians included in the 1240K dataset to form the merged middle-density 1240K dataset, including 146,802 SNPs. Ancient eastern Eurasians were included in both the merged HO and 1240K datasets, which included 47 ancient YRB farmers from 19 populations in Shandong, Henan, Shaanxi and Qinghai [7, 13, 85]; 30 ancient people from 13 populations in Amur River Basin or West Liao River Basin [51] to estimate Fst genetic distances among each population pair. Pairwise genetic distances were designed with two parameters (--within and --keep-cluster-names).

Inference of population admixture events

To construct the phylogenetic relationship among these ethnolinguistically diverse populations, we performed phylogenetic reconstruction using TreeMix v.1.13 [95]. PLINK v.1.90 [67] was used to evaluate the allele frequency of each population, which was used as the input file in the TreeMix-based analysis. We adopted the French population from the 1KGP as the outgroup (-root French) and ran TreeMix with the migration edges ranging from 0 to 7 and five replications for each run to explore the possible gene flow events. We used the plotting_funcs.R script to visualize each model’s phylogenetic topology and corresponding residual matrix. We used the -k flag (-k 500) to group SNPs to account for the LD. Additional parameters (-bootstrap and -global) were also used to get the best-fitted model. We also ran MEGA (Molecular Evolutionary Genetics Analysis) [96] based on the Fst genetic matrix to validate the obtained phylogenetic topology, and we obtained the consistent pattern of the major clades.

Runs of homozygosity

We estimated the indicator of genomic autozygosity using PLINK v.1.90 [67] based on the high-density Illumina dataset. We set the ROH containing at least 50 SNPs and a total length ≥ 500 kilobases using two parameters (--homozyg-snp 50 and --homozyg-kb 500). Two consecutive SNPs more than 100 kb apart (--homozyg-gap 100) were regarded as independent ROH. The default settings of at least one SNP per 50 kb on average (--homozyg-density 50), the scanning window contains 50 SNPs (--homozyg-window-snp 50), a scanning window hit should contain at most one heterozygous call (--homozyg-window-het 1) and the hit rate of all scanning windows containing the SNP must be at least 0.05 (--homozyg-window-threshold 0.05) were used. We further visualized the ROH distribution of each studied population statistically using R v.3.5.2 via the box plots.

Shared genetic drift and admixture signal estimation based on shared alleles

To measure the genetic affinity directly within HM people and among HM and other geographically close modern populations, we performed outgroup f3-statistics using the qp3pop program in ADMIXTOOLS [44]. As the merged HO dataset included the most comprehensive modern and ancient reference populations, we used f3(HM people, modern Eurasian; Yoruba) to explore the shared genetic affinity between HM people and modern reference populations and used f3(HM people, ancient Eurasian; Yoruba) to measure their genetic relationship with ancient reference populations. We also conducted the three population tests based on the merged Illumina and 1240K datasets. Similarly, we conducted admixture f3-statistics in the form f3(ancestral source1, ancestral source2; HM people) to identify the possible ancestral sources that can produce statistically significant values based on the three datasets. Here, negative f3 values with a Z-score lower than − 3 indicated that two ancestral sources might be the ancestral source proximities of the targeted populations and also confirmed that the studied population was an admixed population.

Genome-wide admixture models based on the f 4-statistic tests

We conducted four population tests for targeted HM people based on individual and merged populations. We used qpDstat in ADMIXTOOLS [44] to conduct the f4(HM1, HM2; reference populations, Mbuti), f4(reference population1, reference population2; studied populations, Mbuti), and f4(reference population1, studied populations; reference population2, Mbuti). The first form was used to explore the genetic homogeneity and heterogeneity between two included HM populations. The latter two formats were used to test the differentiated genetic ancestry between our targeted and reference populations. We also used qpWave to confirm the genetic homozygosity between two HM-speaking populations and used qpAdm [44] to estimate the admixture proportion with the following outgroups: Mbuti, Russia_Ust_Ishim, Russia_Kostenki14, Papuan, Australia, Mixe, Russia_MA1_HG, Onge, Atayal, and China_Tianyuan. We next used qpGraph to test the optimal frequency-based admixture models with gene flow events among various alternative models [44].

Admixture time estimation based on the decay of LD

Population admixture can introduce the exponential decay of LD. We used MALDER to test the admixture LD decays and estimate the possible admixture times of HM people [97]. We used multiple modern northern and southern East Asian populations as potential ancestral sources and tested all possible source combinations. The exponential curve fitting processes added the minimum distance between two SNP bins (mindis: 0.005 in Morgan) and leave-one-chromosome-out jackknifing (jackknife: YES).

Haplotype-based fine-scale population structure reconstruction

Segmented haplotype estimation

A stricter filtering strategy of missingness per SNP and missingness per individual was performed using PLINK v.1.90 [67] with two parameters (--geno 0.01 and --mind 0.01). We used the Segmented HAPlotype Estimation & Imputation Tool (SHAPEIT v2.r904) [98] to estimate haplotypes based on the high-density Illumina dataset and modern populations included in the merged HO dataset. Phased haplotypes were estimated with the following parameters to find a good starting point for the estimated haplotypes and get more parsimonious graphs: the number of burn-in iterations of 10 (--burn 10), the number of iterations of the pruning stage of 10 (--prune 10), and the number of main iterations of 30 (--main 30). We used the default settings of model parameters and HapMap phase II b37 as the genetic map in the haplotype estimates. The obtained haplotype data was used to explore the fine-scale population structure via fineSTRUCTURE, identify ancestral proximity and estimate their admixture proportion and time, and screen the natural selection signatures for local adaptation.

Admixture events inferred from ChromoPainter and fastGLOBETROTTER

To identify ancestral sources, date, and describe admixture events of our targeted HM people, we used ChromoPainterv2 [71] to paint the ancestral haplotype composition of our sampled HM populations. We merged our data with 929 lift-over high-coverage whole-genomes from 54 worldwide ethnolinguistically diverse populations and obtained haplotype data using SHAPEIT v2.r904 [98]. Han people from ** and quality control

We extracted 24,047 Y-chromosomal SNPs and 3746 mitochondrial SNPs from the merged Illumina dataset to explore the paternal and maternal population history based on the sharing haplogroups and coalescence processes. We used PLINK v.1.90 to conduct quality control based on the missing SNP and genoty** rates with two parameters (--geno: 0.1 and --mind: 0.1) [107]. In the final quality-control dataset, we retained 11,369 Y-SNPs in 203 individuals and 1428 mtDNA SNPs for uniparental evolutionary history reconstruction.

Haplogroup classification, haplogroup frequency spectrum estimation, and clustering analysis

For Y-chromosome haplogroup classification, we used the Python package of hGrpr2.py instrumented in HaploGrouper [108] and the Y-LineageTracker [109] to classify the haplogroups. Two additional reference files were used in the HaploGrouper-based analysis, including the treeFileNEW_isogg2019.txt and snpFile_b38_isogg2019.txt. The Chip version was used in the LineageTracker-based analysis. We also used this software to estimate the haplogroup frequency in different levels of the focused terminal lineages. HaploGrouper and HaploGrep were used to classify the maternal haplogroups.

Phylogeny analysis and network analysis

We used the BEAST2.0 [110] and Y-LineageTracker [109] to reconstruct the phylogenetic topology focused on the population divergence, expansion, and migration events. BEAUti, Tracer v1.7.2, and FigTree v1.4.4 were used to prepare the intermediate files for BEAST-based analysis and visualize the resulting phylogeny. The BEAST2.0 was also used to reconstruct the maternal phylogeny. Finally, we used the median-joining Network instructed in the popART [111] to rebuild the network relationship among different haplogroups and populations based on the obtained maternal and paternal genetic variations.