Introduction

Over four years ago, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) emerged, causing the coronavirus disease 2019 (COVID-19) outbreak1. Currently, the widespread variants of concern (VOCs) are derived from the Omicron sub-variant BA.2, including XBB, XBB.1.5, BQ.1, BQ.1.1, BA.5.2 and BF.7. The unusually high number of mutations in the spike (S) proteins of these variants results in a sizeable antigenic shift from previous VOCs2,3,4. While BA.5.2 and BF.7 caused widespread breakthrough infections (BTIs) in China5, XBB and XBB.1.5 infections were rapidly spreading globally and accounted for approximately 90% of the total international prevalence6. The frequent emergence of new VOCs and the gradual weakening of the vaccine-induced immunity against the prototype strain make the current vaccine strategy inadequate in protecting against VOCs with different antigenicity. Consequently, research on vaccines, antibodies and other prophylactic measures remains challenging and seriously concerning.

Meanwhile, long coronavirus disease (long COVID or Post-COVID Conditions) has attracted overwhelming global attention. It refers to a lack of return to a usual state of health following acute COVID-19 illness, including signs, symptoms, and conditions that continue or develop after acute infection, like malfunction of major organs such as the liver, kidneys and the cardiovascular system7,8. Long COVID can occur in individuals regardless of vaccination status, symptom presentation, or infection with the wild-type strain, as determined primarily through questionnaires8,9. Omicron infections result in fewer hospitalizations, less severe illness, and a higher rate of asymptomatic cases, making it challenging to evaluate the related long COVID10,11,12. A recently published study showed that approximately 70% of Omicron BA.2 related long COVID will recover in one year after infection13. However, little is known about its mechanism and possible hidden pathological features.

It has been reported that the severity of long COVID is negatively correlated with vaccination status14,13,43,Single-cell data analysis

Single-cell data were integrated and clustered using the Seurat R package (version 4) (https://satijalab.org/seurat/). A total of 124,541 cells were obtained from single-cell sequencing of the nine samples, and 108,306 cells remained after quality control. The cell quality control was conducted as follows: cells with a mitochondrial gene ratio exceeding 10% were removed, and only cells with gene numbers ranging from 500 to 4500 and UMI numbers ranging from 800 to 16,000 were retained. DoubletFinder R package (https://github.com/chris-mcginnis-ucsf/DoubletFinder) was used to remove potential doublets, and further manually remove potentially marginalized doublets based on known classic markers. The filtered data were then standardized and normalized, and principal component analysis was performed on the top 2,000 genes with the highest coefficients of variation. The Harmony R package (https://github.com/immunogenomics/harmony) and the anchor module of Seurat were used to remove inter-batch effects between the samples and groups for cell clustering. Based on the elbow point and significance of the different principal components, the top 30 PCs were selected for subsequent cell clustering, and different resolutions were set to determine the cell clusters. Dimensionality reduction and visualization of single cells were performed using the Uniform Manifold Approximation and Projection (UMAP).

Cell type annotation

Using UMAP, all cells underwent dimensional reduction and were clustered in a two-dimensional space based on shared features. Firstly, the Azimuth algorithm was used to map the data to the reference cell set of PBMC, and then combined with specific high expression genes to manually determine the cell type. Specifically, classic biomarkers for specific cell types were used to identify the cells in different clusters. The FindAllMarkers function in Seurat was used to identify the 50 most highly expressed genes in each cluster of cells, providing a comprehensive understanding of cell types based on the top gene and literature. When clustering for the first or the second time, clusters expressing two or more classic markers and marginalized cells were considered doublets and excluded from subsequent analysis.

Cell difference abundance analysis

Use Milo algorithm62 to divide the cells of the control group and BA.2-BTI-6m group into different neighborhoods and calculate their spatial distribution differences, map** them to different cell types. The key parameters for executing the Milo algorithm are k = 10 and d = 30. In addition, the proportion of cell types for each sample was calculated based on the conventional cell percentage and their differences between groups were calculated using the rank sum test.

Differential gene identification and functional analysis

The Findmarkers() function in the Seurat package was used to identify differentially expressed gene (DEG)s between distinct cell groups, using a standard of |logFC|> 0.25 and FDR < 0.01. DEGs only contain genes expressed in at least 25% of cells of the control group or infection group. The ClusterProfiler R package facilitated Gene Ontology and KEGG enrichment analyses and visualization of DEGs.

Gene set activity score of individual cells

The AddModuleScore() function of Seurat was used to calculate the activity scores of different gene sets in single cells. The gene set was sourced from the msigdb R package (Antigen processing and presentation (hsa04520), JAK_STAT_signaling (hsa04630), B cell activation (GO:0042113), B cell receptor signaling (GO:0050853), positive regulation of Treg activity (GO:0045591), response interferon (GO:0034341), protein processing (GO:0016485) and coagulation regulation (GO:0007597, GO:0050819, GO:0050820, GO:0050818). T cell toxicity activity was defined by the following gene sets: PRF1, IFNG, GNLY, NKG7, GZMB, GZMA, GZMH, KLRK1, KLRB1, KLRD1, CTSW, and CST7. The tissue specific gene set based on proteomics comes from the research of Gutmann et al.23 and Li et al.24.

TCR/BCR analysis

Using human GRCh38 as the reference genome, the Cell Ranger vdj pipeline was used to identify the TCR/BCR clonotype and quantify VDJ gene expression. For TCR, we only retained cells with at least one productive TCRα chain (TRA) or TCRβ chain (TRB) for subsequent analysis. Where a cell had two or more paired TRA or TRB chains, we only retained the one with the highest basal expression. Clonotypes were defined based on their unique CDR3 amino acid sequence, and each unique TRA/TRB/TRA-TRB pair was defined as a clonotype. For BCR analysis, we retained only cells with at least one productive heavy chain (IGH) and IGK/IGL for subsequent analysis. When a cell had two or more paired IGH or IGK/IGL chains, only those with the highest basal expression were retained. Each unique pair IGH-IGK/IGL was defined as a clonotype. The scRepertoire R package (https://github.com/ncborcherding/scRepertoire) was used to analyze the single-cell immune repertoire and calculate the clonal diversity of the samples based on the aroma index. Based on the cell barcode information, clonotypes with TCR or BCR were mapped onto the cell UMAP map.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.