Introduction

The current coronavirus disease 2019 (COVID-19) pandemic, caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), was initially declared a Public Health Emergency of International Concern in January 20201,2. Despite global efforts to interrupt transmission chains with quarantine, isolation, and travel restrictions at the onset of the pandemic, SARS-CoV-2 spread rapidly across the globe, creating a global pandemic and threat to human health worldwide. By November 15th, 2021, the World Health Organization (WHO) reported 253 million confirmed COVID-19 cases in 222 countries and over 5 million deaths3,4. SARS-CoV-2 reached all 50 states of the United States and associated territories, including Puerto Rico, by March 2020, after multiple introductions by travelers with infection5,6,7. The rapid spread across the United States was primarily propelled by interstate transmission chains and air travel to the associated territories6,8,9.

SARS-CoV-2 is an enveloped virus with a single-stranded positive-sense RNA genome of ~30,000 base pairs. During replication, a virus-encoded exonuclease provides a proof-reading activity that contributes to the observed low mutation rate and stable genome10,11. Nevertheless, the unprecedented spread of SARS-CoV-2 globally and the wealth of genomic sequence data available through the international initiative for genomic studies and surveillance has facilitated phylodynamic approaches to infer viral evolutionary rate, growth rate, and the estimated time of origin for specific outbreaks11. Studies have revealed that the viral genome has been accumulating mutations of concern, especially in the spike protein region, which confer phenotypes with increased fitness and pathogenicity12,13,14. Increased infectivity, resistance to monoclonal antibody therapy and evasion of the immune response were among the most frequently observed phenotypes attributed to WHO-monitored variants; these phenotypes often dominated transmission and replacement of other lineages upon emergence15,16,17,18. The Variant Being Monitored (VBM) B.1.1.7 (Alpha) was first identified in the United States in late December 2020 and was then characterized by a considerable increase in COVID-19 incidence associated with increased infectivity and occasionally more severe disease manifestations that increased hospitalization rates19,20. Alpha became the dominant variant, especially in Europe and the United States, until the emergence of Variant of Concern (VOC) B.1.617.2/AY.x (Delta), first identified in the United States in May 2021, which developed into a prominent variant with an apparent higher virulence and pathogenic phenotype21,22,23. Because of the potential for increased transmissibility, morbidity mortality, and decreased efficacy of vaccines and other intervention strategies, monitoring the spread of variants (VBMs and VOCs) rapidly became a public health concern and priority24,25.

Puerto Rico, an unincorporated territory of the United States, is a densely populated island and a popular tourist destination located in the Caribbean basin. SARS-CoV-2 was first identified in Puerto Rico on March 13th, 2020, in two European travelers who arrived on a cruise ship and in one local resident who had close contact with family members with recent travel history. Additional travel-related and local cases were confirmed within the following weeks26. In response to the emerging threat, the government of Puerto Rico executed the most restrictive (compared to the United States) national stay-at-home order on March 15th, 2020, to mitigate transmission while preparing the public health infrastructure for the imminent impact27,28. Travel restrictions imposed by the United States during the initial pandemic minimized international traffic to Puerto Rico, although domestic travel from the United States continued. Puerto Rico represents a unique epidemiologic setting in a geographically isolated location (an island), but with a regular influx of travelers mostly from the United States. This is an ideal setting to monitor introduction and spread of SARS-CoV-2 variants and answer questions to help inform SARS-CoV-2 spread and disease prevention strategies. Puerto Rico’s public health response incorporated extensive molecular surveillance to the increased laboratory capacity, which presented a unique opportunity to study the impact of SARS-CoV-2 variant turnover, local dissemination, and evolution during a period of changing epidemiology and public health responses.

In response to the impending local epidemic, we established a partnership with the local health authorities and academia to conduct a genomic surveillance initiative to sample complete genomes of SARS-CoV-2 across the island through time, monitor lineage circulation, and understand the genomic epidemiology of the COVID-19 pandemic in Puerto Rico. This report presents the results from 19 months of genomic surveillance and phylogenetic analyses, which identified multiple introduction events that propelled the rapid expansion and persistent transmission of the virus on the island and lead to the establishment of an autochthonous lineage between August 2020 and January 2021.

Methods

Epidemiological data

We retrieved the number COVID-19 cases reported by the Puerto Rico Department of Health (PRDH) from March 2020 to 30 September 2021 from the PRDH database dashboard on 1 December 2021 available here https://covid19datos.salud.gov.pr/. The collection of cases includes cases classified as confirmed (by molecular tests) or probable (by antigen tests) and plotted by date of sample collection.

Patient sample selection

Nasopharyngeal swab samples pre-selected for genomic surveillance were received from COVID-19 passive surveillance conducted by PRDH, the Ponce Health Sciences University (PHSU), and hospital-based acute febrile illness surveillance conducted by the Centers for Disease Control and Prevention (CDC) Dengue Branch. A total of 785 samples were collected from March 2020 to September 30th, 2020, from the seven health regions of the island, including 63 out of the 78 municipalities, and selection criteria included all samples with SARS-CoV-2 detected by reverse-transcriptase polymerase chain reaction (RT-PCR), viral load (CT < 28) and sufficient residual sample volume stored at −80 °C 29. All samples were de-linked from patient identifiable information and processed under the guidelines approved by the CDC and Ponce School of Medicine institutional review boards (IRB) protocol 6731, which waived the need for informed consent for sequencing of residual samples.

Lineage frequency analysis

The frequency of SARS-CoV-2 lineage detection in Puerto Rico was calculated using the total number of SARS-CoV-2 genomes published in the Global Initiative on Sharing All Influenza Data (GISAID) (https://www.gisaid.org) with collection dates ranging between March 1st, 2020 and September 30th, 2021. All complete genome sequences and metadata were retrieved from the GISAID database as of October 31st, 2021. The dataset was filtered for complete genome data, high-coverage data, and complete collection date for a final dataset of 2514 entries. Lineage assignment on GISAID was determined by the Phylogenetic Assignment of Named Global Outbreak Lineages (Pangolin)30,31. R with ggplot package was used to calculate lineage frequency and plot the graph focusing on the following lineages of interest: B.1.1.7 (Alpha), P.1 + P.1.1 (Gamma), B.1.588, Delta (B.1.617.2+AY.x), B.1.427 + B.1.429 (Epsilon), B.1.526 (Iota), B.1.621/1 (Mu), and all other Pangolin-designated B lineages grouped as Other. No genomes collected in May 2020 have been published in GISAID by October 31st, 2021.

Complete genome sequencing and assembly

Complete SARS-CoV-2 genome sequences were generated directly from clinical nasopharyngeal samples. Viral RNA was extracted from viral transport media using the automated MagNA Pure 96 system (Roche) with the MagNA Pure 96 DNA and Viral Nucleic Acid Small Volume Kit (Roche) following manufacturer-recommended protocols for 0.2 mL sample input volume and 0.1 mL RNA elution volume. MP96 external lysis buffer was used to pre-treat the samples for neutralization and assist the lysis process. First strand cDNA was synthesized with random hexamers using SuperScript IV reverse transcriptase (ThermoFisher), and tiling PCR amplicons were generated using Q5® high-fidelity DNA polymerase (New England Biolabs) and the ARTIC nCoV-2019 V3 primer scheme purchased from Integrated DNA Technologies (https://github.com/artic-network/artic-ncov2019/blob/master/primer_scheme/nCoV-2019/V3/nCov-2019.tsv). Candidate samples for sequencing presented clearly visible bands of target size (~400 bp) in DNA gel electrophoresis for both primer pools. PCR products were purified with AMPure XP magnetic beads (Beckman Coulter) and quantified using Qubit 4.0 fluorometer (ThermoFisher). DNA libraries were generated using the NEBNext Ultra II DNA Library Prep Kit for Illumina (New England Biolabs), reducing all reagents volumes to 25% from the manufacturer's recommended protocol to increase throughput. The resulting products were screened for size and quality using the Bioanalyzer 2100 instrument (Agilent Technologies) and quantified with Qubit 4 fluorometer (ThermoFisher). Qualifying libraries were pooled and run in the MiSeq sequencer instrument (Illumina) using the MiSeq Reagent Kit v3 in 600-cycle program.

The resulting sequence reads were screened for quality, trimmed, and assembled into complete consensus SARS-CoV-2 genomes using the Genome Detective Virus Tool v1.13632 (https://www.genomedetective.com) and assembly confirmed with iVar33. The Pangolin COVID-19 Lineage Assigner tool was used for lineage assignment34 (https://pangolin.cog-uk.io). A total of 753 samples were sequenced with more than 95% genome coverage at a minimum of 10x sequence depth. All sequence data obtained for this study was submitted to GISAID, accession numbers available in Supplementary Data 1.

Phylogenetic analysis

Our Puerto Rico SARS-CoV-2 genomes dataset was analyzed against a diverse panel of genomes from across the world which provide regional phylogenetic context. Initially, we downloaded the Genomic Epidemiology metadata package for all entries from GISAID on August 18th, 2021 to screen genomes for subsampling. However, due to the large number of genomes available in GISAID, we downloaded and combined the following pre-sampled datasets for regional studies: NextRegion-North America, NextRegion-South America, and NextRegion-Global. We then used the standard ncov augur/auspice multiple input workflow available in the Nextstrain platform35 (https://github.com/nextstrain/ncov) to subsample contextual genomes and phylogenetic inference with time-stamped trees. The custom subsampling scheme program selected 2611 contextual genomes from the United States, North America, the Caribbean, Central America, South America, Africa, Europe, Asia, and Oceania, with higher proportions from The Americas and selected based on collection dates and genetic proximity to our Puerto Rico dataset. The combined dataset of 3364 genomes was aligned using MAFFT36 and a global maximum likelihood (ML) phylogenetic inference was reconstructed with IQ-TREE37. The ncov workflow then transferred the ML tree to TreeTime38 for time calibration and ancestral state reconstruction of the tree topology at constate rate of 8 × 10−4 nucleotide substitutions per site per year. The resulting global ML tree was visualized with Nextstrain auspice35 and annotated with iTol for region of origin and emerging variants39. Subsampling from the Genomic Epidemiology metadata package retrieved in August and from the combined NextRegions produced phylogenetic inference trees with similar topologies. A list of all the sequences used in this study, including sequence labels and authors can be found in the “Data availability” section.

Selected lineages of interest were studied further by reconstruction of phylogenetic focus trees. For the B.1.588 lineage-focused tree, we selected all B.1.588 genomes published in GISAID by October 31st, 2021. Contextual B.1 lineage genomes were selected based on phylogenetic clustering near the base of the B.1.588 clade in the global tree and by temporal proximity to the date range of B.1.588 circulation between June 2020 and January 2021. Maximum likelihood phylogenetic trees were reconstructed with the resulting dataset of 239 genomes under the GTR + G + I nucleotide substitution model and 1000 bootstrap replicates using IQ-TREE v1.6.1237. The resulting tree topology and node support were compared to Bayesian maximum clade credibility (MCC) tree reconstruction using BEAST v1.10.454. The sampling for this study was also limited due to the ability of our partnership to procure samples from every municipality of the island, especially during the first year of the local epidemic. The availability of case metadata, such as travel history, was also limited, which would have facilitated an in-depth analysis on the impact of importations on the island. Future national genomic surveillance programs could benefit from improved systematic sampling engaging with clinical laboratories to ensure timely reporting of results to the local public health authorities, proper sample storage, and transfer to sequencing laboratories. Regarding phylogenetic analyses, the slow mutation rate and the low genetic diversity of the virus frequently impair the resolution of internal nodes with statistical support affecting the interpretation of phylogenetic histories and geographic origins. These analyses could also be affected by selecting context genomes from the unprecedented abundance of genomic data published in GISAID.

This study provides an overview of the COVID-19 epidemic in Puerto Rico during March 2020–November 2021 from the genomic epidemiology perspective. The documentation of an autochthonous lineage and dynamics of virus movement between the United States and Puerto Rico is important to inform prevention and surveillance efforts in both regions. Our phylogenetic study offers the genomic framework to understand the genomic changes occurring through time in the Puerto Rican viral population and elucidate the mutational landscape within this region. Furthermore, our ongoing genomic surveillance initiative will facilitate the study of SARS-CoV-2 intra-island phylodynamics and compare pre- and post-vaccination populations. Finally, this report highlights the importance of government and academic partnerships to respond to public health threats and the potential of systematic genomic surveillance to improve disease prevention and control.