Findings

Background

DNA sequencing technology is advancing so fast that we are very close to being able to sequence whole human genomes routinely. This ability is likely to revolutionize diagnosis and treatment of many human diseases and generally further our understanding of human biology. An ideal DNA sequencing platform is one that provides the continuous sequences of each of the chromosomes in a genome and enables the identification of all sequence variants directly. However, owing to technical limitations, the current methods for sequencing large genomes generate reads with lengths that are typically smaller than 250 bp and with limited insert size, usually less than 20 kbp [1]. The subsequent analysis of variation in a human individual generally starts from a re-sequencing strategy, that is, a strategy based on the short-read alignment to a consensus reference sequence such as the Genome Reference Consortium human genome build 37 (GRCh37) [2, 3]. This approach has sufficient sensitivity and specificity for discovering most of the single nucleotide polymorphisms (SNPs), small insertions (typically less than one fourth of the read length) and small deletions (typically less than half of the read length) in the genome, as well as some large deletions in non-repetitive sequences (for which short-read alignment is less challenging than that for repetitive sequences) [4, 5]. However, this approach is consistently biased towards the identification of certain types of other forms of variation such as large insertions, multiple nucleotide polymorphisms (MNP), inversions, translocations and novel sequences and towards the breakpoint resolutions [3, 6].

The sequence complexity of the structural variation in individual genomes and the fact that the human genome reference sequence is imperfect introduces challenges for discovery using the re-sequencing approach [7], despite the importance of those types of variation in the definition of genome structure and disease aetiology [8]. These limitations raise interest in taking another direction in investigations of human genome variation, in which we first assemble the genome and subsequently discover the variants by analysis of the assembly-versus-assembly alignment [7]. An assembly encodes not only small variants but also large variants and is free of the artifacts present in the imperfect genome reference. The sequence-ready and nucleotide resolution characteristics of the variants obtained from the de novo genome assembly also enable the annotation of their ancestral state and mechanism formation. These features are known to be evolutionary and pathologically important [9, 1 for the Glossary. NAHR, non-allelic homologous recombination; NHR, non-homologous recombination; TEI, transposable element insertion; VNTR, variable number of tandem repeats