Background

Genoty** errors occur in large datasets. Errors can arise for many reasons including data entry, technician, or assay errors. As we continue to genotype large numbers of microsatellite (MS) and single-nucleotide polymorphism (SNP) markers, we must consider the effect of these genoty** errors on our ability to detect or find genes. Although several previous studies have shown that genoty** error can reduce the power to detect linkage [1, 2], a comparison of the effect of genoty** error on MS marker maps vs. SNP marker maps has not been performed.

Genoty** error can be divided into two types, those that do and do not result in Mendelian inconsistencies. Genoty** errors that result in Mendelian inconsistencies can often be detected using a single marker, such as the segregation of five or more alleles in a nuclear family. Linkage programs that detect and remove Mendelian inconsistent genoty** errors are available (e.g., SIBPAL [3], SIMWALK2 [4], PEDCHECK [5], etc). However, many genoty** errors will be consistent with Mendelian inheritance. Methods to detect genoty** errors that are consistent with Mendelian inheritance include identification of tightly linked double recombinants, which relies on multipoint marker information [6], and distortions of Hardy-Weinberg proportions [711]. In addition, likelihood-based approaches are available, which do not aim to detect and remove the genoty** errors, but instead, incorporate the possibility of genoty** error into the computation of the likelihood [4, 1214]. However, these methods typically identify fewer than 50% of the genoty** errors that are consistent with Mendelian inheritance.

The proportion of genoty** errors that are consistent with Mendelian inheritance, and thus less easily detectable using current methods, depends on the type of marker and the family structure being evaluated. In the extreme case, all genoty** errors are Mendelian consistent in situations with biallelic markers and sibship data without parents [12]. Although genoty** error rates may be lower for SNPs than MS markers on a per marker basis, the much larger number of markers that will typically be used for SNP maps means that on an absolute scale there may be more error in SNP maps. What remains unclear is the effect of these errors on the linkage results. In this paper, we use the Genetic Analysis Workshop 14 (GAW14) simulated dataset to evaluate the effect of genoty** error on MS and SNP marker maps. We show that genoty** errors that are consistent with Mendelian inheritance are difficult to detect as double recombinants using multipoint information, and thus are likely to remain in the analysis. This may have a greater impact on SNP marker maps compared to MS marker maps, because SNP genoty** errors are more likely to be consistent with Mendelian inheritance and, as we show, MS errors are easier to detect with this method. In addition, we show that although the simulated SNP marker map resulted in greater power to detect true linkage signals, in the presence of genoty** error there was also an increase in the number of false-positive signals.

Methods

Simulation of data

Because we were interested in sib pairs, we used the GAW14 simulated data from the three nuclear family populations (Aipotu, Danacaa, and Karngar). We combined the three populations to increase our power and used all 100 replicates. Parental information was removed to evaluate sib pairs without known parental genotypes. MS marker maps were simulated with 7-cM marker spacing, and SNP marker maps were simulated with 3-cM marker spacing. The genetic model for affected status is described elsewhere. Random genoty** error was simulated at error rates of 0.14% and 2.8%, which were selected to represent typical error rates for SNP and MS datasets, respectively. To simulate genoty** error for the SNP data, genotypes were randomly chosen for replacement at the specified error rate, and one of the alleles was selected randomly and changed to the other allele. To simulate genoty** error for the MS data, genotypes and alleles were randomly selected for replacement as above; however, the allele was replaced by one of the alleles adjacent in size (i.e., either one more or one fewer repeats) to mimic laboratory conditions.

Detection of Mendelian consistent genoty** error as double recombinants

Mendelian consistent genoty** error is not detectable based on information from a single marker. However, this type of genoty** error may appear as a double recombinant in multipoint analysis [15], which could be detected as a large change in the identity-by-descent (IBD) sharing on both sides of a particular location. In addition, we can identify which individual within the family had the genoty** error by identifying the common individual among all pairs of individuals with large changes in the IBD sharing. We obtained estimates of multipoint IBD sharing among siblings using GENIBD [3]. To detect double recombinants, we examined the difference in the estimates for sharing 0 and 2 alleles IBD between the current marker and the one preceding it (δpre) and the current marker and its successor (δpost) for each pair of individuals at each marker. If the absolute values of δpre and δpost exceeded the same predetermined cutoff, δ, for two or more sib pairs that included the same individual, the marker was deemed to be the site of a double recombinant for that individual. This means that double recombinants cannot be detected for pedigrees with a single sib pair or at the ends of the chromosome by our definition. The false- positive and false-negative rates were computed separately for the MS and SNP markers based on knowledge of the simulated errors. The Shannon information content (SIC) was computed using MLOD [3] to evaluate the error rates as a function of the SIC.

Evaluation of power and type I error for MS and SNP marker maps

To evaluate the power and type I error rates, we performed model-free linkage analysis separately for the MS and SNP data using the w4 option in SIBPAL [3]. As implemented in SIBPAL, the Haseman-Elston method regresses a weighted combination of the squared trait difference and squared mean-corrected trait sum on the estimated proportion of alleles shared IBD. The weights are chosen to be proportional to the inverse of the residual variances of the squared differences and sums. To compute the power, we defined a true positive as a signal that exceeded the given threshold, and that had a peak location within 20 cM of the true location per the simulation answers. To evaluate the type I error rate, a false positive was defined as a signal that exceeded the given threshold, that had a peak location more than 40 cM from the true location of a disease gene, and that was more than 20 cM from any other peak location, i.e., there could be more than one false positive on a chromosome.

Evaluation of likelihood-based approach to account for genoty** error

The loss in power that results from genoty** error can be minimized using likelihood-based approaches to account for the genoty** error in the linkage analysis [4, 12]. We incorporated this approach into SIBPAL by modifying the marker penetrances such that the probability that the observed genotype is the correct genotype is 1-ε, and the probability that it is any other genotype is ε/(n-1), where n is the total number of genotypes for that locus. We reanalyzed the SNP simulated data with GENIBD and SIBPAL using the correct error rate that was simulated (i.e., with ε = 0.0014 and 0.028), and evaluated the power and type I error rate as described above.

Results

Detection of genoty** error as double recombinants

For the MS marker map, on average 35.6% of the genoty** errors were detected as Mendelian inconsistencies, and removed from the analysis, whereas none (0%) of the genoty** errors were identified as Mendelian inconsistencies for the SNP marker map. Therefore, under the most rigorous condition (δ = 0.9), 53.1% of the genoty** errors were detected for the MS marker map, whereas only 2.4% of the genoty** errors were detected for the SNP marker map. We then utilized sib-pair IBD sharing estimates to identify Mendelian consistent genoty** errors as double recombinants, detected as large changes in IBD sharing estimates from neighboring markers on both sides of a particular location. Figure 1 shows the true-positive and false-positive rates of this method averaged across all markers and all families, for both the MS and SNP marker maps with a 2.8% simulated error rate. In general, the method performed poorly with high false-positive rates at low values of δ, and low true-positive rates at high values of δ.

Figure 1
figure 1

True positive and false positive rates of the method to detect double recombinants versus δ. True positive rates are represented by blue diamonds and false positive by pink squares. Rates are plotted versus δ (difference in IBD sharing between adjacent markers) for MS (open symbols) and SNP (closed symbols) marker maps.

The ability to detect Mendelian consistent genoty** errors may depend on factors such as the sibship size and the marker information content. Therefore, we explored the true- and false-positive rates of our method based on sibship size and family specific SIC (Figure 2). As expected, Mendelian consistent genoty** error was more frequently detected under circumstances with more complete information (e.g., larger sibship sizes and higher SIC). However, even with the same SIC, Mendelian consistent genoty** error was more frequently detected by this method of finding double recombinants using the MS marker map (e.g., true-positive rate of 0.45 for SIC between 0.9–1.0) compared to the SNP marker map (e.g., true-positive rate of 0.20 for SIC between 0.9–1.0) with similar values for the false-positive rate.

Figure 2
figure 2

True and false positive rates by SIC. True positive rates are represented by blue diamonds and false positive by pink squares. Rates are plotted by Family Specific Shannon Information Content (δ = 0.5) for MS (open symbols) and SNP (closed symbols) marker maps.

Effect of genoty** error on power and type I error

For all levels of genoty** error, the SNP marker map was more powerful than the MS marker map for detecting linkage (Table 1). However, the SNP marker map also had a higher false-positive rate in the presence of genoty** error compared to the MS marker map (Table 2), suggesting that a different threshold should be employed for declaring significant evidence of linkage for the two marker maps. Marker-specific genoty** error rates are expected to be lower for SNP markers compared to MS markers, so a more fair comparison might be to compare the power for the SNP marker map at the 0.14% error rate to the MS marker map at the 2.8% error rate. However, even under this circumstance the SNP marker map has slightly higher power to detect significant evidence of linkage (p-value < 2.2 × 10-5) compared to the MS marker map (75% vs. 67%), while the false-positive rates are more similar (5 for SNP and 3 for MS).

Table 1 True-positive rate (%)
Table 2 Number of false-positive signals

Genoty** error decreases the power to detect linkage for both types of marker maps. As shown in Figure 2, the type I error rate dramatically increased as the genoty** error rate increased for the SNP marker map, while remaining fairly constant for the MS marker map. This implies that use of a two-stage approach (i.e., genome scan using MS markers and follow-up using SNPs) would be a better overall strategy in terms of reducing cost and effort for fine-map**.

Results from the likelihood-based approach to account for genoty** error

After using the likelihood approach to account for genoty** error, we found that for the SNP marker map, there was very little improvement in either the true- or false-positive rate at the 0.14% error level. At the higher 2.8% error rate, while the true-positive rate remained fairly constant, there was a significant improvement in the false-positive rate. However, this error rate is probably much higher than is found in more recent genoty**.

Mendelian consistent genoty** errors were more easily detected with our method for the MS marker map than the SNP marker map, even at the same marker information content. This suggests that when the error is corrected or accounted for using the likelihood-based approach, a greater amount of power will be restored for the MS marker map than the SNP marker map.

Conclusion

The effect of genoty** error on linkage analysis for SNP vs. MS marker maps is similar if you compare the power for the SNP marker map at the 0.14% error rate to the MS marker map at the 2.8% error rate. The SNP marker map has slightly higher power to detect significant evidence of linkage compared to the MS marker map and similar false-positive rates.

To make a fair comparison between the MS and SNP performances in our method to detect double recombinants, we need to compare the true-positive rate at similar false positive rates. A δ = 0.9 for the MS gives approximately the same false-positive rate (1.97%) as a δ = 0.2 for SNPs (1.70%). At these deltas, the true-positive rates are also very similar, 17.5% for MS and 16.9% for SNPs. From this, we can conclude that our method for detecting double recombinants performs very comparably for MS and SNP markers, but to achieve those error rates, different cutoffs for delta should be used depending on the type of marker.

On the whole, our method to detect double recombinants resulted in more false positives than true positives. We conclude that it is not practical to detect genoty** errors as double recombinants through large changes in IBD sharing. Most of the situations that were identified through this method as a genoty** error were actually false positives. These results are consistent with the results of the method implemented in SIMWALK2 [4, 6]. Badzioch [6] reports a maximum error detection rate of 50% and notes that as many as 70% of the errors detected were false positives.

Another issue with this method of identifying double recombinants is that if a single recombinant occurs in both siblings, and the recombinant occurs on one side of a given marker in the first sibling and on the other side of that marker in the second sibling, the two siblings could in fact share an additional allele IBD for only that single marker. This method would have no way of distinguishing between a double recombinant and two single recombinants occurring in almost the exact same chromosomal region in two individual siblings and may account for some of the false positives that have been encountered. However, as the field moves toward using SNP marker maps, the marker density will increase and the chance of this happening will decrease. It would be interesting to look at the effect of the density of a map given the same type of markers on the false positive rate of this method.

Genoty** errors have always been a problem in linkage analysis. In this paper we evaluated error levels of 0.14% and 2.8%, which are within the bounds of realistic genoty** error rates [5, 16]. Methods to find and correct Mendelian consistent genoty** errors generally rely on detecting double recombinants and assuming them to be genoty** errors. Likelihood-based extensions to linkage analysis to maximize the power in the presence of undetected genoty** error have been made. As genoty** error rates shrink and cost of genoty** decreases, the importance of dealing with genoty** error will fade. However, the current reality for many datasets is that there are higher undetected genoty** error rates. Error rates as low as 1% can significantly affect the power of multipoint linkage analysis [1, 2]. Attempting to remove them by looking for double recombinants via sharp changes in IBD sharing is not a reasonable solution.