Prediction of a deletion copy number variant by a dense SNP panel
© Kadri et al; licensee BioMed Central Ltd. 2012
Received: 9 June 2011
Accepted: 23 March 2012
Published: 23 March 2012
A newly recognized type of genetic variation, Copy Number Variation (CNV), is detected in mammalian genomes, e.g. the cattle genome. This form of variation can potentially cause phenotypic variation. Our objective was to determine whether dense SNP (single nucleotide polymorphisms) panels can capture the genetic variation due to a simple bi-allelic CNV, with the prospect of including the effect of such structural variations into genomic predictions.
A deletion type CNV on bovine chromosome 6 was predicted from its neighboring SNP with a multiple regression model. Our dataset consisted of CNV genotypes of 1,682 cows, along with 100 surrounding SNP genotypes. A prediction model was fitted considering 10 to 100 surrounding SNP and the accuracy obtained directly from the model was confirmed by cross-validation.
Results and conclusions
The accuracy of prediction increased with an increasing number of SNP in the model and the predicted accuracies were similar to those obtained by cross-validation. A substantial increase in accuracy was observed when the number of SNP increased from 10 to 50 but thereafter the increase was smaller, reaching the highest accuracy (0.94) with 100 surrounding SNP. Thus, we conclude that the genotype of a deletion type CNV and its putative QTL effect can be predicted with a maximum accuracy of 0.94 from surrounding SNP. This high prediction accuracy suggests that genetic variation due to simple deletion CNV is well captured by dense SNP panels. Since genomic selection relies on the availability of a dense marker panel with markers in close linkage disequilibrium to the QTL in order to predict their genetic values, we also discuss opportunities for genomic selection to predict the effects of CNV by dense SNP panels, when CNV cause variation in quantitative traits.
A recently recognized source of genomic structural variation called Copy Number Variation (CNV), is gaining interest in genomic studies. It is defined as a DNA segment that is 1 or more kb long and is present at a variable copy number in comparison with a reference genome . CNV are shown to be functionally active in humans. They are responsible for phenotypic changes by altering gene dosage, disturbing coding sequences and perturbing long-range gene regulation . With the discovery of CNV in the cattle genome [3–5] and their potential to cause variation in economically important traits, capturing the effects of CNV and other complex genotypes on phenotype becomes an important factor in the prediction of genetic values.
The aim of this study was to investigate whether a simple deletion CNV can be predicted from dense SNP genotyping data using a multiple regression approach, which, if successful, implies that genetic variation due to this deletion CNV can be predicted in an automated manner by dense SNP genotyping. To this end, we report the linkage disequilibrium (LD) of a bi-allelic deletion type of CNV (the locus varying in copy number, 2 = normal and 1 = deletion) with surrounding SNP to determine whether SNP can predict this simple CNV. A model to predict CNV from surrounding SNP is developed and its accuracy is tested by cross-validation. Prediction of CNV with high accuracy would eliminate the need for explicit detection and genotyping of simple CNV. Our approach is general and can be extended to more complex CNV, but estimation of the prediction accuracy of more complex CNV is outside the scope of this paper.
Genotypic data on SNP and CNV
The SNP and CNV genotypes for dairy cattle were provided by the Milk Genomics Project conducted at Wageningen University, The Netherlands. In the project, 2,000 Holstein Friesian cows (belonging to five large sire families with about 200 daughters and 50 small sire families with about 20 daughters) were genotyped for 50 000 SNP on the Illumina Infinium platform , using a custom array described by Charlier et al. . The 2,844 SNP genotypes on bovine chromosome 6 with a median interval of 18 kb were used for CNV detection. Two algorithms, PennCNV (2008 Nov19 version)  and cnvPartition (v1.2.0, a plug in of Bead studio version 3; Illumina Inc.)  were used with default settings for the detection of CNV. In total, 476 samples showed CNV regions with PennCNV and 245 samples with cnvPartition.
Samples with missing genotypes were excluded from the analysis. This resulted in a dataset of CNV and SNP genotypes for 1,682 individuals, of which 263 carried the deletion. The dataset can be accessed in Additional file 1.
Ten-fold cross-validation was carried out to test the predictive accuracy of the model . The data was randomly split into ten non-overlapping sample subsets. The data from nine subsets were used to fit the model with 10, 20, 30....100 SNP. The estimated SNP effects were then used to predict the copy number in the remaining 10th sub-set, which was excluded from the model fitting. This procedure was repeated for each of the 10 subsets, so that a prediction for every record was obtained once whilst it was excluded from the estimation model. The correlation between the predicted and observed copy number was calculated for each sample as a measure of accuracy (accuracy estimated by cross-validation; ACV) and was used to obtain the prediction accuracies with different numbers of SNP fitted in the model.
The linkage disequilibrium (LD) plot for the SNP and CNV was generated using HaploView (http://www.broadinstitute.org/haploview/haploview) . The CNV was encoded as an SNP, as described elsewhere ; AT for deletion and TT for no deletion in the input file. The LD plot was also used to identify a "disconnected SNP" (dSNP, S173; see next section) that fell outside the tightly linked haplotype blocks, as defined by HaploView.
Prediction of dSNP
To compare the prediction accuracy of the CNV by surrounding SNP with that of a certain single SNP, an SNP was predicted using the same model. Since the majority of the SNP were in tight LD blocks with r2 ~ 1, SNP genotypes for the dSNP S173 were included in the "y" vector and models (1) were fitted in SAS®. The R2 of the dSNP prediction models were compared with those of the CNV prediction models.
Linkage disequilibrium plot
Prediction of the CNV
Including even more SNP resulted in little increase in the value of R2, which reached a maximum value of 0.914 with 100 SNP. Since the curve was very flat between 50 and 100 SNP, we expect limited further increases in R2 by extending the SNP panel beyond 100 SNP.
Prediction of dSNP
In this study, we have investigated the prediction of a CNV from surrounding SNP typed on a custom Illumina Infinium 50 k BeadChip, using a multiple regression model. Although the investigated CNV was discovered using specific CNV detection algorithms, we used it as a model for other, currently unknown CNV. We have assessed the accuracy with which an unknown CNV genotype can be predicted by predicting a common deletion CNV genotype using the surrounding SNP. The investigated CNV was a rather common large deletion CNV of 233 kb. We did not find any SNP in perfect LD (r2 = 1), contrary to previous studies that reported strong LD for deletions with nearby SNP [15, 16]. However, it was possible to predict the CNV with a high accuracy (0.94) by combining information from 50 or more flanking SNP in a multiple regression model. This accuracy was confirmed by cross-validation. If this result proves to be general, it can be concluded that the presence of a bi-allelic deletion type CNV and, in case it has causative effects, its related phenotypic effects, may be estimated by dense SNP genotyping with a high accuracy.
The in silico CNV detection algorithms used in the present study showed ambiguity in mapping the breakpoints (Figure 1). Seventy-seven of the CNV (out of 263 samples harboring a CNV in the 53 Mb region) started at position 53,535,915 on chromosome 6 (~54 kb downstream relative to the most common variant) and 14 CNV started at position 53,605,836, almost ~124 kb downstream (Figure 1). Thus, it is possible that there are multiple distinct CNV in this region. Similarly, there is an alternative CNV endpoint 5 kb upstream of the common CNV endpoint. This uncertainty in breakpoints might explain why we failed to find a SNP in perfect LD with the deletion region, although the CNV genotype calls seemed accurate since they showed Mendelian inheritance. Confirming the nature of the deletion and fine-mapping the CNV boundaries may help to detect better tag SNP for this region.
Perhaps a more likely reason for the relatively low LD between the CNVR and its surrounding SNP is the relatively large distance between the CNVR and the closest SNP, which may be general for CNV due to the often low SNP coverage in CNV regions (as shown in Figure 2). The first SNP downstream from the CNV was 39 kb away, whereas upstream, the first SNP was 145 kb away, which is far greater than the median SNP to SNP distance of 18 kb. Studies [15, 16] that report SNP in strong LD with CNV use a denser SNP map and obtain perfect LD for nearby SNP. Thus, with the next generation SNP chips (containing ~700 k SNP), we expect to predict the CNV more accurately with fewer SNP. However, it is difficult to reliably detect SNP in CNV regions because of the genomic complexity that is generally found in the deleted or duplicated regions and the resulting low reliability of the reference sequence.
The accuracy of the model, as estimated by cross-validation, was high. The cross-validation accuracies were only slightly lower than those predicted by the statistical model. Thus, given a sufficiently big training data set, the model proved to be reliable for future predictions of deletion type CNV from SNP data.
A small increase in accuracy was observed for CNV prediction, when increasing from 50 to 100 SNP (Figure 5). The increase was much smaller for cross-validation accuracy than predicted by the model. Thus, the increase in R2 (and AM) when increasing from 50 to 100 SNP is to a large extent due to over-fitting of the data by the model, and hardly results in a real increase in R2 beyond 50 SNP. This suggests that the LD might be decreasing at distances >500 kb since the 50 SNP used in the model were within 500 kb from the CNV (Figure 2). This is consistent with a previous study that reported LD in eight breeds of cattle  and showed that the LD between pair-wise loci drops to background LD level at a distance of 500 kb.
We compared the predictability of the CNV with that of a 'disconnected' SNP (dSNP), S173. When using information from 40 or less flanking SNP, the SNP was predicted more accurately than the CNV. When including more SNP, the two models showed a similar R2 pattern. Hence, it may be concluded that the predictability of a simple bi-allelic CNV follows the predictability of a dSNP when information from many (>50) SNP is used. This and the fact that the accuracies of the predictions of the CNV and S173 are almost identical with >50 SNP, suggest that both the CNV and dSNP may be on an extended haplotype that is predicted by the SNP with an accuracy of 0.94.
In this study, we have shown that a simple deletion CNV can be predicted with a high accuracy from neighboring SNP using a multiple regression approach. This suggests that dense SNP panels can capture the effects of this type of CNV. However, our study was limited to one large common deletion type CNV that was detected using CNV detection algorithms from SNP data, and a 50 K SNP chip that was solely targeted at SNP genotyping and generally has a poor coverage of CNV regions (as shown in Figure 2). Thus, although our approach is general, further studies are needed to investigate whether similar accuracies can be attained for other, more complex types of CNV.
Genomic selection relies on dense markers that jointly are in sufficiently high LD with (unknown) QTL, so that the effect of the QTL is accurately predicted by the sum of the SNP effects. This situation resembles very much our prediction of the CNV, in cases where the CNV causes quantitative trait genetic variation, and its position is not known. With a CNV of unknown position, we could not have selected the 100 nearest SNP, and we would have had to rely on all ~50 000 genome-wide markers to predict the CNV. Thus, in this case, the number of SNP effects would greatly exceed the number of records, which is known as the k>>n problem in statistics. Genomic selection deals with this problem by using informative prior distributions for the SNP effects. The accuracy of 0.94 found here is thus an upper bound for the accuracy of prediction of breeding values for a quantitative trait by the genomic selection approach, when the quantitative trait that is affected by the current CNV, possibly along with other CNV can be predicted with similar accuracy, and environmental effects. The prediction accuracy of 0.94 is an upper bound, because the k>>n problem may not be completely resolved by the prior distribution of SNP effects and the environmental effects reduce the accuracy of the estimates of the SNP effects relative to those in our study. Both these problems can be overcome by increasing the number of records, in which case the accuracy of genomic selection will approach this upper bound. A similar maximum accuracy of genomic selection was suggested by the result of Daetwyler . With recent studies providing further evidence that CNV are associated with complex diseases in humans, designing genotyping chips with CNV probes may be important to increase the accuracy from the current ~90% towards 100% and thus to capture all genetic variation.
This study was based on data collected in the Milk Genomics Initiative of the Animal Breeding and Genomics Centre of Wageningen University, funded by NZO (Dutch Dairy Association, Zoetermeer, the Netherlands), the cattle improvement company CRV (Arnhem, the Netherlands), and the technology foundation STW (Utrecht, the Netherlands). The first author carried out this research as part of the European Master in Animal Breeding and Genetics. THEM gratefully acknowledges funding from the European Community's FP7/2007-2013 under grant agreement no. 222664 ("Quantomics"). This article reflects only the author's views and the European Community is not liable for any use that may be made of the information contained herein.
- Feuk L, Carson AR, Scherer SW: Structural variation in the human genome. Nat Rev Genet. 2006, 7: 85-97.View ArticlePubMed
- Stranger BE, Forrest MS, Dunning M, Ingle CE, Beazley C, Thorne N, Redon R, Bird CP, de Grassi A, Lee C: Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science. 2007, 315: 848-10.1126/science.1136678.PubMed CentralView ArticlePubMed
- Matukumalli LK, Lawley CT, Schnabel RD, Taylor JF, Allan MF, Heaton MP, O'Connell J, Moore SS, Smith TPL, Sonstegard TS: Development and characterization of a high density SNP genotyping assay for cattle. PLoS One. 2009, 4: e5350-10.1371/journal.pone.0005350.PubMed CentralView ArticlePubMed
- Fadista J, Thomsen B, Holm LE, Bendixen C: Copy number variation in the bovine genome. BMC Genomics. 2010, 11: 284-10.1186/1471-2164-11-284.PubMed CentralView ArticlePubMed
- Liu GE, Van Tassell CP, Sonstegard TS, Li RW, Alexander LJ, Keele JW, Matukumalli LK, Smith TP, Gasbarre LC: Detection of germline and somatic copy number variations in cattle. Animal Genomics for Animal Health Dev Biol. Edited by: Pinard M-H, Gay C, Pastoret PP, Dodet B. 2008, Basel: Karegr, 132: 231-237.View Article
- Schopen G, Koks P, Van Arendonk J, Bovenhuis H, Visker M: Whole genome scan to detect quantitative trait loci for bovine milk protein composition. Anim Genet. 2009, 40: 524-537. 10.1111/j.1365-2052.2009.01880.x.View ArticlePubMed
- Charlier C, Coppieters W, Rollin F, Desmecht D, Agerholm JS, Cambisano N, Carta E, Dardano S, Dive M, Fasquelle C: Highly effective SNP-based association mapping and management of recessive defects in livestock. Nat Genet. 2008, 40: 449-454. 10.1038/ng.96.View ArticlePubMed
- Wang K, Li M, Hadley D, Liu R, Glessner J, Grant SFA, Hakonarson H, Bucan M: PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Res. 2007, 17: 1665-10.1101/gr.6861907.PubMed CentralView ArticlePubMed
- Illumina Inc: DNA copy number and loss of heterozygosity algorithms. [http://www.illumina.com/Documents/products/technotes/technote_cnv_algorithms.pdf]
- Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H, Shapero MH, Carson AR, Chen W: Global variation in copy number in the human genome. Nature. 2006, 444: 444-454. 10.1038/nature05329.PubMed CentralView ArticlePubMed
- Elsik CG, Tellam RL, Worley KC: The genome sequence of taurine cattle: a window to ruminant biology and evolution. Science. 2009, 324: 522-PubMed CentralView ArticlePubMed
- SAS® (SAS® 9.1 software; SAS® Institute Inc. C, NC).
- Hastie T, Tibshirani R, Friedman J: The elements of statistical learning: data mining, inference and prediction. Springer Series in Statistics, New York, USA, 2nd Edition
- Barrett J, Fry B, Maller J, Daly M: Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics. 2005, 21: 263-10.1093/bioinformatics/bth457.View ArticlePubMed
- McCarroll SA, Hadnott TN, Perry GH, Sabeti PC, Zody MC, Barrett JC, Dallaire S, Gabriel SB, Lee C, Daly MJ, Altschuler DV: Common deletion polymorphisms in the human genome. Nat Genet. 2006, 38: 86-92. 10.1038/ng1696.View ArticlePubMed
- Hinds DA, Kloek AP, Jen M, Chen X, Frazer KA: Common deletions and SNPs are in linkage disequilibrium in the human genome. Nat Genet. 2006, 38: 82-85. 10.1038/ng1695.View ArticlePubMed
- McKay S, Schnabel R, Murdoch B, Matukumalli L, Aerts J, Coppieters W, Crews D, Neto E, Gill C, Gao C: Whole genome linkage disequilibrium maps in cattle. BMC Genet. 2007, 8: 74-PubMed CentralView ArticlePubMed
- Daetwyler H: Genome-wide evaluation of populations. PhD thesis. 2009, Wageningen Universiteit, ISBN 978-90-8585-528-6
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.