Data
We used targeted CD163 exon sequence data and whole-genome sequence data from pigs in the Pig Improvement Company (PIC) breeding programme. This programme contains a diverse collection of genetics, which represent broadly used populations, including animals of Large White, Landrace, Duroc, Hampshire and Pietrain heritage. The targeted sequencing included DNA samples from 35,000 pigs, which were previously collected from 2011 to 2016 as part of the breeding programme.
To put the targeted exon sequence data in a genomic context and compare it to genomic distributions of selection and constraint, we used whole-genome sequence data from three lines of pigs of the PIC breeding programme. These lines were also sampled in the targeted exon sequencing. We used 1146 individuals from Line 1, sequenced at various coverages. Eighty-four individuals were sequenced at 30X coverage, 11 at 10X, 45 at 5X, 561 at 2X, and the remaining 445 at 1X. Individuals and their sequence coverages were chosen with the AlphaSeqOpt algorithm [13, 14], and we added sires that contributed a large proportion of the progeny in Line 1 that were genotyped as part of the routine breeding activities of PIC. AlphaSeqOpt uses phased genotype data, which in our case consisted of single nucleotide polymorphism (SNP) genotypes from a 60 K or 15 K SNP chip, which were phased with AlphaPhase [15]. The AlphaSeqOpt algorithm consists of several steps. First, we identified focal individuals that had a genome representing the haplotype diversity of the population as much as possible, and allocated a fixed sequencing budget to the families of the focal individuals in order to maximise phasing accuracy proportionate to the population haplotype footprint of the focal individual. Then, we identified individuals that carried underrepresented haplotypes to maximise the number of haplotypes that were sequenced at sufficiently high coverage for accurate imputation. We also used 408 individuals from Line 2 and 638 individuals from Line 3, which were all sequenced at 2X coverage. These individuals were sires that contributed a large proportion of the progeny in Lines 2 and 3 that were genotyped as part of the routine breeding activities of PIC.
Targeted sequencing of CD163
We used a hierarchical pooling strategy to sequence CD163 exons in many individuals cost-effectively. Using sequence capture, we could target the sequencing effort to the CD163 exons only, and thus fit many animals into the same lane of an Illumina sequencer. The pooling allowed us to use fewer targeted capture reactions, while retaining the ability to go back to the original plate and individually sequence animals for validation.
Therefore, we pooled 96 DNA samples each into one combined DNA sample and constructed a shotgun sequencing library using the ThruPLEX Tag-seq kit from Rubicon Genomics. This kit incorporates unique molecular identifiers that allow a consensus sequence to be generated from reads originating from the same molecule, and thus reduces the impact of sequencing errors. Twenty-four such barcoded libraries were combined and used as input into a sequence capture reaction with baits that were designed against the CD163 exons (Arbor Biosciences, Ann Arbor, MI). Then, the product of the library capture was used to generate 2 × 150-bp reads on an Illumina MiSeq sequencer. This pooling scheme allowed us to sequence up to 2304 samples per sequencing run. In total, 35,808 animals were sequenced using this scheme.
We aligned reads with BWA (v 0.7.15-r1140) [16], using the BWA-MEM algorithm, against the 10.2 version of the pig genome to which we added a 33-kb contig representing the CD163 genomic region that was missing from this version of the reference genome. The coding sequence of CD163 on this contig is identical to the sequence that is annotated as CD163 in the version 11.1 of the pig reference genome. We used Connor (https://github.com/umich-brcf-bioinf/Connor) to call consensus sequences from reads with the same unique molecular identifier and called variants from these consensus alignments using the LoFreq variant caller [17]. We used snpEff [18] to classify the variants as synonymous, nonsynonymous, stop-gain and frameshift indel variants.
Validation of potential knock-out variants
Potential stop-gain variants detected in the pooled targeted sequencing data were pursued for validation by sequencing individual animals, i.e. we went back to the pools in which the variants were detected and sequenced amplicons of the appropriate exons from all the DNA samples that made up the pool with individual barcodes on the MiSeq. None of the potential stop-gain variants were validated in the individual sequencing. We tested five potential frameshift indel variants in the same way, and none of these were validated in the individual sequencing either.
Whole-genome sequence data processing
We aligned reads to the pig genome (Sscrofa11.1) with BWA-MEM [16], removed duplicates with Picard (https://broadinstitute.github.io/picard/index.html), and called variants with the GATK HaplotypeCaller [19]. We filtered and processed variant call format files with VCFtools [20]. We used the Ensembl variant effect predictor [21] to find the protein-coding SNPs, and classify them into synonymous and nonsynonymous SNPs based on the Ensembl gene annotation [22] version 90. We downloaded variants in CD163 from the Ensembl variation database.
Residual variant intolerance score
The residual variant intolerance score [23] measures gene-level tolerance to mutations by counting segregating variants. To calculate the residual variant intolerance, we counted the number of nonsynonymous variants and the total number of variants in each gene, and calculated the studentised residual of the regression between the number of nonsynonymous variants and the total number of variants. We included variants that segregated in at least one of the three lines. We applied the residual variant intolerance score both at the level of the gene, and at the level of the protein domain [24], using protein domains that were found by identifying Pfam profiles in Ensembl protein sequences with PfamScan [25]. All gene-level analyses were performed on the principal transcript, as designated with APPRIS annotation [26].
Selection analysis in the pig lineage
SnIPRE [27] uses a Poisson model to measure gene-level selection based on between-species divergence and within-species polymorphism. We calculated the divergence between the pig and cattle (UMD 3.1.1) genomes using the Nei-Gojobori method [28], which estimates the number of potential synonymous and nonsynonymous substitutions between two codons. We aligned the reference genomes using Lastz [29] and refined the alignments using the chain/net method [30]. We excluded all codons that were not fully aligned between genomes, i.e., any codon that contained an alignment gap or a missing base in any of the genomes. For within-species polymorphism data, we used the protein-coding variants from whole genome sequence data of the three lines combined.
SnIPRE models the logarithm of the mutation count in a sample of individuals as a linear function of fixed effects (an intercept term, a term for nonsynonymous variants, a term for divergent variants, an interaction term for divergent nonsynonymous variants, and an offset term) and random gene effects, which allow the coefficients for nonsynonymous variants, divergent variants and the interaction between them to be estimated with regularisation. The selection effect for each gene is the interaction term for nonsynonymous fixed variants (summing the coefficient for the fixed effect and the gene-specific coefficient for that gene), i.e. it provides an estimate of the excess or deficit in nonsynonymous divergent variants in a gene. We ran the empirical Bayes implementation of SnIPRE, using the lme4 R package [31], which generates confidence intervals for the selection effect based on standard errors.
Selective sweep analysis by haplotype diversity
We estimated haplotype diversity at CD163, at 100 random control genes of similar length as CD163 (at most 10% difference), and at 11 homologs of genes that are stably expressed in humans [32]. We imputed genome-wide sequence data to 65,000 pigs from Line 1, using SNP genotypes from a 60 K or 15 K SNP chip and the Line 1 sequence data described above. We extracted mapped read counts that supported each allele from low coverage samples, as outlined in [33], and used multilocus hybrid peeling [34], as implemented in AlphaPeel, to phase and impute all individuals to full sequence data in the selected genes.
We extracted all variants that were within exons and introns of the genes and identified the haplotypes that were carried by each genotyped individual in each gene. For each gene, including introns, strings of phased alleles were compared to define haplotypes carried by each individual in each parental chromosome. Strings of alleles that were identical (with a mismatch threshold) between two individuals were considered to be the same haplotype, while strings with more than two mismatches were considered as different haplotypes to account for sequencing or phasing errors. Then, we calculated haplotype homozygosities based on the pooled frequency of the two most common haplotypes (H12), which is used as a test statistic for detection of selective sweeps, and has been shown to be sensitive to soft sweeps [35].
Gene ontology enrichment of gene lists
We downloaded gene ontology (GO) biological process terms for all Ensembl genes from BioMart [36], and ranked enriched biological processes based on p values from a Fisher’s exact test of independence. For comparison, we extracted the genes found to be under positive selection in the pig, based on dN/dS ratio in [37], and mapped their names to Ensembl gene identifiers with BioMart. We found enriched biological process terms in this gene list in the same way as in our data, and identified the overlap of genes with enriched gene ontology terms between the two lists.