Files
Abstract
High-density and sequence genotype data was expected to improve the accuracy of genomic predictions in commercial animal genomic improvement programs by increasing the capture of genetic variation, however realized results have found only modest gains in accuracy relative to low- to mid-density SNP chip panels. This has raised a question of how much of this phenomenon can be attributed to lower-density SNP panels already sufficiently capturing segregation patterns of causal variants due to the strong linkage disequilibrium structure in these populations or if highly informative variants in high-density and sequence data are obfuscated by the dimensionality of the data. This work employs a simulated data design to evaluate the impact of biallelic SNP markers with low information content regarding segregation of causal variants on genomic predictions and investigate strategies for distinguishing spurious, inconsistent trait associations among these markers from true associations driven by linkage between SNPs and causative variants. It is shown that the dimensionality of high-density and sequence data will inevitably result in false SNP-trait associations that lead to overfitting of genomic models and lower quality genomic predictions. Furthermore, efforts to identify the most informative and biologically-relevant SNP markers by genome-wide association studies are impeded by the presence of these spurious associations. It may be possible to identify such spurious associations by comparing multiple criteria that measure trait-specific SNP relevance and eliminating those markers with inconsistent signals of relevance. A fuzzy inference system approach is evaluated for its potential to aggregate two SNP preselection criteria, FST scores and p-values, into a single composite score for preselection. Though gains in genomic prediction accuracy using this strategy were moderate compared to results for preselection based on the individual input criteria, it is demonstrated that there is potential for such an approach to consolidate information on SNP markers from multiple sources to better distinguish true from spurious trait associations.