Cleaning genotype data from diversity outbred mice

Karl W. Broman, Daniel M. Gatti, Karen L. Svenson, Saunak Sen, Gary A. Churchill

Research output: Contribution to journalArticle

Abstract

Data cleaning is an important first step in most statistical analyses, including efforts to map the genetic loci that contribute to variation in quantitative traits. Here we illustrate approaches to quality control and cleaning of array-based genotyping data for multiparent populations (experimental crosses derived from more than two founder strains), using MegaMUGA array data from a set of 291 Diversity Outbred (DO) mice. Our approach employs data visualizations that can reveal problems at the level of individual mice or with individual SNP markers. We find that the proportion of missing genotypes for each mouse is an effective indicator of sample quality. We use microarray probe intensities for SNPs on the X and Y chromosomes to confirm the sex of each mouse, and we use the proportion of matching SNP genotypes between pairs of mice to detect sample duplicates. We use a hidden Markov model (HMM) reconstruction of the founder haplotype mosaic across each mouse genome to estimate the number of crossovers and to identify potential genotyping errors. To evaluate marker quality, we find that missing data and genotyping error rates are the most effective diagnostics. We also examine the SNP genotype frequencies with markers grouped according to their minor allele frequency in the founder strains. For markers with high apparent error rates, a scatterplot of the allele-specific probe intensities can reveal the underlying cause of incorrect genotype calls. The decision to include or exclude low-quality samples can have a significant impact on the mapping results for a given study. We find that the impact of low-quality markers on a given study is often minimal, but reporting problematic markers can improve the utility of the genotyping array across many studies.

Original languageEnglish (US)
Pages (from-to)1571-1579
Number of pages9
JournalG3: Genes, Genomes, Genetics
Volume9
Issue number5
DOIs
StatePublished - May 1 2019

Fingerprint

Genotype
Single Nucleotide Polymorphism
Genetic Loci
Y Chromosome
X Chromosome
Gene Frequency
Quality Control
Haplotypes
Alleles
Genome
Population

All Science Journal Classification (ASJC) codes

  • Molecular Biology
  • Genetics
  • Genetics(clinical)

Cite this

Broman, K. W., Gatti, D. M., Svenson, K. L., Sen, S., & Churchill, G. A. (2019). Cleaning genotype data from diversity outbred mice. G3: Genes, Genomes, Genetics, 9(5), 1571-1579. https://doi.org/10.1534/g3.119.400165

Cleaning genotype data from diversity outbred mice. / Broman, Karl W.; Gatti, Daniel M.; Svenson, Karen L.; Sen, Saunak; Churchill, Gary A.

In: G3: Genes, Genomes, Genetics, Vol. 9, No. 5, 01.05.2019, p. 1571-1579.

Research output: Contribution to journalArticle

Broman, KW, Gatti, DM, Svenson, KL, Sen, S & Churchill, GA 2019, 'Cleaning genotype data from diversity outbred mice', G3: Genes, Genomes, Genetics, vol. 9, no. 5, pp. 1571-1579. https://doi.org/10.1534/g3.119.400165
Broman, Karl W. ; Gatti, Daniel M. ; Svenson, Karen L. ; Sen, Saunak ; Churchill, Gary A. / Cleaning genotype data from diversity outbred mice. In: G3: Genes, Genomes, Genetics. 2019 ; Vol. 9, No. 5. pp. 1571-1579.
@article{e77907977f6543f38992c20cc9766da6,
title = "Cleaning genotype data from diversity outbred mice",
abstract = "Data cleaning is an important first step in most statistical analyses, including efforts to map the genetic loci that contribute to variation in quantitative traits. Here we illustrate approaches to quality control and cleaning of array-based genotyping data for multiparent populations (experimental crosses derived from more than two founder strains), using MegaMUGA array data from a set of 291 Diversity Outbred (DO) mice. Our approach employs data visualizations that can reveal problems at the level of individual mice or with individual SNP markers. We find that the proportion of missing genotypes for each mouse is an effective indicator of sample quality. We use microarray probe intensities for SNPs on the X and Y chromosomes to confirm the sex of each mouse, and we use the proportion of matching SNP genotypes between pairs of mice to detect sample duplicates. We use a hidden Markov model (HMM) reconstruction of the founder haplotype mosaic across each mouse genome to estimate the number of crossovers and to identify potential genotyping errors. To evaluate marker quality, we find that missing data and genotyping error rates are the most effective diagnostics. We also examine the SNP genotype frequencies with markers grouped according to their minor allele frequency in the founder strains. For markers with high apparent error rates, a scatterplot of the allele-specific probe intensities can reveal the underlying cause of incorrect genotype calls. The decision to include or exclude low-quality samples can have a significant impact on the mapping results for a given study. We find that the impact of low-quality markers on a given study is often minimal, but reporting problematic markers can improve the utility of the genotyping array across many studies.",
author = "Broman, {Karl W.} and Gatti, {Daniel M.} and Svenson, {Karen L.} and Saunak Sen and Churchill, {Gary A.}",
year = "2019",
month = "5",
day = "1",
doi = "10.1534/g3.119.400165",
language = "English (US)",
volume = "9",
pages = "1571--1579",
journal = "G3: Genes, Genomes, Genetics",
issn = "2160-1836",
publisher = "Genetics Society of America",
number = "5",

}

TY - JOUR

T1 - Cleaning genotype data from diversity outbred mice

AU - Broman, Karl W.

AU - Gatti, Daniel M.

AU - Svenson, Karen L.

AU - Sen, Saunak

AU - Churchill, Gary A.

PY - 2019/5/1

Y1 - 2019/5/1

N2 - Data cleaning is an important first step in most statistical analyses, including efforts to map the genetic loci that contribute to variation in quantitative traits. Here we illustrate approaches to quality control and cleaning of array-based genotyping data for multiparent populations (experimental crosses derived from more than two founder strains), using MegaMUGA array data from a set of 291 Diversity Outbred (DO) mice. Our approach employs data visualizations that can reveal problems at the level of individual mice or with individual SNP markers. We find that the proportion of missing genotypes for each mouse is an effective indicator of sample quality. We use microarray probe intensities for SNPs on the X and Y chromosomes to confirm the sex of each mouse, and we use the proportion of matching SNP genotypes between pairs of mice to detect sample duplicates. We use a hidden Markov model (HMM) reconstruction of the founder haplotype mosaic across each mouse genome to estimate the number of crossovers and to identify potential genotyping errors. To evaluate marker quality, we find that missing data and genotyping error rates are the most effective diagnostics. We also examine the SNP genotype frequencies with markers grouped according to their minor allele frequency in the founder strains. For markers with high apparent error rates, a scatterplot of the allele-specific probe intensities can reveal the underlying cause of incorrect genotype calls. The decision to include or exclude low-quality samples can have a significant impact on the mapping results for a given study. We find that the impact of low-quality markers on a given study is often minimal, but reporting problematic markers can improve the utility of the genotyping array across many studies.

AB - Data cleaning is an important first step in most statistical analyses, including efforts to map the genetic loci that contribute to variation in quantitative traits. Here we illustrate approaches to quality control and cleaning of array-based genotyping data for multiparent populations (experimental crosses derived from more than two founder strains), using MegaMUGA array data from a set of 291 Diversity Outbred (DO) mice. Our approach employs data visualizations that can reveal problems at the level of individual mice or with individual SNP markers. We find that the proportion of missing genotypes for each mouse is an effective indicator of sample quality. We use microarray probe intensities for SNPs on the X and Y chromosomes to confirm the sex of each mouse, and we use the proportion of matching SNP genotypes between pairs of mice to detect sample duplicates. We use a hidden Markov model (HMM) reconstruction of the founder haplotype mosaic across each mouse genome to estimate the number of crossovers and to identify potential genotyping errors. To evaluate marker quality, we find that missing data and genotyping error rates are the most effective diagnostics. We also examine the SNP genotype frequencies with markers grouped according to their minor allele frequency in the founder strains. For markers with high apparent error rates, a scatterplot of the allele-specific probe intensities can reveal the underlying cause of incorrect genotype calls. The decision to include or exclude low-quality samples can have a significant impact on the mapping results for a given study. We find that the impact of low-quality markers on a given study is often minimal, but reporting problematic markers can improve the utility of the genotyping array across many studies.

UR - http://www.scopus.com/inward/record.url?scp=85065785046&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85065785046&partnerID=8YFLogxK

U2 - 10.1534/g3.119.400165

DO - 10.1534/g3.119.400165

M3 - Article

C2 - 30877082

AN - SCOPUS:85065785046

VL - 9

SP - 1571

EP - 1579

JO - G3: Genes, Genomes, Genetics

JF - G3: Genes, Genomes, Genetics

SN - 2160-1836

IS - 5

ER -