T2-11 Understanding False Positives in Mapping of Microbiome Sequence Data Using In-silico Simulations

Monday, August 1, 2016: 11:30 AM
242 (America's Center - St. Louis)
Niina Haiminen, IBM TJ Watson Research Center, Yorktown Heights, NY
Laxmi Parida, IBM TJ Watson Research Center, Yorktown Heights, NY
Robert J Prill, IBM Almaden Research Center, San Jose, CA
David Chambliss, IBM Almaden Research Center, San Jose, CA
Kristen L Beck, IBM Almaden Research Center, San Jose, CA
Simone Bianco, IBM Almaden Research Center, San Jose, CA
Stefan Edlund, IBM Almaden Research Center, San Jose, CA
Kun Hu, IBM Almaden Research Center, San Jose, CA
Matthew Davis, IBM Almaden Research Center, San Jose, CA
James Kaufman, IBM Almaden Research Center, San Jose, CA
Dylan Storey, University of California, Davis, Davis, CA
Bart C Weimer, University of California, Davis, Davis, CA
Peter Markwell, MARS Incorporated, McLean, VA
Robert C. Baker, MARS Incorporated, McLean, VA
Introduction: Consider the task of mapping short read sequencing data of multiple genomes in a micro-environment to the correct organism or, more generally "Operational Taxonomic Unit” (OTU), using a reference database. Potential challenges here include reference databases having redundant candidates or lacking sufficient accuracy; many different species or strains in the environment being genetically very close leading to confusion; the extraction process and associated biotechnology introducing sequencing errors. These factors confound the mapping problem, leading to inaccurate results and subsequent interpretations.

Purpose: Most existing solution pipelines yield read mapping results that are riddled with false positives. For instance, through in-silico simulations, we find that up to 85 to 90% of the predicted potential OTU set obtained using standard pipelines from literature are false. We tackle this problem by introducing a computational solution.

Methods: Our method is based on promiscuity of reads, i.e., reads mapping to multiple OTUs, in contrast to current approaches that rely on the abundance of reads. Ranking the potential OTU matches for each read, we demonstrate through simulations that the rank frequency distribution of true positive OTUs’ reads peak at rank 1. To further enrich the true positives, we define a normalized score per OTU, based on the promiscuity. Sorting by the score, the false positive OTUs sink to the bottom.

Results: Our preliminary experiments demonstrate that false positive OTUs can be substantially reduced, without losing any true positives. Using wgsim we simulated 10,000 sequencing reads of 100 bp from the 16S genes of 20 bacterial species, including food pathogens, from 13 genera. Averaging the results over 100 instances we obtain the following: the method reduced an average of 368 false positive OTUs down to a mere 29, without losing any true positive in any of the instances.

Significance: More accurately identifying the truly present organisms in food samples benefits downstream analyses including hazard detection.