Purpose: Most existing solution pipelines yield read mapping results that are riddled with false positives. For instance, through in-silico simulations, we find that up to 85 to 90% of the predicted potential OTU set obtained using standard pipelines from literature are false. We tackle this problem by introducing a computational solution.
Methods: Our method is based on promiscuity of reads, i.e., reads mapping to multiple OTUs, in contrast to current approaches that rely on the abundance of reads. Ranking the potential OTU matches for each read, we demonstrate through simulations that the rank frequency distribution of true positive OTUs’ reads peak at rank 1. To further enrich the true positives, we define a normalized score per OTU, based on the promiscuity. Sorting by the score, the false positive OTUs sink to the bottom.
Results: Our preliminary experiments demonstrate that false positive OTUs can be substantially reduced, without losing any true positives. Using wgsim we simulated 10,000 sequencing reads of 100 bp from the 16S genes of 20 bacterial species, including food pathogens, from 13 genera. Averaging the results over 100 instances we obtain the following: the method reduced an average of 368 false positive OTUs down to a mere 29, without losing any true positive in any of the instances.
Significance: More accurately identifying the truly present organisms in food samples benefits downstream analyses including hazard detection.