P3-13 Biomarker Identification from Next Generation Sequencing Data for Foodborne Pathogen Detection and Verification

Tuesday, July 28, 2015
Hall B (Oregon Convention Center)
Wen Zou , U.S. Food and Drug Administration-NCTR , Jefferson , AR
Weizhong Zhao , U.S. Food and Drug Administration-NCTR , Jefferson , AR
James Chen , U.S. Food and Drug Administration-NCTR , Jefferson , AR
Introduction: Next-generation sequencing (NGS) technology has recently been widely applied in clinical and public health laboratory investigations for pathogen detection and surveillance. Major gaps currently exist in NGS data analysis and data interpretation.  Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. 

Purpose: The purpose of this study is to implement topic modeling in NGS data analysis for biomarker identification in foodborne pathogen detection and verification.

Methods: A framework was developed to pursue data mining on NGS datasets by topic modeling. It consists of four major procedures: NGS data retrieval, preprocess, topic modeling, and data mining of the LDA topic outputs. The preprocessed NGS sequences were transformed into corpus, in which each document was reasonably viewed as “a bag of words assumption” which was essential for effectiveness of topic modeling approach.  

Results: The NGS data set of 119 Salmonella isolates were retrieved from National Center for Biotechnology Information (NCBI) database and was used as an example in this work to show the working flow of this framework. The output topics by LDA algorithms could be treated as features of Salmonella strains to accurately describe the genetic diversity of fliC gene in various serotypes. The distinguished SNPs were identified by the following data mining methods as the potential biomarkers. 

Significance: The implementation of topic modeling in NGS data analysis framework provides us a new way in NGS data analysis for elucidating genetic information and biomarker identification, therefore, enhance the NGS data analysis and its applications on pathogen identification, source tracking, and population genome evolution.