Purpose: The purpose of this study is to implement topic modeling in NGS data analysis for biomarker identification in foodborne pathogen detection and verification.
Methods: A framework was developed to pursue data mining on NGS datasets by topic modeling. It consists of four major procedures: NGS data retrieval, preprocess, topic modeling, and data mining of the LDA topic outputs. The preprocessed NGS sequences were transformed into corpus, in which each document was reasonably viewed as “a bag of words assumption” which was essential for effectiveness of topic modeling approach.
Results: The NGS data set of 119 Salmonella isolates were retrieved from National Center for Biotechnology Information (NCBI) database and was used as an example in this work to show the working flow of this framework. The output topics by LDA algorithms could be treated as features of Salmonella strains to accurately describe the genetic diversity of fliC gene in various serotypes. The distinguished SNPs were identified by the following data mining methods as the potential biomarkers.
Significance: The implementation of topic modeling in NGS data analysis framework provides us a new way in NGS data analysis for elucidating genetic information and biomarker identification, therefore, enhance the NGS data analysis and its applications on pathogen identification, source tracking, and population genome evolution.