Purpose: Here we describe the development and validation of a novel, high density DNA microarray representing all known E. coli genes mined from approximately 300 whole genome sequences. The FDA-ECID array has been designed and manufactured using next-generation Affymetrix PEG-GeneAtlas technology. This custom tool is rapid, affordable and high-throughput.
Methods: Using BLASTCLUST and NETCLUST tools, we analyzed 300 whole genome sequences and determined the non-redundant pangenome of the species of E. coli to be ~40k unique genes. Each of these ~40k genes is represented as a probe set on our FDA-ECID microarray. Additionally, we have represented each allele from the fliC, wzx, and wzy genes; thereby allowing this microarray the ability to perform molecular serotyping. Using the same 300 genome sequences, we identified ~125k conserved 25-mers each containing a central single nucleotide polymorphism (SNP). Of these, we filtered the most informative 10% that were capable of accurately recapitulating the phylogeny of E. coli. Each of the 10k informative SNPs is represented on the FDA-ECID microarray using a SNP-typing probe strategy.
Results: As part of a validation process, we have performed hybridizations in quadruplicate of 4 diverse, well characterized, sequenced reference strains (Sakai, 55989, CFT073, MG1655). These data allowed us to optimize gene-calling and SNP-calling algorithms. We also present the results from our interrogation of a vast collection (>900) of temporally and geographically diverse E. coli isolates.
Significance: In summary, the FDA-ECID microarray is a powerful tool for molecular epidemiology, phylogenetic analysis, virulence assessment, molecular serotyping, and exploring the global genomic diversity of Escherichia coli.