Abstract - Food-borne salmonellosis is an important public health concern worldwide. Despite the increasing adoption of DNA-based subtyping approaches, serotype information remains a cornerstone in food safety and public health activities aimed at reducing the burden of salmonellosis. At the same time, recent advances in whole-genome sequencing (WGS) promise to revolutionize our ability to perform advanced pathogen characterization in support of improved source attribution and outbreak analysis.
We present the Salmonella In Silico Typing Resource (SISTR), a bioinformatics platform (available at https://usegalaxy.eu/root?tool_id=sistr_cmd) for rapidly performing in silico Salmonella serotype predictions using a combination of O (somatic) and H (flagellar) antigen sequences combined with a 330 gene cgMLST scheme for refining predictions. SISTR serovar prediction algorithm uses the in silico results to create a query based on O serogroup, H1, and H2 antigen gene sequences that is used to identify the serovar based on the antigenic formula. Individual O antigen factors are determined using the White-Kauffmann-Le Minor (WKL) scheme. As sequencing of Salmonella isolates at public health laboratories around the world becomes increasingly common, rapid in silico analysis of minimally processed draft genome assemblies provides a powerful approach for molecular epidemiology in support of public health investigations.
Results - We tested and validated the SISTR on Salmonella whole genome sequences from the public repositories of WGS data at the National Center for Biotechnology Information. The final validation dataset consisted of 4,191 draft genomes with 10 or more genome representatives per serovar with metadata of sufficient quality. A total of 3,967 genomes were accurately predicted out of a total of 4,191 genomes included in this analysis, for a global prediction accuracy of 94.9%. The accuracy of prediction was also assessed on a per serovar basis. For serovars with a minimum of four genomes in the dataset, 79 of 84 had at least a 75% concordance between reported and predicted serovar. Among serovars with ten or more genomes, 35 of 46 had a concordance of 97% or higher.
Conclusion - We have developed SISTR to accurately perform serotype predictions using minimally processed assemblies through a command line interface available through Galaxy and IRIDA platforms. Analytical platforms to perform rapid analysis of Salmonella genome sequence data using a number of complementary approaches will improve the response capacity of the public health system for the prevention and control of salmonellosis.