Continuous analysis of intra-host variation in SARS-CoV-2
Our effort focuses on four goals:
- Continuously analysis of within-host sequence variants in high quality public read-level datasets.
- Maintenance of curated workflows for the analysis of SARS-CoV-2 sequence data and free powerful infrastructure to execute them.
- Development of continuously updated analysis page and dashboard summarizing latest insights from the variant.
- Providing access to all results in raw and aggregated form for immediate use.
The current knowledge about the evolutionary dynamics of SARS-CoV-2 comes primarily from genome assemblies and not from read-level data. While complete genomes allow complex inferences about the evolutionary trajectory of the virus they hide any information about intrahost dynamics because they do not show variants that exist at sub-consensus allele frequencies. This situation is further aggravated by the fact that the number of publicly available read-level datasets lags dramatically behind the number of complete genomes assemblies making it impossible to confirm or further investigate data found in the GISAID database. In addition, only a fraction of available read-level datasets are useful because of the lacking metadata.
DatasetsInformation about how we select, pre-process, and analyze public read-level datasets.
Read more ...
WorkflowsCurated and validated Workflows for immediate use on public Galaxy instances across the globe.
Read more ...
Why analyze intra-host variation?
Many lineage defining sites have been present in SARS-CoV-2 genomes at below-consensus frequencies well before becoming fixed. As we demonstrate on our recent virological post the mutations occurring at the 14 Omicron S-gene codons which display either evidence of negative selection or no evidence of selection (neutral evolution), have rarely been seen within previously sampled sequences (see here) indicating the action of strong purifying selection due to functional constraints. Despite the rarity of these mutations in assembled genomes, it is not uncommon to find them in within-patient sequence datasets (Figure below), often at sub-consensus allelic frequencies. This indicates that, with the possible exceptions of S/N764K, S/N856K and S/Q954H, the mutations at these sites are not rare simply because they are unlikely to occur, but rather because whenever they do occur they are unlikely to either increase sufficiently in frequency to be transmitted, or increase sufficiently in frequency among transmitting viruses to be detected by genomic surveillance.
This work is funded by NIH NHGRI Grant U41 HG006620, NIH NIAID Grant R01 AI134384, NIH NIGMS Grant R01 GM093939 and NSF ABI Grant 1661497. Usegalaxy.eu is supported by the German Federal Ministry of Education and Research grants 031L0101C and de.NBI-epi to BG. Usegalaxy.org.au is supported by Bioplatforms Australia and the Australian Research Data Commons through funding from the Australian Government National Collaborative Research Infrastructure Strategy. Usegalaxy.be is supported by the Research Foundation-Flanders (FWO) grant I002919N and the Flemish Supercomputer Center (VSC).