We provide two resources interpreting our results. The first is this page. Its content is updated weekly and we are continuously adding new analysis snippets. The second resource is our constantly updated Observable Dashboard.
The data
High level descriptive statistics
After filtering (see Methods) the dataset has the following characteristics:
Distribution of allele frequencies
The allele frequencies have the following distribution when stratified by EFFECT types:
Distribution of variants across genes
The density of synonymous and non-synonymous changes across genes:
Allele frequency distribution for individual sites
To view allele frequency distribution for individual sites click on a circle in the graph below:
Co-occurring sites
The figure below shows all sets of two or more variants co-occuring in two of more samples:
Temporal substitution dynamics at VOC sites
One interesting application of our data is examining the extent of intra-host variation at sites identified as Variants of Concern. Many VOC sites have been present in SARS-CoV-2 genomes at below-consensus frequencies well before becoming fixed. As we demonstrate on our recent virological post the mutations occurring at the 14 Omicron S-gene codons which display either evidence of negative selection or no evidence of selection (neutral evolution), have rarely been seen within previously sampled sequences (see here) indicating the action of strong purifying selection due to functional constraints. Despite the rarity of these mutations in assembled genomes, it is not uncommon to find them in within-patient sequence datasets (Figure below), often at sub-consensus allelic frequencies. This indicates that, with the possible exceptions of S/N764K, S/N856K and S/Q954H, the mutations at these sites are not rare simply because they are unlikely to occur, but rather because whenever they do occur they are unlikely to either increase sufficiently in frequency to be transmitted, or increase sufficiently in frequency among transmitting viruses to be detected by genomic surveillance. The following figure shows this dynamics using pre-omicron data across the SARS-CoV-2 genome:
Substitution dynamics during chronic infection
In addition to continuously analyzing data from several national surveillance projects we applied our workflows to a unique datadset generated by Weigang et al. 2021. In this dataset ampliconic and metatranscriptomic data was collected at nine time points. At several time points (days 14 and 105) in addition to ampliconic sequencing from swab specimens a cell-culture propagated isolates were also created and sequenced. The following figure shows the temporal dynamics for variants identified in these samples:
Methods
The filtered set
Using T value for each sample we then filtered allelic variants in every batch by removing all variants that have allele frequency below 50% and appear in less than T samples within this batch.