← Back to covid19


Current sample information

We continuously analyze a subset of public read-level data. This subset contains high quality metadata that allows accurate variant calling by ensuring correct processing of variants contained within primer-annealing sites for datasets produced using ARTIC protocols. The current set of samples is produced by COG-UK, Estonian, Greek, Irish, and South African surveillance efforts. Figure below shows the temporal distribution of analyzed samples.

Each project releases data in batches. Reads and metadata from each batch are processed as described in this document.

Why not all data?

The current number of raw read datasets in the EBI sequence archives is in the millions:

Why not to analyze all data? Well ... essentially not a single dataset among these specifies metadata correctly. For example, look at the library_construction_protocol metadata field for all ampliconic (library_strategy=="AMPLICON") data produced on Illumina platform (instrument_platform=="ILLUMINA"):

The absolute majority of the data does not specify anything at all (None on the X-axis), while the others (top 25 shown; see X-axis labels) are not that much more useful! Thus we do not exactly know how the data was generated and so we cannot reliably call variants!