VGP + Galaxy
The Vertebrate Genomes Project (VGP) is an international collaborative effort by the G10K consortium to generate near error-free genome assemblies for more than 70,000 vertebrate species. Using Galaxy infrasctructure and public instances, this collaboration has generated new, more open methods of genome assembly and access to data:
- Integration of the Genome Ark on public Galaxy servers.
- A Galaxy platform with toolkits specifically tailored for Genome assembly.
- Multiple workflows available from the IWC on US and EU servers using the most up-to-date VGP pipelines.
- A list of publicly-available histories1 for each assembly completed on Galaxy as they are generated.
Genomes assembled so far using Galaxy platform
The following continuously updated table represents genome assemblies produced so far using Galaxy workflows.
For species repeated twice two haplotype assemblies are available. Taxonomic labels are clickable: "Class" and "Order" will bring you to Wikipedia and "Species" to GenomeArk - a central repository of VGP data. A vector graphics version of this figure is available here. | Size = estimated genome size; Het = estimated heterozygosity, Repeat = estimated repeat content; Contig NG50 and Scaffold NG50 = NG50 statistics for contigs and scaffolds, respectively; Gaps = total length of gaps in scaffolds.
I want to assemble a genome!
The whole point of bringing VGP assembly workflows to Galaxy is to give you the ability to produce high quality assemblies for free.
Prerequisites
To produce high quality assemblies you need to start with high quality high coverage (the median coverage for species listed in the table above is 33) HiFi sequencing data. The following "tiers" of sequencing data are supported by our workflows. Supplementing HiFi data with parental reads, HiC datasets, and/or BioNano optical maps will produce increasingly complete assemblies:
Tier1 | Assembly quality |
---|---|
HiFi | The minimum requirement |
HiFi + HiC | Better continuity |
HiFi + BioNano | Better continuity |
HiFi + HiC + BioNano | Even better continuity |
HiFi + parental Illumina data | Better haplotype resolution |
HiFi + parental Illumina data + HiC | Better haplotype resolution and improved continuity |
HiFi + parental Illumina + BioNano | Better haplotype resolution and improved continuity |
HiFi + parental Illumina data + HiC + BioNano | Better haplotype resolution and ultimate continuity |
1For details on individual analysis trajectories visit workflows page
Workflows
Once you have the required data you can upload the datasets into European, American, or Australian Galaxy instances and begin assembly as described in our workflows page.
Where is existing VGP data?
All initial data, intermediate datasets, and final assembles are available from the GenomeArk platform. Galaxy has integrated the Genome Ark server as a remote data repository on all public servers.
Who pays for computation?
The computational resources required for assembly are supported by public computational infrastructure. In turn, this computational infrastructure is brought you by:
- EU - de.NBI, Uni-Freiburg, EOSC and ELIXIR
- US - ACCESS-CI, TACC, Jetstream2 (additional funding is provided by NSF and NIH)
- Australia - Australian BioCommons, QCIF, Melbourne Bioinformatics, AARNet