The Vertebrate Genomes Project in Galaxy

VGP + Galaxy


The Vertebrate Genomes Project (VGP) is an international collaborative effort by the G10K consortium to generate near error-free genome assemblies for all the 74,962 vertebrate species. Using Galaxy infrastructure and public instances, this collaboration has generated new, more open methods of genome assembly and access to data:

  • Integration of the Genome Ark on public Galaxy servers.
  • A Galaxy platform with toolkits specifically tailored for Genome assembly.
  • Multiple workflows available from the IWC on US and EU servers using the most up-to-date VGP pipelines.
  • A list of publicly-available histories1 for each assembly completed on Galaxy as they are generated.

Genomes assembled so far using Galaxy platform


The following continuously updated table represents genome assemblies produced so far using Galaxy workflows.


For species repeated twice two haplotype assemblies are available. Taxonomic labels are clickable: "Class" and "Order" will bring you to Wikipedia and "Species" to GenomeArk - a central repository of VGP data. A vector graphics version of this figure is available here. | Size = estimated genome size; Het = estimated heterozygosity, Repeat = estimated repeat content; Contig NG50 and Scaffold NG50 = NG50 statistics for contigs and scaffolds, respectively; Gaps = total length of gaps in scaffolds.

I want to assemble a genome!


The whole point of bringing VGP assembly workflows to Galaxy is to give you the ability to produce high quality assemblies for free.

Prerequisites

To produce high quality assemblies you need to start with high quality high coverage (the median coverage for species listed in the table above is 33) HiFi sequencing data. The following "tiers" of sequencing data are supported by our workflows. Supplementing HiFi data with parental reads, HiC datasets, and/or BioNano optical maps will produce increasingly complete assemblies:

Tier1Assembly quality
HiFiThe minimum requirement
HiFi + HiCBetter continuity
HiFi + BioNanoBetter continuity
HiFi + HiC + BioNanoEven better continuity
HiFi + parental Illumina dataBetter haplotype resolution
HiFi + parental Illumina data + HiCBetter haplotype resolution and improved continuity
HiFi + parental Illumina + BioNanoBetter haplotype resolution and improved continuity
HiFi + parental Illumina data + HiC + BioNanoBetter haplotype resolution and ultimate continuity

1For details on individual analysis trajectories visit workflows page

Workflows

Once you have the required data you can upload the datasets into European, American, or Australian Galaxy instances and begin assembly as described in our workflows page.

Explore Workflows

Where is existing VGP data?


All initial data, intermediate datasets, and final assembles are available from the GenomeArk platform. Galaxy has integrated the Genome Ark server as a remote data repository on all public servers.

Explore GenomeArk

Who pays for computation?


The computational resources required for assembly are supported by public computational infrastructure. In turn, this computational infrastructure is brought you by: