VGP Workflows
Overview
The Vertebrate Genomes Project's pipelines in Galaxy are intended to allow a user to generate high-quality, near error-free assemblies of species from a user's own data or from the GenomeArk database. These workflows use PacBio HiFi reads, Hi-C data, and optionally Bionano optical maps. The image below shows major assembly workflows represented as a decision tree. The numbers in the left bottom corner of each workflow are clickable and point to a detailed description of the corresponding workflow.
Solo = data is only available for the sample whose genome is being assembled. In this case, you can make either a pseudohaplotype assembly, or a HiC-phased assembly if you have HiC data from the same individual.
Trio = parental information is available in the form of Illumina reads from each parent of the F1 being assembled.
Where can I find these workflows?
The latest versions of the workflow are located on the European Galaxy instance. We are still in the process of polishing and optimizing the workflows. Once our optimization work is complete, the workflows will be distributed via all main Galaxy instances as well as through DockStore and WorkflowHub platforms.
How can I download and use these workflows?
If you are using the European Galaxy instance you can use them directly:
- Log into your account on usegalaxy.eu
- Go to "Shared Data"→"Workflows"
- In the search box enter "
VGP_curated
" - Pick workflow from the list
If you would like to use these workflows on a difference instance, you can download them:
- Click on "EU latest" that you can find in the tables describing the workflows. These tables are located below on this page.
- Click the "Download workflow" icon in the top right corner of the page
- Login into Galaxy instance you are planning to use
- Click "Workflows" on the top of Galaxy interface
- Click "Import" button
- Upload the workflow file you downloaded at step 2
- Click "Import workflow"
Workflow 1
K-mer profiling for Solo datasets
Link | Description | Inputs | Outputs | Example History |
---|---|---|---|---|
EU latest | MerylDB generation: Create Meryl Database used for the estimation of assembly parameters and quality control with Merqury. |
1. Hifi long reads [fastq ] |
1. Meryl Database of k-mer counts 2. GenomeScope summary, models and plots |
EU |
Workflow 2
K-mer profiling for Trio datasets
Link | Description | Inputs | Outputs | Example History | |
---|---|---|---|---|---|
EU latest | MerylDB generation Trio: Create Meryl Database used for the estimation of assembly parameters and quality control with Merqury using parental and offspring datasets |
1. Hifi long reads [fastq ] 2. Paternal short-read Illumina sequencing reads [ fastq ] 3. Maternal short-read Illumina sequencing reads [ fastq ] |
1. Meryl Database of kmer counts 2. GenomeScope summary, models and plots |
EU |
Workflow 3
Contigging Solo
Link | Description | Inputs | Outputs | Example History |
---|---|---|---|---|
EU latest | Long Read Assembly with Hifiasm: Generate phased assembly based on PacBio Hifi Reads |
1. Hifi long reads [fastq ] 2. K-mer database [ meryldb ] 3. Genome profile summary generated by Genomescope [ txt ] 4. Name of first assembly 5. Name of second assembly |
1. Primary assembly 2. Alternate assembly 3. QC: BUSCO report for both assemblies 4. QC: Merqury report for both assemblies 5. QC: Assembly statistics for both assemblies 6. QC: Nx plot for both assemblies 7. QC: Size plot for both assemblies |
EU |
Workflow 4
Contigging Solo with HiC
Link | Description | Inputs | Outputs | Example History |
---|---|---|---|---|
EU latest | Long Read Assembly with Hifiasm and HiC: Generate phased assembly based on PacBio Hifi Reads using HiC data from the same individual for phasing |
1. Hifi long reads [fastq ] 2. HiC forward reads (if multiple input files, concatenated in same order as reverse reads) [ fastq ] 3. HiC reverse reads (if multiple input files, concatenated in same order as forward reads) [ fastq ] 4. K-mer database [ meryldb ] 5. Genome profile summary generated by Genomescope [ txt ] 6. Name of first assembly 7. Name of second assembly |
1. Haplotype 1 assembly 2. Haplotype 2 assembly 3. QC: BUSCO report for both assemblies 4. QC: Merqury report for both assemblies 5. QC: Assembly statistics for both assemblies 6. QC: Nx plot for both assemblies 7. QC: Size plot for both assemblies |
EU |
Workflow 5
Contigging Trio
Link | Description | Inputs | Outputs | Example History |
---|---|---|---|---|
EU latest | Long Read Assembly with Hifiasm and Trio data: Generate phased assembly based on PacBio Hifi Reads using parental Illumina data for phasing |
1. Hifi long reads [fastq ] 2. Concatenated Illumina reads : Paternal [ fastq ] 3. Concatenated Illumina reads : Maternal [ fastq ] 4. K-mer database [ meryldb ] 5. Paternal hapmer database [ meryldb ] 6. Maternal hapmer database [ meryldb ] 7. Genome profile summary generated by Genomescope [ txt ] 8. Name of first haplotype 9. Name of second haplotype |
1. Haplotype 1 assembly 2. Haplotype 2 assembly 3. QC: BUSCO report for both assemblies 4. Merqury report for both assemblies 5. Assembly statistics for both assemblies 6. Nx Plot for both assemblies 7. Size plot for both assemblies |
EU |
Workflow 6
Purging duplicates
Link | Description | Inputs | Outputs | Example History |
---|---|---|---|---|
EU latest | Purge Duplications: Purge contigs marked as duplicates by purge_dups (could be haplotypic duplication or overlap duplication) |
1. Hifi long reads - trimmed [fastq ] 2. Primary Assembly (hap1) [ fasta ] 3. Alternate Assembly (hap2) [ fasta ] 4. K-mer database [ meryldb ] 5. Genomescope model parameters [ txt ] 6. Estimated Genome Size [ txt ] 7. Name of first haplotype 8. Name of second haplotype |
1. Haplotype 1 purged purged assembly 2. Haplotype 2 assembly 3. QC: BUSCO report for both assemblies 4. QC: Merqury report for both assemblies 5. QC: Assembly statistics for both assemblies 6. QC: Nx plot for both assemblies 7. QC: Size plot for both assemblies |
EU |
Workflow 7
Scaffolding with BioNano data
Link | Description | Inputs | Outputs | Example history |
---|---|---|---|---|
EU latest | Bionano Scaffolding: Scaffolding using Bionano optical map data |
1. Scaffolded assembly [fasta ] 2. Bionano data [ cmap ] 3. Estimated genome size [ txt ] 4. Phased assembly generated by Hifiasm [ gfa1 ] |
1. Scaffolds, and non-scaffolded contigs 2. QC: Assembly statistics QC: 3. Nx plot 4. QC: Size plot |
EU |
Workflow 8
Scaffolding with HiC
Link | Description | Inputs | Outputs | Example history |
---|---|---|---|---|
EU latest | HiC Scaffolding with Yahs: Scaffolding using HiC data with YAHS |
1. Scaffolded assembly [fasta ] 2. Concatenated HiC forward reads [ fastq ] 3. Concatenated HiC reverse reads [ fastq ] 4. Restriction enzyme sequence [ txt ] 5. Estimated genome size [ txt ] |
1. Scaffolds 2. QC: Assembly statistics 3. QC: Nx plot 4. QC: Size plot 5. QC: BUSCO report 6. QC: Pretext Maps before and after scaffolding |
EU |
Workflow 9
Decontamination
Link | Description | Inputs | Outputs |
---|---|---|---|
EU latest | Decontamination: Decontaminate scaffolded assembly |
1. Scaffolded assembly [fasta ] |
1. Decontaminated assembly 2. Contaminant list |
Examples of use
The following examples use zebra finch (Taeniopygia guttata) data from GenomeArk to demonstrate the assembly process across different data availability scenarios.
Solo only
Input(s): PacBio HiFi data
Workflow trajectory
1 → 3 → 6 → 9
Results
Workflow | Outputs | History link |
---|---|---|
K-mer profiling 1 |
GenomeScope profile using 21-mers on HiFi reads for zebra finch |
EU |
Contigging 3 |
BUSCO results for the primary assembly when run with workflow 3, which is only HiFi data and with hifiasm purging off. One can see the large amount of duplicate BUSCO genes, indicating a need for purging. Merqury spectra-cn plot. Spectra are colored by k-mer count in the assemblies (considered together). The presence of k-mers seen three times across the two assemblies (the slight green peak) but at diploid kmer multiplicity (~30-40 on the x-axis) could also indicate a need for purging. Merqury spectra-asm plot, which colors the spectra according to the assembly those kmers came from. This plot indicates most of the kmers are found in the primary assembly, meaning the primary and alternate are unbalanced. This can be rectified by purging. |
EU |
Purging duplicates 6 |
BUSCO results for the purged primary assembly. One can see that purging has worked to get rid of much of the duplicated genes, which were the darker blue color. Spectra-cn plot for purged assemblies. Slight overpurging can be detected by the new gray read-only peak. Spectra-asm plot for purged assemblies. The primary and alternate assemblies have, on the whole, been reconciled, as can be seen by the green shared peak at diploid coverage, indicating that homozygous regions are represented in both assemblies. Nx and Size plots |
EU |
Solo with HiC
Input(s): (1) HiFi data and (2) HiC data for the same individual
Workflow trajectory
1 → 4 → 6 → 8a → 9
Results (shown for one of two haplotypes)
Workflow | Outputs | History link |
---|---|---|
Contigging 4 |
BUSCO results for one of the haplotypes resulting from using hifiasm with HiC-phasing. One can observe much less duplicated BUSCO genes compared to the primary assembly without purging, indicating that the HiC phasing was effective at phasing the haplotypes. Spectra-cn plot for HiC phased assembly. There are still some 3-copy k-mers that one could try to address with purging. Spectra-asm for HiC-phased assembly. Observe that it looks much like the spectra-asm for the purged assembly, showing that the two haplotypes are reconciled without a need for purging, when using HiC for phasing. Nx and Size plots |
EU |
Purging duplicates 6 |
BUSCO results for a HiC-phased haplotype after purging. Spectra-cn plot for the HiC-phased haploypes after purging. Potential overpurging can be seen by the new read-only bump that was not there before. Spectra-asm plot for the HiC-phased assemblies after purging. Nx and Size plots |
EU |
Scaffolding with HiC 8a |
BUSCO results for a HiC-phased haplotype after purging and scaffolding with HiC data. BUSCO results typically do not change much after HiC scaffolding. PretextMap for HiC-phased contigs after purging, but before HiC scaffodling. PretextMap for HiC-phased contigs after HiC scaffolding. |
EU |
Trio Only
Inputs(s): (1) HiFi data for the child, (2) Illumina data for parental individual, and (3) Illumina data for maternal individual.
Workflow trajectory
2 → 5 → 6 → 9
Results
Workflow | Outputs | History link |
---|---|---|
K-mer profiling 2 |
GenomeScope profile for Child (top), Father (middle), and Mother (bottom) using 21-mers on HiFi reads for zebra finch |
EU |
Contigging 5 |
BUSCO results for one of the haplotypes resulting from using hifiasm with Trio data Spectra-cn plot for phased Trio assembly Spectra-asm plot for the phased Trio assembly Nx and Size plots |
EU |
Purging duplicates 6 |
BUSCO results for one of the haplotypes resulting from purging duplicates Spectra-cn plot for phased Trio assembly after purging duplicates Spectra-asm plot for the phased Trio assembly after purging duplicates Nx and Size plots |
EU |
Trio with HiC
Inputs(s): (1) HiFi data, (2) Illumina data for parental individual, (3) Illumina data for maternal individual, and (4) HiC data.
Workflow trajectory
2 → 5 → 6 → 8a → 9
Results
Workflow | Outputs | History link |
---|---|---|
Scaffolding with HiC 8a (Data for primary assembly only!) |
BUSCO results for one of the haplotypes resulting after scaffolding PretextMap for HiC-phased contigs after purging, but before scaffodling. PretextMap for HiC-phased contigs after scaffodling. |
EU |
Trio with BioNano
Inputs(s): (1) HiFi data, (2) Illumina data for parental individual, (3) Illumina data for maternal individual, and (4) BioNano data.
Workflow trajectory
2 → 5 → 6 → 7 → 9
Results
Workflow | Outputs | History link |
---|---|---|
Scaffolding with BioNano 7 (Data for primary assembly only!) |
EU |