VGP Workflows
Overview
The Vertebrate Genomes Project's pipelines in Galaxy are intended to allow a user to generate high-quality, near error-free assemblies of species from a user's own data or from the GenomeArk database. These workflows use PacBio HiFi reads, Hi-C data, and, optionally, BioNano optical maps. The image below shows major assembly workflows.
Eight analysis trajectories are possible depending on the combination of input data. Decision on invocation of workflow 6 is based on the analysis of QC output of workflows 3, 4, or 5 (see below). Thicker lines connecting workflows 7, 8, and 9 represent the fact that these workflows are invoked separately for each phased assembly (once for maternal [or hap1] and once for paternal [or hap2]). Solo = data is only available for the sample whose genome is being assembled. In this case, you can make either a pseudohaplotype assembly, or a HiC-phased assembly if you have HiC data from the same individual.
Trio = parental information is available in the form of Illumina reads from each parent of the F1 being assembled.
How can I download and use these workflows?
If you are using any of the usegalaxy.*
instances (https://usegalaxy.org, https://usegalaxy.eu, https://usegalaxy.org.au): log into your account and follow steps from the this video:
What do individual workflows do?
Workflow | Link | Description | Inputs | Outputs | Example History |
---|---|---|---|---|---|
1 | Dockstore | K-mer profiling: Create Meryl Database used for the estimation of assembly parameters and quality control with Merqury. |
1. Hifi long reads [fastq ] |
1. Meryl Database of k-mer counts 2. GenomeScope summary, models and plots |
EU |
2 | Dockstore | K-mer profiling Trio: Create Meryl Database used for the estimation of assembly parameters and quality control with Merqury using parental and offspring datasets |
1. Hifi long reads [fastq ] 2. Paternal short-read Illumina sequencing reads [ fastq ] 3. Maternal short-read Illumina sequencing reads [ fastq ] |
1. Meryl Database of kmer counts 2. GenomeScope summary, models and plots |
EU |
3 | Dockstore | Contiging Solo: Generate phased assembly based on PacBio Hifi Reads |
1. Hifi long reads [fastq ] 2. K-mer database [ meryldb ] 3. Genome profile summary generated by Genomescope [ txt ] 4. Name of first assembly 5. Name of second assembly |
1. Primary assembly 2. Alternate assembly 3. QC: BUSCO report for both assemblies 4. QC: Merqury report for both assemblies 5. QC: Assembly statistics for both assemblies 6. QC: Nx plot for both assemblies 7. QC: Size plot for both assemblies |
EU |
4 | Dockstore | Contiging Solo w/HiC: Generate phased assembly based on PacBio Hifi Reads using HiC data from the same individual for phasing |
1. Hifi long reads [fastq ] 2. HiC forward reads (if multiple input files, concatenated in same order as reverse reads) [ fastq ] 3. HiC reverse reads (if multiple input files, concatenated in same order as forward reads) [ fastq ] 4. K-mer database [ meryldb ] 5. Genome profile summary generated by Genomescope [ txt ] 6. Name of first assembly 7. Name of second assembly |
1. Haplotype 1 assembly 2. Haplotype 2 assembly 3. QC: BUSCO report for both assemblies 4. QC: Merqury report for both assemblies 5. QC: Assembly statistics for both assemblies 6. QC: Nx plot for both assemblies 7. QC: Size plot for both assemblies |
EU |
5 | Dockstore | Contiging Trio: Generate phased assembly based on PacBio Hifi Reads using parental Illumina data for phasing |
1. Hifi long reads [fastq ] 2. Concatenated Illumina reads : Paternal [ fastq ] 3. Concatenated Illumina reads : Maternal [ fastq ] 4. K-mer database [ meryldb ] 5. Paternal hapmer database [ meryldb ] 6. Maternal hapmer database [ meryldb ] 7. Genome profile summary generated by Genomescope [ txt ] 8. Name of first haplotype 9. Name of second haplotype |
1. Haplotype 1 assembly 2. Haplotype 2 assembly 3. QC: BUSCO report for both assemblies 4. Merqury report for both assemblies 5. Assembly statistics for both assemblies 6. Nx Plot for both assemblies 7. Size plot for both assemblies |
EU |
6 | Dockstore | Purging: Purge contigs marked as duplicates by purge_dups (could be haplotypic duplication or overlap duplication) |
1. Hifi long reads - trimmed [fastq ] 2. Primary Assembly (hap1) [ fasta ] 3. Alternate Assembly (hap2) [ fasta ] 4. K-mer database [ meryldb ] 5. Genomescope model parameters [ txt ] 6. Estimated Genome Size [ txt ] 7. Name of first haplotype 8. Name of second haplotype |
1. Haplotype 1 purged assembly 2. Haplotype 2 purged assembly 3. QC: BUSCO report for both assemblies 4. QC: Merqury report for both assemblies 5. QC: Assembly statistics for both assemblies 6. QC: Nx plot for both assemblies 7. QC: Size plot for both assemblies |
EU |
6Pri | Dockstore | Purging with custom cutoffs for PRIMARY assembly: Purge contigs marked as duplicates by purge_dups with custom cutoffs if the automatic detection was not satisfying. |
1. Hifi long reads - trimmed [fastq ] 2. Primary Assembly (hap1) [ fasta ] 3. Alternate Assembly (hap2) [ fasta ] 4. K-mer database [ meryldb ] 5. Genomescope model parameters [ txt ] 6. Auto-Alignement Primary assembly [ Paf ] 7. Cutoffs for Primary Assembly 8. PBCSTATS base coverage for Primary assembly [ tab ] 9. Estimated Genome Size [ txt ] 10. Name of first haplotype 11. Name of second haplotype |
1. Haplotype 1 purged assembly 2. Haplotype purged 2 assembly 3. QC: BUSCO report for both assemblies 4. QC: Merqury report for both assemblies 5. QC: Assembly statistics for both assemblies 6. QC: Nx plot for both assemblies 7. QC: Size plot for both assemblies |
EU |
6Alt | Dockstore | Purging with custom cutoffs for ALTERNATE assembly: Purge contigs marked as duplicates by purge_dups with custom cutoffs if the automatic detection was not satisfying. |
1. Hifiasm Alternate assembly + sequences purged from the primary assembly 2. PBCSTATS base coverage for Alternate assembly [ tab ] 3. Cutoffs file for alternate assembly 4. Estimated Genome Size [ txt ] 5. K-mer database [ meryldb ] 6. Hifiasm Purged Primary assembly 7. Name of first haplotype 8. Name of second haplotype |
1. Haplotype 1 purged assembly 2. Haplotype 2 purged assembly 3. QC: BUSCO report for both assemblies 4. QC: Merqury report for both assemblies 5. QC: Assembly statistics for both assemblies 6. QC: Nx plot for both assemblies 7. QC: Size plot for both assemblies |
EU |
7 | Dockstore | Scaffolding Bionano: Scaffolding using Bionano optical map data |
1. Scaffolded assembly [fasta ] 2. Bionano data [ cmap ] 3. Estimated genome size [ txt ] 4. Phased assembly generated by Hifiasm [ gfa1 ] |
1. Scaffolds, and non-scaffolded contigs 2. QC: Assembly statistics QC: 3. Nx plot 4. QC: Size plot |
EU |
8 | Dockstore | Scaffolding HiC: Scaffolding using HiC data with YAHS |
1. Scaffolded assembly [fasta ] 2. Concatenated HiC forward reads [ fastq ] 3. Concatenated HiC reverse reads [ fastq ] 4. Restriction enzyme sequence [ txt ] 5. Estimated genome size [ txt ] |
1. Scaffolds 2. QC: Assembly statistics 3. QC: Nx plot 4. QC: Size plot 5. QC: BUSCO report 6. QC: Pretext Maps before and after scaffolding |
EU |
9 | Dockstore | Decontamination: Decontaminate scaffolded assembly |
1. Scaffolded assembly [fasta ] |
1. Decontaminated assembly 2. Contaminant list |
Examples of use
The following examples use zebra finch (Taeniopygia guttata) data from GenomeArk to demonstrate the assembly process across different data availability scenarios.
Solo only
Input(s): PacBio HiFi data
Workflow trajectory
1 → 3 → 6 → 9
Results
Workflow | Outputs | History link |
---|---|---|
K-mer profiling 1 |
GenomeScope profile using 21-mers on HiFi reads for zebra finch |
EU |
Contigging 3 |
BUSCO results for the primary assembly when run with workflow 3, which is only HiFi data and with hifiasm purging off. One can see the large amount of duplicate BUSCO genes, indicating a need for purging. Merqury spectra-cn plot. Spectra are colored by k-mer count in the assemblies (considered together). The presence of k-mers seen three times across the two assemblies (the slight green peak) but at diploid kmer multiplicity (~30-40 on the x-axis) could also indicate a need for purging. Merqury spectra-asm plot, which colors the spectra according to the assembly those k-mers came from. This plot indicates most of the kmers are found in the primary assembly, meaning the primary and alternate are unbalanced. This can be rectified by purging. |
EU |
Purging duplicates 6 |
BUSCO results for the purged primary assembly. One can see that purging has worked to get rid of much of the duplicated genes, which were the darker blue color. Spectra-cn plot for purged assemblies. Slight overpurging can be detected by the new gray read-only peak. Spectra-asm plot for purged assemblies. The primary and alternate assemblies have, on the whole, been reconciled, as can be seen by the green shared peak at diploid coverage, indicating that homozygous regions are represented in both assemblies. Nx and Size plots |
EU |
Solo with HiC
Input(s): (1) HiFi data and (2) HiC data for the same individual
Workflow trajectory
1 → 4 → 6 → 8a → 9
Results (shown for one of two haplotypes)
Workflow | Outputs | History link |
---|---|---|
Contigging 4 |
BUSCO results for one of the haplotypes resulting from using hifiasm with HiC-phasing. One can observe much less duplicated BUSCO genes compared to the primary assembly without purging, indicating that the HiC phasing was effective at phasing the haplotypes. Spectra-cn plot for HiC phased assembly. There are still some 3-copy k-mers that one could try to address with purging. Spectra-asm for HiC-phased assembly. Observe that it looks much like the spectra-asm for the purged assembly, showing that the two haplotypes are reconciled without a need for purging, when using HiC for phasing. Nx and Size plots |
EU |
Purging duplicates 6 |
BUSCO results for a HiC-phased haplotype after purging. Spectra-cn plot for the HiC-phased haploypes after purging. Potential overpurging can be seen by the new read-only bump that was not there before. Spectra-asm plot for the HiC-phased assemblies after purging. Nx and Size plots |
EU |
Scaffolding with HiC 8a |
BUSCO results for a HiC-phased haplotype after purging and scaffolding with HiC data. BUSCO results typically do not change much after HiC scaffolding. PretextMap for HiC-phased contigs after purging, but before HiC scaffodling. PretextMap for HiC-phased contigs after HiC scaffolding. |
EU |
Trio Only
Inputs(s): (1) HiFi data for the child, (2) Illumina data for parental individual, and (3) Illumina data for maternal individual.
Workflow trajectory
2 → 5 → 6 → 9
Results
Workflow | Outputs | History link |
---|---|---|
K-mer profiling 2 |
GenomeScope profile for Child (top), Father (middle), and Mother (bottom) using 21-mers on HiFi reads for zebra finch |
EU |
Contigging 5 |
BUSCO results for one of the haplotypes resulting from using hifiasm with Trio data Spectra-cn plot for phased Trio assembly Spectra-asm plot for the phased Trio assembly Nx and Size plots |
EU |
Purging duplicates 6 |
BUSCO results for one of the haplotypes resulting from purging duplicates Spectra-cn plot for phased Trio assembly after purging duplicates Spectra-asm plot for the phased Trio assembly after purging duplicates Nx and Size plots |
EU |
Trio with HiC
Inputs(s): (1) HiFi data, (2) Illumina data for parental individual, (3) Illumina data for maternal individual, and (4) HiC data.
Workflow trajectory
2 → 5 → 6 → 8a → 9
Results
Workflow | Outputs | History link |
---|---|---|
Scaffolding with HiC 8a (Data for primary assembly only!) |
BUSCO results for one of the haplotypes resulting after scaffolding PretextMap for HiC-phased contigs after purging, but before scaffodling. PretextMap for HiC-phased contigs after scaffodling. |
EU |
Trio with BioNano
Inputs(s): (1) HiFi data, (2) Illumina data for parental individual, (3) Illumina data for maternal individual, and (4) BioNano data.
Workflow trajectory
2 → 5 → 6 → 7 → 9
Results
Workflow | Outputs | History link |
---|---|---|
Scaffolding with BioNano 7 (Data for primary assembly only!) |
EU |