VGP-Galaxy assembly workflows

Overview
Where can I find these workflows?
How can I use these workflows?

The Vertebrate Genomes Project's pipelines in Galaxy are intended to allow a user to generate high-quality, near error-free assemblies of species from a user's own data or from the GenomeArk database. These workflows use PacBio HiFi reads, Hi-C data, and, optionally, BioNano optical maps. The image below shows major assembly workflows.

We maintain a set of comprehensive tutorials explaining the use of these workflows in minute details. To access these tutorials see 'How can I use these workflows section' at the bottom of this page.

Overview

Eight analysis trajectories are possible depending on the combination of input data. Decision on invocation of workflow 6 is based on the analysis of QC output of workflows 3, 4, or 5 (see below). Thicker lines connecting workflows 7, 8, and 9 represent the fact that these workflows are invoked separately for each phased assembly (once for maternal [or hap1] and once for paternal [or hap2]). Solo = data is only available for the sample whose genome is being assembled. In this case, you can make either a pseudohaplotype assembly, or a HiC-phased assembly if you have HiC data from the same individual.
Trio = parental information is available in the form of Illumina reads from each parent of the F1 being assembled.

Where can I find these workflows?

The latest versions of workflows can be found in DockStore:

Workflow	Link	Description	Inputs	Outputs
1	Dockstore	K-mer profiling: Create Meryl Database used for the estimation of assembly parameters and quality control with Merqury.	1. Hifi long reads [`fastq`]	1. Meryl Database of k-mer counts 2. GenomeScope summary, models and plots
2	Dockstore	K-mer profiling Trio: Create Meryl Database used for the estimation of assembly parameters and quality control with Merqury using parental and offspring datasets	1. Hifi long reads [`fastq`] 2. Paternal short-read Illumina sequencing reads [`fastq`] 3. Maternal short-read Illumina sequencing reads [`fastq`]	1. Meryl Database of kmer counts 2. GenomeScope summary, models and plots
3	Dockstore	Contiging Solo: Generate phased assembly based on PacBio Hifi Reads	1. Hifi long reads [`fastq`] 2. K-mer database [`meryldb`] 3. Genome profile summary generated by Genomescope [`txt`] 4. Name of first assembly 5. Name of second assembly	1. Primary assembly 2. Alternate assembly 3. QC: BUSCO report for both assemblies 4. QC: Merqury report for both assemblies 5. QC: Assembly statistics for both assemblies 6. QC: Nx plot for both assemblies 7. QC: Size plot for both assemblies
4	Dockstore	Contiging Solo w/HiC: Generate phased assembly based on PacBio Hifi Reads using HiC data from the same individual for phasing	1. Hifi long reads [`fastq`] 2. HiC forward reads (if multiple input files, concatenated in same order as reverse reads) [`fastq`] 3. HiC reverse reads (if multiple input files, concatenated in same order as forward reads) [`fastq`] 4. K-mer database [`meryldb`] 5. Genome profile summary generated by Genomescope [`txt`] 6. Name of first assembly 7. Name of second assembly	1. Haplotype 1 assembly 2. Haplotype 2 assembly 3. QC: BUSCO report for both assemblies 4. QC: Merqury report for both assemblies 5. QC: Assembly statistics for both assemblies 6. QC: Nx plot for both assemblies 7. QC: Size plot for both assemblies
5	Dockstore	Contiging Trio: Generate phased assembly based on PacBio Hifi Reads using parental Illumina data for phasing	1. Hifi long reads [`fastq`] 2. Concatenated Illumina reads : Paternal [`fastq`] 3. Concatenated Illumina reads : Maternal [`fastq`] 4. K-mer database [`meryldb`] 5. Paternal hapmer database [`meryldb`] 6. Maternal hapmer database [`meryldb`] 7. Genome profile summary generated by Genomescope [`txt`] 8. Name of first haplotype 9. Name of second haplotype	1. Haplotype 1 assembly 2. Haplotype 2 assembly 3. QC: BUSCO report for both assemblies 4. Merqury report for both assemblies 5. Assembly statistics for both assemblies 6. Nx Plot for both assemblies 7. Size plot for both assemblies
6	Dockstore	Purging: Purge contigs marked as duplicates by purge_dups (could be haplotypic duplication or overlap duplication)	1. Hifi long reads - trimmed [`fastq`] 2. Primary Assembly (hap1) [`fasta`] 3. Alternate Assembly (hap2) [`fasta`] 4. K-mer database [`meryldb`] 5. Genomescope model parameters [`txt`] 6. Estimated Genome Size [`txt`] 7. Name of first haplotype 8. Name of second haplotype	1. Haplotype 1 purged assembly 2. Haplotype 2 purged assembly 3. QC: BUSCO report for both assemblies 4. QC: Merqury report for both assemblies 5. QC: Assembly statistics for both assemblies 6. QC: Nx plot for both assemblies 7. QC: Size plot for both assemblies
7	Dockstore	Scaffolding Bionano: Scaffolding using Bionano optical map data	1. Scaffolded assembly [`fasta`] 2. Bionano data [`cmap`] 3. Estimated genome size [`txt`] 4. Phased assembly generated by Hifiasm [`gfa1`]	1. Scaffolds, and non-scaffolded contigs 2. QC: Assembly statistics QC: 3. Nx plot 4. QC: Size plot
8	Dockstore	Scaffolding HiC: Scaffolding using HiC data with YAHS (for a single haplotype)	1. Scaffolded assembly [`gfa`] 2. Concatenated HiC forward reads [`fastq`] 3. Concatenated HiC reverse reads [`fastq`] 4. Restriction enzyme sequence [`txt`] 5. Estimated genome size [`txt`]	1. Scaffolds 2. QC: Assembly statistics 3. QC: Nx plot 4. QC: Size plot 5. QC: BUSCO report 6. QC: Pretext Maps before and after scaffolding
9	Dockstore	Decontamination: Decontaminate scaffolded assembly	1. Scaffolded assembly [`fasta`]	1. Decontaminated assembly 2. Contaminant list

How can I use these workflows?

We provide and maintain two tutorials explaining how to perform large genome assembly in Galaxy:

Short tutorial: Describes the workflows, their functions, their inputs and outputs, and how to use them.
Long tutorial: Describes the assembly pipeline tool by tool for an in-depth understanding of the process.

If Galaxy instance you are using is missing some of the tools used in the workflow, contact administrators of this instance. Major instances such as https://usegalaxy.eu, https://usegalaxy.org, and https://usegalaxy.org.au already have all necessary tools installed.