VGP-Galaxy assembly workflows

The Vertebrate Genomes Project's pipelines in Galaxy are intended to allow a user to generate high-quality, near error-free assemblies of species from a user's own data or from the GenomeArk database. These workflows use PacBio HiFi reads, Hi-C data, and, optionally, BioNano optical maps. The image below shows major assembly workflows.

Overview


Eight analysis trajectories are possible depending on the combination of input data. Decision on invocation of workflow 6 is based on the analysis of QC output of workflows 3, 4, or 5 (see below). Thicker lines connecting workflows 7, 8, and 9 represent the fact that these workflows are invoked separately for each phased assembly (once for maternal [or hap1] and once for paternal [or hap2]). Solo = data is only available for the sample whose genome is being assembled. In this case, you can make either a pseudohaplotype assembly, or a HiC-phased assembly if you have HiC data from the same individual.
Trio = parental information is available in the form of Illumina reads from each parent of the F1 being assembled.

Where can I find these workflows?


The latest versions of workflows can be found in DockStore:

Workflow Link Description Inputs Outputs
1 Dockstore K-mer profiling:
Create Meryl Database used for the estimation of assembly parameters and quality control with Merqury.
1. Hifi long reads [fastq] 1. Meryl Database of k-mer counts
2. GenomeScope summary, models and plots
2 Dockstore K-mer profiling Trio:
Create Meryl Database used for the estimation of assembly parameters and quality control with Merqury using parental and offspring datasets
1. Hifi long reads [fastq]
2. Paternal short-read Illumina sequencing reads [fastq]
3. Maternal short-read Illumina sequencing reads [fastq]
1. Meryl Database of kmer counts
2. GenomeScope summary, models and plots
3 Dockstore Contiging Solo:
Generate phased assembly based on PacBio Hifi Reads
1. Hifi long reads [fastq]
2. K-mer database [meryldb]
3. Genome profile summary generated by Genomescope [txt]
4. Name of first assembly
5. Name of second assembly
1. Primary assembly
2. Alternate assembly
3. QC: BUSCO report for both assemblies
4. QC: Merqury report for both assemblies
5. QC: Assembly statistics for both assemblies
6. QC: Nx plot for both assemblies
7. QC: Size plot for both assemblies
4 Dockstore Contiging Solo w/HiC:
Generate phased assembly based on PacBio Hifi Reads using HiC data from the same individual for phasing
1. Hifi long reads [fastq]
2. HiC forward reads (if multiple input files, concatenated in same order as reverse reads) [fastq]
3. HiC reverse reads (if multiple input files, concatenated in same order as forward reads) [fastq]
4. K-mer database [meryldb]
5. Genome profile summary generated by Genomescope [txt]
6. Name of first assembly
7. Name of second assembly
1. Haplotype 1 assembly
2. Haplotype 2 assembly
3. QC: BUSCO report for both assemblies
4. QC: Merqury report for both assemblies
5. QC: Assembly statistics for both assemblies
6. QC: Nx plot for both assemblies
7. QC: Size plot for both assemblies
5 Dockstore Contiging Trio:
Generate phased assembly based on PacBio Hifi Reads using parental Illumina data for phasing
1. Hifi long reads [fastq]
2. Concatenated Illumina reads : Paternal [fastq]
3. Concatenated Illumina reads : Maternal [fastq]
4. K-mer database [meryldb]
5. Paternal hapmer database [meryldb]
6. Maternal hapmer database [meryldb]
7. Genome profile summary generated by Genomescope [txt]
8. Name of first haplotype
9. Name of second haplotype
1. Haplotype 1 assembly
2. Haplotype 2 assembly
3. QC: BUSCO report for both assemblies
4. Merqury report for both assemblies
5. Assembly statistics for both assemblies
6. Nx Plot for both assemblies
7. Size plot for both assemblies
6 Dockstore Purging:
Purge contigs marked as duplicates by purge_dups (could be haplotypic duplication or overlap duplication)
1. Hifi long reads - trimmed [fastq]
2. Primary Assembly (hap1) [fasta]
3. Alternate Assembly (hap2) [fasta]
4. K-mer database [meryldb]
5. Genomescope model parameters [txt]
6. Estimated Genome Size [txt]
7. Name of first haplotype
8. Name of second haplotype
1. Haplotype 1 purged assembly
2. Haplotype 2 purged assembly
3. QC: BUSCO report for both assemblies
4. QC: Merqury report for both assemblies
5. QC: Assembly statistics for both assemblies
6. QC: Nx plot for both assemblies
7. QC: Size plot for both assemblies
7 Dockstore Scaffolding Bionano:
Scaffolding using Bionano optical map data
1. Scaffolded assembly [fasta]
2. Bionano data [cmap]
3. Estimated genome size [txt]
4. Phased assembly generated by Hifiasm [gfa1]
1. Scaffolds, and non-scaffolded contigs
2. QC: Assembly statistics
QC: 3. Nx plot
4. QC: Size plot
8 Dockstore Scaffolding HiC:
Scaffolding using HiC data with YAHS (for a single haplotype)
1. Scaffolded assembly [gfa]
2. Concatenated HiC forward reads [fastq]
3. Concatenated HiC reverse reads [fastq]
4. Restriction enzyme sequence [txt]
5. Estimated genome size [txt]
1. Scaffolds
2. QC: Assembly statistics
3. QC: Nx plot
4. QC: Size plot
5. QC: BUSCO report
6. QC: Pretext Maps before and after scaffolding
9 Dockstore Decontamination:
Decontaminate scaffolded assembly
1. Scaffolded assembly [fasta] 1. Decontaminated assembly
2. Contaminant list

How can I use these workflows?


We provide and maintain two tutorials explaining how to perform large genome assembly in Galaxy:

  • Short tutorial: Describes the workflows, their functions, their inputs and outputs, and how to use them.
  • Long tutorial: Describes the assembly pipeline tool by tool for an in-depth understanding of the process.