VGP-Galaxy assembly workflows
The Vertebrate Genomes Project's pipelines in Galaxy are intended to allow a user to generate high-quality, near error-free assemblies of species from a user's own data or from the GenomeArk database. These workflows use PacBio HiFi reads, Hi-C data, and, optionally, BioNano optical maps. The image below shows major assembly workflows.
Overview
Eight analysis trajectories are possible depending on the combination of input data. Decision on invocation of workflow 6 is based on the analysis of QC output of workflows 3, 4, or 5 (see below). Thicker lines connecting workflows 7, 8, and 9 represent the fact that these workflows are invoked separately for each phased assembly (once for maternal [or hap1] and once for paternal [or hap2]).
Solo = data is only available for the sample whose genome is being assembled. In this case, you can make either a pseudohaplotype assembly, or a HiC-phased assembly if you have HiC data from the same individual.
Trio = parental information is available in the form of Illumina reads from each parent of the F1 being assembled.
Where can I find these workflows?
The latest versions of workflows can be found in DockStore:
Workflow | Link | Description | Inputs | Outputs |
---|---|---|---|---|
1 | Dockstore | K-mer profiling: Create Meryl Database used for the estimation of assembly parameters and quality control with Merqury. |
1. Hifi long reads [fastq ] |
1. Meryl Database of k-mer counts 2. GenomeScope summary, models and plots |
2 | Dockstore | K-mer profiling Trio: Create Meryl Database used for the estimation of assembly parameters and quality control with Merqury using parental and offspring datasets |
1. Hifi long reads [fastq ] 2. Paternal short-read Illumina sequencing reads [ fastq ] 3. Maternal short-read Illumina sequencing reads [ fastq ] |
1. Meryl Database of kmer counts 2. GenomeScope summary, models and plots |
3 | Dockstore | Contiging Solo: Generate phased assembly based on PacBio Hifi Reads |
1. Hifi long reads [fastq ] 2. K-mer database [ meryldb ] 3. Genome profile summary generated by Genomescope [ txt ] 4. Name of first assembly 5. Name of second assembly |
1. Primary assembly 2. Alternate assembly 3. QC: BUSCO report for both assemblies 4. QC: Merqury report for both assemblies 5. QC: Assembly statistics for both assemblies 6. QC: Nx plot for both assemblies 7. QC: Size plot for both assemblies |
4 | Dockstore | Contiging Solo w/HiC: Generate phased assembly based on PacBio Hifi Reads using HiC data from the same individual for phasing |
1. Hifi long reads [fastq ] 2. HiC forward reads (if multiple input files, concatenated in same order as reverse reads) [ fastq ] 3. HiC reverse reads (if multiple input files, concatenated in same order as forward reads) [ fastq ] 4. K-mer database [ meryldb ] 5. Genome profile summary generated by Genomescope [ txt ] 6. Name of first assembly 7. Name of second assembly |
1. Haplotype 1 assembly 2. Haplotype 2 assembly 3. QC: BUSCO report for both assemblies 4. QC: Merqury report for both assemblies 5. QC: Assembly statistics for both assemblies 6. QC: Nx plot for both assemblies 7. QC: Size plot for both assemblies |
5 | Dockstore | Contiging Trio: Generate phased assembly based on PacBio Hifi Reads using parental Illumina data for phasing |
1. Hifi long reads [fastq ] 2. Concatenated Illumina reads : Paternal [ fastq ] 3. Concatenated Illumina reads : Maternal [ fastq ] 4. K-mer database [ meryldb ] 5. Paternal hapmer database [ meryldb ] 6. Maternal hapmer database [ meryldb ] 7. Genome profile summary generated by Genomescope [ txt ] 8. Name of first haplotype 9. Name of second haplotype |
1. Haplotype 1 assembly 2. Haplotype 2 assembly 3. QC: BUSCO report for both assemblies 4. Merqury report for both assemblies 5. Assembly statistics for both assemblies 6. Nx Plot for both assemblies 7. Size plot for both assemblies |
6 | Dockstore | Purging: Purge contigs marked as duplicates by purge_dups (could be haplotypic duplication or overlap duplication) |
1. Hifi long reads - trimmed [fastq ] 2. Primary Assembly (hap1) [ fasta ] 3. Alternate Assembly (hap2) [ fasta ] 4. K-mer database [ meryldb ] 5. Genomescope model parameters [ txt ] 6. Estimated Genome Size [ txt ] 7. Name of first haplotype 8. Name of second haplotype |
1. Haplotype 1 purged assembly 2. Haplotype 2 purged assembly 3. QC: BUSCO report for both assemblies 4. QC: Merqury report for both assemblies 5. QC: Assembly statistics for both assemblies 6. QC: Nx plot for both assemblies 7. QC: Size plot for both assemblies |
7 | Dockstore | Scaffolding Bionano: Scaffolding using Bionano optical map data |
1. Scaffolded assembly [fasta ] 2. Bionano data [ cmap ] 3. Estimated genome size [ txt ] 4. Phased assembly generated by Hifiasm [ gfa1 ] |
1. Scaffolds, and non-scaffolded contigs 2. QC: Assembly statistics QC: 3. Nx plot 4. QC: Size plot |
8 | Dockstore | Scaffolding HiC: Scaffolding using HiC data with YAHS (for a single haplotype) |
1. Scaffolded assembly [gfa ] 2. Concatenated HiC forward reads [ fastq ] 3. Concatenated HiC reverse reads [ fastq ] 4. Restriction enzyme sequence [ txt ] 5. Estimated genome size [ txt ] |
1. Scaffolds 2. QC: Assembly statistics 3. QC: Nx plot 4. QC: Size plot 5. QC: BUSCO report 6. QC: Pretext Maps before and after scaffolding |
9 | Dockstore | Decontamination: Decontaminate scaffolded assembly |
1. Scaffolded assembly [fasta ] |
1. Decontaminated assembly 2. Contaminant list |
How can I use these workflows?
We provide and maintain two tutorials explaining how to perform large genome assembly in Galaxy:
- Short tutorial: Describes the workflows, their functions, their inputs and outputs, and how to use them.
- Long tutorial: Describes the assembly pipeline tool by tool for an in-depth understanding of the process.