VGP Workflows

Overview


The Vertebrate Genomes Project's pipelines in Galaxy are intended to allow a user to generate high-quality, near error-free assemblies of species from a user's own data or from the GenomeArk database. These workflows use PacBio HiFi reads, Hi-C data, and, optionally, BioNano optical maps. The image below shows major assembly workflows.

Eight analysis trajectories are possible depending on the combination of input data. Decision on invocation of workflow 6 is based on the analysis of QC output of workflows 3, 4, or 5 (see below). Thicker lines connecting workflows 7, 8, and 9 represent the fact that these workflows are invoked separately for each phased assembly (once for maternal [or hap1] and once for paternal [or hap2]). Solo = data is only available for the sample whose genome is being assembled. In this case, you can make either a pseudohaplotype assembly, or a HiC-phased assembly if you have HiC data from the same individual.
Trio = parental information is available in the form of Illumina reads from each parent of the F1 being assembled.

How can I download and use these workflows?


If you are using any of the usegalaxy.* instances (https://usegalaxy.org, https://usegalaxy.eu, https://usegalaxy.org.au): log into your account and follow steps from the this video:

What do individual workflows do?


Workflow Link Description Inputs Outputs Example History
1 Dockstore K-mer profiling:
Create Meryl Database used for the estimation of assembly parameters and quality control with Merqury.
1. Hifi long reads [fastq] 1. Meryl Database of k-mer counts
2. GenomeScope summary, models and plots
EU
2 Dockstore K-mer profiling Trio:
Create Meryl Database used for the estimation of assembly parameters and quality control with Merqury using parental and offspring datasets
1. Hifi long reads [fastq]
2. Paternal short-read Illumina sequencing reads [fastq]
3. Maternal short-read Illumina sequencing reads [fastq]
1. Meryl Database of kmer counts
2. GenomeScope summary, models and plots
EU
3 Dockstore Contiging Solo:
Generate phased assembly based on PacBio Hifi Reads
1. Hifi long reads [fastq]
2. K-mer database [meryldb]
3. Genome profile summary generated by Genomescope [txt]
4. Name of first assembly
5. Name of second assembly
1. Primary assembly
2. Alternate assembly
3. QC: BUSCO report for both assemblies
4. QC: Merqury report for both assemblies
5. QC: Assembly statistics for both assemblies
6. QC: Nx plot for both assemblies
7. QC: Size plot for both assemblies
EU
4 Dockstore Contiging Solo w/HiC:
Generate phased assembly based on PacBio Hifi Reads using HiC data from the same individual for phasing
1. Hifi long reads [fastq]
2. HiC forward reads (if multiple input files, concatenated in same order as reverse reads) [fastq]
3. HiC reverse reads (if multiple input files, concatenated in same order as forward reads) [fastq]
4. K-mer database [meryldb]
5. Genome profile summary generated by Genomescope [txt]
6. Name of first assembly
7. Name of second assembly
1. Haplotype 1 assembly
2. Haplotype 2 assembly
3. QC: BUSCO report for both assemblies
4. QC: Merqury report for both assemblies
5. QC: Assembly statistics for both assemblies
6. QC: Nx plot for both assemblies
7. QC: Size plot for both assemblies
EU
5 Dockstore Contiging Trio:
Generate phased assembly based on PacBio Hifi Reads using parental Illumina data for phasing
1. Hifi long reads [fastq]
2. Concatenated Illumina reads : Paternal [fastq]
3. Concatenated Illumina reads : Maternal [fastq]
4. K-mer database [meryldb]
5. Paternal hapmer database [meryldb]
6. Maternal hapmer database [meryldb]
7. Genome profile summary generated by Genomescope [txt]
8. Name of first haplotype
9. Name of second haplotype
1. Haplotype 1 assembly
2. Haplotype 2 assembly
3. QC: BUSCO report for both assemblies
4. Merqury report for both assemblies
5. Assembly statistics for both assemblies
6. Nx Plot for both assemblies
7. Size plot for both assemblies
EU
6 Dockstore Purging:
Purge contigs marked as duplicates by purge_dups (could be haplotypic duplication or overlap duplication)
1. Hifi long reads - trimmed [fastq]
2. Primary Assembly (hap1) [fasta]
3. Alternate Assembly (hap2) [fasta]
4. K-mer database [meryldb]
5. Genomescope model parameters [txt]
6. Estimated Genome Size [txt]
7. Name of first haplotype
8. Name of second haplotype
1. Haplotype 1 purged assembly
2. Haplotype 2 purged assembly
3. QC: BUSCO report for both assemblies
4. QC: Merqury report for both assemblies
5. QC: Assembly statistics for both assemblies
6. QC: Nx plot for both assemblies
7. QC: Size plot for both assemblies
EU
6Pri Dockstore Purging with custom cutoffs for PRIMARY assembly:
Purge contigs marked as duplicates by purge_dups with custom cutoffs if the automatic detection was not satisfying.
1. Hifi long reads - trimmed [fastq]
2. Primary Assembly (hap1) [fasta]
3. Alternate Assembly (hap2) [fasta]
4. K-mer database [meryldb]
5. Genomescope model parameters [txt]
6. Auto-Alignement Primary assembly [Paf]
7. Cutoffs for Primary Assembly
8. PBCSTATS base coverage for Primary assembly [tab]
9. Estimated Genome Size [txt]
10. Name of first haplotype
11. Name of second haplotype
1. Haplotype 1 purged assembly
2. Haplotype purged 2 assembly
3. QC: BUSCO report for both assemblies
4. QC: Merqury report for both assemblies
5. QC: Assembly statistics for both assemblies
6. QC: Nx plot for both assemblies
7. QC: Size plot for both assemblies
EU
6Alt Dockstore Purging with custom cutoffs for ALTERNATE assembly:
Purge contigs marked as duplicates by purge_dups with custom cutoffs if the automatic detection was not satisfying.
1. Hifiasm Alternate assembly + sequences purged from the primary assembly
2. PBCSTATS base coverage for Alternate assembly [tab]
3. Cutoffs file for alternate assembly
4. Estimated Genome Size [txt]
5. K-mer database [meryldb]
6. Hifiasm Purged Primary assembly
7. Name of first haplotype
8. Name of second haplotype
1. Haplotype 1 purged assembly
2. Haplotype 2 purged assembly
3. QC: BUSCO report for both assemblies
4. QC: Merqury report for both assemblies
5. QC: Assembly statistics for both assemblies
6. QC: Nx plot for both assemblies
7. QC: Size plot for both assemblies
EU
7 Dockstore Scaffolding Bionano:
Scaffolding using Bionano optical map data
1. Scaffolded assembly [fasta]
2. Bionano data [cmap]
3. Estimated genome size [txt]
4. Phased assembly generated by Hifiasm [gfa1]
1. Scaffolds, and non-scaffolded contigs
2. QC: Assembly statistics
QC: 3. Nx plot
4. QC: Size plot
EU
8 Dockstore Scaffolding HiC:
Scaffolding using HiC data with YAHS
1. Scaffolded assembly [fasta]
2. Concatenated HiC forward reads [fastq]
3. Concatenated HiC reverse reads [fastq]
4. Restriction enzyme sequence [txt]
5. Estimated genome size [txt]
1. Scaffolds
2. QC: Assembly statistics
3. QC: Nx plot
4. QC: Size plot
5. QC: BUSCO report
6. QC: Pretext Maps before and after scaffolding
EU
9 Dockstore Decontamination:
Decontaminate scaffolded assembly
1. Scaffolded assembly [fasta] 1. Decontaminated assembly
2. Contaminant list

Examples of use

The following examples use zebra finch (Taeniopygia guttata) data from GenomeArk to demonstrate the assembly process across different data availability scenarios.

Solo only


Input(s): PacBio HiFi data

Workflow trajectory

1369

Results

Workflow Outputs History link
K-mer profiling
1

GenomeScope profile using 21-mers on HiFi reads for zebra finch
EU
Contigging
3

BUSCO results for the primary assembly when run with workflow 3, which is only HiFi data and with hifiasm purging off. One can see the large amount of duplicate BUSCO genes, indicating a need for purging.

Merqury spectra-cn plot. Spectra are colored by k-mer count in the assemblies (considered together). The presence of k-mers seen three times across the two assemblies (the slight green peak) but at diploid kmer multiplicity (~30-40 on the x-axis) could also indicate a need for purging.

Merqury spectra-asm plot, which colors the spectra according to the assembly those k-mers came from. This plot indicates most of the kmers are found in the primary assembly, meaning the primary and alternate are unbalanced. This can be rectified by purging.
]
EU
Purging duplicates
6

BUSCO results for the purged primary assembly. One can see that purging has worked to get rid of much of the duplicated genes, which were the darker blue color.

Spectra-cn plot for purged assemblies. Slight overpurging can be detected by the new gray read-only peak.

Spectra-asm plot for purged assemblies. The primary and alternate assemblies have, on the whole, been reconciled, as can be seen by the green shared peak at diploid coverage, indicating that homozygous regions are represented in both assemblies.

Nx and Size plots
EU

Solo with HiC


Input(s): (1) HiFi data and (2) HiC data for the same individual

Workflow trajectory

1468a9

Results (shown for one of two haplotypes)

Workflow Outputs History link
Contigging
4

BUSCO results for one of the haplotypes resulting from using hifiasm with HiC-phasing. One can observe much less duplicated BUSCO genes compared to the primary assembly without purging, indicating that the HiC phasing was effective at phasing the haplotypes.

Spectra-cn plot for HiC phased assembly. There are still some 3-copy k-mers that one could try to address with purging.

Spectra-asm for HiC-phased assembly. Observe that it looks much like the spectra-asm for the purged assembly, showing that the two haplotypes are reconciled without a need for purging, when using HiC for phasing.

Nx and Size plots
EU
Purging duplicates
6

BUSCO results for a HiC-phased haplotype after purging.

Spectra-cn plot for the HiC-phased haploypes after purging. Potential overpurging can be seen by the new read-only bump that was not there before.

Spectra-asm plot for the HiC-phased assemblies after purging.

Nx and Size plots
EU
Scaffolding with HiC
8a

BUSCO results for a HiC-phased haplotype after purging and scaffolding with HiC data. BUSCO results typically do not change much after HiC scaffolding.

PretextMap for HiC-phased contigs after purging, but before HiC scaffodling.


PretextMap for HiC-phased contigs after HiC scaffolding.
EU

Trio Only

Inputs(s): (1) HiFi data for the child, (2) Illumina data for parental individual, and (3) Illumina data for maternal individual.

Workflow trajectory

2569

Results

Workflow Outputs History link
K-mer profiling
2

GenomeScope profile for Child (top), Father (middle), and Mother (bottom) using 21-mers on HiFi reads for zebra finch
EU
Contigging
5

BUSCO results for one of the haplotypes resulting from using hifiasm with Trio data

Spectra-cn plot for phased Trio assembly

Spectra-asm plot for the phased Trio assembly

Nx and Size plots
EU
Purging duplicates
6

BUSCO results for one of the haplotypes resulting from purging duplicates

Spectra-cn plot for phased Trio assembly after purging duplicates

Spectra-asm plot for the phased Trio assembly after purging duplicates

Nx and Size plots
EU

Trio with HiC

Inputs(s): (1) HiFi data, (2) Illumina data for parental individual, (3) Illumina data for maternal individual, and (4) HiC data.

Workflow trajectory

2568a9

Results

Workflow Outputs History link
Scaffolding with HiC
8a
(Data for primary assembly only!)

BUSCO results for one of the haplotypes resulting after scaffolding

PretextMap for HiC-phased contigs after purging, but before scaffodling.

PretextMap for HiC-phased contigs after scaffodling.
EU

Trio with BioNano

Inputs(s): (1) HiFi data, (2) Illumina data for parental individual, (3) Illumina data for maternal individual, and (4) BioNano data.

Workflow trajectory

25679

Results

Workflow Outputs History link
Scaffolding with BioNano
7
(Data for primary assembly only!)

EU