VGP Workflows

Overview


The Vertebrate Genomes Project's pipelines in Galaxy are intended to allow a user to generate high-quality, near error-free assemblies of species from a user's own data or from the GenomeArk database. These workflows use PacBio HiFi reads, Hi-C data, and optionally Bionano optical maps. The image below shows major assembly workflows represented as a decision tree. The numbers in the left bottom corner of each workflow are clickable and point to a detailed description of the corresponding workflow.

Solo = data is only available for the sample whose genome is being assembled. In this case, you can make either a pseudohaplotype assembly, or a HiC-phased assembly if you have HiC data from the same individual.
Trio = parental information is available in the form of Illumina reads from each parent of the F1 being assembled.

Where can I find these workflows?


The latest versions of the workflow are located on the European Galaxy instance. We are still in the process of polishing and optimizing the workflows. Once our optimization work is complete, the workflows will be distributed via all main Galaxy instances as well as through DockStore and WorkflowHub platforms.

How can I download and use these workflows?


If you are using the European Galaxy instance you can use them directly:

  1. Log into your account on usegalaxy.eu
  2. Go to "Shared Data"→"Workflows"
  3. In the search box enter "VGP_curated"
  4. Pick workflow from the list

If you would like to use these workflows on a difference instance, you can download them:

  1. Click on "EU latest" that you can find in the tables describing the workflows. These tables are located below on this page.
  2. Click the "Download workflow" icon in the top right corner of the page
  3. Login into Galaxy instance you are planning to use
  4. Click "Workflows" on the top of Galaxy interface
  5. Click "Import" button
  6. Upload the workflow file you downloaded at step 2
  7. Click "Import workflow"

Workflow 1


K-mer profiling for Solo datasets

Link Description Inputs Outputs Example History
EU latest MerylDB generation:
Create Meryl Database used for the estimation of assembly parameters and quality control with Merqury.
1. Hifi long reads [fastq] 1. Meryl Database of k-mer counts
2. GenomeScope summary, models and plots
EU

Workflow 2


K-mer profiling for Trio datasets

Link Description Inputs Outputs Example History
EU latest MerylDB generation Trio:
Create Meryl Database used for the estimation of assembly parameters and quality control with Merqury using parental and offspring datasets
1. Hifi long reads [fastq]
2. Paternal short-read Illumina sequencing reads [fastq]
3. Maternal short-read Illumina sequencing reads [fastq]
1. Meryl Database of kmer counts
2. GenomeScope summary, models and plots
EU

Workflow 3


Contigging Solo

Link Description Inputs Outputs Example History
EU latest Long Read Assembly with Hifiasm:
Generate phased assembly based on PacBio Hifi Reads
1. Hifi long reads [fastq]
2. K-mer database [meryldb]
3. Genome profile summary generated by Genomescope [txt]
4. Name of first assembly
5. Name of second assembly
1. Primary assembly
2. Alternate assembly
3. QC: BUSCO report for both assemblies
4. QC: Merqury report for both assemblies
5. QC: Assembly statistics for both assemblies
6. QC: Nx plot for both assemblies
7. QC: Size plot for both assemblies
EU

Workflow 4


Contigging Solo with HiC

Link Description Inputs Outputs Example History
EU latest Long Read Assembly with Hifiasm and HiC:
Generate phased assembly based on PacBio Hifi Reads using HiC data from the same individual for phasing
1. Hifi long reads [fastq]
2. HiC forward reads (if multiple input files, concatenated in same order as reverse reads) [fastq]
3. HiC reverse reads (if multiple input files, concatenated in same order as forward reads) [fastq]
4. K-mer database [meryldb]
5. Genome profile summary generated by Genomescope [txt]
6. Name of first assembly
7. Name of second assembly
1. Haplotype 1 assembly
2. Haplotype 2 assembly
3. QC: BUSCO report for both assemblies
4. QC: Merqury report for both assemblies
5. QC: Assembly statistics for both assemblies
6. QC: Nx plot for both assemblies
7. QC: Size plot for both assemblies
EU

Workflow 5


Contigging Trio

Link Description Inputs Outputs Example History
EU latest Long Read Assembly with Hifiasm and Trio data:
Generate phased assembly based on PacBio Hifi Reads using parental Illumina data for phasing
1. Hifi long reads [fastq]
2. Concatenated Illumina reads : Paternal [fastq]
3. Concatenated Illumina reads : Maternal [fastq]
4. K-mer database [meryldb]
5. Paternal hapmer database [meryldb]
6. Maternal hapmer database [meryldb]
7. Genome profile summary generated by Genomescope [txt]
8. Name of first haplotype
9. Name of second haplotype
1. Haplotype 1 assembly
2. Haplotype 2 assembly
3. QC: BUSCO report for both assemblies
4. Merqury report for both assemblies
5. Assembly statistics for both assemblies
6. Nx Plot for both assemblies
7. Size plot for both assemblies
EU

Workflow 6


Purging duplicates

Link Description Inputs Outputs Example History
EU latest Purge Duplications:
Purge contigs marked as duplicates by purge_dups (could be haplotypic duplication or overlap duplication)
1. Hifi long reads - trimmed [fastq]
2. Primary Assembly (hap1) [fasta]
3. Alternate Assembly (hap2) [fasta]
4. K-mer database [meryldb]
5. Genomescope model parameters [txt]
6. Estimated Genome Size [txt]
7. Name of first haplotype
8. Name of second haplotype
1. Haplotype 1 purged purged assembly
2. Haplotype 2 assembly
3. QC: BUSCO report for both assemblies
4. QC: Merqury report for both assemblies
5. QC: Assembly statistics for both assemblies
6. QC: Nx plot for both assemblies
7. QC: Size plot for both assemblies
EU

Workflow 7


Scaffolding with BioNano data

Link Description Inputs Outputs Example history
EU latest Bionano Scaffolding:
Scaffolding using Bionano optical map data
1. Scaffolded assembly [fasta]
2. Bionano data [cmap]
3. Estimated genome size [txt]
4. Phased assembly generated by Hifiasm [gfa1]
1. Scaffolds, and non-scaffolded contigs
2. QC: Assembly statistics
QC: 3. Nx plot
4. QC: Size plot
EU

Workflow 8


Scaffolding with HiC

Link Description Inputs Outputs Example history
EU latest HiC Scaffolding with Yahs:
Scaffolding using HiC data with YAHS
1. Scaffolded assembly [fasta]
2. Concatenated HiC forward reads [fastq]
3. Concatenated HiC reverse reads [fastq]
4. Restriction enzyme sequence [txt]
5. Estimated genome size [txt]
1. Scaffolds
2. QC: Assembly statistics
3. QC: Nx plot
4. QC: Size plot
5. QC: BUSCO report
6. QC: Pretext Maps before and after scaffolding
EU

Workflow 9


Decontamination

Link Description Inputs Outputs
EU latest Decontamination:
Decontaminate scaffolded assembly
1. Scaffolded assembly [fasta] 1. Decontaminated assembly
2. Contaminant list

Examples of use

The following examples use zebra finch (Taeniopygia guttata) data from GenomeArk to demonstrate the assembly process across different data availability scenarios.

Solo only


Input(s): PacBio HiFi data

Workflow trajectory

1369

Results

Workflow Outputs History link
K-mer profiling
1

GenomeScope profile using 21-mers on HiFi reads for zebra finch
EU
Contigging
3

BUSCO results for the primary assembly when run with workflow 3, which is only HiFi data and with hifiasm purging off. One can see the large amount of duplicate BUSCO genes, indicating a need for purging.

Merqury spectra-cn plot. Spectra are colored by k-mer count in the assemblies (considered together). The presence of k-mers seen three times across the two assemblies (the slight green peak) but at diploid kmer multiplicity (~30-40 on the x-axis) could also indicate a need for purging.

Merqury spectra-asm plot, which colors the spectra according to the assembly those kmers came from. This plot indicates most of the kmers are found in the primary assembly, meaning the primary and alternate are unbalanced. This can be rectified by purging.
]
EU
Purging duplicates
6

BUSCO results for the purged primary assembly. One can see that purging has worked to get rid of much of the duplicated genes, which were the darker blue color.

Spectra-cn plot for purged assemblies. Slight overpurging can be detected by the new gray read-only peak.

Spectra-asm plot for purged assemblies. The primary and alternate assemblies have, on the whole, been reconciled, as can be seen by the green shared peak at diploid coverage, indicating that homozygous regions are represented in both assemblies.

Nx and Size plots
EU

Solo with HiC


Input(s): (1) HiFi data and (2) HiC data for the same individual

Workflow trajectory

1468a9

Results (shown for one of two haplotypes)

Workflow Outputs History link
Contigging
4

BUSCO results for one of the haplotypes resulting from using hifiasm with HiC-phasing. One can observe much less duplicated BUSCO genes compared to the primary assembly without purging, indicating that the HiC phasing was effective at phasing the haplotypes.

Spectra-cn plot for HiC phased assembly. There are still some 3-copy k-mers that one could try to address with purging.

Spectra-asm for HiC-phased assembly. Observe that it looks much like the spectra-asm for the purged assembly, showing that the two haplotypes are reconciled without a need for purging, when using HiC for phasing.

Nx and Size plots
EU
Purging duplicates
6

BUSCO results for a HiC-phased haplotype after purging.

Spectra-cn plot for the HiC-phased haploypes after purging. Potential overpurging can be seen by the new read-only bump that was not there before.

Spectra-asm plot for the HiC-phased assemblies after purging.

Nx and Size plots
EU
Scaffolding with HiC
8a

BUSCO results for a HiC-phased haplotype after purging and scaffolding with HiC data. BUSCO results typically do not change much after HiC scaffolding.

PretextMap for HiC-phased contigs after purging, but before HiC scaffodling.


PretextMap for HiC-phased contigs after HiC scaffolding.
EU

Trio Only

Inputs(s): (1) HiFi data for the child, (2) Illumina data for parental individual, and (3) Illumina data for maternal individual.

Workflow trajectory

2569

Results

Workflow Outputs History link
K-mer profiling
2

GenomeScope profile for Child (top), Father (middle), and Mother (bottom) using 21-mers on HiFi reads for zebra finch
EU
Contigging
5

BUSCO results for one of the haplotypes resulting from using hifiasm with Trio data

Spectra-cn plot for phased Trio assembly

Spectra-asm plot for the phased Trio assembly

Nx and Size plots
EU
Purging duplicates
6

BUSCO results for one of the haplotypes resulting from purging duplicates

Spectra-cn plot for phased Trio assembly after purging duplicates

Spectra-asm plot for the phased Trio assembly after purging duplicates

Nx and Size plots
EU

Trio with HiC

Inputs(s): (1) HiFi data, (2) Illumina data for parental individual, (3) Illumina data for maternal individual, and (4) HiC data.

Workflow trajectory

2568a9

Results

Workflow Outputs History link
Scaffolding with HiC
8a
(Data for primary assembly only!)

BUSCO results for one of the haplotypes resulting after scaffolding

PretextMap for HiC-phased contigs after purging, but before scaffodling.

PretextMap for HiC-phased contigs after scaffodling.
EU

Trio with BioNano

Inputs(s): (1) HiFi data, (2) Illumina data for parental individual, (3) Illumina data for maternal individual, and (4) BioNano data.

Workflow trajectory

25679

Results

Workflow Outputs History link
Scaffolding with BioNano
7
(Data for primary assembly only!)

EU