This content has a new home at /learn.
-
- Topic points
- Details about the test data
- Getting started
-
Post-process transcripts derived from a de novo transcriptome assembly (basic run)
-
Classify post-processed transcripts into pre-computed orthologous gene family clusters
-
Integrate gene models in pre-computed orthologous gene family clusters with classified transcripts
-
Estimate multiple sequence alignments of integrated orthologous gene family clusters
-
Build and visualize phylogenetic trees of aligned orthologous gene family clusters
- Build and visualize phylogenetic trees of aligned orthologous gene family clusters
- That's it!
- That's it!
PlantTribes Analysis
Here we show the basic steps of performing comparative and evolutionary analyses of gene families and transcriptomes using the Galaxy PlantTribes tools. Specifically, we will
- determine the evolutionary relationship of de novo reconstructed transcripts across lineages in a gene family context
- determine whole genome duplication (WGD) events within a single species (paralogs comparison)
|Figure 1. The PlantTribes
analysis workflow
- has a basic understanding of how Galaxy works ([see this](/tutorials/g101) if you don't)
- has an account in Galaxy ([see this](/tutorials/g101/#setting-up-galaxy-account) if you don't)
- has their browser configured as described [here](/tutorials/g101/#getting-your-display-sorted-out)
- knows how to upload data into Galaxy ([see this](/tutorials/upload) if you don't)
- has a basic understanding of dataset collections ([see this](/tutorials/collections) if you don't)
Topic points
- Upload the PlantTribes test data for this tutorial into Galaxy
- Post-process transcripts derived from a de novo transcriptome assembly
- Target gene family assembly of post-processed transcripts derived from a de novo transcriptome assembly
- Classify post-processed transcripts into pre-computed orthologous gene family clusters
- Integrate gene models in pre-computed orthologous gene family clusters with classified transcripts
- Estimate multiple sequence alignments of integrated orthologous gene family clusters
- Build and visualize phylogenetic trees of aligned orthologous gene family clusters
- Estimate paralogous and orthologous pairwise non-synonymous (Ka) and synonymous (Ks) substitution rates and visualize the distribution with fitted significant components
Details about the test data
In this tutorial, we will be using the test data available on the PlantTribes GitHub repository.
Dataset | Description |
---|---|
assembly.fasta |
A sub-set of a plant transcriptome de novo assembly |
targetOrthos.ids |
A list of targeted orthogroup identifiers corresponding to the 22 representative plant genomes gene family scaffold |
species1.fna |
A sub-set of coding sequences (CDS) for the first species for estimating paralogous and orthologous pairwise synonymous (Ks) and non-synonymous (Ka) substitution rates |
species1.faa |
Corresponding protein sequences for the first species for estimating paralogous and orthologous pairwise synonymous (Ks) and non-synonymous (Ka) substitution rates |
species2.fna |
A sub-set of coding sequences (CDS) for the second species for estimating paralogous and orthologous pairwise synonymous (Ks) and non-synonymous (Ka) substitution rates |
species2.faa |
Corresponding protein sequences for the second species for estimating paralogous and orthologous pairwise synonymous (Ks) and non-synonymous (Ka) substitution rates |
Getting started
Upload into Galaxy the test datasets that you downloaded from the PlantTribes GitHub repository. You can use the upload tool's Auto-detect
feature or manually set the data formats. If you use Auto-detect
, make sure that the data formats are properly set for all files. With the exception of targetOrthos.ids
which is tabular
, data formats are all fasta
.
|Figure 2. Uploading the test datasets from the PlantTribes GitHub repository into Galaxy
Uploading the datasets into Galaxy creates a new history. You can rename the history to be PlantTribes test data
by clicking on Unnamed history
.
|Figure 3. A new history named PlantTribes test data
Post-process transcripts derived from a de novo transcriptome assembly (basic run)
Open the PlantTribes
section in your tool panel and select the AssemblyPostprocessor
tool. Select the following settings on the tool form to post-process the de novo assembly test dataset assembly.fasta
(history item 1) using TransDecoder
coding regions prediction method. The Remove duplicate sequences
option will remove similar (sub)sequences and sequences shorter than 200 bp.
|Figure 4. AssemblyPostProcessor
options (basic run)
Executing the AssemblyPostProcessor
tool with the settings shown in Figure 4
will produce the following items in your history.
|Figure 5. AssemblyPostProcessor
outputs (basic run)
Description of the outputs (basic run)
- Primary TransDecoder coding regions prediction `transcripts.pep` and `transcripts.cds` (history items 7 and 12)
- Validated and filtered representative coding region predictions `transcripts.cleaned.pep` and `transcripts.cleaned.cds` (history items 8 and 11)
- Validated, filtered and non-redundant representative coding region predictions `transcripts.cleaned.nr.pep` and `transcripts.cleaned.nr.cds` (history items 9 and 10)
Target gene family assembly of post-processed transcripts derived from a de novo transcriptome assembly
Using the AssemblyPostProcessor
tool again, enter the following settings to:
- assign post-processed transcripts of `assembly.fasta` (history item 1) to targeted gene families (orthogroups) listed in the `targetOrthos.ids` (history item 6) of the `22Gv1.1 OrthoMCL scaffold`
- whenever possible, reassemble fragmented primary contigs with sufficiently overlapping ends
|Figure 6. AssemblyPostProcessor
options (advanced run)
Executing the AssemblyPostProcessor
tool with the settings shown in Figure 6
will produce the following items in your history.
|Figure 7. AssemblyPostProcessor
outputs (advanced run)
Description of the outputs (advanced run)
- Primary `TransDecoder` coding regions prediction `transcripts.pep` and `transcripts.cds` (history items 14 and 19)
- Validated and filtered representative coding region predictions `transcripts.cleaned.pep` and `transcripts.cleaned.cds` (history items 15 and 18)
- Validated, filtered and non-redundant representative coding region predictions `transcripts.cleaned.nr.pep` and `transcripts.cleaned.nr.cds` (history items 16 and 17)
- A collection of sub-directories of post-processed targeted gene family assemblies (history item 13), each of which contains:
- targeted gene family primary assembly (`*.contigs.fasta`)
- corresponding non-redundant representative coding region predictions (`*.contigs.fasta.cds`)
- corresponding non-redundant representative protein predictions (`*.contigs.fasta.pep`)
- targeted gene family primary assembly summary statistics (`*.contigs.fasta.stats`)
Classify post-processed transcripts into pre-computed orthologous gene family clusters
Select the GeneFamilyClassifier
tool from the PlantTribes
section in your tool panel and enter the following settings on the tool form to classify post-processed transcripts transcripts.cleaned.nr.pep
and transcripts.cleaned.nr.cds
(history items 9 and 10) using both blastp
and hmmscan
into pre-computed orthogroups of the 22Gv1.1 OrthoMCL scaffold
and to create protein and coding sequences orthogroup fasta files for the classified transcripts.
|Figure 8. GeneFamilyClassifier
settings
Executing the GeneFamilyClassifier
tool with the settings shown in Figure 8
will produce the following items in your history.
|Figure 9. GeneFamilyClassifier
outputs which include a dataset collection (history item 20)
Description of the outputs
- Gene family classification protein and coding sequences orthogroup fasta files `gene family clusters` (history item 21)
- Gene family classification metadata files contained within a dataset collection (history item 20) which consists of
- `proteins.blastp.22Gv1.1` - `blastp` results of predicted peptides `transcripts.cleaned.nr.pep` (history item 9) against `22Gv1.1 OrthoMCL scaffold` protein blast database
- `proteins.hmmscan.22Gv1.1` - `hmmscan` results of predicted peptides `transcripts.cleaned.nr.pep` (history item 9) against `22Gv1.1 OrthoMCL scaffold` protein orthogoup HMM profiles
- `proteins.blastp.22Gv1.1.bestOrthos` - best scoring `22Gv1.1 OrthoMCL scaffold` orthogroups for predicted peptides `transcripts.cleaned.nr.pep` (history item 9) based on `blastp` results
- `proteins.hmmscan.22Gv1.1.bestOrthos` - best scoring `22Gv1.1 OrthoMCL scaffold` orthogroups for predicted peptides `transcripts.cleaned.nr.pep` (history item 9) based on `hmmscan` results
- `proteins.both.22Gv1.1.bestOrthos` - selected best scoring `22Gv1.1 OrthoMCL scaffold` orthogroups for predicted peptides `transcripts.cleaned.nr.pep` (history item 9) based on both `blastp` and `hmmscan` results
- `proteins.both.22Gv1.1.bestOrthos.summary` - annotation summary of assigned orthogroups that includes gene counts of scaffold backbone taxa, super clusters (super orthogoups) at multiple stringencies and functional annotations from sources such as Gene Ontology (GO), InterPro protein domains, TAIR, UniProtKB/TrEMBL and UniProtKB/Swiss-Prot
Integrate gene models in pre-computed orthologous gene family clusters with classified transcripts
Select the GeneFamilyIntegrator
tool from the PlantTribes
section in your tool panel and enter the following settings on the tool form to integrate 22Gv1.1 OrthoMCL scaffold
orthogroup backbone gene models with classified protein and coding sequences orthogroup fasta files (history item 21).
|Figure 10. GeneFamilyIntegrator
settings
Executing the GeneFamilyIntegrator
tool with the settings shown in Figure 10
will produce the following item in your history.
|Figure 11. GeneFamilyIntegrator
output
Description of the outputs
- Integrated gene family classification protein and coding sequences orthogroup fasta files (history item 28)
Estimate multiple sequence alignments of integrated orthologous gene family clusters
Select the GeneFamilyAligner
tool from the PlantTribes
section in your tool panel and enter the following settings on the tool form to align integrated protein and coding sequences orthogroup fasta files (history item 28) using the MAFFT
multiple sequence alignment method.
|Figure 12. GeneFamilyAligner
settings
Executing the GeneFamilyAligner
tool with the Output additional dataset collection of files
option set to "Yes" (in addition to the settings shown in Figure 12) will produce an additional dataset collection item in your history. The elements of this dataset collection are orthogroup multiple sequence alignment files in fasta format. You can render graphic visualizations of these files by clicking on the visualization icon in the history item. You can render the visualization using the MSA viewer
to visualize large MSAs interactively or you can launch Jalview (a JNLP based multiple sequence alignment editing, visualization, and analysis workbench) to perform additional tasks on the multiple sequence alignment.
|Figure 13. GeneFamilyAligner
output
Description of the outputs
- Trimmed gene family classification protein and coding sequences orthogroup multiple sequence alignments fasta files (history item 29)
- A dataset collection of trimmed gene family classification protein and coding sequences orthogroup multiple sequence alignments fasta files (history item 29)
Build and visualize phylogenetic trees of aligned orthologous gene family clusters
Select the GeneFamilyPhylogenyBuilder
tool from the PlantTribes
section in your tool panel and enter the following settings on the tool form to build phylogenetic trees of trimmed protein and coding sequences multiple sequence alignments (history item 29) using the FastTree
phylogenetic inference method.
|Figure 14. GeneFamilyPhylogenyBuilder
settings
Executing the GeneFamilyPhylogenyBuilder
tool with the settings shown in Figure 14
will produce multiple items in your history, including the following dataset collection. The elements of this dataset collection are orthogroup newick phylogenetic tree files, all having the nhx
Galaxy datatype. You can render graphic visualizations of these files by clicking on the Visualize
icon for the history item which will allow you to choose to render the visualization using either the Charts
plugin or the Phyloviz
plugin. The graphic to the left of the bottom arrow in Figure 15
is produced when choosing the Phyloviz
plugin.
|Figure 15. GeneFamilyPhylogenyBuilder
dataset collection output
Description of the outputs
- A dataset collection of gene family orthogroup newick phylogenetic tree files (history item 30)
Build and visualize phylogenetic trees of aligned orthologous gene family clusters
Select the GeneFamilyPhylogenyBuilder
tool from the PlantTribes
section in your tool panel and enter the following settings on the tool form to build phylogenetic trees of trimmed protein and coding sequences multiple sequence alignments (history item 29) using the FastTree
phylogenetic inference method.
|Figure 14. GeneFamilyPhylogenyBuilder
settings
Executing the GeneFamilyPhylogenyBuilder
tool with the settings shown in Figure 14
will produce multiple items in your history, including the following dataset collection. The elements of this dataset collection are orthogroup newick phylogenetic tree files, all having the nhx
Galaxy datatype. You can render graphic visualizations of these files by clicking on the Visualize
icon for the history item which will allow you to choose to render the visualization using either the Charts
plugin or the Phyloviz
plugin. The graphic to the left of the bottom arrow in Figure 15
is produced when choosing the Phyloviz
plugin.
|Figure 15. GeneFamilyPhylogenyBuilder
dataset collection output
The Charts
plugin provides several options for rendering the visualization.
|Figure 16. Charts visualization of one of the dataset collection elements produced by the GeneFamilyPhylogenyBuilder
tool
Estimate paralogous and orthologous pairwise non-synonymous (Ka) and synonymous (Ks) substitution rates and visualize the distribution with fitted significant components
Select the KaKsAnalysis
tool from the PlantTribes
section in your tool panel and enter the following settings on the tool form to estimate paralogous pairwise synonymous (Ks) substitution rates and significant distribution Ks components and plot the distribution with fitted significant components.
|Figure 17. KaKsAnalysis
settings
Executing the KaKsAnalysis
tool with the settings shown in Figure 17
will produce the items shown on the left side of Figure 18
in your history.
|Figure 18. KaKsAnalysis
outputs and process flow to visualize the distribution with fitted significant components
Figure 18
)
Description of the outputs (left side of - Reformatted species1 input coding sequences (history item 187)
- Reformatted species1 input amino acids (history item 188)
- Species1 self blastn results (history item 189)
- Species1 paralogous pairs (history item 190)
- Species1 non-synonymous (Ka) and synonymous (Ks) substitution rates analysis results (history item 191)
- Estimated significant components in the distribution of species1 synonymous (Ks) substitution rates (history item 192)
Figure 18
)
Visualize the distribution with fitted significant components (process flow depicted in Select the KsDistribution
tool from the PlantTribes
section in your tool panel and select history items 191 and 192 for the inputs as shown in the upper right image in Figure 18
. Executing the tool will produce a history item with a pdf
data format (lower right corner of Figure 18
) which is the species1 synonymous (Ks) substitution distribution plot (history item 193). When the View data
icon for the history item is clicked, the dataset is rendered (lower middle of Figure 18
).
That's it!
Hopefully this tutorial has given you a taste for what is possible with these PlantTribes
tools. Experiment! There are many more things that you can do with them. If things do not work - complain using the Open Chat
button below or the Galaxy support forum.
The Charts
plugin provides several options for rendering the visualization.
|Figure 16. Charts visualization of one of the dataset collection elements produced by the GeneFamilyPhylogenyBuilder
tool
Estimate paralogous and orthologous pairwise non-synonymous (Ka) and synonymous (Ks) substitution rates and visualize the distribution with fitted significant components
Select the KaKsAnalysis
tool from the PlantTribes
section in your tool panel and enter the following settings on the tool form to estimate paralogous pairwise synonymous (Ks) substitution rates and significant distribution Ks components and plot the distribution with fitted significant components.
|Figure 17. KaKsAnalysis
settings
Executing the KaKsAnalysis
tool with the settings shown in Figure 17
will produce the items shown on the left side of Figure 18
in your history.
|Figure 18. KaKsAnalysis
outputs and process flow to visualize the distribution with fitted significant components
Figure 18
)
Description of the outputs (left side of - Reformatted species1 input coding sequences (history item 187)
- Reformatted species1 input amino acids (history item 188)
- Species1 self blastn results (history item 189)
- Species1 paralogous pairs (history item 190)
- Species1 non-synonymous (Ka) and synonymous (Ks) substitution rates analysis results (history item 191)
- Estimated significant components in the distribution of species1 synonymous (Ks) substitution rates (history item 192)
Figure 18
)
Visualize the distribution with fitted significant components (process flow depicted in Select the KsDistribution
tool from the PlantTribes
section in your tool panel and select history items 191 and 192 for the inputs as shown in the upper right image in Figure 18
. Executing the tool will produce a history item with a pdf
data format (lower right corner of Figure 18
) which is the species1 synonymous (Ks) substitution distribution plot (history item 193). When the View data
icon for the history item is clicked, the dataset is rendered (lower middle of Figure 18
).
That's it!
Hopefully this tutorial has given you a taste for what is possible with these PlantTribes
tools. Experiment! There are many more things that you can do with them. If things do not work - complain using the Open Chat
button below or the Galaxy support forum.