Galaxy Project Workshop 2015

The Galaxy Ecosystem

http://galaxyproject.org

  1. Galaxy Main
  2. Ways to use Galaxy explained in Choices
  3. Tool Shed Wiki and Main Tool Shed at http://usegalaxy.org/toolshed
  4. More resources at Learn, Support, Teach, and Galaxy Biostars (linked access: Support/Biostar)

Basic Analysis with Galaxy

Protocol

Completed History https://usegalaxy.org/u/usinggalaxy2/h/revised-galaxy-variant-101

  1. History menu Create New and rename Basic
  2. Go to page Galaxy Variant 101: Introduction to Polymorphism Detection via Variant Analysis
    1. Shared Data to Published Pages and find in list
    2. optionally search by 101
  3. Get input datasets by importing the datasets from the Page
    1. paired end DNA data, two conditions
      • first set child
      • second set mother
  4. Optionally import the entire finished history
  5. Import Workflow
  6. Edit Workflow
    1. rename to Revised Galaxy Variant 101
    2. Change database to hg19
      • for tools BWA for Illumina, Freebayes, and Naive Variant Caller
  7. Execute the Workflow
    1. send to new history
    2. auto-named after the workflow name’. e.g.. Revised Galaxy Variant 101
  8. Examine FreeBayes, NVC, and Filter results for polymorphism
  9. Homework:
    1. upload a SNP reference dataset, compare to the VCF datasets, and see if any of the polymorphic variants are annotated

RNA-seq Examples

Differential Expression using Tuxedo pipeline: Known vs Novel splice variants

Review the source tutorial overview under Teach named Teach/Resource/GVL-RNA-SeqTutorial. This summary points to the full exercise created by the Genomics Virtual Lab (GVL)

Wikis: Support and several others under Learn

Known Protocol

Completed History https://usegalaxy.org/u/usinggalaxy2/h/gvl-rna-seq-dm3-known

  1. History menu Create New and rename GVL RNA-seq dm3 Known
  2. Get input data
    • single end RNA data, two conditions, three replicates each
    • Use Upload tool to load dataset using FTP links
      1. batch loading for six fastq datasets
      2. assign fastqsanger and database dm3
      3. single GTF annotation dataset
      4. auto detect format and assign database dm3
    • Optionally import history https://usegalaxy.org/u/usinggalaxy2/h/gvl-rna-seq-dm3-inputs
      • rename history to end with Known
  3. Execute Compute quality statistics
  4. Execute Draw nucleotides distribution chart
    1. review how could be used for QA
  5. Run FastQC on one dataset
    1. determine length (needed for Tophat) ~75
  6. Execute Tophat
    • maps spliced RNA data to a reference genome
    • parameters
      • set Is this library mate-paired? as Single-end
      • use Multiple datasets
      • runs distinct jobs in batch
    • Open Tophat settings to use as Full parameter list
      • set Gene Model Annotations as GTF reference annotation dataset
      • use default Only look for supplied junctions as No
      • set remaining parameters as default
  7. while running, use re-run to examine parameters in more detail
    1. note version (wrapper and binary) at top right of tool form
      • Versions (wrapper)
      • Options link to Tool Shed (wrapper and binary)
      • version is also reported after execution on the i Info form
    2. note option Minimum length of read segments
      • must be at least ½ of the length of the input sequences or bias will result
      • input sequence data is >= 50 bases (25 x 2); leave at default
    3. note option Use Own Junctions
      • optional input dataset, not used
  8. Execute Cuffdiff
    • performs differential expression analysis
    • parameters
      • set Condition C1
      • datasets C1 R1, R2, R3
      • set Condition C2
      • datasets C2 R1, R2, R3
      • set changes for Known
      • set Transcripts
        • GTF known genes/transcript reference annotation
      • set Perform Bias Correction
        • Locally cached dm3
      • use Set Additional Parameters
        • set Average Fragment Length as 70
        • set Fragment Length Standard Deviation as 2 (or 0)

Novel Protocol

Completed History https://usegalaxy.org/u/usinggalaxy2/h/gvl-rna-seq-dm3-novel

  1. Review the Tuxedo protocol at http://cole-trapnell-lab.github.io/cufflinks/manual
    1. Known (above) is a protocol that does not consider novel alternative splicing
      • only Known splice variants are considered, from GTF (not Novel)
  2. Run workflow again with Cufflinks and Cuffmerge to include Novel transcript splice variants found in the sequence input datasets in the analysis
  3. Use History Menu: Copy datasets
    1. send to new history named GVL RNA-seq dm3 Novel
    2. Tophat accepted hits
    3. GTF reference annotation
  4. Execute Cufflinks
    • assembles aligned reads into spliced transcripts
    • note that this does not create consensus transcripts, only the position/coordinates of transcript variants
    • inputs
      • set batch mode for SAM or BAM file of aligned RNA-Seq reads using Multiple Datasets
      • select all six Tophat BAM datasets
    • parameters
      • use defaults except for:
      • set Perform Bias Correction as Yes
      • note that this interprets the database from the inputs
      • this is line-command option "-j" which when used ignores intronic alignments
    • While running, click on re-run to examine options in more detail
      • why is Use Reference Annotation set as No?
      • this is line-command option "-G" which when used will ignore novel splices
  5. Execute Cuffmerge
    • joins together novel and known splices
    • creates a merged GTF reference annotation dataset for use in Cuffdiff
    • inputs
      • set Cufflinks output assembled transcripts as the core set of inputs
      • set Reference Annotation as the known gene/transcript GTF annotation dataset
    • parameters
      • set Sequence Data as locally cached dm3
    • While running, click on re-run to examine options in more detail
      • why are inputs entered individually and not as batch?
      • click on batch option for Multiple Datasets and review
      • not appropriate for this job since all inputs are for the same run
  6. Execute Cuffdiff
    • performs differential expression analysis
    • inputs
      • copy one Cuffdiff datasets from History Known
      • click on pencil icon to Edit Attributes
      • set Info field to Known
      • copy existing contents first to Annotation field if desired
      • note Annotation field is not displayed in the History panel’s dataset view by default
    • use re-run from the Cuffdiff: Known dataset
      1. parameters
      2. set Transcripts
        • GTF combined reference annotation result from Cuffmerge
      3. set Condition C1
        • datasets C1 R1, R2, R3
      4. set Condition C2
        • datasets C2 R1, R2, R3
      5. set Perform Bias Correction
        • Locally cached dm3
      6. use Set Additional Parameters
        • set Average Fragment Length as 70
        • set ‘’Fragment Length Standard Deviation’’ as 2 (or 0)
  7. Once complete, click on pencil icon to Edit Attributes
    1. set Info field to Novel
    2. copy existing contents first to Annotation field if desired

Compare

Completed History https://usegalaxy.org/u/usinggalaxy2/h/gvl-rna-seq-dm3-compare

  1. Use History Menu: Copy datasets
    1. all 7 differential expression Cuffdiff output from "Known"
    2. all 7 differential expression Cuffdiff output from "Novel"
    3. send both to a new history named GVL RNA-seq dm3 Compare
  2. Filter for significant in all Cuffdiff DE results
    1. How can this be done?
      • Filter is a simple tool
      • can run in batch using Multiple datasets (use ‘’command’’ key to select each)
      • try ‘’With following condition’’ as c14=='yes' (significant)
      • and Number of header lines to skip as 1
    2. use this workflow to run Filter in batch with labels https://usegalaxy.org/u/usinggalaxy2/w/gvl-rna-seq-dm3-compare
  3. enable Scratchbook
    • open filtered datasets as pairs to compare results Known vs Novel
    • disable Scratchbook
  4. Homework:
    1. copy all Known and Novel datasets into the same history
      • graph statistics and compare between Known and Novel protocols
      • visualize the data for both runs in Trackster and/or IGV using GVL protocol as a guide
    2. extract a workflow for Known and Novel, annotate, edit, and re-run it.
      • Extract Workflow is included in the next topic Creating Production Workflows

Creating Production Workflows

Polish Workflows in the editor by including annotation and variables. Learn how to customize tool panel display and tune parameters within the editor.

Wikis: Learn/AdvancedWorkflow and Learn/AdvancedWorkflow/Extract and Learn/AdvancedWorkflow/BasicEditing

Protocol

Starting History https://usegalaxy.org/u/usinggalaxy2/h/workflow0

  1. Import starting History
  2. History menu Extract Workflow, edit (update/replace tools, connect noodles), and execute
  3. Annotate the inputs (name & datatype) and execute
  4. Add in the #{input} variable (inherited name label) and execute
  5. Add in the ${input} variable (run-time name label) and execute
  6. Hide intermediate datasets and execute
  7. Fully annotate each step for Publication and execute
  8. Promote the final Workflow in the Tool Panel

Target Topics

Wikis: Histories and Learn/ManagingDatasets

  1. History menu Copy History (use RNA-seq Novel)
    1. click on link to new copy
  2. Hidden and Delete
    1. Operations on multiple datasets to on
      • hide 3 datasets (reversible)
      • unhide one (reversible)
      • delete 3 datasets (reversible)
      • undelete one (reversible)
      • permanently delete one (permanent, recovers disk space)
    2. Operations on multiple datasets to off
  3. Links deleted and hidden
    1. click to show each (reversible)
      • unhide one (reversible)
      • undelete one (reversible)
      • permanently delete one (permanent)
    2. resume show
  4. History menu Unhide Hidden Datasets (batch, reversible 1 at a time)
  5. History menu Delete Hidden Datasets (batch, reversible)
  6. History menu Purge Deleted Datasets (batch, permanent, recovers disk space)
  7. History menu Show Structure
  8. History menu Export Citations (beta)
  9. History menu Delete (reversible) - don’t do
  10. History menu Delete Permanently (permanent, recovers disk space) - don’t do
  11. Annotation
    • annotate one
    • click on pencil icon to show Edit Attributes also contains
  12. Tags
    • add the same tag to 3 datasets, one a BAM dataset
      • workshop
      • workshop bam
      • gtf
  13. History menu Create New and rename Tags
    • User -> Saved Datasets filter by tags
      • gtf -> shows all datasets of datatype = gtf
      • bam -> shows all datasets of datatype = bam
      • workshop -> shows only datasets with tag workshop
      • click on Tags to show detail
      • copy workshop datasets to the current history Tags
  14. Download dataset
    • Download dataset -> bam (rename to remove history number)
    • Download bam_index -> bam.bai
  15. History menu Saved Histories
    1. rename RNA-seq Novel copy to delete copy
    2. delete delete copy (reversible)
    3. Advanced search as deleted
    4. undelete delete copy
    5. Advanced search as active
    6. permanently delete delete copy (permanent, recovers disk space)
    7. Advanced search as all to show status

Sharing and Publishing

  1. Saved Histories Tags History
    1. pull-down menu Share or Publish
    2. Share or Publish form
      • accessible link
      • publish
      • share with another user
  2. go to Shared Data -> Published Histories and click on Tags
    1. Upper right corner Switch to this history
    2. History menu Share or Publish
      • unshare
  3. History menu ‘’Histories Shared with me’’
    • import
    • unshare from account
  4. Same form and options for Workflows, Pages, and Visualizations

Importing

Wikis: Support

Upload

Completed History https://usegalaxy.org/u/usinggalaxy2/h/import

  1. History menu Create New and rename Import
  2. go to ‘’Upload File’’ tool (Get Data or top of left Tool Panel)
    1. Choose local file
      • select downloaded bam dataset
      • when uploading bam, the bam dataset is coordinate sorted and the index is created
      • do not load a bam.bai dataset
      • set Type as Auto-detect and Genome as dm3
      • Start
  3. Paste/Fetch data
    1. open Upload File tool if not already open
    2. already loaded by http/ftp via URL
    3. paste in raw data
      • type in a simple BED3 file format, a few lines, no extra lines
      • example: chrM 4000 5000
      • set Type as Auto-detect and Genome as hg19
      • click on gear icon and check box for Convert spaces to tabs
      • close Upload configuration box
      • Start
  4. FTP load
    1. find a target genome, we’ll use H1N1
      • google search with terms ncbi swine flu genome
      • click on first link
      • note section New and click on here
      • on next form, filter Find related data as ‘’Genome’’
      • click on Bioprojects: PRJNA15521 link for third listed Influenza A virus HxNx (H1N1)
      • on next form, click on Assembly details Assembly GCF_000865725.1 link under Chrs -> 8
    2. the download is from the web page NCBI !H1N1 Genome
    3. select Send to -> Choose Destination: File, Format: FASTA, Sort by: Default order
      • click on Create File
      • rename to H1N1.fasta and Save
    4. open Filezilla (if you have it) or another FTP client
      • settings
      • Host: usegalaxy.org plus your account email and password then Quickconnect
      • action
      • left side is your computer, right side is the Galaxy server
      • on your computer, navigate to the file on left, then drag it over to right to load
      • check Successful transfers at bottom to ensure complete loading
      • quit out of Filezilla
    5. open ‘’Upload File’’ tool if not already open
      • click on Choose FTP File, then checkbox next to loaded H1N1.fasta file
      • Start
      • Close
  5. Examine all datasets
    1. note that the NCBI fasta dataset is formatted incorrectly
      • spaces between fasta records
      • very long descriptions that will likely cause problems if used as a Custom Reference Genome
      • we will correct this in the sub-topic Custom Genomes
  6. Homework:
    1. use the pasted BED dataset with the tool Extract Genomic DNA to retrieve genomic sequence in fasta format for the specified interval. Example with a custom genome (use built-in for this job):Support

External Sources

  1. Get Data: UCSC Main
    1. form options
      • set clade: Mammal, genome: Human, assembly: hg19, group: Genes and Gene Predictions, track: RefSeq Genes, table: refGene, position: chrM (lookup)
    2. Run the tool twice
      • run1 GTF
      • set output format GTF
      • leave Send output to Galaxy box checked
      • get output
      • click Send query to Galaxy
      • run2 Selected fields
      • click on describe table schema
      • note that the field name contains the transcript_id and name2 the gene_id
      • use browser back button to return to the form
      • set output format selected fields from primary and related tables
      • leave Send output to Galaxy box checked
      • get output
      • on next form, only check name and name2
      • click done with selections
      • click Send query to Galaxy
  2. Get Data: EBI SRA
    1. search ERX010058 (Swine flu aka HxNx, paired-end Illumina reads)
      • click into Experiment (1 results found)
      • use tab Read Files
      • sequence data is in column Fastq files (galaxy)
      • load File1 & File2 for Sample Accession named ERS003429 (2nd in list)
      • will take 2 cycles through the tool
      • search by accession ‘’ERS003429’’ for File2
    2. dataset will be loaded directly from the tool
  3. Examine all datasets
    1. note that the UCSC GTF dataset is formatted incorrectly
      • UCSC GTF datasets have transcript set for both transcript_id and gene_id in the 9th field
      • format is problematic if used with most tools
      • the GTF file is corrected in the sub-topic Other Dataset Manipulations
    2. note that the SRA datasets have the datatype set as ‘’fastq’’
      • fastqsanger is required for most tools
      • confirm the quality score scaling and assign the correct datatype in the sub-topic FASTQ Dataset Prep & Troubleshooting

Libraries

  1. go to Shared Data: Data Libraries
  2. search for iGenomes
    1. click into the Library
      • check the box for the dataset hg19_genes.gtf
      • use default setting For selected datasets: Import to current history
      • click on Go
  3. go back to Analyze Data

Shared

Wiki: Learn/Share

  1. share a Workflow with another user
  2. go to Workflow
    1. the shared workflow(s) will appear under Workflows shared with you by others
    2. import the Workflow
      • the Workflow is available to edit, run, rename, etc.
  3. go back to ‘’Analyze Data’’
  4. Homework:
    • share a History, Page, and/or Visualization
    • import other objects under Shared Data: Published XXX

FASTQ Dataset Prep & Troubleshooting

Wiki: Support

  1. History menu Saved Histories
  2. History menu Copy Dataset
    • from Import check the two fastq dataset into a new History named Fastq
  3. Execute FastQC
    • use Multiple datasets selecting both datasets
    • examine results: Illumina 1.8+ is the datatype fastqsanger
    • meaning: Sanger Phred+33 scaled quality scores
    • click on the pencil icon for each to reach the Edit Attributes form

Other Dataset Manipulations

BONUS as time allows

  1. Create correct GTF: Inputs UCSC table browser files from History Upload. Using a combination of Text Manipulation tools.
  2. Select content from GTF: Input iGenome GTF dataset from History Upload. Using regular expressions and Select tool.

Custom Genomes

Wikis: Support and Learn/CustomGenomes

  1. History menu Saved Histories
  2. History menu Copy Dataset
    • from ‘’Import’’ check the H1N1.fasta dataset into a new History named Custom
  3. using methods from the wiki in the Learn/CustomGenomes section, reformat the dataset
    • fasta datasets should be correctly formatted to specification before beginning any analysis
  4. Troubleshooting #3 Extra spaces
  5. Troubleshooting #6 Remove description from identifier line
  6. Troubleshooting #4 Wrap fasta lines to a consistent length
  7. Homework:
    1. create a workflow that includes all steps and use it whenever a new fasta dataset is loaded for use as a Custom Genome
    2. copy the ‘’fastqsanger’’ datasets from the History named ‘’Fastq’’ and map to this genome
    3. follow the instruction in the wiki Learn/CustomGenomes to promote the Custom Genome to a Custom Build
      • assign the new build database to all related datasets
      • database assignment is needed to visualize in Trackster using wiki guide Learn/Visualization



Thanks for using Galaxy!

The Galaxy Team



END