Help for Differential Expression Analysis
FAQs and Galaxy Help Q&A. Most tool errors have been discussed or have existing help:
- Getting Inputs Right
- Format help for Tabular/BED/Interval Datasets
- Common datatypes explained
- Search all Prior Q&A and Galaxy Resources
The error and usage help in this FAQ applies to:
Expect odd errors or content problems if any of the usage requirements below are not met:
- Differential expression tools all require count dataset replicates when used in Galaxy. At least two per factor level and the same number per factor level. These must all contain unique content.
- Factor/Factor level names should only contain alphanumeric characters and optionally underscores. Avoid starting these with a number and do not include spaces.
- If the tool uses
Conditions, the same naming requirements apply.
DEXSeqadditionally requires that the first Condition is labeled as
- Reference annotation should be in GTF format for most of these tools, with no header/comment lines. Remove all GTF header lines with the tool
Remove beginning of a file. If any are comment lines are internal to the file, those should be removed. The tool
Selectcan be used.
- Make sure that if a GTF dataset is used, and tool form settings are expecting particular attributes, those are actually in your annotation file (example: gene_id).
- GFF3 data (when accepted by a tool) should have single
#comment line and any others (at the start or internal) that usually start with a
##should be removed. The tool
Selectcan be used.
- If a GTF dataset is not available for your genome, a two-column tabular dataset containing
transcript <tab> genecan be used instead with most of these tools. Some reformatting of a different annotation file type might be needed. Tools in the groups under
GENERAL TEXT TOOLScan be used.
- Make sure that if your count inputs have a header, the option
Files have header?is set to
Yes. If no header, set to
- Custom genomes/transcriptomes/exomes must be formatted correctly before mapping. FAQ: Preparing and using a Custom Reference Genome or Build
- Any reference annotation should be an exact match for any genome/transcriptome/exome used for mapping. Build and version matter. FAQ: Mismatched Chromosome identifiers (and how to avoid them)
- Avoid using UCSC's annotation extracted from their Table Browser. All GTF datasets from the UCSC Table Browser have the same content populated for the transcript_id and gene_id values. Both are the "transcript_id", which creates scientific content problems, effectively meaning that the counts will be summarized "by transcript" and not "by gene", even if labeled in a tool's output as being "by gene". It is usually possible to extract gene/transcript in tabular format from other related tables. Review the Table Browser usage at UCSC for how to link/extract data or ask them for guidance if you need extra help to get this information for a specific data track.
- Note: Selected genomes at UCSC do have a reference anotatation GTF pre-computed and available with a Gene Symbol populated into the "gene_id" value. Find these in the UCSC "Downloads" area. When available, the link can be directly copy/pasted into the Upload tool in Galaxy. Allow Galaxy to autodetect the datatype to produce an uncompressed GTF dataset in your history ready to use with tools. Examples:
Review the documentation for these tools to better understand the usage. Galaxy tutorials also cover the topic.*
Tool publication/help links are usually at the very bottom of tool forms, review these links if they apply to the tools you are using.