Chromosome Identifier Mismatch Problems in Tool Inputs
Methods described help to identify and correct errors or unexpected results linked to inputs having non-identical chromosome identifiers and/or different chromosome sequence content.
If using a Custom Reference genome, the methods below also apply, but the first step is to make certain that the Custom Genome is formatted correctly. Improper formating is the most common root cause of CG related errors.
Find BAM dataset identifiers
Quickly learn what the identifiers are in any BAM dataset.
- Run BAM-to-SAM on the aligned data outputting just the SAM header
- The chromosomes will be listed in the header
- Compare these chromosome identifiers between the chromosome (aka "Chrom") field in all other inputs: VCF, GTF, GFF(3), BED, Interval, etc.
Directly obtain UCSC sourced genome identifiers
- Go to http://genome.ucsc.edu/, navigate to "genomes", then the species of interest
- On the home page for the genome build, immediately under the top navigation box, in the blue bar next to the full genome build name, is linked text like "(sequences)"
- Click on this and it will take you to a detail page with a table listing out the contents
- Use the tool "Get Data -> UCSC Main"
- In the Table Browser, choose the target genome and build
- For "group" choose the last option "All Tables"
- For "table" choose "chromInfo"
- Leave all other options at default and send the output to Galaxy
- This new dataset will load as a tabular dataset into your history
- It will list out the contents of the genome build, including the chromosome identifiers (in the first column)
Adjusting Identifiers or Input source
UCSC sourced data used with Other sourced data
A GTF formatted dataset (potentially a "reference annotation dataset"), with Ensembl/UCSC/Other based chromosome identifiers, is to be used with a tool that also makes use of a different sourced reference genome
Or, the reverse may be true, Ensembl/UCSC/Other sourced reference genome and a differnt sourced reference annotation
The underlying genome sequence content is otherwise identical. If not, see the next section for alternative methods.
To adjust the Ensembl/Other reference annotation to match a UCSC-sourced reference genome (or another source that uses UCSC-style chromosome names), add a "chr" to the chromosome name, so that "N" becomes "chrN". Using tools from the group "Text Manipulation". Examples below.
For bed data:
- Tool Add column: add "chr" to the original dataset as a new column.
- Tool Merge Columns: merge "c7" with "c1"
- Tool Cut: cut "c8,c2,c3,c4,c5,c6" (replace c1 & c7 - with merged c8 - the new chrom identifier)
- Click on the pencil icon for the result dataset, then the tab for "Datatype". Assign "bed" and save. Allow the metadata to complete assignment (the "yellow" dataset state)
- Now click on the tab for "Attributes" and assign the remaining columns. Strand = 6, name = 4, and score = 5. Save. For best results with certain downstream tools, allow the metadata to complete assignment
For wig/wiggle data (NOT compressed bigWig):
- Tool Replace parts of text
- File to process: Use Multi-select select wig datasets to fix (one or more)
- Find pattern: chrom=
- Replace with: chrom=chr
- Remainder of options left at default
Any mixed sourced data
The inputs are a match for sequence content but simply adding "chr" will not make all chromosomes identifiers synch up between the inputs. How to fix or replace the inputs so that a match is possible.
The underlying genome sequence content may or may not be identical. Read method descriptions carefully to learn if that method is right for your usage case.
Sequence content is a match but adding "chr" is not enough to obtain an exact identifier match. You want to try to fix the identifiers anyway!!
- Manipuations with tools can often be used to split up a dataset, perform text substitutions and additions, concatinate datasets, and most other common operations one could do with command-line shell tools.
- The dataset could also be downloaded locally to your computer and manipulated there using command-line tools or the text editor of choice.
Sequence content is a match but adding "chr" is not enough to obtain an exact identifier match. You DO NOT want to try to fix the identifiers or it is overly complicated or it is simply not possible to fix the data without an external reference mapping file (not always available).
- Obtain a reference annotation dataset that is a match for the reference genome used
- Sometimes the source is the same for both
- Sometimes the source is the same, but the content of the reference annotation is not ideal for the tools used
- Example: The tool Cuffdiff makes use of specific attributes in the reference annotation (p_id, tss_id, gene_name). If these attributes are not present in the GTF dataset, the resuls will not be fully annotated and some calculations will be skipped
- Use the iGenomes version of the reference annotation, as described below
- Using Cuffdiff and the Gene ID is not present? Check your GTF file - the attribute gene_name is probably missing
- Sometimes the source can be iGenomes, which does contain the extra specific attributes needed for RNA-seq and certain other operationsar
- Example: The tool htseq-count is used and the attributes gene_id and transcript_id need to be distinct values (also true for the tool Cuffdiff for the best results)
- Two sources: https://support.illumina.com/sequencing/sequencing_software/igenome.html and http://cole-trapnell-lab.github.io/cufflinks/igenome_table/index.html
- Download the .tar file locally, uncompress it, then upload just the genes.gtf dataset to Galaxy
- Note: the compression format .tar is not accepted by the Upload tool
- If a .tar dataset is attempted to be uploaded, the load may fail or just the first file in the archive is uploaded (and it will not be the genes.gtf file)
Sequence content is NOT a match or you want to try using a different reference genome instead of a different reference annotation source (reverse of Method 6 above).
- Map against the same reference genome that the reference annotation is based on
- Where and if this reference genome is available will depend on the genome build
- In most cases, the source will be the same for both
- If loading your own genome, make sure it is formatted correctly as a Custom Genome
- Promote the Custom Genome to a Custom Build and assign the genome/build metadata attribute to datasets
- Custom Reference Genome help
- Be aware that if the genome is large, this option may result in a memory failure. Try Method 2 or consider moving to a local or cloud Galaxy where you can control the resources