Data Preparation
Please note that "built-in" or "cached" data can now be managed directly from within the Galaxy admin interface. For details, see Data Managers Overview and our Data Managers Tutorial.
NOTE: Be aware that that as of early 2014, builds are incorporated into the Galaxy schema in tables. Data Managers are recommended to index new genomes (these are found in the ToolShed. This wiki is considered legacy and provided as a reference.
Builds list changes
If you still choose to do this manually, follow the instructions at Data Integration to start, (impact: more than just a builds.txt file is needed to establish a new reference genome), making certain that your server has the necessary changes/additions to the Data Tables model or use the alternate configuration file, then follow the guide here for the organization and execution of data preparation tasks in a local or cloud instance.
http://usegalaxy.org
Data and indexes hosted atUsing the Galaxy team's version of reference genomes and indexes can often be a good strategy for those working with both a local and the public Main instance. All options are described here http://datacache.galaxyproject.org/. More details about the rsync server and the contents is at Usegalaxy.org Rsync.
What's in this wiki ?
This wiki shows you how to organize, index, and link in your local built-in data for the most commonly used tools. Galaxy's web tool forms are each a web-accessible input wrapper that interacts with one or more underlying tools. Many require that reference data be indexed in a specific way as an one of the inputs, whether specifically selected on the form by the user or interpreted from the other input's metadata (specifically, the "database" attribute, or dbkey).
Although a reference genome can be used from the history with most tools (see Custom Genomes), this is a resource intensive process, and local built-in indexes mean quicker job execution and reduced server load.
The link between a tool and built-in data is a configurable ".loc"
file.
Setting up Tools and Reference Data
1. Determine the tools versions needed, the source, and any dependencies.
2. Decide how you will organize your tools.
3. Install tools.
4. Decide how you will organize your data.
5. Install data.
6. Restart your server!
7. Repeat 3, 5, 6 as needed.
Tips for Installing Tools
The how-to is below, but this section gives a quick overview by tool or tool group.
General
For tools from the "Fastx Toolkit", go to the http://hannonlab.cshl.edu/fastx_toolkit web site for the current download and instructions.
Two other common dependencies are "rpy" (a Python library) and "R". Start with rpy first. Check if R was built with the --enable-R-shlib
option if you run into problems (unlikely, this is default in pre-compiled binaries). Go to http://rpy.sourceforge.net and http://www.r-project.org for the current download and instructions.
Bowtie/Bowtie2 installation
To install Bowtie and/or Bowtie2, go to the http://bowtie-bio.sourceforge.net/index.shtml or http://sourceforge.net/projects/bowtie-bio/files/bowtie2. Download (source or binary) and follow the instructions per tool.
BWA installation
To install BWA, download the source from http://bio-bwa.sourceforge.net. To install, open the archive and run make
in the new BWA directory.
LASTZ installation
LASTZ is downloaded from http://www.bx.psu.edu/miller_lab/dist (e.g. lastz-X.0n.m.tar.gz
). Installation help is at http://www.bx.psu.edu/miller_lab/dist/README.lastz-X.0X.X0/README.lastz-X.0X.X0.html.
Extract Genomic DNA installation
The Extract tool is downloaded from http://genome.ucsc.edu. It uses the same reference index as LASTZ and the instructions for the data prep is merged below.
Megablast installation
Megablast in Galaxy was updated to use NCBI BLAST+ (BLASTN
) in April 2012 (changeset 0b5cb60e4810). See dependencies wiki for current version then download blast+. Many data indexes are available directly at NCBI from ftp://ftp.ncbi.nlm.nih.gov/blast/db/
Picard/SRMA installation
SRMA is a Java program that relies on Picard, a Java implementation of C Samtools. SRMA is available from the Sourceforge SRMA project. The SRMA jar file should be named srma.jar
and placed in $GALAXY_PATH/tool-data/shared/jars
. If you want to compile SRMA from source, you will also need to install Picard (Sourceforge Picard project) and extract it into the lib directory of the SRMA directory. You can get more info on SRMA from its wiki.
(Note that there also is a C version of SRMA, but the Galaxy team does not use it as of the last page edit.)
NGS: SAM Tools
SAM Tools is highly recommended, if not actually considered required, for every local instance running any other tools in the "NGS:" tool groups. SAM Tools is available at http://samtools.sourceforge.net. To install: unpack the archive in a new /samtools
directory and then run make
.
Setting Up the Reference Genomes for NGS Tools
There are three key steps:
- Obtain the data
- Index or prepare it
- Modify the associate .loc file (this tells Galaxy how to find/use it)
Build Names
Build names need to exactly match existing build names in $GALAXY_PATH/tool-data/shared/ucsc/builds.txt
. Each is unique.
Tools and Their Corresponding loc Files
The following table shows the name of the loc file associated with each tool.
Tool | XML File | loc File |
---|---|---|
Bowtie, Tophat | $GALAXY_PATH/tools/sr_mapping/bowtie_wrapper.xml | $GALAXY_PATH/tool-data/bowtie_indices.loc |
$GALAXY_PATH/tools/sr_mapping/bowtie_wrapper_color.xml | $GALAXY_PATH/tool-data/bowtie_indices_color.loc | |
Bowtie2, Tophat2 | $GALAXY_PATH/tools/sr_mapping/bowtie2_wrapper.xml | $GALAXY_PATH/tool-data/bowtie2_indices.loc |
BWA | $GALAXY_PATH/tools/sr_mapping/bwa_wrapper.xml | $GALAXY_PATH/tool-data/bwa_index.loc |
SAM Tools | $GALAXY_PATH/tools/samtools/sam_to_bam.xml | $GALAXY_PATH/tool-data/sam_fa_indices.loc |
$GALAXY_PATH/tools/samtools/sam_merge.xml | $GALAXY_PATH/tool-data/sam_fa_indices.loc | |
$GALAXY_PATH/tools/samtools/sam_pileup.xml | $GALAXY_PATH/tool-data/sam_fa_indices.loc | |
LASTZ | $GALAXY_PATH/tools/sr_mapping/lastz_wrapper.xml | $GALAXY_PATH/tool-data/lastz_seqs.loc |
$GALAXY_PATH/tools/sr_mapping/lastz_paired_reads_wrapper.xml | $GALAXY_PATH/tool-data/lastz_seqs.loc | |
Megablast | $GALAXY_PATH/tools/metag_tools/megablast_wrapper.xml | $GALAXY_PATH/tool-data/blastdb.loc |
SRMA | $GALAXY_PATH/tools/sr_mapping/srma_wrapper.xml | $GALAXY_PATH/tool-data/srma_index.loc |
There is a sample file for each of these files, with .sample
appended to the filename. This file explains the necessary format of the loc file.
Organizing Index Files
The best way to organize the various index files is to have dedicated directories for each build that contains a directory for each NGS tool, which then contains the actual index files.
A structure like this is recommended:
$BASE_PATH/
hg18/
bowtie_path/
base/
hg18.1.ebwt
hg18.2.ebwt
hg18.3.ebwt
hg18.4.ebwt
hg18.rev.1.ebwt
hg18.rev.2.ebwt
color/
hg18.1.ebwt
hg18.2.ebwt
hg18.3.ebwt
hg18.4.ebwt
hg18.rev.1.ebwt
hg18.rev.2.ebwt
bwa_path/
hg18.amb
hg18.ann
hg18.bwt
hg18.pac
hg18.rbwt
hg18.rpac
hg18.rsa
hg18.sa
sam_index/
hg18.fasta
hga8.fasta.fai
Bowtie and Tophat
Generating Indices
Instructions are for both Bowtie/Tophat and Bowtie2/Tophat2.
Bowtie and Tophat use the same index set, and Bowtie2 and Tophat2 use the same index set, but Bowtie/Tophat and Bowtie2/Tophat2 do not use the same index set.
Know what tools are you evoking and which indexes are being accessed by that tool based on the .loc
file contents. If your tools are giving errors after a data update or at first usage, review the wrapper and tool versions against the indexes used in the error message to verify!
Have Bowtie and/or Bowtie2 installed and in your $PATH
.
Also have Tophat and/or Tophat2 installed and in your $PATH
so that you can test the tools on the command-line for simple installation checks or for use later on in Galaxy.
Usage: bowtie-build [option] index_basename.fa index_basename
(where index_basename.fa
is your input reference genome in fasta format)
The Galaxy team uses the [option]
-f to create indexes, for example:
bowtie-build -f hg19.fa hg19
or
bowtie2-build -f hg19.fa hg19
The [option]
-C would be used instead to create color indexes for Bowtie/Tophat. Please note that Bowtie2/Tophat2 do not support colorspace reads. Confused? Type bowtie-build
or bowtie2-build
at the command prompt to view the usage and see the manual:
Bowtie: http://bowtie-bio.sourceforge.net/manual.shtml
Bowtie2: http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml
Put Bowtie/Bowtie2 indexes and color/non-color indexes in distinct directories. That means up to *3 distinct directories* for a full set per reference genome version. But you may only need 1 or 2.
The index files that will be created for Bowtie are:
index_basename.1.ebwt
index_basename.2.ebwt
index_basename.3.ebwt
index_basename.4.ebwt
index_basename.rev.1.ebwt
index_basename.rev.2.ebwt
The index files that will be creates for Bowtie2 are:
index_basename.1.bt2
index_basename.2.bt2
index_basename.3.bt2
index_basename.4.bt2
index_basename.rev.1.bt2
index_basename.rev.2.bt2
Setting Up loc Files
- Know where the data is
-
Know where the
bowtie_index.loc
,bowtie_index_color.loc
, andbowtie2_index.loc
files comes from and where they should be placed to be used- initially are named
/galaxy-dist/tool-data/bowtie_index.loc.sample
, (/galaxy-dist/tool-data/bowtie_index_color.loc.sample
, and/galaxy-dist/tool-data/bowtie2_index.loc.sample
- from the Tool Shed repository
- to your Galaxy instance
- initially are named
- Follow instructions in sample file to add in rows for each database. One row per database.
- Remove the ".sample" from the file name if this is the first time you are using it
- Remove any rows for databases that you no longer want to host if you are altering an existing .loc
- You can make sure the file was created correctly by restarting the server and opening up the Bowtie/Bowtie2, Bowtie_color, or Tophat/Tophat2 tool, and checking the dropdown menu of genomes. These tools are found in the tool groups NGS: Mapping and NGS: RNA Analysis, unless you custom installed them elsewhere.
- Test the new database(s) by running a few sequences that you expect to have hits with default parameters.
BWA
Generating Indices
Have BWA installed in your $PATH
.
Usage: bwa index [options] <reference_in>
The Galaxy team uses the option -a bwtsw
to create indexes.
The manual is here: BWA manual.
The following index files will be created for the FASTA file name.fasta
:
reference_in.fasta.amb
reference_in.fasta.ann
reference_in.fasta.bwt
reference_in.fasta.pac
reference_in.fasta.sa
Note: that if using BWA version earlier than 5.10, you will also see the following reverse index files:
reference_in.fasta.rbwt
reference_in.fasta.rpac
reference_in.fasta.rsa
Setting Up loc File
- Know where the data is
-
Know where the
bwa_index.loc
file comes from and where it should be placed to be used- initially are named (
/galaxy-dist/tool-data/bwa_index.loc.sample
and (/galaxy-dist/tool-data/bwa_index_color.loc.sample
- from the Tool Shed repository
- to your Galaxy instance
- initially are named (
- Follow instructions in sample file to add in rows for each database. One row per database.
- Remove the ".sample" from the file name if this is the first time you are using it
- Remove any rows for databases that you no longer want to host if you are altering an existing .loc
- You can make sure the file was created correctly by restarting the server and opening up the BWA tool, and checking the dropdown menu of genomes.
- Test the new database(s) by running a few sequences that you expect to have hits with default parameters.
SAM Tools
Generating Indices
Have SAMTools installed in your $PATH
.
Usage: samtools faidx <ref.fasta> [region1 [...]]
No special options are needed.
The following index files will be created for the FASTA file name.fasta
:
ref.fasta.fai
Place a relative symbolic link to the original FASTA file in the same location as the sam index (or the original file), making sure the original FASTA file can be read by the Galaxy user. The Galaxy team uses a symbolic link to both organize files (placing the FASTA in a distinct directory) and to reduce data duplication. Creating a structure like:
/ref/bowtie
/ref/bwa
/ref/sam/ref.fasta (relative symbolic link to ../seq/database.fasta)
/ref/sam/ref.fasta.fai
/ref/seq/ref.fasta
Setting Up loc Files
- Know where the data is
-
Know where the
sam_fa_index.loc
file comes from and where it should be placed to be used- initially is named (
/galaxy-dist/tool-data/sam_fa_index.loc.sample
- from the distribution
- to your Galaxy instance
- initially is named (
- Follow instructions in sample file to add in rows for each database. One row per database.
- Remove the ".sample" from the file name if this is the first time you are using it
- Remove any rows for databases that you no longer want to host if you are altering an existing .loc
- You can make sure the file was created correctly by restarting the server and a tool from the SAM Tools tool set. Input datasets should have a database assigned that corresponds to a database having a sam index.
- Test the new database(s) by running a few datasets through tools. Change dataset database assignments using the "Edit Attributes" form (pencil icon).
LASTZ and EXTRACT Genomic DNA
Have LASTZ installed in your $PATH
, although it is not needed for creating indexes, you will need it for testing/using the tool.
The LASTZ and 'Extract Genomic DNA' tools both use a .2bit compressed file representing a reference genome. If the data is sourced from UCSC, this can often just be downloaded.
If from another source, use the FASTA file as input and have twoBitToFa installed in your $PATH
.
twoBitToFa is available from UCSC as a precompiled binary if needed, see the Downloads link on left side bar.
Usage: twoBitToFa ref.2bit ref.fasta
Type tool at command prompt for more usage details.
The following index files will be created for the FASTA file name.fasta
:
ref.2bit
The Galaxy team places the .2bit file in the same location as the original fasta FASTA file to stay organized, such as:
/ref/seq/ref.2bit
/ref/seq/ref.fasta
Setting Up loc Files
- Know where the data is
-
Know where the
lastz_seqs.loc
andalignseq.loc
files comes from and where they should be placed to be used- initially are named (
/galaxy-dist/tool-data/lastz_seqs.loc.sample
and (/galaxy-dist/tool-data/alignseq.loc.sample
- lastz comes from the Tool Shed and
alignseq.loc
is one of the key configuration files from the distribution (used for many purposes) - to your Galaxy instance
- initially are named (
- Follow instructions in sample file to add in rows for each database. One row per database.
- Remove the ".sample" from the file name if this is the first time you are using it
- Remove any rows for databases that you no longer want to host if you are altering an existing .loc
- Restarting the server
- You can make sure the
lastz_seqs.loc
is correct by opening up the LASTZ tool, and checking the dropdown menu of genomes. Test the new database(s) by running a few sequences that you expect to have hits with default parameters. - You can make sure the
alignseq.loc
is correct by loading a simple BED file of coordinates that you know will pull regions from the target genome as a dataset, assigning the database as the reference genome that you are testing, and running the tool. Change dataset database assignments using the "Edit Attributes" form (pencil icon).
Megablast
Have Megablast installed in your $PATH
, although it is not needed for creating indexes, you will need it for testing/using the tool.
Megablast in Galaxy was updated to use NCBI BLAST+ (BLASTN
) in April 2012 (changeset 0b5cb60e4810).
Get the indexes: download directly at NCBI from ftp://ftp.ncbi.nlm.nih.gov/blast/db/.
Create your own, Usage: formatdb -i <database>.fa -p F -n "<database>" -v 2000
The Galaxy Main public instance uses htgs, wgs, and nt from NCBI.
Put the data files in an organized hierarchy such as:
/galaxy-dist/tool-data/blast/<div>/<date>/<date_div>.*
or
/galaxy-dist/tool-data/blast/<date_db>.*
Setting Up loc Files
- Know where the data is
- Know where the
blastdb.loc
file is located (/galaxy-dist/tool-data/blastdb.loc.sample
is default) - Follow instructions in sample file to add in rows for each database. One row per database.
- Remove the ".sample" from the file name if this is the first time you are using it
- Remove any rows for databases that you no longer want to host
- You can make sure the file was created correctly by restarting the server and opening up the Megablast page, where you should see the list of databases you added.
- Test the databases by running a few of the sequence from the same database against themselves through the UI (self-hits) with simple filtering set to "no" (-F F). (Load a few .fa sequences as a dataset -> run tool).
SRMA
Generating Indices
SRMA needs three files in the same directory for each genome, named in a specific way. There are two "index" files required: one is the SRMA .dict
file and the other is the Samtools .fai
index file. To create the .dict
file, run the Picard CreateSequenceDictionary command:
Usage: java -cp "../path/to/srma.jar" net.sf.picard.sam.CreateSequenceDictionary R=<ref.fa> O=<ref.dict>
Note that the .fa
extension is replaced with the .dict
extension. If you haven't already created the Samtools index, you will need to do that:
Usage: samtools faidx <ref.fa> [region1 [...]]
Note that the resulting file should have the extension .fa.fai
. Finally, the fasta file (<ref.fa>
) also needs to be available in the same directory (a relative symbolic link is fine).
Setting Up loc Files
The process for establishing the SRMA loc file is pretty much like the others.
- You need to make sure the files are accessible
- Modify the file
srma_index.loc.sample
in the Galaxytool-data
directory following instructions within the file itself - Remove the .sample from the file before using
- Restart the server and test