Uploading data into Galaxy

Uploading data into Galaxy can be done in number of ways summarized in the following chart:

Figure 1. Deciding which upload approach to use depends on the size and number of datasets, whether they are web accessible, or if they have been deposited to the Short Read Archive (SRA). Many = more than 10. Big = over 100 Mb.

The general "best practice approach" is this:

  • if you have just a few small (< 100 MB) datasets stored on your computer → use the Local file upload
  • if you have large files (> 100 MB) → use FTP upload
  • if you have many files (> 10) → use FTP upload
  • if datasets you are uploading have been deposited to SRA → use SRA upload

Local file upload

Uploading local files directly into Galaxy works well for a small number (say, a dozen) or relatively small (tens of MB) datasets. The following screencast shows how this works:

FTP upload

FTP stands for file transfer protocol. This is the best way to upload large files (or large number of files) into Galaxy:

URL upload

In some cases you may need to upload public datasets from the Internet. These datasets will have web addresses (also called Uniform Resource Locators or URLs). A web address of a dataset can be pasted directly into the upload tool interface:

SRA upload

Finally, if the data you are uploading has been deposited to the Short Read Archive (SRA) at NCBI use the following approach:

To GZIP or not to GZIP

Compressing can be a highly efficient way to store many types of biological data. For example fastq datasets (a typical representation for sequencing reads) can be compressed to approx. 33% of their original size with gzip utility. So if the data you are uploading is already compressed (e.g., has .gz or .bz2 file extensions) keep it this way! Upon upload simply tell Galaxy (as shown in FTP upload video) that file(s) is(are) compressed (i.e., of fastqsanger.gz type if this is appropriate).