NCBI SRA Fastq
This section will guide you through downloading experimental metadata, organizing the metadata to short lists corresponding to conditions and replicates, and finally importing the data from NCBI SRA in collections reflecting the experimental design.
It is critical to understand the condition/replicate structure of an experiment before working with the data so that it can be imported as collections ready for analysis. Direct your browser to https://www.ncbi.nlm.nih.gov/Traces/study/ and in the search box enter GEO data set identifier (for example: GSE72018). Once the study appears, click the box to download the “RunInfo Table”.
The “RunInfo Table” provides the experimental condition and replicate structure of all of the samples. Prior to importing the data, we need to parse this file into individual files that contain the sample IDs of the replicates in each condition. This can be achieved by using a combination of the ‘group’, ‘compare two datasets’, ‘filter’, and ‘cut’ tools to end up with single column lists of sample IDs (SRRxxxxx) corresponding to each condition.
We can now provide the files with SRR IDs to NCBI SRA Tools (fastq-dump) to import the data from SRA to Galaxy. By organizing the replicates of each condition in separate lists, the data will be imported as “collections” that can be directly loaded to a workflow or analysis pipeline.
In these FASTQ data:
- The quality score identifier (+) is sometimes not a match for the sequence identifier (@).
- The forward and reverse reads may be interlaced and need to be seperated into distinct datasets.
- Both may be present in a dataset. Correct the first, then the second, as explained below.
- Format problems of any kind can cause tool failures and/or unexpected results.
- Fix the problems before running any other tools (including FastQC, Fastq Groomer, or other QA tools)
Notice that the sequence identifier for the quality score name ("+" line) is NOT one of these accepted formats:
- The same exact content is present for both the quality score name and the sequence name ("@" line)
- Quality score name is a single plus sign ("+")
@MG00HS05:491:C7450ACXX:4:1101:1240:2223_forward/1 TTTGTGACTAATTGTATAACAGGTTATTTTAGTTTCTGTTCTGTGGAAAGTGCAAAGCATTCCAATAAGGGTTTTTATGTTTGCAAAGGAGAGTTTGTCTA +SRR5330501.1 MG00HS05:491:C7450ACXX:4:1101:1240:2223 length=101 :??>D>,,,CDD8,C<EEI@@C;3<AF9:C+22+2A4E:11@C11?D9*?BDB*009)9*08?D9BEIDEI.@CA=6=A@A?D===@?;.;?>AAA;>>A#
Correct the format by running the tool Replace Text in entire line with these options:
Note: If the quality score line is named like "+ERR" instead (or other valid options), modify the pattern search to match.
The forward and reverse reads were extracted into the same dataset as interlaced sequences. These need to be filtered into two distinct fastq datasets for use with tools as paired-end datasets.
@1/1 NTCAAAATCTGCCGTTGATTTTTCATCCAATGGACAGCCACAAGCAATCCCAGTCCTCCTCTTACTCTCCTCAGCCTCCCACTTCTTCTG + #:9::><<>>DDDDDDDDDDDDDDDDDDDDDDDDDDDDD=DDDD@DDDDDDDDDDD@DDDDDDDDDDDDD@@;@@????7:D=7D@@@@@ @1/2 GTGGTGGAAGGGTGCCAGCGGAGTGGCGGCGATGGCGACGGAGGCGGGGGGGAGGGGGACAGGAGAAGGGGGGGGGCGGGGGGGGGGAGA + IFGGCHHEGHHFEHGHHHFHG?E:FECDD############################################################# @2/1 NGCTTGGTTAAGGGAACCCACCGGAGCCGTAGCGAAAGCGAGTCTTCATAGGGCAATTGTCACTGCTTATGGACCCGAACCTGGGTGATC + #::9:;<<<;DDDDDDDDDDD::DDDDDDDDDDDDDDDDDDDDDDDDDD;DDDB@DDD@D=7???7<><<=?<=?DD;D@@9@@@7:71: @2/2 GGGGAGAACCAGCTAGCTCTGGGTTCGAGTGGCATTTCACCCCTAACCACAACTCATCCGCTGATTCTTCAACATCAGTCGGTTCGGACC + HHHHHHHHHHHHHHHHHHHHHHHHHHHFHEHHHHHHHFHHHHHHHHHHHHHHHHHHHHHHFHHFHHHHHHHHHHHHFHEHHHFHHHHFBH
Use the tool FASTQ de-interlacer on paired end reads
Create distinct datasets from an interlaced fastq dataset by running the tool Manipulate FASTQ reads on various attributes on the original dataset. It will be run twice.
Note: The solution is NOT use the FASTQ Splitter tool. The data to be manipulated are interlaced sequences. This is different in format from data that are joined into a single sequence.
Match Reads by
Identifier Match Type
Manipulate Reads by
Miscellaneous Manipulation Type
Exact same settings as above except for this change:
Use the same operations as in Solution 2 above, except change the first Manipulate FASTQ query term to be:
If your data has differently formatted sequence identifiers, the "Match by" expression from Solution 2 above can be modified to suit your identifiers.
Alternative identifiers such as:
@M00946:180:000000000-ANFB2:1:1107:14919:14410 1:N:0:1 @M00946:180:000000000-ANFB2:1:1107:14919:14410 2:N:0:1
Can be split with: