Galaxy 101
Introduction to Galaxy
In this lecture we will introduce you to bare basics of Galaxy:
- Getting data from external databases such as UCSC
- Performing simple data manipulation
- Understanding Galaxy's History system
- Creating a running a workflow
What are we trying to do?
Suppose you get the following question:
Mom (or Dad) ... Which coding exon has the highest number of single nucleotide polymorphisms on chromosome 22?
You think to yourself "Wow! This is a simple question ... I know exactly where the data is (at UCSC) but how do I actually compute this?" The truth is, there is really no straightforward way of answering this question in a time frame comparable to the attention span of a 7-year-old. Well ... actually there is and it is called Galaxy. So let's try it...
0. Organizing your windows and setting up Galaxy account
0.0. Getting your display sorted out
Note: Some screenshots shown here may appear slightly different from the ones you will see on your screen. Galaxy is quickly evolving and as a result some discrepancies are possible (and likely).
To get the most of this tutorial open two browser windows. One you already have (it is this page). To open the other, right click this link and choose "Open in a New Window" (or something similar depending on your operating system and browser):

Then organize your windows as something like this (depending on the size of your monitor you may or may not be able to organize things this way, but you get the idea):

0.1. Setting up Galaxy account
Go to the User link at the top of Galaxy interface and choose Register (unless of course you already have an account):

Then enter your information and you're in!
1. Getting data from UCSC
1.0. Getting coding exons
First thing we will do is to obtain data from UCSC by clicking Get Data → UCSC Main:

You will see UCSC Table Browser interface appearing in your browser window:

Make sure that your settings are exactly the same as shown on the screen (in particular, position should be set to "chr22", output format should be set to "BED - browser extensible data", and "Galaxy" should be checked within the Send output to option). Click get output and you will see the next screen:

here make sure Create one BED record per: is set to "Coding Exons" and click Send Query to Galaxy button. After this you will see your first History Item in Galaxy's right pane. It will go through gray (preparing) and yellow (running) states to become green:

1.1. Getting SNPs
Now is the time to obtain SNP data (SNPs are single nucleotide polymorphisms). This is done almost exactly the same way. First thing we will do is to again click on Get Data → UCSC Main:

but now change group to "Variation":

so that the whole page looks like this:

click get output and you should see this:

where you need to make sure that Whole Gene is selected ("Whole Gene" here really means "Whole Feature") and click Send Query to Galaxy button. You will get your second item in the history:

Now we will rename the two history items to "Exons" and "SNPs" by clicking on the Pencil icon adjacent to each item. After changing the name scroll down and click Save. Also we will rename history to "my example" (or whatever you want) by clicking on Unnamed history so everything looks like this:

Video
Galaxy Intro: UCSC Data from Galaxy Project on Vimeo.
2. Finding Exons with the highest number of SNPs
2.0. Joining exons with SNPs
Let's remind ourselves that our objective was to find which exon contains the most SNPs. This first step in answering this question will be joining exons with SNPs (a fancy word for printing exons and SNPs that overlap side by side). This is done using Operate on Genomics Intervals → Join tool:

make sure your Exons are first and SNPs are second and click Execute. You will get the third history item:

2.1. Counting the number of SNPs per exon
Let's look at the data obtained from the join operation above (you can do it by clicking the "eye" icon adjacent to the dataset). Below is a subsample of rows from the joined datasets (you may need to scroll sideways to see the entire length of the rows below):
``` chr22 15710867 15711034 uc062bej.1_cds_9_0_chr22_15710868_f 0 + chr22 15710880 15710881 rs568292779 0 - chr22 15710867 15711034 uc062bej.1_cds_9_0_chr22_15710868_f 0 + chr22 15710947 15710948 rs544633418 0 - chr22 15710867 15711034 uc062bej.1_cds_9_0_chr22_15710868_f 0 + chr22 15710906 15710907 rs548691624 0 - chr22 15710867 15711034 uc062bej.1_cds_9_0_chr22_15710868_f 0 + chr22 15710920 15710921 rs530488686 0 - chr22 15710867 15711034 uc062bej.1_cds_9_0_chr22_15710868_f 0 + chr22 15710932 15710933 rs563306354 0 - chr22 15710867 15711034 uc062bej.1_cds_9_0_chr22_15710868_f 0 + chr22 15711019 15711020 rs559431407 0 - chr22 15710867 15711034 uc062bej.1_cds_9_0_chr22_15710868_f 0 + chr22 15710949 15710950 rs532940301 0 - ``` ``` .... ``` ``` chr22 15719659 15719777 uc062bej.1_cds_10_0_chr22_15719660_f 0 + chr22 15719668 15719669 rs200891952 0 - chr22 15719659 15719777 uc062bej.1_cds_10_0_chr22_15719660_f 0 + chr22 15719751 15719752 rs556077431 0 - ``` ----
Look at the rows highlighted in red. They all correspond to the same exon (id = uc062bej.1_cds_9_0_chr22_15710868_f) that overlaps seven distinct SNPs (ids: rs568292779, rs544633418, rs548691624, rs530488686, rs563306354, rs559431407, rs532940301). In other words this means that this exon contains seven SNPs. Since our ultimate goal is to count the number of SNPs per exon we can simply do this by counting how many times an exon's id appears in pur dataset. This can be easily done with the Join, Subtract, and Group → Group tool:

choose column 4 by selecting "Column: 4" in Group by column. Then click on Insert Operation and make sure the interface looks exactly as shown below:

click Execute. Your history will look like this:

if you look at the above image you will see that the result of grouping (dataset #4) contains two columns. This first contains the exon name while the second shows the number of times this name has been repeated in dataset #3.
2.2. Sorting exons by SNP count
To see which exon has the highest number of SNPs we can simply sort the dataset #4 on the second column in descending order. This is done with Filter and Sort → Sort:

This will generate the fifth history item:

and you can now see that the highest number of SNPs per exon is 63.
2.3. Selecting top five
Now let's select top five exons with the highest number of SNPs. For this we will use Text Manipulation → Select First tool:

Clicking Execute will produce the sixth history item that will contain just five lines:

2.4. Recovering exon info and displaying data in genome browsers
Now we know that in this dataset the five top exons contain between 26 and 63 SNPs. But what else can we learn about these? To know more we need to get back the positional information (coordinates) of these exons. This information was lost at the grouping step and now all we have is just two columns. To get coordinates back we will match the names of exons in dataset #6 (column 1) against names of the exons in the original dataset #1 (column 4). This can be done with Join, Subtract and Group → Compare two Queries tool (note the settings of the tool in the middle pane):

this adds the seventh dataset to the history:

The best way to learn about these exons is to look at their genomic surrounding. There is really no better way to do this than using genome browsers. Because this analysis was performed on "standard" human genome (hg38 in this case), you have two choices - UCSC Genome Browser and IGV:

For example, clicking on display at UCSC main will show something like this:

Video
GXYcast1 from Galaxy Project on Vimeo.
3. Understanding histories
In Galaxy your analysis steps are represented as a list called History:

Histories can be very large, you can have as many histories as you want, and all history behavior is controlled by the
,
, and
buttons on the top of the History pane:

The
simply refreshes the history. The
button gives you access to myriad of history-specific options:

Many of the options here are self-explanatory. If you create a new history, your current history does not disappear. If you would like to list all of your histories just choose Saved Histories and you will see a list of all your histories in the center pane:

Yet there is a better way to look for all your histories. This is what the
button is for. It allows you to browse and move datasets across histories:

Here, the current history is on the left (Galaxy 101 (2015)) and your (or mine in this case) other histories are displayed to the right of the current history. These can be ordered in a variety of ways by clicking the ... button:

You can also scroll sideways using trackpad gestures, move datasets across histories by simply clicking and dragging, and search for histories and individual datasets. This interface also allows you to switch to any existing history (i.e., making it current). Click Done once you're done.
A comprehensive overview of Galaxy's history functions is found here.
4. Creating and editing a workflow
4.0. Extracting a workflow
Lets take a look at the history again:

You can see that this history contains all steps of our analysis. So by building this history we have actually created a complete record of our analysis with Galaxy preserving all parameter settings applied at every step. Wouldn't it be nice to just convert this history into a workflow that we'll be able to execute again and again? This can be done by clicking on the
button and selecting Extract Workflow option:

The center pane will change as shown below and you will be able to choose which steps to include/exclude and how to name the newly created workflow. In this case I named it galaxy101-2015:

once you click Create Workflow you will get the following message: "Workflow 'galaxy101-2015' created from current history. You can edit or run the workflow".
4.1. Opening workflow editor
Let's click edit (if you click something else and the message in the center page disappears, you can always access all your workflows including the one you just created using the Workflow link on top of Galaxy interface). This will open Galaxy's workflow editor (to get this view I clicked the arrow at the lower left corner of the screen, which collapsed the tool pane of the Galaxy interface). It will allow you to examine and change settings of this workflow as shown below. Note that the box corresponding to the Select First tool is selected (highlighted with the blueish border) and you can see parameters of this tool on the right pane. This is how you can view and change parameters of all tools involved in the workflow:

4.2. Hiding intermediate steps
Among multiple things you can do with workflows I will just mention one. When workflow is executed one is usually interested in the final product and not in the intermediate steps. These steps can be hidden by mousing over a small asterisk in the lower right corner of every tool:

Yet there is a catch. In a newly created workflow all steps are hidden by default and the default behavior of Galaxy is that if all steps of a given workflow are hidden, then nothing gets hidden in the history. This may be counterintuitive, but this is done to decrease the amount of clicking if you do want to hide some steps. So in our case if we want to hide all intermediate steps with the exception of the last one we will click that asterisk in last step of the workflow:

Once you do this the representation of the workflow in the bottom right corner of the editor will change with the last step becoming orange. This means that this is the only step, which will generate a dataset visible in the history:

4.3. Renaming inputs
Right now both inputs to the workflow look exactly the same. This is a problem as will be very confusing which input should be Exons and which should be SNPs:

One the image above you will see that the top input dataset (the one with the blue border) connects to the Join tool first, so it must correspond to the exon data. If you click on this box (in the image above it is already clicked on because it is outlined with the blue border) you will be able to rename the dataset in the right pane:

Then click on the second input dataset and rename it "Features" (this would make this workflow a bit more generic, which will be useful later in this tutorial):

4.4. Renaming outputs
Finally let's rename the workflow's output. For this:
- click on the last dataset (Compare two Queries)
- scroll down the rightmost pane and click on

- Type
Top Exonsin the Rename dataset text box:

4.5. Setting parameters "at runtime"
What we are trying to do here is do design a generic workflow. This means that from time to time you will need to change parameters within this workflow. For instance, in this tutorial we were selecting 5 exons containing the highest number of SNPs. But what if you need to select 10? Thus it makes sense to leave these types of parameters adjustable. To do this:
First, select a tool in which you want to set parameters at runtime (Select first in this case):

Next, select parameter you would like to set at runtime. To do this just hover over the
icon so it looks like this:

and click! Your parameter will now be set at runtime.
4.6. Save! It is important...
Now let's save the changes we've made by clicking
and selecting Save:

5. Run workflow on whole genome data
Now that we have a workflow, let's do something grand like, for example, finding exons with the highest number of repetitive elements across the entire human genome.
5.0. Create a new history
First go back into analysis view by clicking Analyze Data on top of the Galaxy's interface. Now let's create a new history by clicking
and selecting Create New:

5.1. Get Exons
Now let's get coding exons for the entire genome by going to Get Data → UCSC Main and setting up parameters as shown below. Note that this time region radio button is set to genome:

Click get output and you will get the next page (if it looks different from the image below, go back and make sure output format is set to BED - browser extensible format):

Choose Coding exons and click Send query to Galaxy.
5.2. Get Repeats
Go again to Get Data → UCSC Main and make sure the following settings are selected (in particular group = Repeats and track = RepeatMasker):

Click get output and you will get the next page (if it looks different from the image below, go back and make sure output format is set to BED - browser extensible format):

Select Whole gene and click Send Query to Galaxy.
5.3. Start the Workflow
At this point you will have two items in your history - one with exons and one with repeats. These datasets are large (especially repeats) and it will take some time for them to become green. Luckily you do not have to wait as Galaxy will automatically start jobs once uploads have ended. So nothing stops us from starting the workflow we have created. First, click on the Workflow link at the top of Galaxy interface, mouse over galaxy101-2015, click, and select Run. Center pane will change to allow you launching the workflow. Select appropriate datasets for Repeats and Exon inputs as shown below. Now scroll to Step 6 and will see that we can set up Select first parameter at Runtime (meaning Now!). So lets put 20 in there (or anything else you want) and scroll further down to click
to see this:

Once workflow has started you will initially be able to see all its steps. Note that you are joining all exons with all repeats, so naturally this will take some time:

5.4. Get coffee
As we mentioned above this will take some time, so go get coffee. At last you will see this:

6. We did not fake this:
The two histories and the workflow described in this page are accessible directly from this page below:
- History Galaxy 101 (2015)
- History Exons vs. Repeats
- Workflow Galaxy 101-2015
From there you can import histories and workflows to make them your own. For example, to import Galaxy 101 (2015) history simply click this link and select Import history link:

Next
We will start using Galaxy to process and map NGS data.