A user's guide to the virtual, automated, computing Lab Notebook in Galaxy.

How a "Redo" button enables transparent and open scientific computing.

By Ross Lazarus

January 16, 2022

Data rich science depends on complicated computing for analyses. The computing for a non-trivial analysis, from raw data to results, might need hundreds of different open source command line packages to complete. Correct assumptions, models and methods are essential to getting any analysis right. However, once that design is completed, and all the required software packages are downloaded and installed, running the complete analysis will typically rely on a specially written shell script or other code. Automation is essential, because manual processes involving hundreds of steps, of which many are long running, are not reliably replicatable.

Dedicated code for each different analysis must control and configure all the independent interacting analysis packages, and the data flows between them. It should start with raw input data, and automatically run the required analysis, from start to finish, every time it is executed. Creating and fixing bugs is a familiar activity for scientists who create their own new code. Most will be painfully aware that there are many opportunities for subtle, hard to find errors. These multiply as the computing becomes more complicated, with more software packages and data flows to get wrong.

Individual errors generally make things wrong in different ways, rather than cancelling each other out, so the automation code must all be correct for the final result to be valid. The open source approach to ensuring that any piece of code is correct, is to make it easy for other researchers to scrutinise and run it for themselves, by sharing it in a way that makes it easy to replicate. Independent replication of reported experimental findings provides evidence of trustworthiness and serves as a cornerstone of progress in science. To keep this discussion more focussed, only the computational aspects of replication are considered further here.

The Galaxy web server is designed to offload processor and memory intensive execution of tool software packages to specialised computational nodes, such as on a cluster or cloud. This design feature enables a Galaxy server to scale to very heavy loads, and to efficiently make use of large hardware allocations. Each time a user clicks Execute on a tool form after configuring it, the web server prepares a complete self-contained computational environment, to run as a job on a remote node. All the metadata that define that job, including all tool parameters and input data, and all executable software package and dependency versions, is recorded in the database, for future use in replication. Some of this metadata can be viewed as text using the i (View Details) icon available on the expanded job output in the user's history.

This automated, detailed record of every individual tool execution is a technical achievement that helps enable open scientific computing. It provides a shareable, virtual laboratory notebook, where entries are always complete. Every job also adds a circular arrow (Run this job again) button next to the View Details button, in the output in the user's history. Selecting Run this job again really does allow you to completely rerun the original job, by instantly replicating the original tool form. It can be thought of as a specialised “wayback machine” for research, because those form settings are identical to the form settings when the job was originally executed. That is because the metadata in the virtual laboratory notebook is always correct and complete.

If the form is executed without change, the default is strict replication, with all the original data and settings. Tool software package and dependency versions are re-used by default, even if newer versions have been installed since then. Recreating the original form saves error prone retyping, particularly when complicated tools require many user controlled parameter settings. So, in addition to strict replication, small, controlled changes to the cloned form, allow other useful open science research activities, including extending and repurposing the original work. For example, different input data can be analysed exactly the same way, or the effects of changes to parameter values can be explored. Finally, if updates to software packages or dependencies have been installed since the original job, it can be re-run with those updated versions, to see if the results are sensitive to the subsequent software bug fixes.

At a higher level, Galaxy supports automated workflows, that sequentially execute many different software packages whenever run. Like a specialised analysis script, this can automate a complete complicated analysis, enabling it to be reliably run from start to end. At execution, each step is a job, so metadata is recorded for replication, but a single workflow can represent and reproduce an entire analysis computation. When independent users share and execute the same workflow with the same input data, they are each replicating the entire original analysis, and they all see identical outputs in the resulting history. The level of computational replication in Galaxy is individual tool execution as a job, but workflows are inherently replicable at the level of an entire analysis.

User data in Galaxy is private by default, but easily shared. Users on the same server can share or publish histories and workflows. Users on other servers can import any history or workflow they are given as an archive file, or as a published link. A shared history has all those redo buttons we just talked about. This makes every Galaxy job transparent in that they are easily replicated, with all the required settings, for others to scrutinise. A shared workflow contains all the tool settings and data flows, so is a transparent way to share a complete, complicated analysis. This sharable transparency makes it easy for independent replication, validation, or repurposing of the originally reported work. It also enables extremely efficient technical support through easy access to a transparent record of the failing job.

The Run this job again button's underlying functionality might be more important than many users realise. It's a feature the community can be proud of, because few of the available alternatives offer simple push-button computational reproducibility. Galaxy provides easy ways for users to share and replicate their complex computational analyses, making open scientific computing accessible to the world's scientists.