Galaxy Main ToolShed

This page provides guidelines (not strict rules) for naming and populating ToolShed repositories. In some cases, populating repositories in ways that differ from these guidelines can be justified. There are also cases where repositories were created before some of the ToolShed features that now support these guidelines were available, and these repositories should be left alone. Changing repositories to follow newer guidelines could break tool version lineage chains or produce other undesired results that undermine reproducibility in Galaxy. In addition, the guidelines introduced in this page will evolve slightly over time to accommodate new ToolShed features that are not yet available. However, the ToolShed of the future will always be backward-compatible with the ToolShed of the present or past, so the best practices discussed here will always be supported.

The Building repository relationships page examined two kinds of repository relationships supported by the ToolShed; simple repository dependencies and complex repository dependencies. The consequences of both of these kinds of relationships are seen when installing repositories that define them into Galaxy. A key principle of repository installation is that a repository revision will only be installed once (an installed repository is uniquely identified by a name-spaced pattern consisting of the ToolShed, repository name, owner and revision). When a Galaxy administrator installs additional repositories over time, its dependencies that are already installed will not be installed again. Instead, a relationship “link” will be created between a newly installed dependent repository and dependencies that were previously installed. In other words, a repository’s contents can be shared by the contents of many other repositories, implying that the contents should be modular. The ToolShed’s support for repository relationships provides the basis for defining guidelines for populating repositories with Galaxy utilities.

Guidelines for Tools

In general, repositories with tools should be restricted to containing a single tool. The current set of ToolShed features support this as a best practice. Again, existing repositories containing multiple tools should not be altered. In fact, there are several repositories owned by the Galaxy devteam account in the ToolShed that contain multiple tools (e.g., the emboss_5 repository, the picard repository, the bwa_wrappers repository and others). These are examples of repositories that were created before the ToolShed supported repository relationships, and extracting tools into separate repositories currently breaks version lineage chains.

The ToolShed’s support for repository relationships used in combination with restricting repositories to a single tool results in many advantages. A single tool allows Galaxy administrators to install only those tools in which they are interested. Large repositories with multiple Galaxy utilities are more likely to include items that are not desirable in all Galaxy environments. The ToolShed’s features ensuring reproducibility in Galaxy further justifies this practice. Repositories containing Galaxy tools have specific installable revisions. These revisions are defined by the versions of the tools included in the repository. A new installable revision is created only when the version of the tool changes for repositories containing a single tool. However, a new installable revision is created when the version of any one of the tools changes for repositories containing multiple tools. Multiple instances of the same version of a tool can potentially be installed when repositories contain multiple tools. In these cases, Galaxy will load only a single instance of a tool version, but the tool and related files will still be installed on disk multiple times.

Exported Galaxy workflows further demonstrate the benefits of restricting repositories to a single tool. Consider a repository that contains an exported Galaxy workflow that requires various tools not contained in the repository. When the workflow is imported into Galaxy, the user is alerted to the missing tools and allowed to search for them in accessible ToolSheds. Only required tools should be installed when discovered, but this becomes more difficult when repositories contain multiple tools.

Every tool configuration file should include functional test definitions following the instructions and examples in the Galaxy Tool XML File wiki with test data files placed in a directory named test-data in the repository. Repositories that follow these policies will be certified with the ToolShed’s Install and Test Framework.

Of course, installing a set of tools is optimal under some conditions (e.g., tools that are generally useful only as a complete set). The ToolShed provides a way to do this using simple repository dependency definitions. Björn Grüning’s chemicaltoolbox repository in the Main Galaxy ToolShed is an excellent example for using this approach. Repositories like this will include only a single file named repository_dependencies.xml which contains simple repository dependency definitions for each required repository. These repositories are called "Repository suite definitions". This approach enables easy installation of a large suite of tools and other utilities, each of which is contained in a separate repository. Each repository can be more easily maintained within the Galaxy environment after installation.

Repositories that contain Galaxy Data Manager tools further justify the use of "Repository suite definitions". Simple repository dependency relationships should be used to ultimately associate tools with appropriate Data Manager tools that provide relevant reference data. The relationships between tools and Data Manager tools is most logically defined at the repository level. This approach allows Galaxy administrators the most options for installing these associated utilities into their environments.

Guidelines for Tool Dependency Installation Recipes

Tool dependency definitions are recipes for installing specific versions of packages, either as pre-compiled binaries or from source code. Since recipes will undoubtedly change over time (initial recipes may contain errors, web sites hosting source code may change, etc), they should be contained in a repository whose type is "Tool dependency definition". These repositories are restricted to having a single installable revision (its latest revision), ensuring that dependent repositories will always get the current recipe.

Since these repositories are ultimately used to install specific versions of packages, they should be named appropriately. The convention for naming these repositories is package__ (e.g., package_amos_3_1_0, package_ape_3_0, package_atlas_3_10, etc.). The repository name should consist of the package name as well as the version because the repository must contain only the recipe for installing the specified package version. If a new version of the package is introduced, a new repository should be created to contain the recipe for installing it. This approach supports reproducibility in Galaxy for installed tools that define relationships to repositories containing these recipes.

The November 4, 2013 Galaxy Release introduced enhancements to tool dependency installation recipes, providing support for downloading precompiled binaries for selected operating systems and architectures. As of that release, recipes should specify urls for downloading precompiled binaries, while also continuing to include recipes for installing and compiling from source (contact the Galaxy development team if you are looking for a location to host compiled binaries). With these enhanced recipes, the ToolShed installation process inspects the Galaxy environment to determine if it is compatible with one of the precompiled binaries defined by the recipe. If not, the installation process proceeds with the source code installation and compilation recipe steps.

Guidelines for Custom Datatypes

From Galaxy’s perspective, a datatype is a “custom datatype” if it is not included in the Galaxy code distribution. Custom datatypes should be contained in a repository with no other utilities. Any number of custom datatypes can be contained within a single repository, but they should all be related to a general type of data. This approach has several benefits. Any number of repositories containing tools can define simple repository dependency relationships to these repositories, and "Repository suite definitions" can do so as well. This approach also lowers the risk of installing conflicting datatypes into a Galaxy environment.