Best Practices for Populating a Repository

A misconception of the Test and Main Public Galaxy ToolSheds hosted by the core Galaxy development team is that they are source code repositories for developing new Galaxy tools and other utilities. They are not! Just like local development Galaxy instances should be used to aid in the development of new Galaxy utilities, a local development ToolShed should be used during this process. However, even a local development ToolShed should not be treated as a source code repository for tool development. If you do this, you will undoubtedly experience unexpected (and potentially undesirable) behavior by the ToolShed.

The ToolShed, whether a local instance or not, is for sharing fully functional Galaxy utilities with others. Local development ToolSheds are certainly extremely useful during the development process, but changes added to a repository in a ToolShed should generally consist of fully functional utilities. Of course, if problems are discovered after an upload, smaller change sets can be uploaded to fix them. But this process should be concluded in a local development ToolShed before the repository is created and populated in one of the public ToolShed’s hosted by the Galaxy development team.

Since the contents of a repository could ultimately play an important role in scientific research, it is important to consider what you choose to include in those contents. The ToolShed provides you the flexibility to include any type of file in your repository. This allows you to include README files to communicate licensing information about tools or other important information about the contained utilities (e.g., installation details. etc). Since the most important files in a repository are usually those that define Galaxy utilities, this discussion is centered around them. This page simply convey the “best practice approach” for populating a repository with little discussion about the reasons.

The ToolShed provides support for defining relationships between repositories and their contents. A repository owner can use ToolShed features to define a relationship between a repository and any number of additional repositories that it requires in order to function when installed into a Galaxy environment. What this clearly implies is that the ToolShed enables reuse of a specific repository’s contents by many other repositories when they are installed together into Galaxy. This means that the number of utilities included in a repository should be optimally set for reusability. This optimal number is defined by the utility type.

From Galaxy’s perspective, a datatype is a “custom datatype” if it is not included in the Galaxy code distribution. Custom datatypes should be contained in a repository with no other utilities. Any number of custom datatypes can be contained within a single repository, but they should all be related to a general type of data. For example, the emboss tools require the emboss datatypes, but these datatypes are generally not useful for other tools. Similarly, the snpeff tools require the snpeff datatypes. So a repository that contains emboss datatypes should not also contain the snpeff datatypes.

Exported Galaxy workflows are simple json objects (i.e., Python dictionaries). They can be thought of as a model defining the order in which data flows from one Galaxy tool to another during an overall analysis. Workflows contained in an installed ToolShed repository can be imported into Galaxy. Any number of them can be included in a repository. However, to be functional, a workflow requires all tools to which it refers to be installed into the same Galaxy instance. Workflows are not currently versioned, so it is difficult to define best practices for including them in a repository. If, for example, the version of a workflow was defined by the combination of all of the tools for which it was built, defining best practices would be easier. There are pros and cons for including a workflow in a repository that contains Galaxy tools. There are also pros and cons for restricting workflows to repositories without any additional Galaxy utilities. In either case, installation of repositories that contain exported workflows into a Galaxy instance will generally require manual intervention by the Galaxy administrator to make them functional.

With regard to Galaxy tools, a repository should only contain one. The reasons for this are related to tool versioning and reproducibility. Repository owners can define relationships between repositories, with a single repository requiring any number of additional repositories. This feature allows for multiple related tools and utilities to be installed with a single repository selected from the ToolShed.

Tool dependencies are recipes for installing and possibly compiling 3rd-party binaries that a Galaxy tool will locate and use at execution time. These recipes should always be contained within an appropriately named repository of type “Tool dependency definition”.