Galaxy ObjectStore

ObjectStore is Galaxy's data virtualization technology; it abstracts Galaxy's business logic for data persistence technology and topology. In other words, the ObjectStore makes it possible to store data on a wide-variety of persistence media spanning from a local storage to cloud-based solutions and define any data distribution policy, without altering Galaxy's model, functions, api, or any component of its business logic in general.

The ObjectStore allows a Galaxy installation to make use of data living in more than simply a single filesystem of locally mounted disk. Accordingly, ObjectStore enables Galaxy admins to plug additional persistence media to an existing file system, which facilitates expanding a mounted filesystem (e.g., when it has ran out of space) without having to move data. Additionally, ObjectStore enables replicating data onto multiple persistence media, which increases Galaxy's data access fault-tolerance, in case any of the media becomes inaccessible.

Note that ObjectStore is configurable by a Galaxy admin, and data belonging to all the users of that Galaxy instance will be stored on the defined persistence media.

Related publications and presentations:

Galaxy ObjectStore Backends

A backend is any persistence media that ObjectStore can be configured to read/write from/to. The following is a list of backends that ObjectStore currently supports:

  • Local storage (e.g., disk);
  • Network attached storage (NAS);
  • Cloud

    • Amazon Simple Storage Service (S3);
    • Google Cloud Storage (config);
    • Microsoft Azure BLOB Storage;
    • OpenStack Object Storage (Swift)
  • integrated Rule-Oriented Data Store (iRODS)

ObjectStore Configuration

In order to configure ObjectStore you may take the following steps:

  1. Set the path to ObjectStore configuration file in config/galaxy.yml (not to be mistaken with config/galaxy.yml.sample) as the following.

    # Configuration file for the object store. If this is set and exists,
    # it overrides any other objectstore settings. The value of this option 
    # will be resolved with respect to <config_dir>.
    object_store_config_file: object_store_conf.xml
    
  2. Create the object_store_conf.xml file (you may use <config_dir>/object_store_conf.xml.sample as template/reference), and set it as the following:

    <?xml version="1.0"?>
    <object_store type="...">
        <!-- Backend-specific setters.  -->
    </object_store>
    

    where the type attribute can have any of the following values:

    {cloud, disk, distributed, hierarchical, azure_blob, s3, swift, irods, pithos}
    

    For instance, the following configuration defines a single disk storage:

    <?xml version="1.0"?>
    <object_store type="disk">
       <files_dir path="database/files"/>
       <extra_dir type="temp" path="database/tmp"/>
       <extra_dir type="job_work" path="database/job_working_directory"/>
    </object_store>
    

    For backend-specific configuration, see ObjectStore Backends section.

  3. Restart Galaxy for changes to take place.

Data Distribution Methods

A Galaxy admin can set ObjectStore to leverage either a single backend, or define a nested relation between multiple backends. To use a single backend, provide the backend-specific configuration (e.g., similar to the above disk example), and Galaxy will always read/write to that backend.

In order to define a nested relation of multiple backends, you may use hierarchical or distributed backends. The following table captures the difference between the hierarchical and distributed backends:

Where data is read from? Where data is written to?
Hierarchical first backend where data exists always the first backend
Distributed first backend where data exists pseudo-randomly selected backend

For instance, using the following configuration, ObjectStore always writes data on S3, but retrieves data from the first backend where the data exists.

<?xml version="1.0"?>
<object_store type="hierarchical">
    <backends>
        <object_store type="cloud" provider="aws" order="0">
            <auth access_key="..." secret_key="..." />
            <bucket name="..." use_reduced_redundancy="False" />
            <cache path="database/object_store_cache" size="100" />
            <extra_dir type="job_work" path="database/job_working_directory_s3"/>
            <extra_dir type="temp" path="database/tmp_s3"/>
        </object_store>
        <object_store type="disk" id="secondary" order="1">
            <files_dir path="database/files"/>
            <extra_dir type="temp" path="database/tmp"/>
            <extra_dir type="job_work" path="database/job_working_directory"/>
        </object_store>
    </backends>
</object_store>

This configuration is useful when you have been using a backend for a while, then you decide to "extend" it by adding a new backend, but without having to move data from previous backend to the new backend. For instance, you have been using disk for a while, but now you have run out of space on disk, so you decided to add an S3 bucket; you would like to do so by (a) keeping the existing data on disk as they are, and still be able to use them for any job/workflow execution, and (b) you would like any new data to be persisted on S3. This is the scenario for which the hierarchical ObjectStore is defined for, as it will always persist new data on S3, and can look for data on both S3 and disk (where it will find data created prior to adding S3).

On the other hand, the distributed ObjectStore pseudo-randomly selects a backend to which it should persist data. (The pseudo-random selection is based on admin-specified weights for backends.) For instance:

<?xml version="1.0"?>
<object_store type="distributed">
    <backends>
        <backend id="files1" type="disk" weight="99">
            <files_dir path="database/files1"/>
            <extra_dir type="temp" path="database/tmp1"/>
            <extra_dir type="job_work" path="database/job_working_directory1"/>
        </backend>
        <backend id="files2" type="disk" weight="1">
            <files_dir path="database/files2"/>
            <extra_dir type="temp" path="database/tmp2"/>
            <extra_dir type="job_work" path="database/job_working_directory2"/>
        </backend>
    </backends>
</object_store>

Using this configuration, ObjectStore randomly distributes data between database/files1 and database/files2, with 0.99 and 0.01 probability of selecting the former and latter respectively.

Additionally, you can define any nested relation of the backends, for instance:

<?xml version="1.0"?>
<object_store type="hierarchical">
    <backends>
        <object_store type="distributed" id="primary" order="0">
            <backends>
                <backend id="files1" type="disk" weight="1">
                    <files_dir path="database/files1"/>
                    <extra_dir type="temp" path="database/tmp1"/>
                    <extra_dir type="job_work" path="database/job_working_directory1"/>
                </backend>
                <backend id="files2" type="disk" weight="1">
                    <files_dir path="database/files2"/>
                    <extra_dir type="temp" path="database/tmp2"/>
                    <extra_dir type="job_work" path="database/job_working_directory2"/>
                </backend>
            </backends>
        </object_store>
        <object_store type="disk" id="secondary" order="1">
            <files_dir path="database/files3"/>
            <extra_dir type="temp" path="database/tmp3"/>
            <extra_dir type="job_work" path="database/job_working_directory3"/>
        </object_store>
    </backends>
</object_store>