GCC2014 Talk Abstracts

Session 1, Tuesday, July 1, 9:15-10:30

Transcriptomes and Exomes: Computational Challenges of NGS Data

Steven Salzberg

Steven Salzberg1

1 Johns Hopkins University

Steven Salzberg is a Professor of Medicine, Biostatistics, and Computer Science at the Johns Hopkins University School of Medicine where he is also Director of the Center for Computational Biology at the McKusick-Nathans Institute of Genetic Medicine. Steven has made many prominent contributions to open source software, including several of the most popular tools used on Galaxy Platforms. Recently he was awarded the 2013 Benjamin Franklin Award for Open Access in the Life Sciences, and the 2012 Balles Prize in Critical Thinking for his science column at Forbes.

The Galaxy framework as a unifying bioinformatics solution for multi-omic data analysis


Pratik D. Jagtap1,3, James Johnson2, Getiria Onsongo2, Bart Gottschalk2, Timothy J. Griffin1,3

1Center for Mass Spectrometry and Proteomics, University of Minnesota, Minneapolis, Minnesota, United States
2Minnesota Supercomputing Institute, University of Minnesota, Minneapolis, Minnesota, United States
3Department of Biochemistry, Molecular Biology, and Biophysics, University of Minnesota, Minneapolis, Minnesota, United States

Slides, Video

Integration and correlation of multiple areas of 'omics' datasets (genomic, transcriptomic, proteomic) has potential to provide novel biological insights. Integration of these datasets is challenging however, involving use of multiple, domain-specific software in a sequential manner.

We describe extending the use of Galaxy for proteomics software, enabling novel, advanced multi-omic applications in proteogenomics and metaproteomics. Focusing on the perspective of a biological user, we will demonstrate the benefits of Galaxy for these analyses, as well as its value for software developers seeking to publish new software. We will also report on our experience in training non-expert biologists to use Galaxy for these advanced, multi-omic applications.

Working with biological collaborators, multiple proteogenomics and metaproteomics datasets representing a broad array of biological applications were used to develop workflows. Software required for sequential analytical steps such as database generation (RNA-Seq derived and others), database search and genome visualization were deployed, tested and optimized for use in workflows.

Novel proteoforms (proteogenomic workflows, e.g., Galaxy Workflow: Integrated ProteoGenomics Workflow (ProteinPilot)) and microorganisms (metaproteomic workflows, e.g., Workflow for metaproteomics analysis - ProteinPilot' ) were reliably identified using shareable workflows. Tandem proteogenomic and metaproteomic analysis of datasets will be discussed using modular workflows. Sharing of datasets, workflows and histories on the usegalaxyp.org website and proteomic public repositories will also be discussed.

We demonstrate the use of Galaxy for integrated analysis of multi-omic data, in an accessible, transparent and reproducible manner. Our results and experiences using this framework demonstrate the potential for Galaxy to be a unifying bioinformatics solution for multi-omic data analysis.

iReport: HTML Reporting in Galaxy

Saskia Hiltemann

Saskia Hiltemann1, Youri Hoogstrate1, Hailiang Mei2, Guido Jenster1, Andrew Stubbs1

1 ErasmusMC, Rotterdam, The Netherlands
2 LUMC, Leiden, The Netherlands

Slides, Video

Galaxy offers a number of great visualisation tools (Trackster, Circster), but currently lacks the ability to easily summarise the various outputs of a workflow into a single view. iReport is a Galaxy tool for the easy creation of HTML reports from Galaxy outputs. Rather than having a static HTML output, iFUSE2 uses javascript and jQuery to allow for interactivity in the form of searching and sorting of tables, automatic zooming of image data, tabbed view for organisation of outputs, etc. Users define the number and names of tabs for their report, and can add different types of content-items to these tabs (e.g. text, tabular data, image data, PDF files, links to datasets, and more).

We have previously implemented Galaxy-based data processing pipelines for next-generation sequencing (NGS) and for array based allelic copy number determination named CGtag (Hiltemann et al. 2014) and developed a web based fusion gene visualizer, iFUSE (Hiltemann 2013). We used the iReport tool to make iFUSE2, the next-step extension to support fusion gene determination within Galaxy, which runs as the last step of our workflow and combines the outputs of various Galaxy tools into a single view.

iReport is available from the DTL toolshed (toolshed.dtls.nl) and the main Galaxy toolshed.

Session 2, Tuesday, July 1, 11:00-12:15

Galaxy Deployment on Heterogenous Hardware

Carrie Ganote

Carrie Ganote1, Soichi Hayashi1

1National Center for Genome Analysis Support

Slides, Video

Indiana University, like many institutions, houses a heterogenous mixture of compute resources. In addition to university resources, the National Center for Genome Analysis Support, the Extreme Science and Engineering Discovery Environment, and the Open Science Grid all provide resources to biologists with NSF affiliations. Such a diverse mixture of compute power and services could be applied to address the equally diverse set of problems and needs in the bioinformatics field.

Many software suites are well suited for large numbers of fast CPUS, such as phylogenetic tree building algorithms. De novo assembly problems really crave a machine with lots of RAM to spare. Alignment and mapping problems where each input is a separate invocation lend themselves perfectly to high-throughput, heavily distributed compute systems. Galaxy is a web interface that acts as a mediator between the biologist and the underlying hardware and software - in an ideal setup, Galaxy would be able to delegate work to the best suited underlying infrastructure.

We present an instance of Galaxy at Indiana University, installed and maintained by NCGAS, that takes advantage of a variety of compute resources to increase utilization and efficiency. The OSG is a distributed grid through which Blast jobs can be run. IU, NCGAS and XSEDE jointly support Mason, a 512Gb/node system. For IU users, Big Red 2 is the first university-owned petaFLOPS machine. Connecting these resources to Galaxy and using the best tool for the job results in the best performance and utilization - everyone wins.

Connecting Galaxy to tools with alternative storage and compute models

Brad Chapman

Brad Chapman1, Rory Kirchner1, Oliver Hofmann1, Winston Hide1

1Bioinformatics Core, Harvard School of Public Health

Slides, Video

The community developed bcbio-nextgen framework provides implementations of best-practice pipelines for variant calling and RNA-seq analysis. The framework handles computation, data storage and program connectivity in ways that parallel Galaxy's approaches, making it difficult to plug in as a standard tool. We'd like to be able to integrate with Galaxy by sharing the underlying implementation code for accessing data, rather than pushing and pulling large files. This talk will discuss ideas to access shared data on external object stores like S3 or HDFS in a consistent way that does not rely on data copying. It also will incorporate approaches to compartmentalize complex sets of tools inside containers using Docker. The goal is to stimulate discussion about ways to make Galaxy a modular component within complex analysis environments. Our ultimate vision is to have an Amazon based cloud implementation that uses CloudMan to run a Galaxy front end sending out jobs to tools like bcbio-nextgen.

A journal’s experiences of reproducing published data analyses using Galaxy

Peter Li

Peter Li1, Huayan Gao2, Tin-Lap Lee2 and Scott C. Edmunds1

1GigaScience, BGI-Hong Kong Co., Ltd, Hong Kong 2 School of Biomedical Sciences, The Chinese University of Hong Kong, Shatin, Hong Kong

Slides, Video

GigaScience is a journal with a focus on the publication of reproducible research. This is facilitated by its GigaDB database where the data and the tools used for its analysis may be deposited by authors and made publicly available with citable DOIs. We have investigated the extent by which the results from articles published in GigaScience can be made reproducible using Galaxy in a pilot project based on a previously published paper reporting on SOAPdenovo2. The performance of this de novo genome assembler was compared with SOAPdenovo1 and ALL-PATHS-LG by Luo et al., (2012) for its ability to assemble bacterial, insect and human genomes. After integrating the three genome assemblers, and their associated tools into Galaxy, workflows were implemented in a way that re-created the genome assembly pipelines used by the authors. However, our aim of reproducing the genome assembly statistics from Luo et al., (2012) with the workflows was met with mixed success. Whilst the results generated by SOAPdenovo2 could be reproduced by our Galaxy workflows, we were less successful with SOAPdenovo1 and ALL-PATHS-LG. In this presentation, we will show how Galaxy was used, the problems that were encountered and the results of this reproducibility exercise.


Luo et al., (2012) SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience 1:18.

Enabling Dynamic Science with Flexible Infrastructure

Anushka BrownleyAaron Gardner

Anushka Brownley1, Aaron Gardner1


Slides, Video

As a trusted industry leader in designing and implementing effective scientific infrastructure for research and other organizations, BioTeam has partnered with the Galaxy Project to build and offer the SlipStream Galaxy Appliance, a commercially supported platform. With the increasing throughput of data generation instruments, the dynamic landscape of computational tools, and the variability in analysis processes, it is challenging for scientists to work within the confines of a static infrastructure. BioTeam will discuss some of these challenges and the technical advances we have been working on to build a more flexible Galaxy appliance to support the changing compute and analysis needs of the scientific researcher.

Session 3, Tuesday, July 1, 1:15-2:30

State of the Galaxy


Anton Nekrutenko1 and James Taylor2

1Penn State University
2Emory University

Slides, Video

An overview of where the Galaxy Project is and where it is going.

Update on Ion Torrent Sequencing – Accurate, Long Reads

Mike Lelivelt1

1 Director of Bioinformatics and Software Products, Ion Torrent, part of Life Technologies

Slides, Video

Session 4, Tuesday, July 1, 4:00-5:30

The Galaxy Tool Shed: A Framework for Building Galaxy Tools

Greg Von Kuster

Greg von Kuster1 and the Galaxy Team

1Penn State University, State College, Pennsylvania, United States

Slides, Video

The Tool Shed has become an integral part of the process for building and deploying Galaxy tools and other utilities. In addition to tools, the Tool Shed supports Galaxy Data Managers, custom data types and exported Galaxy workflows. This list will be extended to support additional utilities when appropriate. The Tool Shed provides the ability to define relationships between repositories, enabling complementary utilities to be installed together.

The Tool Shed assures reproducibility within Galaxy when utilities are installed from the Tool Shed using the streamlined installation process between the two applications. An underlying principle of this assurance is that all versions of utilities available in the Tool Shed will always be accessible to any Galaxy instance. This principle implies that a select development path should be followed to produce repositories that are optimal for sharing.

Here we'll examine the various components and steps that comprise this process. Development begins within a local environment that includes Galaxy and a Tool Shed, where a hierarchy of related repositories can be built. The Tool Shed allows the developer to export the related repositories into a capsule that can be imported into another Tool Shed. This mechanism streamlines the process of deploying utilities from a development environment to the test and main public Galaxy Tool Sheds where an automated install and test framework certifies the repositories for sharing. When installed together into Galaxy after certification, the related repositories provide complementary Galaxy utilities that function together.

Integrating the NCBI BLAST+ suite into Galaxy

Peter Cock

Peter Cock1,John Chilton2, Björn Grüning3, Jim Johnson4, Nicola Soranzo5

1The James Hutton Institute, Scotland, United Kingdom
2 Department of Biochemistry and Molecular Biology, Penn State University, United States
3 Pharmaceutical Bioinformatics, Institute of Pharmaceutical Sciences, Albert-Ludwigs-University, Freiburg, Germany
4 Minnesota Supercomputing Institute, University of Minnesota, Minneapolis, United States
5 Bioinformatics Research Program, CRS4, Pula, Italy

Slides, SlideShare, Video

NCBI BLAST is one of the best known computational tools in modern biology, and a common addition to Galaxy instances. This talk covers the history of the Galaxy wrappers for the NCBI BLAST+ command line tool suite, example use cases and workflows, and in particular our development process as a potential best practice model for Galaxy tool development - both technically and by showcasing Galaxy functionality, but also in terms of community building.

Initially included within the main Galaxy distribution, the BLAST+ wrappers are now run as a separate open source project using a dedicated repository on GitHub, combined with open discussion on the public Galaxy development mailing list.

The BLAST+ wrappers have grown to take advantage of most features offered by Galaxy and the ToolShed, including ToolShed dependencies, custom datatypes (including composite types for BLAST databases), configuration files for local databases, Galaxy tool XML macros to avoid duplication, and functional testing.

Automated testing is an important part of the development model and release process used. Integration with TravisCI provides continuous integration testing where any update to the code is automatically tested on a Virtual Machine. This is reinforced by a policy of staging updates to the Galaxy Test ToolShed for an additional round of automated testing, prior to release on the main Galaxy ToolShed.

Finally, an overview of how BLAST is setup on the Galaxy Instances we maintain will cover issues like job parallelization, thread and memory considerations, updating NCBI BLAST databases, and caching BLAST databases on cluster nodes.

deepTools: a flexible platform for exploring deep-sequencing data

Björn Grüning

Fidel Ramírez1, Friederike Dündar1,2, Sarah Diehl1, Björn A. Grüning3, and Thomas Manke1

1Max Planck Institute of Immunobiology and Epigenetics, Freiburg, Germany
2 Faculty of Biology, University of Freiburg, Freiburg, Germany
3Department of Computer Science, University of Freiburg, Freiburg, Germany

Slides, Video

We present a Galaxy based web server for processing and visualizing deeply sequenced data. The web server core functionality consists of a suite of newly developed tools, called deepTools, that enable users with little bioinformatic background to explore the results of their sequencing experiments in a standardized setting. Users can upload preprocessed files with continuous data in standard formats and generate heatmaps and summary plots in a straight-forward, yet highly customizable manner. In addition, we offer several tools for the analysis of files containing aligned reads and enable efficient and reproducible generation of normalized coverage files. As a modular and open-source platform, deepTools can easily be expanded and customized to future demands and developments. The deepTools webserver is freely available at http://deeptools.ie-freiburg.mpg.de and is accompanied by extensive documentation and tutorials aimed at conveying the principles of deepsequencing data analysis. The web server can be used without registration. deepTools is also available from the Galaxy toolshed, which allows an easy automated installation to any Galaxy instance.

Session 5, Wednesday, July 2, 9:10-10:25

The GCC2014 Hackathon

GCC2014 Hackathon Participants

Dannon Baker1, Brad Chapman2, John Chilton3, Kyle Ellrott4, and GCC2014 Hackathon Participants

1Johns Hopkins University, Baltimore Maryland, United States
2Harvard University, Cambridge, Massachusetts, United States
3Penn State University, State College, Pennsylvania, United States
4University of California Santa Cruz (UCSC), Santa Cruz, California, United States

Slides, Video

This year for the three days before GCC we are hosting a Galaxy Hackathon. Hackathons are events at which a group of developers with different backgrounds and skills collaborate hands-on and face-to-face to try to solve problems affecting a particular community, and in this case the Galaxy community. Gathering a diverse set of people in a single room where they can focus on code free of all the distractions that are inevitable back at the office has proven to be a great mechanism for not only getting interesting things done in a short amount of time, but also for community building. The hackathon goals include growing the Galaxy developer community and connecting existing developers who are interested in similar problems, giving them an in-person opportunity to code together and plan for future post-hackathon collaborations.

In this talk, we’ll very briefly describe our Galaxy Hackathon goals and provide a general overview of progress made at the event. Since hackathons are by definition community driven, most of the talk will showcase the efforts of and be presented by the self-organizing groups that form during the event.

More Options, Less Time: Streamlining Access to Reference Datasets

Dan Blenkenberg

Daniel Blankenberg1 and the Galaxy Team2

1Penn State University, State College, Pennsylvania, United States

Slides, Video

Recent enhancements to the Galaxy framework have introduced a new class of Galaxy Utilities, known as Data Managers (doi:10.1093/bioinformatics/btu119). Data Manager tools allow the Galaxy administrator to download, create and install additional datasets for any type of built-in datasets using a web-based GUI in real time.

Despite these advances, populating a Galaxy instance with a set of built-in datasets can be quite time consuming, especially in cases where data not only needs to be downloaded, but additional computation, such as building indexes, is required. While this works quite well, it is wasteful to have each Galaxy installation build these datasets especially for common resources and genomes. It can take considerable amounts of time to populate a new Galaxy instance with needed datasets. Although the Galaxy Project provides a public rsync server with all of the built-in datasets that are used on the Main public site, utilizing this resource can be difficult and unwieldy, as there is a large amount of data and it lacks an accessible interface interface. While the individual location files are made available, they cannot be used as-is by an end user, unless the user has the exact same directory structure on their own machine that is hosting their Galaxy instance.

Here, we describe a new set of resources that aim to rectify this situation. These resources streamline the configuration of built-in data datasets for new and existing Galaxy instances and alleviate the technical barriers preventing many users from taking advantage of prebuilt reference datasets.

Building More Powerful Galaxy Workflows with Dataset Collections

John Chilton

John Chilton1 and the Galaxy Team

1Penn State University, State College, Pennsylvania, United States

Slides, Video

Galaxy features the ability to extract a sample analysis histories out into reusable workflows as well as the ability to construct such workflows up from scratch or via modification to existing workflows. While these have been salient features of Galaxy for some time, the kinds of workflows that could be expressed by Galaxy have had critical limitations. Perhaps most glaring of these is that Galaxy workflows have required a fixed number of inputs. Many relatively basic biomedical analyses require running a variable number of inputs across identical processing steps (“mapping”) and then combining or collecting these results into a merged output (“reducing). This talk will present dataset collections - an extension to Galaxy that allows for the expression of these mapping, reducing workflows.

In particular, the concepts behind dataset collections will be covered including briefly discussing implementation details such as data model modifications and API methods. Demonstration of how to “map” existing Galaxy tools across dataset collections to produce new collections and how to “reduce” these collections using other tools. Likewise, modification to the workflow extraction and editing interfaces to accommodate these new operations will be demonstrated.

Dataset collections are a powerful new feature that greatly enhance the expressivity of Galaxy workflows, but a lot work remains to do be done. The talk will conclude with a potential roadmap and timeline for dataset collection related development - including building UI components for digging into collections, building new collections, visualizing across collections, and tool enhancements allowing tools to create collections.

An Appliance for Life Science Research: Isilon, Penguin and Galaxy

Patrick Combes

Patrick Combes1

1 Senior Solution Architect for Life Sciences, EMC Isilon

Slides, Video

Isilon and Penguin Computing have paired to create a mid-size appliance for Galaxy by leveraging their respective strengths in storage and compute. This session will detail the architecture and projected use cases for the appliance.

Session 6, Wednesday, July 2, 10:55-12:15

Lab Specimen Tracking with Galaxy

Martin Čech

Martin Čech1, Pavel Švéda1, Ondřej Fabián1 and the Galaxy Team

1Penn State University, State College, Pennsylvania, United States

Slides, Video

No experiment begins with sequencing. Instead it commences with a collection of samples followed by DNA isolation (generation of cDNA, immunoprecipitation etc.), preparation of sequencing libraries, sequencing itself, and, finally, data analysis. In other words, during an NGS experiment a biological specimen undergoes transformation into a dataset to be analyzed. When an experiment involves a handful of samples, tracking the specimen-to-dataset metamorphosis is straightforward. However, low cost of sequencing enables individual single-PI laboratories to perform studies involving hundreds and even thousands of samples. At this scale tracking information about individual samples becomes challenging. Yet such tracking is essential for troubleshooting and ensuring a successful study. We have developed an open-source sample tracking system based on mobile devices carried by everyone in their pockets. The mobile application is able to communicate with a variety of sequencing instruments and trigger automated data analyses through the Galaxy system (http://usegalaxy.org).

The Munich NGS-FabLab for medical sequence data

Sebastian Schaaf

Sebastian Schaaf1,2, Aarif Mohamed Nazeer Batcha2, Sandra Fischer2, Guokun Zhang2, Ulrich Mansmann1,2

1German Cancer Consortium (DKTK), Heidelberg, Germany
2Department of Medical Informatics, Biometry and Epidemiology (IBE), Ludwig Maximilians University (LMU) Munich, Germany

Slides, Video

Using NGS data in a clinical context comes along with a whole range of challenges, constraints and requirements, affecting all levels of an IT infrastructure dealing with that type of data – and related biomedical metadata. Especially in Germany, the restrictive data security laws play a key role. In 2010, the Munich regional area successfully applied for a grant ('Leading-Edge Cluster Competition') dedicated to ‘personalized medicine’, supporting infrastructures for improving cross-connections between the medical faculties of both universities and associated institutions, their hospitals, independent research institutes (Helmholtz Centre, Max Planck Institutes) and industrial partners.

Aiming for a structured, biomedical metadata-driven organization of clinical NGS data, an interconnected, user-friendly, modular, broad-ranged and self-hosted open source analysis platform turned out to be crucial. Or in a nutshell: a Galaxy instance.

This talk is about the experiences of nearly three years of getting from blank to a conceptual Galaxy-driven NGS infrastructure, dedicated to scientist or clinicians from basic research up to experimental molecular diagnostics within a university medical center’s environment. Topics will include experiences with core IT, faculty politics, project cooperations, software establishment etc. as well as derived Dos and Don’ts. Furthermore, some small software improvements will be presented, hopefully contributing back to the community. On top, we would like to draw connections to contents presented, discussed, improved since the last two GCC’s in Chicago and Oslo - and also may have been forgotten. Over time, we had the impression to face several of them, pretty glad not to be in a minority of one.

Galaxydx - A Web-server dedicated to diagnosis data analysis

Vivien DeshaiesAlban Lermine

Vivien DESHAIES1,2,3, Alban LERMINE1,2,3, Séverine LAIR1,2,3 , Nicolas SERVANT1,2,3, Elodie GIRARD1,2,3, Julien TARABEUX4,5, Philippe HUPE1,2,3, Claude HOUDAYER4,5, Emmanuel BARILLOT1,2,3

1Institut Curie
2INSERM U900, Bioinformatics and Computational Systems Biology of Cancer, Paris, France
3Mines ParisTech, Fontainebleau, France
4INSERM U830, Génétique et biologie des cancers, Paris, France
5Biologie des Tumeurs, Paris, France

Slides, Video

Early cancer diagnostic is a challenge that can dramatically improve cancer treatment efficiency. High throughput sequencing technology is the more promising solution to reach this goal, but the analysis of their output is not straightforward and most of the time, need to launch software only available via command line interface.

Galaxy is a web platform that aim to: (1) make command line softwares accessible in an easy to use web interface, (2) construct personal workflows, (3) make analyses reproducible among time, (4) share know-how (workflow sharing) as well as data and annotations.

We built Galaxydx, an implementation of Galaxy containing a suite of softwares used for the analyses of diagnosis sequencing data (PGM torrent suite, BWA, GATK, VarScan, Annovar, … etc). Galaxydx allows Clinicians as well as Biologists to be autonomous to perform a complete set of analyses such as: (1) mapping, (2) variant calling, (3) variant filtering, (4) variant annotation, (5) rearrangements calling and (6) visualization through diagnosis dedicated Genome browser (Alamut).

We also work on data integrity and confidentiality by modifying the Galaxy writing methodology. Analyses in Galaxydx are organized by project and user, output files are owned by the user who generates them. It allows us to systematically check system rights on data before any process (Can the current user read input data? Can the current user write in this project?)

Using Galaxy and Globus to deliver Science as a Service

Ravi Madduri

Ravi K Madduri1,2, Paul Dave2, Alex Rodriguez2, Vassily Trubetskoy3, Dinanath Sulakhe2, Lea Davis3, Nancy Cox3 and Ian Foster1,2

1Argonne National Laboratory, Argonne, Illinois, United States
2Computation Institute, University of Chicago, Chicago, Illinois, United States
3Section of Genetic Medicine, University of Chicago, Chicago, Illinois, United States

Slides, Video

At the Computation Institute, we originally posited the notion of science as a service in 2005 as a means of publishing and accessing scientific data and applications through well-defined and internet accessible services. Our vision of science as a service worked well in a world when computing resources were scarce; when we needed to federate heterogeneous resources and make them accessible to researchers; when different tools and data were provided using different interfaces and representations; and when research problems involved datasets that could be hosted and analyzed on a single computer. In this talk we re-examine our vision of science as a service in a world in which computing resources are now commoditized; a world in which researchers are increasingly facing 'big data' challenges; a world in which Cloud providers, such as Amazon Web Services, have become viable alternatives to purchasing dedicated infrastructure; and a world in which building reliable infrastructure for solving scientific problems is only an API call away.

We will present our efforts on using Galaxy and Globus to create cloud-based services for scientific domains such as Genomics, Climate modeling, Cosmology, ECG Analysis and Material Sciences. We will present lessons learned, extensions we created to enable these communities adoption of Galaxy as an analysis engine. We will present a recent genomics usecase enabled using Galaxy based Globus Genomics on creating and running Consensus Genotyper for exome sequencing pipeline on large scale Tourette's Syndrome data set. (Joint work with Dr. Nancy Cox's group at UChicago.)

SGI UV: Harnessing the Big Brain Platform for Galaxy

James Reaney

James Reaney1

1 Senior Director, Research Markets, SGI

Slides, Video

GI UV scales to truly extraordinary levels – today up to 2,560 physical cores and 64TB of cache-coherent, globally shared memory in a single system. UV is also a developer’s dream playground: standard Intel x86 architecture, standard Linux distros, support for large numbers of Nvidia GPU and Xeon® PHI®, and all those cores and memory at your disposal in a single OS. Run standard ISV applications or any open-source code just like any Linux instance, no recompiling necessary. The versatility, high performance, and extreme scale of UV makes it the ultimate "analysis supernode", but what if we used UV as an enabling platform for Galaxy workflows? How much more extensible might the tools become? What new scales might Galaxy workflows reach? What larger-scale research might be simply enabled in the first place by having a more effective computational architecture underlying the Galaxy workflow?

Session 7, Wednesday, July 2, 1:15-2:35

Building a virtual research environment with Galaxy

Olivier InizanMikael Loaec

Olivier Inizan1, Mikael Loaec1, Helena Rasche2, Hadi Quesneville1

1URGI-INRA, Versailles, France 2Center for Phage Technology, Texas A&M University, College Station, Texas, United States

Slides, Video

The democratization of virtualization techniques provide a new opportunity to improve bioinformatics analysis. Storing, sharing and reusing tools dedicated to an analysis is the goal of the galaxy toolshed project. With virtualization techniques, it is now possible to expand their strategy to all the components required to perform a bioinformatic analysis such as the operating system, the software, the datasets, the dependencies, the user data, …).

Integrating these components in a virtual machine provide a virtual research environment (VRE) that could be duplicated and shared. With the growing availability of infrastructures supporting virtualization (such as cloud computing infrastructures), VREs offer a new opportunity to improve bioinformatics analysis accessibility and reproducibility.

Accessibility and reproducibility are the building blocks of the Galaxy project and the Galaxy platform could play a significant role in such environments. However, to become accessible and shareable, creating and updating a VRE should be automated as much as possible, from the virtual machine provisioning to tools deployment and tests.

Here we describe our progress towards an automation process for the deployment of a Galaxy instance. The current work is focused on virtual machine provisionment with Cobbler and automatic configuration with Puppet. The opportunities that such an approach provides to developers and biologists will be discussed, illustrated on the future French infrastructures dedicated to cloud computing: the IFB and INRA academic Clouds.

The Australian Genomics Virtual Laboratory

Andrew Lonie

Andrew Lonie1, Enis Afgan2,3, Ron Horst4, Simon Gladman5, Clare Sloggett1, Nuwan Goonasekera1, Igor Manukin4, Yousef Kowsar4

1Life Sciences Computation Centre, University of Melbourne, Australia
2University of Melbourne, Australia
3Ruđer Bošković Institute, Croatia
4University of Queensland, Australia
5Life Sciences Computation Centre, Monash University, Australia

Slides, Video

The Australian Genomics Virtual Laboratory (GVL) is a national program aiming to provide the research community with an accessible, scalable genomics analysis platform on national compute infrastructure. The GVL leverages a significant investment in cloud infrastructure by the Australian government and existing cloud management tools to enable researchers to create on-demand genomics analyses environments based on the open source Galaxy workflow platform, linked through high speed networks to very large reliable data storage, and local instances of visualization engines like the UCSC browser.

This talk will discuss the technical and practical lessons learned during the development of the Genomics Virtual Lab, including considerations in defining and implementing a one-size-fits-all pre-configured Galaxy image, the constraints a cloud environment places on practical 'real data' genomics, identification of and interaction with the user base, and deliberations on the future of the Genomics Virtual Laboratory including architecting for the entire genomics analysis life cycle on the cloud.

Galaxy on the GenomeCloud : Yet another on-demand Galaxy cloud, but only powered by Apache CloudStack

Youngki Kim

Youngki Kim1, CB Hong1, Kjoong Kim1, Daechul Choi1

1GenomeCloud, Seoul, Korea

Slides, Video

Bioinformatics and genome data analysis in South Korea is at its early stage but getting busier. To keep pace with this trend of research, GenomeCloud was created at the end of 2012. GenomeCloud is an integrated platform for analysing, interpreting and storing genome data, based on KT's cloud computing infrastructure which uses Apache CloudStack software. GenomeCloud consists of g-Analysis (automated genome analysis pipelines at your fingertips), g-Cluster (easy-of-use and cost-effective genome research infrastructure) and g-Storage (a simple way to store and share genome-specific data).

Because of flexible tool integration architecture and seamless workflow creation functionality, Galaxy was selected to achieve multi purpose goals such as agile pipeline development and bioinformatics education support. To provide on-demand and Apache CloudStack based Galaxy cluster, we have automated virtual machine creation, clustering and various software setup including Galaxy.

Furthermore, seamless integration with GenomeCloud helps researchers not only create and manage Galaxy through a convenient web interface but also fully utilizes genome data in g-Storage. g-Storage is powered by OpenStackSwift and specially designed genome file transfer protocol.

Galaxy on the GenomeCloud uses Grid Engine as a Cloud HPC Solutions, Ganglia as a distributed monitoring system and LVM over NFS as a large volume shared storage, all of which are setup automatically upon request. This talk will be about our experiences while integrating Galaxy with GenomeCloud and use cases of Galaxy such as scalable bioinformatics education system and request fulfillment of RNA-seq analysis.

Test-driven Evaluation of Galaxy Scalability on the Cloud

Nuwan Goonasekera

Enis Afgan1,2, Derek Benson3, and Nuwan Goonasekera1

1VLSCI, University of Melbourne, Melbourne, Australia
2 CIR, RBI, Zagreb, Croatia
3 Research Computing Centre, University of Queensland

Slides, Video

To verify the essential functions of a Galaxy instance are being provided correctly to the end-user, functional testing of typical Galaxy tasks is important. In addition, for groups which intend to deploy their own Galaxy instances (on the cloud or otherwise), knowing the scalability characteristics of the instance with respect to the number of users, machine size, storage solution and cloud provider, is also important. By combining both functional and performance testing into one common testing infrastructure, we assessed both of these aspects with the same underlying test code.

With respect to the first aspect of assessing whether the basic functions of Galaxy are working correctly from an end-user perspective, functional testing was performed via the browser automation tool Selenium, which can mimic the exact actions of an end-user interacting with the application. We then extended these tests to use the Selenium Grid, which converted the functional test into a performance test by running the tests in parallel, thus simulating multiple concurrent users.

This presentation will describe how these two aspects were used to determine the scalability characteristics of Galaxy on the cloud. The presentation will discuss the following:

  • Describe how the same infrastructure is reused for testing the functional and scalability characteristics of Galaxy, using CloudMan;
  • Analyse how a number of variables, such as the number of users, machine size and storage option, affects scalability;
  • Provide insights into how Galaxy scales on the cloud, and what factors to consider when deploying on your own infrastructure;
  • Provide a reusable suite of tests for functionally verifying and benchmarking private GVL/Galaxy instances

Data and results collected to obtain above conclusions will be made publicly available and can act as reference data points for others reusing the presented system on their own Galaxy instances.

Bioinformatics on AWS: New and Noteworthy Features

Angel Pizaro

Angel Pizarro1

1 Senior Solutions Architect, Amazon Web Services

In this talk, we will cover recent service and feature releases from Amazon Web Services, and how they apply to bioinformatics and scientific computing.