- Galaxy Pipeline for Faster Whole Genome Genotype Calling on the GeneTitan Platform
- Integrating Galaxy with Globus Online: Lessons learned from the CVRG project
- Scalable data management and computable framework for large scale longitudinal studies
- Nebula - A Web-Server for Advanced ChIP-Seq Data Analysis
Abstracts for talks that were presented at the GCC2012 main sessions. See Training Day for training abstracts.
Session 1
State of the Galaxy
Anton Nekrutenko1 and James Taylor2
1Penn State University
2Emory University
Slides, Video
An overview of where the Galaxy Project is and where it is going.
Integration of S-MART, a toolbox to aid RNA-seq data analysis in Galaxy
Yufei Luo1, Matthias Zytnicki1, Olivier Inizan1, Delphine Steinbach1, Hadi Quesneville1
1 Unité de recherche en Génomique-Info, UR1164 INRA, Route de Saint Cyr, 78000, Versailles, France
{yufei.luo, matthias.zytnicki, olivier.inizan, delphine.steinbach, hadi.quesneville}@versailles.inra.fr
Slides, Video
URGI is an INRA bio-informatics unit dedicated to plants and pest genomics. We have developed a toolbox called S-MART, which handles mapped RNA-Seq data. S-MART is an intuitive tool, which performs many tasks usually required for the analysis of mapped RNA-Seq reads. S-MART does not require any computer science background and thus can be used by the biologist community through our Galaxy instance, which is considered as an "official Galaxy instance" by the Penn State Galaxy team, called URGI Galaxy (http://urgi.versailles.inra.fr/galaxy), in which we integrated our tools for biologist and bio-informatician users. S-MART work-flows may perform several entire analyses, from the mapped reads to the loci of interest, without any need for other ad hoc scripts. We are currently integrating these work-flows to our URGI Galaxy: (i) piRNAs (Piwi-interacting RNAs) clusters detection (ii) nucleotidic distribution of the 5' ends of the reads (iii) comparison of RNA-seq with tiling arrays using sliding windows (iv) differential expression when no reference annotation is given. We will present the S-MART Galaxy Tool Box with some of its workflows at the conference.
Connecting Galaxy to a Data Repository
Richard Park1,2, Nils Gehlenborg1, Psalm Haseley1, Ilya Sytchev3, Shannan Ho Sui3, Winston Hide3, Peter Park1
1 Center for Biomedical Informatics of Harvard Medical School
2 Boston University Bioinformatics Program
3 Center for Stem Cell Bioinformatics, Harvard Stem Cell Institute
Slides, Video
Integrating Galaxy with data repositories without a significant amount of manual intervention has remained a difficult task. By extending and adding to Galaxy’s core API functionality, we are able to demonstrate controlling and automating analyses purely through a 3rd party interface using the Harvard Stem Cell Institute's Stem Cell Commons project as a use case. Experimental metadata defined by ISA-TAB specifications are integrated to guide users to choose the proper workflows, parameters, and data sources to run and interpret analyses.
With the avalanche of data available in public and private repositories there is a growing need to use Galaxy in a fully automated and dynamic fashion. We enhanced Galaxy’s API in a number of ways. Galaxy can now import, delete, and download workflows as well as run workflows with variable tool parameters. Finally, we are able to generate and run dynamic workflows based on a variable number of input files selected from a data repository.
This ongoing project demonstrates a novel integration of Galaxy with experimental metadata and raw data available in biomedical data repositories. We directly benefit by using key Galaxy features such as cluster/Cloud deployment, the large selection of tools, and the workflow editor. We hope to provide the greater Galaxy community the utility of our API extensions as well as the novel possibilities of using Galaxy in a fully automated fashion. We hope to use this opportunity to gain feedback and learn better approaches from the Galaxy developer community.
Role of Galaxy in a bioinformatic plant breeding platform
Vincent Maillol1, Roberto Bacilieri1, Stéphanie Sidibe Bocs3, Jean-Michel Boursiquot1,2, Grégory Carrier2, Alexis Dereeper4, Gaétan Droc3, Cécile Fleury3, Pierre Larmande4, Loïc Lecunff2, Jean-Pierre Péros1 Bertrand Pitollat3, Manuel Ruiz3, Gautier Sarah1, Guilhem Sempéré3, Marilyne Summo3, Patrice This1, and Jean-Francois Dufayard3
1 INRA – Montpellier SupAgro, UMR 1334 AGAP, DAVEM team, 2 Place P. Viala, 34060 Montpellier, France.
2 Institut Français de la Vigne et du Vin - Unité Mixte Technologique Géno-Vigne, 2 Place P. Viala, 34060 Montpellier, France.
3 CIRAD - UMR 1334 AGAP, ID team, avenue Agropolis, 34398 Montpellier Cedex 5, France.
4 IRD - 911 avenue Agropolis, 34394 Montpellier, France.
Slides, Video
With NGS development, bioinformatics has become central in plant breeding laboratories, and researchers are in need of some autonomy in its use. The Southgreen platform (CIRAD, IRD, INRA) performs bioinformatics analyses for many plant breeding research teams in Montpellier (France), and offers many systems to users: for example GNPAnnot (automatic genomic sequence annotation), Greenphyl (phylogenetic orthology prediction), ESTtik (EST annotation) or the Bacchus analysis pipeline. Most of these systems have been translated into Galaxy workflows.
As for the Bacchus pipeline, it has been created at INRA Montpellier (France) to investigate clonal diversity in grapevine genomes. For this task, many softwares have been wrapped in the Galaxy framework. Bacchus can be decomposed in three steps: i) Genome reconstruction, ii) test of reconstruction results, and iii) diversity analysis. This last step is done using SNP's and structural variations. To detect SNP's, the latest Freebayes version was used, while the IDfixe software was developped for structural variation detection. Some of the softwares developped for this pipeline are now used in the international project Grapereseq.
Today, the Galaxy framework is widely used by Southgreen plateform users as an alternative to the command line system. In this context, dozens of users have already been trained in Galaxy-using bioinformatics. During weekly collective pair-programming sessions, platform engineers and interested scientists integrate new tools and functionalities.Thus, Galaxy is now a core component of the plant breeding community around the Southgreen platform, and the main access portal for non- bioinformatics specialists to our computing clusters.
Session 2
Galaxy Pipeline for Faster Whole Genome Genotype Calling on the GeneTitan Platform
Oleksiy Karpenko1 and Neil J. Bahroos1
1 University of Illinois at Chicago (UIC), Research Resources Center, Center for Research Informatics, Center for Clinical and Translational Science
Latest genotyping solutions allow for rapid testing of more than two million markers in one experiment. Fully automated instruments such as Affymetrix GeneTitan enable processing of large numbers of samples in a truly high-throughput manner. In concert with solutions like Axiom, fully customizable array plates can now utilize automated workflows that can leverage multi-channel instrumentation like the GeneTitan.
With the growing size of raw data output, the serial computational architecture of the software, typically distributed by the vendors on turnkey desktop solutions for quality control and genotype calling, becomes legacy rather than an advantage. More advanced software and techniques provide more power and flexibility in options and can be deployed in an HPC environment, but become technically inconvenient for biologists to use.
Here we present a pipeline that uses Galaxy as an interface to provide the mechanism to lower the barrier to more complicated and native software for the instrument in a high throughput manner. We will also report the results of processing and genotyping of large samples of African-American population with Affymetrix PanAFR arrays.
Integrating Galaxy with Globus Online: Lessons learned from the CVRG project
Bo Liu1 (boliu@uchicago.edu), Ravi Madduri1 (madduri@mcs.anl.gov)
1 Computation Institute, University of Chicago and Argonne National Laboratory
Slides, Video
Globus Online (GO) is a hosted service that uses powerful grid transfer capabilities to automate the tasks of moving large quantities of data in a secure, efficient and fast way. Integrating Galaxy with Globus Online addresses the challenges in transferring large-scale datasets in and out of Galaxy quickly and reliably. In CVRG (CardioVascular Research Grid) project, we have extended Galaxy with "Globus Online" tools for data transfer, "CRData" tools for executing R scripts, "Picard/GATK via Condor" tools for running Picard and GATK tools at Condor nodes in parallel, etc. The thorough integration of Galaxy and GO services, including GO-transfer, GO-storage and GO-collaborate, will accelerate the development of Galaxy and make it more suitable for complicated bioinformatics pipelines. For example, GO-storage provides large capacity data storage that can be accessed within Galaxy. Galaxy could use GO-collaborate for user authentication, group management and task collaboration, and then GO users could access Galaxy without register and easily share Galaxy history/workflow/dataset with GO users or groups. The distributed computing capabilities of Globus also make the execution of Galaxy jobs faster and more efficiently.
Scalable data management and computable framework for large scale longitudinal studies
Gianmauro Cuccuru1, Simone Leo1, Luca Lianas1, Josh Moore4, Maristella Pitzalis2, Serena Sanna3, Ilenia Zara1, Jason Swedlow4, Gianluigi Zanetti1
1CRS4, Pula, Sardegna, Italy
2 Dipartimento di Scienze Biomediche, Università di Sassari, Sassari, Sardegna, Italy
3 Istituto di Ricerca Genetica e Biomedica (IRGB) del CNR, Monserrato, Sardegna, Italy
4 Wellcome Trust Centre for Gene Regulation and Expression, College of Life Sciences, University of Dundee, Dundee, Scotland, UK
Slides, Video
We have implemented a platform for the analysis of two large scale longitudinal studies (~10,000 individuals) on autoimmune diseases and longevity conducted in Sardinia. We use GALAXY to provide a convenient and user-friendly interface, that allows to access data import, analysis and sharing of results.
The platform is designed to represent, using a uniform computational formalism, information on all relevant objects (e.g., physical samples, experimental and derived data, clinical data) and the network of actions, performed during the experiments and the computational analysis, that relate one to the other. The system supports type introspection on all of its objects and follows OpenEHR, an open standard that describes the management, storage, retrieval and exchange of health data in electronic health records. The latter choice guarantees a robust, computable, uniform and implementation independent description of the clinical data. The results of computation, e.g., genotype calling data, is held in specialized data structures that directly support further parallel processing and analysis.
The platform is built upon the core services of OME Remote Objects (OMERO) and GALAXY. OMERO is an open source software platform that includes a number of storage mechanisms, remoting middleware, an API, and client applications for biological data management developed by the Open Microscopy Environment. The GALAXY front-end exposes a rich set of functionalities including suites of map-reduce programs for GWAS and sequencing applications, as well as basic chain-of-custody inspection tools and tools for biological and clinical data import.
The platform is used in production by the IRGB/CNR since 2011.
Nebula - A Web-Server for Advanced ChIP-Seq Data Analysis
Valentina Boeva1,2,3 (Valentina.Boeva@curie.fr), Alban LERMINE1,2,3 (Alban.Lermine@curie.fr), Camille BARETTE1 (Camille.Barette@curie.fr), Emmanuel BARILLOT1,2,3 (Emmanuel.Barillot@curie.fr)
1 Institut Curie
2 INSERM, U900, Bioinformatics and Computational Systems Biology of Cancer, Paris
3 Mines ParisTech, Fontainebleau, France
Keywords: ChIP-seq, Galaxy, peaks, motifs, genome feature association.
We present a web service, Nebula, which allows biologists to perform by them selves complete analysis of ChIP-seq data. ChIP-seq is chromatin immunoprecipitation followed by sequencing of the extracted DNA fragments. This technique allows accurate characterization of the binding sites of transcription factors and other DNA-associated proteins.
Many existing tools for ChIP-seq data analysis are difficult to use by nonbioinformaticians. These tools map sequenced reads to the reference genome or predict binding site locations (ChIP-seq peaks). Several tools exist for peak filtering, motif discovery and genome feature association. Such tools are often command line applications or R packages.
Our web service, Nebula, was designed for biologists. It is based on the Galaxy open source framework. Galaxy already includes a large number of functionalities for mapping reads and peak calling. We added the following to Galaxy: (1) peak calling with FindPeaks and a module for immunoprecipitation quality control, (2) de novo motif discovery with ChIPmunk, (3) calculation of the density and the cumulative distribution of peak locations around gene TSSs, (4) annotation of peaks with genomic features, and (5) annotation of genes with peak information. Nebula generates the graphs and the enrichment statistics at each step of the process. During steps 3 to 5, Nebula optionally repeats the analysis on a control dataset and compares these results with those from the main dataset. Nebula can also incorporate gene expression (or gene modulation) data during these steps. In summary, Nebula is an innovative web service that provides an advanced ChIP-seq analysis pipeline, the output of which is directly publishable.
Additional information: Nebula accepts mapped reads in SAM/BAM format. Each step of the pipeline produces several output files, which are mainly tabdelimited text files, .BED files or images. We used Perl and R to develop the tools used to perform the steps 3 to 5. The pipeline also includes several published tools (samtools, bedTools, MACS, FindPeaks, ChIPmunk).
Lunch: Day 1
Implications of a Galaxy Community Cloud in Clinical Genomics
1 Chief Technical Officer (CTO), Life Sciences, EMC Isilon
The intersection of the established Galaxy Research community with the requirements from the growing Clinical Genomics community for higher throughput results and clinically actionable variants requires increased security and audit along with GxP and HIPAA compliance. We will present these issues for Galaxy storage in both a Clinical Genomics and Sequencing as a Service context within the Private Cloud.
Session 3
Establishing a National Genomics Virtual Laboratory with Galaxy CloudMan
Enis Afgan1,2
1Ruđer Bošković Institute (RBI)
2Victoria Life Sciences Computation Initiative (VLSCI), University of Melbourne
Slides, Video
An increasing number of small groups do not have simple access to needed bioinformatics tools and data resources. The tools are complicated to install and customize, require dedicated compute resources and data stores, and typically involve a high level of ongoing maintenance to keep the software, data and hardware current, which in turn requires significant expertise in software development, system administration and hardware and networking, as well as access to hardware resources and data-centers. Galaxy CloudMan addresses many of the issues dealing with the initial provisioning of a configured set of tools and data that have been integrated with Galaxy, thus facilitating the access to accessible and private bioinformatics platform.
Through the context of CloudMan, this talk will focus on the components required to establish a national collaborative initiative that aims at connecting genome researchers with massive datasets, sophisticated analysis and visualization tools, and large-scale computational and storage infrastructure. The initiative is known as the Genomics Virtual Laboratory; it is designed to scale to multiple locations and arbitrary cluster sizes as well as be supported by comprehensive training courses, outreach programs, and end-user support.
Note: If you want hands-on experience with CloudMan, you are encouraged to attend the Training Day session on CloudMan.
Keeping Track of Life Science Data
Björn Grüning1
1Pharmaceutical Bioinformatics, Institute of Pharmaceutical Sciences, Albert-Ludwigs-University Freiburg
- Video*
Life science data constantly changes and it is challenging to keep local data repositories up to date, because many data providers do not offer a strategy to the user for keeping local data synchronized. A powerful and easy-to-use data distribution system is still missing. The lowest common denominator for data exchange is the File Transfer Protocol. Although it is approved and reliable, it does not offer options to keep track of data revisions. Incremental updates are hardly possible or not possible at all. Up-to-date data are vital to researchers, however, updating might alter downstream calculation results, annotations or predictions. Using Galaxy, it is possible to reproduce the processing of static datasets very comfortably, but reproducibility is difficult to maintain, if external databases are integrated.
We will present a proof-of-concept to process life science data (e.g. sequence or compound libraries) in a Version Control System, typically used to store, share, and control software repositories. The talk will highlight the benefits of such a system for consumers and producers of life science data, initial experiences, and possible pitfalls. Moreover, we will present a prototypical implementation for the Galaxy framework, that utilize revision-controlled data and enables reproducibility, even if source data change frequently.
Easier Workflows & Tool Comparison with Oqtans+
Sebastian J. Schultheiss1, Géraldine Jean1,2, Vipin T. Sreedharan1,3, André Kahles1,3, Regina Bohnert1, Philipp Drewe1,3, Pramod Mudrakarta1, Nico Görnitz4, Georg Zeller1,5, Gunnar Rätsch1,3
1 Machine Learning in Biology Group, Friedrich Miescher Laboratory, Tübingen, Germany
2 LINA, Combinatorics and Bioinformatics Group, University of Nantes, Nantes, France
3 Computational Biology Center, Memorial Sloan-Kettering Cancer Center, New York, USA
4 Machine Learning/Intelligent Data Analysis Group, Technical University Berlin, Berlin, Germany
5 Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
With the latest improvements in next-generation sequencing technologies, data analysis software is constantly updated to obtain the best-possible results in a reasonable time period. However, the sheer number of different software programs available for the same task can be overwhelming, and it is difficult for researchers to determine which ones to use for their experimental set up.
Here, we present Oqtans+, an improved open-source workbench integrated in the Galaxy framework that enables researchers to perform comparative quantitative transcriptome analysis, in part thanks to the Galaxy NGS toolbox. In addition to the NGS toolbox, we provide tool wrappers for the following tools: PALMapper, mTIM, Trinity, rQuant, rDiff, DESeq, Genesetter, GOrilla, SAFT, KIRMES, Shogun, GFF Tools, and others. The distinguishing features of Oqtans+ include a modular pipeline architecture, which facilitates comparative assessment of tool and data quality: Since Oqtans+ contains several tools that can in principle be applied to the same data, it is straightforward to compare the performance of different programs and parameter settings on the same data and choose the best suited for the task. Oqtans+ also contains programs, which are well-suited for the evaluation of RNA-seq read alignment accuracy, in particular when dealing with read alignment filtering and optimal alignment of multiple mapped reads.
Moreover, Oqtans+ provides sophisticated machine learning-powered tools that are shown to perform better or as well as the state-of-the-art for short-read alignments, transcript identification/quantification, and differential expression analysis. Finally, Oqtans+ sets a new standard in terms of reproducibility, building on Galaxy’s features that greatly facilitate persistent storage, exchange, and documentation of intermediate results and analysis workflows. We show how to use Oqtans+ with two easy-to-understand workflow examples and real-world data.
Oqtans+ is available for download (GPL, free for non- commercial use), as a machine image for cloud environments, and at our server via the persistent web address bioweb.me/oqtans.
Contact: support@oqtans.org; ratschg@mskcc.org
CloudMap: A Cloud-based Pipeline for Analysis of Mutant Genome Sequences
Gregory Minevich1, Danny Park1, Richard J. Poole1, Daniel Blankenberg2, Anton Nekrutenko2 and Oliver Hobert1
1 Department of Biochemistry and Molecular Biophysics, Howard Hughes Medical Institute, Columbia University Medical Center, New York, NY, USA
2 Center for Comparative Genomics and Bioinformatics, Penn State University, University Park, PA, USA
Slides, Video
Whole genome sequencing (WGS) is the fastest and most cost effective way to map causal mutations in model organisms such as C. elegans. Our lab has previously developed single step SNP mapping strategies coupled with whole genome sequencing (Doitsidou et al. 2010) as well as software analysis tools for mutant genome sequence analysis (MAQGene, Bigelow et al. 2009). In an effort to take advantage of the cloud and many freely available open source tools, we've adapted our mutant genome sequence analysis pipeline to run on Galaxy. Our pipeline uses custom Python scripts to provide greatly improved mutant mapping tools and relies on the NGS Toolbox in Galaxy, GATK Tools, and snpEff. In addition to allowing for pinpoint mapping of causal mutations in C. elegans using any mapping strain, we also support similar mapping strategies for other model organisms that can be crossed to mapping strains. An alternate mapping strategy whereby mutants are backcrossed to their starting strain (Zuryn, et al. 2010) is also supported. The CloudMap pipeline provides a set of best practices for mapping causal mutations and also facilitates the cataloguing and sharing of WGS variant data among model organism communities that use the tool.
Correspondence to [gm2123 AT columbia DOT edu](mailto:gm2123 AT columbia DOT edu) (G.M.) or [or38 AT columbia DOT edu](mailto:or38 AT columbia DOT edu) (O.H.)
Session 4
GPS: a real-time recommendation system for ChIP-Seq analysis
Hanfei Sun11Dana-Farber Cancer Institute and Harvard School of Public Health
Slides, Video
With the numerous projects focusing on gene regulation and epigenetic mechanisms such as ENCODE, a huge amount of ChIP-seq data has been produced and stored in various databases, such as SRA and UCSC. However, as the size of database grows, it become more difficult for users to retrieve potentially helpful data and analysis the possible relationship within such large datasets.
For many genome databases, keyword search method is widely used for their retrieval. Hence, when a user searches for a dataset with a keyword, he/she can only retrieve the datasets whose meta-data contains this keyword. If a dataset doesn’t contain this keyword in its meta-data, users can’t find it by keyword searching, though it may have most characteristics of the datasets under this keyword.
We proposed a real-time recommendation system that recommends proper datasets that has the similar features of the dataset being viewed by user, which is called GPS (GPS is an acronym for “GPS for Potential Similariry”). Then we make a prototype where the technique is applied with more than 3000 public datasets. Advantages of our system are as follows.
- It provide users a new search method beyond keyword searching.
- It discovers the potential relationship within the large datasets automatically and displays it in a real-time way.
- It may lead to creative thinking support for researchers by showing datasets related to the datasets being viewed.
Integration of Taverna workflows on a Galaxy-based platform for large-scale genomics analysis
Huayan Gao1,2, Peter Li3,4, Tam Sneddon3,4, Dennis Chan3, Alexandra Basford3,4, Scott Edmunds3,4, Alex Wong3, Wai-Yee Chan1,2, Zhang Yong4, Tin-Lap Lee1,2
1 School of Biomedical Sciences, The Chinese University of Hong Kong, Shatin, Hong Kong SAR, China.
2 CUHK-BGI Innovation Institute of Trans-omics, The Chinese University of Hong Kong, Shatin, Hong Kong SAR, China.
3 BGI-Hong Kong Ltd., 16 Dai Fu Street, Tai Po Industrial Estate, NT, Hong Kong SAR, China.
4 BGI-ShenZhen, Bei Shan Industrial Zone, Yantian Distrcit, Shenzhen, China.
Slides, Video
The big data derived from next generation sequencing experiments makes efficient data mining strategies indispensible. Despite the plummeting costs of sequencing, the downstream processes create financial and bioinformatics challenges for many biomedical scientists. To alleviate this major stumbling block, we have established a Galaxy-based platform (CBIIT-Galaxy) and UCSC genome browser mirror for fast and efficient genomic data analysis. We have also implemented Taverna workflows to enable additional common Next-gen sequencing analyses in a simplified workflow format. To allow fast access and interpretation of popular Next-gen sequencing datasets, we are collaborating with the online open-access open-data journal Gigascience. This is a novel publication platform that combines scientific publications and datasets, utilizing CBIIT-Galaxy to aid reproducibility, review and use of data. Its database (GigaDB) contains popular genome datasets related to various species and disease models from the BGI as well as submitting authors. Instead of hours of data transfer, CBIIT-Galaxy server allows researchers to easily locate data from GigaDB and directly import to CBIIT-Galaxy for further data analysis. Preliminary analysis on common functions in CBIIT-Galaxy showed significant performance improvement compared to public Galaxy server. We plan to link global data networks such as GLORIAD (Global Ring Network for Advanced Application Development) to our platform to further improve the network traffic capacity. We will also implement customized Taverna or Myexperiment workflows into CBIIT-Galaxy for public access. Taken together, the CBIIT-Galaxy will serve as an important Galaxy portal for biomedical scientist in Asia and around the globe.
NGS analysis for Biologists: experiences from the cloud
Mohammad Heydarian1, Barbara Sollner-Webb1, and Karen Reddy1
1 Department of Biological Chemistry & Center for Epigenetics, Johns Hopkins University.
Slides, Video
As the cost of next generation sequencing decreases and the accessibility increases many labs will be in a position to perform NGS experiments, at one point limited to the mega-lab. With NGS data comes the necessity for adequate analysis capabilities to make sense of millions to billions of short sequence reads. These large data sets require high performance computing power for their analysis, generally not found in office computers that service the standard science lab. The Galaxy Project has developed an instance of its powerful and user friendly environment to be used with Amazon Web Services to allow researchers to perform NGS analysis on the cloud. This is a very convenient option for researchers who are in need of high computational power for a limited number of experiments (time) and may not have access to the necessary computational infrastructure. Here we discuss our experiences as cellular and molecular biologists with no computational/programming background performing RNA-seq analysis using the Galaxy CloudMan. We aim to create resource that can help the non-computational biologist establish cloud space and perform analyses with minimal programming/coding, to maximize the efficiency of the biologist and to allow her/him to focus on the biology of the experiment. This work will hopefully contribute to the dialogue between biologists and developers to maximize the efficiency of both parties.
The Galaxy Visualization Framework
Slides, Video
The Galaxy team is building a framework for Web-based visualization of next-generation sequencing (NGS) data. This framework supports fast random access for most common NGS data formats and HTML5-based libraries for building different types of visualizations such as track browsers, circos plots, and phylogenetic trees. This talk will provide an overview of The Galaxy Visualization Framework and highlight some visualizations produced using the framework.
Evening: Day 1
Ion Torrent: Open, Accessible, Enabling
1 Director of Bioinformatics and Software Products, Ion Torrent
Ion Torrent has pioneered an entirely new approach to sequencing that enables a direct connection between chemical and digital information and leverage decades of semiconductor technology advances. The result is the first commercial sequencing technology that does not use light, and as a result delivers unprecedented speed, scalability, accuracy, and low cost. In just the first year the Ion Torrent Personal Genome Machine (TM) has become the fastest selling sequencing platform. The throughput scaled 100X, from 10Mb to 1Gb, in just the first year and will scale another 100X in the next year with the new Proton (TM) sequencer, which will enable the single day $1000 human genome. Automated data analysis is driven by Torrent Suite, an open-source software suite that provides a simple and intuitive interface to streamline data analysis and provide results in minutes to hours, not days. Built on top of Torrent Suite is a flexible SDK that allows users to expand the analysis capabilities through the development and utilization of plugins and APIs.
Session 5
High level distributed processing pipelines with Galaxy
Brad Chapman1 (bchapman@hsph.harvard.edu), Shannan Ho Sui1, Enis Afgan3, Ilya Sytchev2, Jason Evans1, Oliver Hofmann1, Winston Hide1
1Harvard School of Public Health
2 Center for Stem Cell Bioinformatics, Harvard Stem Cell Institute
3Ruđer Bošković Institute (RBI)
Slides, Video
We will discuss current work at Harvard School of Public Health to develop custom Galaxy interfaces for distributed processing pipelines. Our goal is to complement existing Galaxy tool functionality with best practice approaches for variant calling, RNA-seq and CHiP-seq analysis. We provide a user-friendly wizard interface that calls a remote back end server for running fully distributed jobs and cleaning up intermediate files. Users initiate processing from within Galaxy and receive results uploaded directly into Galaxy Data Libraries.
We have made this processing approach available via an up-to-date fork of Galaxy central and utilize it in a number of current projects. The pipelines integrate into the Stem Cell Discovery Engine, in an ongoing collaboration with Richard Park and Nils Gehlenborg in the Park Lab.
Additionally, a ready-to-run analysis environment is available on Amazon using CloudBioLinux, CloudMan and BioCloudCentral. This infrastructure enables us to provide a fully push-button approach, from provisioning machines to viewing results in Galaxy. BioCloudCentral was specifically designed to improve the experience of users transitioning to cloud resources by automating initial manual setup steps.
This work highlights the value of working within the Galaxy environment and the power of a connected user community. As an in-progress project, we hope to use this opportunity to discuss approaches for handling big, distributed, high-level processing tasks.
Proteomics tools for Galaxy
Ira Cooke1
1 Life Sciences Computation Centre, Department of Biochemistry, La Trobe University
The Mass Spectrometry group at La Trobe University has been developing a suite of tools for running proteomics analyses and visualizing proteomics data in Galaxy. Our tools are tightly integrated with galaxy, including custom data types and an external display application that allows users to interactively view and search identified proteins as well as quickly navigate between those identifications and the raw spectra on which they were based. The tools are available on the Galaxy toolshed and all other software components are available as open source software.
This talk gives a brief overview of our tools and outlines applications that emphasize the utility of making Proteomics tools available within the Galaxy ecosystem (of largely Genomics tools). The talk also touches on some of the technical challenges we faced, especially in dealing with tools that are highly interdependent, and which spread data across multiple files.
Using Galaxy for Molecular Assay Design
James Ireland1, Andrew Evans1, William FitzHugh1
1 5AM Solutions
Molecular assay design is a staple of the bioinformatics trade. Although the types of molecular assays varies widely from PCR primers for resequencing, to TaqMan probes for gene expression assays to molecular inversion probes for genotyping, the overall workflow remains largely the same. A typical workflow starts with target identification, sequence retrieval, candidate probe/primer design, homology checks for non-specific hybridization and finally identifying potential adverse folding or oligo interactions. Galaxy is a natural platform for assay design because of its (1) natural and intuitive workflow support, (2) preservation of design history, (3) existing sequence manipulation functionalities and (4) ability to easily add in new applications and functionality. In this presentation, we discuss how 5AM Solutions uses Galaxy as a platform for custom assay design that includes integration of Primer3, e-PCR and UNAfold tools.
The National Center for Genome Analysis Support and Galaxy
Richard LeDuc1
1National Center for Genome Analysis Support, Indiana University
The National Center for Genome Analysis Support is an NSF funded resource designed to supply bioinformatic support and computational infrastructure to genomics projects requiring large RAM computational resources – specifically de novo sequence assembly. We are developing a national scale infrastructure to support biological researchers that will use Galaxy to interface between the biologists and the cyberinfrastructure. The ability of NCGAS to assist with bioinformatic software optimization was recently demonstrated by our assistance in optimizing the runtime performance of Trinity resulting in over 3x speed improvement.
Session 6
Integration of SeqWare within Galaxy
Zhibin Lu1, Morgan Taschuk1, Brian O’Connor1 and B.F. Francis Ouellette1,2
1 Ontario Institute for Cancer Research, Informatics and Bio-computing platform, Toronto, Ontario, Canada
2 Department of Cell and Systems Biology, University of Toronto, Toronto, Ontario, Canada
Slides, Video
SeqWare, developed at UCLA, UNC, and OICR, is an open source project to create a tool set to work with next generation sequencers. It includes a LIMS, Pipeline, and Query Engine. The production group at Ontario Institute for Cancer Research (OICR) uses this package to control its workflows, perform analysis and manage NGS data. SeqWare is able to trigger and monitor workflows via web services, and this has made it possible to integrate it with other tools like Galaxy. This helps our biology users to use workflows generated by our sequencing production pipeline and helps OICR’s production group continue downstream analysis within galaxy as well, leveraging the strength of multiple approaches. We will show the integration and architecture of the galaxy instance and SeqWare installed at the OICR Bioinformatics Core Facility, showing the benefit of the integration and further development.
Window2Galaxy – Enabling Linux-Windows Hybrid Workflows
Liram Vardi1 and Amir Ben-Dor1
1 Agilent Labs
Galaxy holds the promise of incorporating diverse command-line tools into reusable workflows. However, a big limitation of a typical Galaxy installation is the strict requirement that all tools be run from a Linux shell. As a result, many external tools that are pre-compiled for windows cannot be easily incorporated into a galaxy workflow. While a windows simulator, such as Wine, can provide a partial solution, its installation is not trivial for some Linux distributions, and, moreover, it does not provide full windows compatibility.
We present a Galaxy extension, Window2Galaxy, that acts as a middle-man between Linux and Windows, enabling Galaxy developers to incorporate Windows command-line tools into a standard Linux-based galaxy workflow. Our tool consists of two parts: A Linux client and a windows web-service. The web-service is hosted on a web-server (which can be run on either an external windows machine or on a local windows virtual machine) is responsible for executing the command line. A Linux client is responsible for copying input files from galaxy to a shared directory; send “execute” request to the windows service and finally, copying output files back to Galaxy repository.
With this extension, adding a windows-based tool to galaxy is straight forward – adding "Window2Galaxy" before the windows-command in the xml configuration file.
From end-users perspective, this extension is completely transparent – workflows can be constructed from various tools, independent of whether those tools are Linux or Windows based.
In the talk, we will present the architecture and provide example use cases.
NBIC Galaxy to Strengthen the Bioinformatics Community in the Netherlands
Hailiang Mei1, David van Enckevort1, Mattias de Hollander2, Jeroen F. J. Laros3, Marc van Driel1, Rob Hooft1
1 Netherlands Bioinformatics Centre
2 The Netherlands Institute of Ecology
3 Center for Human and Clinical Genetics, Leiden University Medical Center
The Netherlands Bioinformatics Centre (NBIC) plays a central coordinating role in several new Galaxy related developments that will further strengthen the bioinformatics community in the Netherlands.
- NBICGalaxy@HPCcloud
The NBIC Galaxy server was originally built as a demonstration server for bioinformatics tools made by NBIC developers. However, the need for processing complete research datasets was clearly visible from the start. A newly installed cloud computing system (HPCcloud) by BigGrid and SARA allows the NBIC Galaxy server to be used for this purpose. A planned fiber network connecting this HPCcloud to several key research institutes in the Netherlands will further help. - CTMM TraIT
From the end of 2011, NBIC has become a partner of the CTMM TraIT project where bioinformatics solutions are being built to process data collected from cancer and cardiovascular disease research projects. Galaxy is considered as a major candidate. We are now working on a pilot project where an existing cancer causing genomic variant detection tool is being connected to a Galaxy backend via the Galaxy API. The aim is to keep using a user-friendly and familiar interface for biologists while taking advantage of the latest sequencing data analysis programs installed in Galaxy. - Education
One main mission of NBIC is to provide education and training to students and researchers. We have successfully used the NBIC Galaxy server in several practical courses. After a short introduction about the Galaxy interface, most attendees are able to start using tools they have never used before and perform data analysis tasks they just learned. The NBIC Galaxy server has demonstrated the potential to be a good education platform for future bioinformaticians.
GenomeSpace
Ted Liefeld1
1 Broad Institute
GenomeSpace is a software environment that provides a connection layer between bioinformatics resources, whether they are Web-based applications, desktop packages, or simple scripts. GenomeSpace addresses the growing need for genomics researchers and bioinformaticians to have “frictionless” data transfer among the variety of analysis tools and data sources. GenomeSpace provides an open environment, which other bioinformatics resources can use to join the community ofGenomeSpace-enabled tools. GenomeSpace is seeded by six prominent tools for genomics analysis: Galaxy, Cytoscape, GenePattern, Genomica, the Integrative Genomics Viewer (IGV), and the UCSC Genome Browser, and developed in collaboration with several biological research projects at the Broad Institute, Stanford University, and UCSD.
Session 7
Tool Shed and Changes to Galaxy Distributions
Galaxy’s ease of use and rich feature set have made it a powerful enabler of biological research, and the recent introduction of the Galaxy tool shed has significantly enhanced this process. The Galaxy Tool Shed enables sharing of Galaxy tools, proprietary datatypes, exported Galaxy workflows, and data across the research community with ease. Tools can be automatically discovered and installed into a local Galaxy environment in real time, and they can easily be deactivated or uninstalled when they are no longer needed. The tool shed also provides the ability to simultaneously install different versions of the same tool into a Galaxy environment, enabling reproducibility and more complex analyses.
Big changes have been going on in how tools are packaged with the distribution. This talk will focus on what's changed, and what you need to know about the Galaxy Tool Shed to deploy your own Galaxy instance.
Note: If you want hands-on experience with the Galaxy Tool Shed, you are encouraged to attend the Tool Shed Training Day session.
The [deadline for abstracts was April 16](/events/gcc2012/Key Dates/).
Oral presentations will be approximately 15-20 minutes long, including time for question and answer. There will also be an opportunity for lightning talks, which will be solicited at the meeting.
- Make any slides freely available on this web site, no later than August 15, 2012.
- Have your talk be videotaped and have that videotape be publicly accessible on the web.
(Note: We may or may not have sufficient funds to record talks.)