From Pipettes to Code: How I Participated in a Pangenome Initiative for Galaxy as a Molecular Medicine Student
My journey wrapping HAL, PAF, and TAF tools for the European Galaxy server.
On this page
With a background in molecular medicine, I was familiar with cellular processes, genetic diseases, and daily wet-lab routines. And while I already had a few years of general programming experience, the world of bioinformatics was relatively new for me. Now I had to deal with bigger genomic data structures, perform variant callings, or work with file formats such as FASTQ or BAM. My name is Niklas Mayle, I’m enrolled in Molecular Medicine as a Master Student at the Albert-Ludwigs University in Freiburg, and to learn more about Bioinformatics, I decided to perform an internship at the Galaxy Team in Freiburg.
My journey with the Galaxy Team started by participating in an ongoing pangenome initiative in Galaxy. A concept I had never heard of before. I was tasked with integrating the Hierarchical Alignment Format, the Transposed Alignment Format, and the Pairwise Alignment Format tools into Galaxy. For me, that meant navigating my first time “wrapping” tools, while learning how to actively use and analyze data with them. Here is the story of my journey of converting multiple command line tools to accessible, graphical toolkits for the scientific community.
The Bias of the Linear Reference Genome
Bioinformatic genomic analysis traditionally relies on a single linear reference genome. This has brought great progress, but suffers reference bias as a fundamental limitation (Ballouz et al. 2019). A single genome cannot represent the entire genetic diversity of a population. If we only ever map new sequences against this one reference, we miss structural variations, lose critical information, and distort our results (Matthews et al. 2024).
Therefore, the field of genomics is currently undergoing a shift towards multiple sequence alignments and pangenomics. This approach successfully captures the natural genomic variation within a species far better (Liao et al. 2023). A pangenome ideally represents the entire genomic repertoire of a species. It incorporates both the “core” genome (sequences shared by all individuals) and the “accessory” genome (sequences present in only a subset of the population) (Matthews et al. 2024). By aligning multiple genomes simultaneously, researchers transition from a flat, one-dimensional reference sequence to a comprehensive, interconnected representation of genetic variation. This approach effectively eliminates reference bias, as every genome within the alignment can be treated equally.
However, the representation, storage, and analysis of whole-genome multiple sequence alignments present computational challenges with traditional multiple alignment formats. They are fragmented and often still indexed and ordered relative to a single reference genome (Hickey et al. 2013). To address these issues, new tools and file formats were developed. Some of which I was tasked to “wrap” into Galaxy.
The Methodological Bottleneck
I know how intimidating a terminal window can be for researchers who just want to answer biological questions. Until now, the Hierarchical Alignment Format (HAL), the Transposed Alignment Format (TAF), and the Pairwise Alignment Format (PAF) tools have existed exclusively as Command-Line Interfaces (CLI). To use them, IT knowledge and the ability to install and manage software dependencies are required. Researchers without this knowledge who are ready to use pangenomics and the new tools will have a methodological bottleneck, which is the command line. This is exactly the bottleneck Galaxy solves in general. As a researcher who is familiar with both worlds, I feel driven to help break down technical barriers and make tools more accessible to the scientific community.
”Tool Wrapping”
“Wrapping” a tool is basically about building a bridge. You take the CLI tool and write an XML file (the wrapper) that tells Galaxy exactly how this tool works: What input data does it need? What parameters can the user change? What does the final command running in the background look like? And what output data does it produce? In my case, all the tools were tool suites consisting of multiple subcommands. So this procedure needed to be done for each subcommand, and each subcommand was “wrapped” in respect to the tool suite.
My daily life started to look very different from the one I knew, working on a lab bench:
- Installation & Resolving dependencies: I had to ensure the tool suites could be cleanly installed via Bioconda so that Galaxy and my local machine could run them in isolated environments. While HAL and PAF existed within the Bioconda “cactus” package and could be easily installed, TAF was missing. This meant a brief detour to learn how to write a Bioconda recipe for TAF, making it easily installable for Galaxy later on.
- Writing the XML wrapper: With the installations complete, I tackled the tool suites one by one, systematically wrapping and testing every single subcommand. I identified all relevant input and output parameters from the CLI and structured the execution commands logically. Using the Galaxy utility “Planemo”, I wrote the actual XML wrappers and embedded detailed instructions into the corresponding help blocks.
- Testing, testing & testing: To guarantee that the wrapped tools work reliably, I designed comprehensive test cases for each wrapper and tested them locally using Planemo. To thoroughly simulate real-world biological queries and edge cases, I utilized existing test files from the official repositories alongside minimal test datasets that I created myself.
Upload to the Intergalactic Utilities Commission
When you develop tools for the global Galaxy ecosystem, they must meet quality standards. This is where the Intergalactic Utilities Commission (IUC) comes into play. Established in 2012, the IUC ensures that all wrapped tools published to the Galaxy Toolshed are functionally correct and optimized for local Galaxy instances. Once my local tests passed, I submitted the finalized tool suites as a pull request to the official IUC GitHub repository. For me, as a “first-time wrapper”, the feedback I received from the IUC gave me first-hand insights about how I wrapped the tools. Through their guidelines and the direct support from experienced developers, I optimized my wrappers and learned more about software development best practices.
After months of coding, testing, and refining, it was finally done. All tool suites have been successfully wrapped, and have been approved and published to the Galaxy Toolshed by IUC. Following this approval, the entire tool suites were installed on the European Galaxy server. With HAL, PAF, and TAF now easily accessible, researchers can use them for complex genomic analyses or seamlessly integrate them with thousands of other Galaxy tools to create fully automated workflows. For me, this work marks the end of my internship. The successful integration of HAL, TAF, and PAF tool suites into the European Galaxy pushes the progress for a much broader “Galaxy Pangenomic Framework” forward.
A Big Thank You
This project showed me that a gap in modern genomics isn’t a lack of bioinformatics theory, but bringing these complex tools into the hands of everyday researchers. By making the tools easily available, this work translates the benefits of pangenomics into a more accessible reality. And for me personally? It proved that making the leap from molecular medicine into the world of bioinformatics is possible. Especially when you have a team as amazing as the Galaxy Team in Freiburg supporting you. The supervision I received was really great! Being surrounded by such a supportive and knowledgeable group of people made all the difference in my learning curve. I especially want to express my sincere thanks to Prof. Dr. Rolf Backofen for the opportunity to join his research group, to Dr. Björn Grüning for welcoming me so warmly into the Galaxy Team, and Saim Momin for his dedicated mentorship and daily guidance. The entire experience was so rewarding that I’ve decided to stay for my Master’s thesis. I can’t wait to see what I build next!