GPU-enabled JupyterLab enables interactive AI use-cases in Galaxy
Democratising GPU infrastructures by providing accelerated JuypterLab instances via Galaxy and combining them with HPC workflows
Introduction
Artificial intelligence (AI) algorithms are being increasingly applied in several fields of Bioinformatics such as protein 3D structure and drug-response prediction, imputing missing data in single-cell gene expression, image segmentation using biomedical images and many more. AI algorithms that train on a large amount of scientific data require powerful compute infrastructure consisting of several CPUs, GPUs and a large storage. JupyterLab provides an excellent framework for developing AI programs but it needs to be hosted on such a powerful infrastructure. To bridge this gap an open-source, Docker-based, and GPU-enabled JupyterLab notebook infrastructure has been developed that runs on the public compute infrastructure of Galaxy for rapid prototyping and developing end-to-end AI projects. Using such an infrastructure, long-running AI model training programs can be executed remotely. Trained models, represented in a standard open neural network exchange (ONNX) format and other resulting datasets are created in a Galaxy history. Other features include GPU support for faster training, support for machine learning packages such as TensorFlow and Scikit-learn, Git integration for version control, the option of creating and executing pipelines of notebooks, the availability of multiple dashboards for monitoring compute resources and visualizations using Bokeh, Seaborn, Matplotlib, Bqplots, Voila. In addition, the JupyterLab tool can also be used as a regular Galaxy tool in a workflow. These features make the JupyterLab notebook highly suitable for creating and managing AI projects.
Implementation
A Docker container is created that installs several packages such as JupyterLab, TensorFlow, Scikit-learn, Pandas,
Bokeh, Elyra AI, Seaborn, ONNX, Git, GPU dashboard, ColabFold, JAX and many others for machine learning and data science projects.
The container inherits an official "NVIDIA/CUDA" base container that contains CUDA packages for GPUs
to work with TensorFlow and then installs the above-mentioned packages. The Docker container is downloaded by a Galaxy interactive tool
to make it available on Galaxy. Having a Docker container running in the backend provides many security benefits as it interacts
minimally with the remote computer's operating system. In addition, a non-root user inside the container provides additional security benefits.
These benefits are important as users can execute arbitrary code on JupyterLab notebooks. The Docker container can separately
be downloaded to any laptop or personal computer (having at least 20 GBs of space) or any other compute infrastructure from
Docker hub and used.
If NVIDIA GPUs are available, the Docker container will automatically recognise them. Otherwise, it will run on CPUs.
The scripts to run the container is mentioned in the GitHub repository. In JupyterLab notebooks, additional packages to create different scientific analyses can be installed and development environments such as using conda
or mamba
can be created.
Use-cases
A recent scientific publication that predicts infected regions of COVID-19 CT scan images is reproduced using multiple features of JupyterLab. In addition, ColabFold, a faster implementation of AlphaFold2, can also be accessed in this notebook to predict the 3D structure of protein sequences. JupyterLab notebook is accessible in two ways - first as an interactive Galaxy tool and second by running the underlying docker container. In both ways, long-running training can be executed on Galaxy’s compute infrastructure. The figure below shows the predicted infected regions of COVID-CT scan images by training Unet AI model on the JupyterLab infrastructure. The accuracy is similar to as published in the associated paper.
Use-case 1: Image segmentation of COVID-19 CT scans
The figure below shows the predicted infected regions of COVID-CT scan images by training Unet AI model on the JupyterLab infrastructure. The accuracy is similar to as published in the associated paper.
Use-case 2: Prediction of 3D structures of proteins
The figure above shows predicted 3D structure of spike protein of SARS-CoV-2 using ColabFold.
GPU JupyterLab as a Galaxy tool in a workflow
GPU JupyterLab intergation in Galaxy can be used as normal Galaxy tool. It takes input datasets from other Galaxy tools, process them in an IPython notebook and produce output datasets. Those feature makes Jupyterlab perfectly able to be run inside Galaxy workflows. The following figure shows a sample workflow where GPU JupyterLab tool is used in a Galaxy workflow.
How to apply for this resource
Please follow these steps for application:
- Create an account on Galaxy Europe using your official university/company email id.
*
If you already have an account change make sure to use an official university/company email in your user preferences. - Apply for accessing GPU JupyterLab.
- Use your official university/company email id in the Google form that matches your Galaxy account.
- Once your request is approved, you will be able to run the GPU-enabled JupyterLab notebook on Galaxy.
- If your are not authorised, you will get an error message that will guide you to the request form.
- Contact us if there are any issues.
Current Resources (will be updated regularly)
Type | Galaxy Europe |
---|---|
GPU | 45 |
CPU | 8000 |
Memory | 60 TB |
Disk space | 5 PB |
Each GPU JupyterLab session (will be updated regularly)
Type | Session |
---|---|
GPU/CPU cores | 1/7 |
Memory | 20 GB + 15 GB from GPU |
Quota/disk space | 250 GB |
Much more ...
There is much more - best to check out our Galaxy tutorial at the GTN and our preprint.
Useful links
- Code to create the Docker container
- Docker container on Docker hub
- JupyterLab in a Galaxy workflow
- Galaxy training network (GTN) tutorial on how to use this resource
- An accessible infrastructure for artificial intelligence using a Docker-based JupyterLab in Galaxy Preprint