Topic extraction from Astrophysical Reports and follow-up analysis tool suggestions in Galaxy

Matching short scientific texts to relevant analysis pipelines using AI

June 11, 2025

Astronomical bulletins such as ATels or GCN Circulars often provide crucial early information about new discoveries - ranging from transient events to peculiar sources across the electromagnetic spectrum. Due to the fleeting nature of some of these phenomena, automating follow-up observations, using insights from past experience, with space-based telescopes is becoming increasingly valuable. Astro-COLIBRI is a platform that aggregates real-time information from multiple observatories, whose goal is to streamline the coordination of follow-up observations. Nevertheless, much of this process still remains manual.

In addition, vast archives of astronomical data already exist, making it possible to enhance the study of a new event by automatically analyzing historical observations from the same region of the sky. However, the lack of standardized formats in these bulletins and the high volume of daily publications present significant challenges for computers to interpret these texts and link them to suitable analysis tools.

Building on discussions from Astro-COLIBRI workshops, researchers have developed a Galaxy-based tool as part of a pilot study within the EuroScienceGateway project (WP5). This tool - which can be generalized to alerts distributed by any astronomical brokers - combines entity recognition, semantic embeddings, and a trained Convolutional Neural Network to connect short astrophysical texts and reported astronomical sources to relevant follow-up analysis tools - related to astronomical instruments - on the MMODA platform.

The tool enables researchers to:

Input short astrophysical texts directly or fetch them from online sources such as ATel or GCN Circulars.
Automatically extract key entities such as source names, positions, phenomenon types, telescopes, and wavelength ranges using both regex patterns and the astroBERT language model.
Embed the text into a 59-dimensional semantic vector describing source class, wavelength coverage, and relevant instruments.
Use a Convolutional Neural Network (CNN) trained on pairs of initial alerts and follow-up reports to suggest appropriate MMODA tools.
Retrieve direct MMODA URLs tailored to the content of the input text for immediate follow-up analysis.

The tool produces nine structured output tables, including:

Entity recognition results
Source classification and sky positions
Vectorized input and CNN predictions
Matching MMODA tool URLs with relevance scores

These outputs demonstrate how the input text is semantically parsed and matched to suitable analysis tools. The dot-product-based scoring ensures that the most relevant instruments - based on wavelength or source type - are prioritized.

This prototype tool is available on the usegalaxy.eu.

We acknowledge the data services provided by SIMBAD, TNS, and FINK, as well as the IVOA community for their ontology work on astronomical source types.

Supporters