Galaxy: the first 10,000 pubs
The Galaxy Publication Library hits a milestone
By Dave Clements
August 26th 2020
We reached 10,000 publications in the Galaxy Publication Library this month. This library tracks publications that use, extend, implement or reference Galaxy or Galaxy-based platforms. It includes journal articles, theses, book chapters, preprints, and a couple more odds and ends. This milestone is a good opportunity to look at what the library tells us about where the Galaxy project has been, and maybe where it's going as well.
The library was started December 2011, when the first 168 galaxy related publications were added and classified using 8 tags. This included all project publications plus every pub that our ad hoc literature searches could find at the time. The library started on CiteULike and stayed there until September 2017, when we moved it to Zotero. The library had 4500 publications by the time we moved to Zotero.
The library uses tags to indicate how publications relate to Galaxy. See below for an explanation and history of the tags.
Trends in the publication library reflect the trajectory of the Galaxy Project. Here are some trends that stand out in this data.
From 2013 through 2019 the number of pubs per year increased by an average of 25% each year. This year the trend is even steeper, and if it continues, we will end up with 2400 new publications in 2020.
Note that it took
- 44 months to reach 2,500 publications,
- 26 months to add the next 2,500 publications,
- 18 months to add the next 2,500 (data not shown), and
- 16 months to add the most recent 2,500 pubs.
If trends continue, we will hit 20,000 pubs by 2024.
The most obvious trend is that there are a lot of pubs using Galaxy in their methods. 59% of all publications mention Galaxy in their methods section, up from 51% of the first 5000 pubs. So far in 2020, 72% of all pubs are tagged Methods.
This trend doesn't show any sign of slowing down.
Not all Methods papers say which Galaxy instance(s) they used. But starting in 2013, papers that do mention this are also tagged with UseMain, UsePublic, UseLocal, and/or UseCloud tags (see Tags below for an explanation of all tags).
The relative number of UseMain and UsePublic pubs highlights the increasing availability of publicly accessible Galaxy platforms.
- In 2013-2014, there were 2 1/2 times as many UseMain pubs as UsePublic pubs.
- In 2015 they were about the same,
- in 2016-2017, there were nearly twice as many UsePublic pubs,
- and in 2019-2020 there are nearly 3 times as many UsePublic as there are UseMain pubs.
This rise reflects the increase in available public platforms from 21 servers at the start of 2012 to over 150 platforms today.
The number of pubs reporting using local / non-public instances of Galaxy increases every year, up from 58 in 2015 to 144 in 2019. However, in most years the percentage of Methods papers that report using local installs is dropping slightly, from 12% of Methods papers in 2015 to 9% in 2020.
We aren't sure what to make of this, as we believe that by far the largest group of Galaxy installations are UseLocal. Some possible explanations:
- Researchers using local Galaxy instances are less like to report that they are using Galaxy.
- Most local installs aren't used in research that ends up being published.
- There are far fewer local installs than we think.
We hope to introduce features in 2021 that will allow us to confirm or contradict hypothesis #3.
The library contains any type of academic publication, including theses, conference papers, books and book chapters, and of course journal articles.
Galaxy has appeared in over 1800 journals. The 20 most popular journals in the library are:
|3||Nucleic Acids Research||263|
|6||Microbiology Resource Announcements *||171|
|10||Frontiers in Microbiology||81|
|16||Briefings in Bioinformatics||68|
|17||Proceedings of the National Academy of Sciences||64|
|20||Future Generation Computer Systems||58|
|20||PLOS Computational Biology||58|
Genome Announcements was re-titled Microbiology Resource Announcements* in July 2018.
There are 6 journals in the top 20 that weren't there when we hit 5000 pubs. All 6 newcomers are open access.
This highlights a general trend towards adoption of open access journals by the Galaxy community:
This shows the top 20 journals in the library at 4 points in time over the past 5 years. A couple of things to note:
- The graph shows the top 20 journals in the entire library at each point in time, not just in the pubs added since the last point in time. (That graph would show the last two points here even more starkly.)
- Points, 1, 2, and 4 correspond to reaching 2500, 5000, and 10,000 pubs.
- With the exception of Bioinformatics, the rankings of paid access journals have been on a clear downward trend since at least 2018.
- Although not shown on the graph, the absolute number of papers published in these formerly high-ranking paid access journals (including Cell and Nature) has remained relatively constant. It is their relative ranking that has dropped as the number of pubs in other journals has increased.
BioRxiv is both over-counted and under-counted because of how we handle preprints: When a paper shows up as a preprint we add it as a preprint. Once or twice a year, we check all pubs currently in preprint form to see if they have been published in a peer-reviewed journal. If they have, then we delete the preprint version, and add the peer-reviewed pub.
This has two implications:
- The total number of articles that have been published in BioRxiv is actually much higher than shown.
- At most points in time, the number of pubs shown in BioRxiv is an over-count of Galaxy-related pubs that still don't exist in a peer-reviewed journal.
The 10,000th pub is
- Dai, W., Xiong, J., Zheng, H., Ni, S., Ye, Y., & Wang, C. (2020). Effect of Rhizophora apiculata plantation for improving water quality, growth, and health of mud crab. Applied Microbiology and Biotechnology, 104(15), 6813–6824. https://doi.org/10.1007/s00253-020-10716-7
Which is an exemplar 10,000th publication: It's
- a Methods paper, by far the most popular topic tag;
- a UsePublic paper, an ascendant topic tag;
- and a Huttenhower paper, the most frequently referenced public platform tag.
However, in one way it going against trends: It's in a paid access journal.
The future of Galaxy being referenced, used, and extended in journals, theses, books, and preprints is bright. However, the future of curating all those pubs is a wee bit murky. Above I said:
If trends continue, we will hit 20,000 pubs by 2024.
Which means adding about 3000 pubs per year. Lately we have knocked the false positive rate for our pub searches down to around 20%, which means we will be reviewing about 3600 pubs per year, or 300 per month.
This is not scalable with current methods.
How we do, and how much we do, paper curation in the future is uncertain. Right now every paper that automatic searches can identify is scanned by me, and either flagged as irrelevant or added to the library. Easy papers take a 1-2 minutes. Most papers take 2-4 minutes, and some take much longer. If we assume 3 minutes per pub, and 300 pubs a month going forward, that is 180 hours per year, or a full work-month. if 4 minutes per pub is a more accurate guess then we go to 240 hours per year.
Or: why not just use a purely automated process like Google Scholar citation alerts?
Most papers that use Galaxy don't formally reference it
Getting tools and platforms cited is a common challenge for bioinformatics software. The more popular / ubiquitous the platform is, the less likely it is that researchers will think of citing it in their methods sections. Galaxy faces this challenge constantly. (It could be worse. I imagine even fewer researchers cite PubMed, Python, or R.) We are grateful when we papers mention Galaxy at all. Expecting everyone to also include formal citations would just be frustrating, and is just not going to happen.
Using just formal references, we would miss out on the bulk of Galaxy papers.
The tools on many public Galaxy platforms are also available as command line tools
Many public Galaxy platforms are tool publishing platforms. These platforms make a lab's tools easily accessible to researchers via a web interface (Galaxy). These labs often also make their tools available as command line tools that can be locally installed. The challenge is that both the Galaxy platform with the tools, and the command line tools often have the same base citation.
Blindly counting any publication that references these papers would result in over-counting papers, and include a large number of irrelevant papers in the Galaxy pub library, thus greatly reducing the value of the library.
Annotation would not be possible
The tags (see below) add significant value to the library. With them we can track trends over time, and identify, for instance, papers about using Galaxy in education. Without them, we have pubs per year and that's it.
This is how we discover a lot of cool work by the community
Finally, by looking at all these papers (albeit, not very deeply) we have found all sorts of cool work done by the Galaxy community. The monthly process of curation is a great way to find out about community developments that might otherwise go unnoticed.
Well, maybe. We do have a large training dataset (10,000 pubs!), and access to machine learning and natural language processing tools and expertise. We might even be able to use Galaxy itself, which has strong support for both those fields. (We also have access to undergraduates!) We are looking into options.
For now, we are stepping back from the monthly curation effort to figure out what to do going forward. If we do return with manual or automatic or a hybrid technique then we will likely also adjust how we tag publications. Specifically, the UseMain and UseCloud tags are no longer meaningful.
UseMain and UseCloud tags are past their expiration date
UseMain: When this was introduced the UseGalaxy.org server stood alone as by far the largest publicly accessible platform. It is still a large platform, but it no longer stands alone. Giving it special emphasis is no longer useful. All UseMain tags may get replaced with a couplet of UsePublic and >UseGalaxy.org tags.
UseCloud: When introduced this meant (and still means) that a custom instance was launched on public or private cloud infrastructure, used for analysis, and then shut down. In 2020, this tag is more confusing than enlightening. Several public Galaxy platforms now support launching private instances in a way that is seamless and transparent to the researcher. Several other platforms use public and private cloud infrastructures (including the big 3 UseGalaxy.* servers), again in ways the end user is never aware of. We may just drop this tag.
Galaxy has had an interesting journey over the past 15 years, and the Galaxy Publications Library reflects that. We look forward to continuing to track where we go in the future.
Thanks for using Galaxy (and for saying that you are using Galaxy!),
We've used Topic Tags since the beginning of the library to track how publications relate to Galaxy. Since the move to Zotero, we've also added Galaxy Featured Tags, Platform Tags, and Publisher Tags.
Topic tags indicate how the publication relates to Galaxy. Here's the current set and when each tag was added:
|+HowTo||Papers about how to use Galaxy for specific analyses. These are tutorials.||2011|
|+IsGalaxy||Publications about Galaxy itself or installations of Galaxy.||2011|
|+Methods||Uses Galaxy in their methods.||2011|
|+Other||Publications that don't fit well under any other tag.||2011|
|+Project||Publications with a Galaxy team member as an author.||2011|
|+Reproducibility||Reproducibility and persistence in science.||2011|
|+Shared||Publications that have published workflows, histories, datasets, pages, or visualizations in a Galaxy instance.||2011|
|+Workbench||Publication mentions Galaxy as a platform.||2011|
|+Tools||Tools that run in, have been ported to, or interact with Galaxy||2012|
|+Cloud||Publications referencing / extending / discussing Galaxy in a cloud context.||2013|
|+RefPublic||References a publicly accessible Galaxy instance or a Galaxy service. This is distinct from the +UsePublic tag.||2013|
|+Unknown||Publications that we know refer to Galaxy, but we aren't sure how because they are behind a paywall we don't have access to. These are revisited periodically.||2013|
|+UseCloud||Uses a custom built cloud based instance of Galaxy in its methods.||2013|
|+UseLocal||Uses a local installation of Galaxy in its methods.||2013|
|+UseMain||Uses the project's public server, usegalaxy.org (a.k.a. Main, in its methods.||2013|
|+UsePublic||Uses a publicly accessible Galaxy instance or a Galaxy service in its methods.||2013|
|+Visualization||Publications referencing Galaxy in a visualization and/or visual analytics context.||2013|
|+Education||Papers referencing Galaxy in a training or education context.||2019|
With the move to Zotero we added two new sets of tags. The first set is used to highlight publications that feature Galaxy prominently:
|+Galactic||Publication is about Galaxy.|
|+Stellar||Publication features Galaxy prominently.|
The second set of new tags show which public Galaxy platform is used or discussed in publications. These are tagged with the platform's name, preceded by a ">". For example, the >RepeatExplorer tag lists all papers that use or reference the RepeatExplorer public server.
Zotero is configured to also add any keywords it can detect automatically when the publication is added. These tags are not rationalized in any way, and tend to describe the research topic or domain. Prosapip1 and Genome evolution are examples.
These tags were added over an 8 year period. Are older papers back-tagged when new tags are added? Mostly not, but there are some exceptions:
- Galaxy Featured Tags exist back to the beginning of time. (These were converted from CiteULike's priority feature.)
- Topic and platform tags have been applied to older publications on a selected basis.
Therefore, don't look for a lot of +UsePublic or +Cloud tagged papers from before 2013.