Galaxy: the first 10,000 pubs

The Galaxy Publication Library hits a milestone

By Dave Clements

August 26th 2020

We reached 10,000 publications in the Galaxy Publication Library this month. This library tracks publications that use, extend, implement or reference Galaxy or Galaxy-based platforms. It includes journal articles, theses, book chapters, preprints, and a couple more odds and ends. This milestone is a good opportunity to look at what the library tells us about where the Galaxy project has been, and maybe where it's going as well.

The library was started December 2011, when the first 168 galaxy related publications were added and classified using 8 tags. This included all project publications plus every pub that our ad hoc literature searches could find at the time. The library started on CiteULike and stayed there until September 2017, when we moved it to Zotero. The library had 4500 publications by the time we moved to Zotero.

The library uses tags to indicate how publications relate to Galaxy. See below for an explanation and history of the tags.


Publications and Tags Over Time

Year # Pubs Methods UsePublic Workbench UseMain RefPublic UseLocal Tools IsGalaxy Reproducibility Cloud Other Shared Unknown HowTo Project Visualization Education UseCloud
2005 1 1
2006 4 3 1
2007 12 2 7 1 2 2
2008 32 15 12 1 2 2 1 1
2009 52 26 18 3 2 1 1 4 1 1
2010 107 50 36 1 1 5 1 1 7 4 5
2011 205 93 69 1 8 16 6 3 8 3 4 6 1 1
2012 398 197 1 128 3 3 30 15 7 14 9 12 10 12 10 2 1
2013 506 264 16 149 92 12 28 37 28 9 22 9 22 13 7 6 3 3 2
2014 741 331 60 226 98 30 43 67 48 25 40 39 23 7 12 7 8 2 1
2015 929 473 140 233 116 52 58 68 49 26 48 33 23 14 8 11 7 1 3
2016 1125 575 213 246 115 116 72 73 49 46 37 47 19 13 20 7 9 2 4
2017 1333 760 279 239 138 110 98 76 72 72 35 25 24 23 8 6 7 5 1
2018 1578 1028 373 236 185 138 110 51 46 76 28 18 26 25 16 8 6 6 3
2019 1926 1310 579 236 205 167 144 64 51 82 22 17 25 5 12 6 4 12 2
2020 1405 1018 452 134 173 106 96 50 42 38 11 11 8 4 5 8 8 15 1
2021 6 2 1 3 1
Total 10360 6144 2114 1975 1128 731 652 528 425 387 261 210 192 131 112 85 56 48 17

Trends in the publication library reflect the trajectory of the Galaxy Project. Here are some trends that stand out in this data.

Up, Up, Up

From 2013 through 2019 the number of pubs per year increased by an average of 25% each year. This year the trend is even steeper, and if it continues, we will end up with 2400 new publications in 2020.

Note that it took

If trends continue, we will hit 20,000 pubs by 2024.

Methods

The most obvious trend is that there are a lot of pubs using Galaxy in their methods. 59% of all publications mention Galaxy in their methods section, up from 51% of the first 5000 pubs. So far in 2020, 72% of all pubs are tagged Methods.

This trend doesn't show any sign of slowing down.

UsePublic and UseMain

Not all Methods papers say which Galaxy instance(s) they used. But starting in 2013, papers that do mention this are also tagged with UseMain, UsePublic, UseLocal, and/or UseCloud tags (see Tags below for an explanation of all tags).

The relative number of UseMain and UsePublic pubs highlights the increasing availability of publicly accessible Galaxy platforms.

  • In 2013-2014, there were 2 1/2 times as many UseMain pubs as UsePublic pubs.
  • In 2015 they were about the same,
  • in 2016-2017, there were nearly twice as many UsePublic pubs,
  • and in 2019-2020 there are nearly 3 times as many UsePublic as there are UseMain pubs.

This rise reflects the increase in available public platforms from 21 servers at the start of 2012 to over 150 platforms today.

UseLocal ... dropping?

The number of pubs reporting using local / non-public instances of Galaxy increases every year, up from 58 in 2015 to 144 in 2019. However, in most years the percentage of Methods papers that report using local installs is dropping slightly, from 12% of Methods papers in 2015 to 9% in 2020.

We aren't sure what to make of this, as we believe that by far the largest group of Galaxy installations are UseLocal. Some possible explanations:

  1. Researchers using local Galaxy instances are less like to report that they are using Galaxy.
  2. Most local installs aren't used in research that ends up being published.
  3. There are far fewer local installs than we think.

We hope to introduce features in 2021 that will allow us to confirm or contradict hypothesis #3.

Journals

The library contains any type of academic publication, including theses, conference papers, books and book chapters, and of course journal articles.

Galaxy has appeared in over 1800 journals. The 20 most popular journals in the library are:

Rank Journal # Pubs
1 PLOS ONE 381
2 Scientific Reports 300
3 Nucleic Acids Research 263
4 BMC Genomics 243
5 Bioinformatics 228
6 Microbiology Resource Announcements * 171
7 BMC Bioinformatics 167
8 Nature Communications 141
9 BioRxiv 86
10 Frontiers in Microbiology 81
11 Cell Reports 80
12 PLOS Genetics 78
13 Genome Biology 76
14 GigaScience 72
15 PeerJ 70
16 Briefings in Bioinformatics 68
17 Molecular Ecology 64
17 Proceedings of the National Academy of Sciences 64
19 Genome Research 62
20 eLife 58
20 F1000Research 58
20 Future Generation Computer Systems 58
20 PLOS Computational Biology 58

Genome Announcements was re-titled Microbiology Resource Announcements* in July 2018.

There are 6 journals in the top 20 that weren't there when we hit 5000 pubs. All 6 newcomers are open access.

Open Access is Rising

This highlights a general trend towards adoption of open access journals by the Galaxy community:

Top 20 journals over time

This shows the top 20 journals in the library at 4 points in time over the past 5 years. A couple of things to note:

  • The graph shows the top 20 journals in the entire library at each point in time, not just in the pubs added since the last point in time. (That graph would show the last two points here even more starkly.)
  • Points, 1, 2, and 4 correspond to reaching 2500, 5000, and 10,000 pubs.
  • With the exception of Bioinformatics, the rankings of paid access journals have been on a clear downward trend since at least 2018.
  • Although not shown on the graph, the absolute number of papers published in these formerly high-ranking paid access journals (including Cell and Nature) has remained relatively constant. It is their relative ranking that has dropped as the number of pubs in other journals has increased.

Preprints & BioRxiv

BioRxiv is both over-counted and under-counted because of how we handle preprints: When a paper shows up as a preprint we add it as a preprint. Once or twice a year, we check all pubs currently in preprint form to see if they have been published in a peer-reviewed journal. If they have, then we delete the preprint version, and add the peer-reviewed pub.

This has two implications:

  • The total number of articles that have been published in BioRxiv is actually much higher than shown.
  • At most points in time, the number of pubs shown in BioRxiv is an over-count of Galaxy-related pubs that still don't exist in a peer-reviewed journal.

The 10,000th Pub?

The 10,000th pub is

  • Dai, W., Xiong, J., Zheng, H., Ni, S., Ye, Y., & Wang, C. (2020). Effect of Rhizophora apiculata plantation for improving water quality, growth, and health of mud crab. Applied Microbiology and Biotechnology, 104(15), 6813–6824. https://doi.org/10.1007/s00253-020-10716-7

Which is an exemplar 10,000th publication: It's

However, in one way it going against trends: It's in a paid access journal.

The Future

The future of Galaxy being referenced, used, and extended in journals, theses, books, and preprints is bright. However, the future of curating all those pubs is a wee bit murky. Above I said:

If trends continue, we will hit 20,000 pubs by 2024.

Which means adding about 3000 pubs per year. Lately we have knocked the false positive rate for our pub searches down to around 20%, which means we will be reviewing about 3600 pubs per year, or 300 per month.

This is not scalable with current methods.

How we do, and how much we do, paper curation in the future is uncertain. Right now every paper that automatic searches can identify is scanned by me, and either flagged as irrelevant or added to the library. Easy papers take a 1-2 minutes. Most papers take 2-4 minutes, and some take much longer. If we assume 3 minutes per pub, and 300 pubs a month going forward, that is 180 hours per year, or a full work-month. if 4 minutes per pub is a more accurate guess then we go to 240 hours per year.

Why go through the trouble?

Or: why not just use a purely automated process like Google Scholar citation alerts?

Most papers that use Galaxy don't formally reference it

Getting tools and platforms cited is a common challenge for bioinformatics software. The more popular / ubiquitous the platform is, the less likely it is that researchers will think of citing it in their methods sections. Galaxy faces this challenge constantly. (It could be worse. I imagine even fewer researchers cite PubMed, Python, or R.) We are grateful when we papers mention Galaxy at all. Expecting everyone to also include formal citations would just be frustrating, and is just not going to happen.

Using just formal references, we would miss out on the bulk of Galaxy papers.

The tools on many public Galaxy platforms are also available as command line tools

Many public Galaxy platforms are tool publishing platforms. These platforms make a lab's tools easily accessible to researchers via a web interface (Galaxy). These labs often also make their tools available as command line tools that can be locally installed. The challenge is that both the Galaxy platform with the tools, and the command line tools often have the same base citation.

Blindly counting any publication that references these papers would result in over-counting papers, and include a large number of irrelevant papers in the Galaxy pub library, thus greatly reducing the value of the library.

Annotation would not be possible

The tags (see below) add significant value to the library. With them we can track trends over time, and identify, for instance, papers about using Galaxy in education. Without them, we have pubs per year and that's it.

This is how we discover a lot of cool work by the community

Finally, by looking at all these papers (albeit, not very deeply) we have found all sorts of cool work done by the Galaxy community. The monthly process of curation is a great way to find out about community developments that might otherwise go unnoticed.

Can we make this tractable?

Well, maybe. We do have a large training dataset (10,000 pubs!), and access to machine learning and natural language processing tools and expertise. We might even be able to use Galaxy itself, which has strong support for both those fields. (We also have access to undergraduates!) We are looking into options.

Next steps

For now, we are stepping back from the monthly curation effort to figure out what to do going forward. If we do return with manual or automatic or a hybrid technique then we will likely also adjust how we tag publications. Specifically, the UseMain and UseCloud tags are no longer meaningful.

UseMain and UseCloud tags are past their expiration date

UseMain: When this was introduced the UseGalaxy.org server stood alone as by far the largest publicly accessible platform. It is still a large platform, but it no longer stands alone. Giving it special emphasis is no longer useful. All UseMain tags may get replaced with a couplet of UsePublic and >UseGalaxy.org tags.

UseCloud: When introduced this meant (and still means) that a custom instance was launched on public or private cloud infrastructure, used for analysis, and then shut down. In 2020, this tag is more confusing than enlightening. Several public Galaxy platforms now support launching private instances in a way that is seamless and transparent to the researcher. Several other platforms use public and private cloud infrastructures (including the big 3 UseGalaxy.* servers), again in ways the end user is never aware of. We may just drop this tag.


Galaxy has had an interesting journey over the past 15 years, and the Galaxy Publications Library reflects that. We look forward to continuing to track where we go in the future.

Thanks for using Galaxy (and for saying that you are using Galaxy!),

Dave Clements


More on Tags

We've used Topic Tags since the beginning of the library to track how publications relate to Galaxy. Since the move to Zotero, we've also added Galaxy Featured Tags, Platform Tags, and Publisher Tags.

Topic Tags

Topic tags indicate how the publication relates to Galaxy. Here's the current set and when each tag was added:

Tag Explanation Year
+HowTo Papers about how to use Galaxy for specific analyses. These are tutorials. 2011
+IsGalaxy Publications about Galaxy itself or installations of Galaxy. 2011
+Methods Uses Galaxy in their methods. 2011
+Other Publications that don't fit well under any other tag. 2011
+Project Publications with a Galaxy team member as an author. 2011
+Reproducibility Reproducibility and persistence in science. 2011
+Shared Publications that have published workflows, histories, datasets, pages, or visualizations in a Galaxy instance. 2011
+Workbench Publication mentions Galaxy as a platform. 2011
+Tools Tools that run in, have been ported to, or interact with Galaxy 2012
+Cloud Publications referencing / extending / discussing Galaxy in a cloud context. 2013
+RefPublic References a publicly accessible Galaxy instance or a Galaxy service. This is distinct from the +UsePublic tag. 2013
+Unknown Publications that we know refer to Galaxy, but we aren't sure how because they are behind a paywall we don't have access to. These are revisited periodically. 2013
+UseCloud Uses a custom built cloud based instance of Galaxy in its methods. 2013
+UseLocal Uses a local installation of Galaxy in its methods. 2013
+UseMain Uses the project's public server, usegalaxy.org (a.k.a. Main, in its methods. 2013
+UsePublic Uses a publicly accessible Galaxy instance or a Galaxy service in its methods. 2013
+Visualization Publications referencing Galaxy in a visualization and/or visual analytics context. 2013
+Education Papers referencing Galaxy in a training or education context. 2019

With the move to Zotero we added two new sets of tags. The first set is used to highlight publications that feature Galaxy prominently:

Tag Explanation
+Galactic Publication is about Galaxy.
+Stellar Publication features Galaxy prominently.

Platform tags

The second set of new tags show which public Galaxy platform is used or discussed in publications. These are tagged with the platform's name, preceded by a ">". For example, the >RepeatExplorer tag lists all papers that use or reference the RepeatExplorer public server.

Publisher Tags

Zotero is configured to also add any keywords it can detect automatically when the publication is added. These tags are not rationalized in any way, and tend to describe the research topic or domain. Prosapip1 and Genome evolution are examples.

Retroactive Tagging?

These tags were added over an 8 year period. Are older papers back-tagged when new tags are added? Mostly not, but there are some exceptions:

  • Galaxy Featured Tags exist back to the beginning of time. (These were converted from CiteULike's priority feature.)
  • Topic and platform tags have been applied to older publications on a selected basis.

Therefore, don't look for a lot of +UsePublic or +Cloud tagged papers from before 2013.

See this and other Galaxy-related posts in The Galactic Blog