Bibliometric Blind Spots: Why Document Type Matters in the Age of Open Science

In the age of open science, we are awash in data, but not necessarily in clarity. Platforms like OpenAlex and Semantic Scholar promise more transparency and accessibility, liberating bibliometric research from the paywalled grip of proprietary systems like Scopus and Web of Science (WOS). Yet, as a recent study (preprint) comparing these databases shows, openness is not synonymous with accuracy. One of the least visible yet most consequential battlegrounds in this new scholarly landscape is the deceptively simple act of labeling what a publication is.

Should a brief editorial be counted like a peer-reviewed article? Should a conference proceeding be treated as a journal contribution? Should a retraction be quietly buried or flagged with its own typology? These are not trivial taxonomic squabbles. They are foundational questions that shape citation metrics, journal impact factors, and university rankings—metrics that, in turn, influence careers, funding decisions, and the visibility of entire research communities.

The study in question analyzed nearly 10 million publications from 2012 to 2022 across five databases. It revealed a striking divergence: Web of Science and Scopus assign clear and consistent document types to nearly every entry, thanks to highly curated editorial processes. OpenAlex, by contrast, classified over 99% of its corpus as “research articles,” lumping together everything from empirical studies to commentaries and even book reviews. Semantic Scholar was even less granular, leaving most records without any document type metadata at all.

This matters profoundly. In a world where “publish or perish” still governs academic life, the inflation of research articles by open databases like OpenAlex could skew the perception of productivity and scholarly value. It becomes especially problematic in comparative metrics—when two universities appear to have similar output, but one has more editorials labeled as research due to a platform’s misclassification, the integrity of the comparison collapses.

What’s more, this isn’t just a question of computational accuracy; it is a question of epistemic justice. Bibliometric systems shape what counts as knowledge. Mislabeling or omitting types of publications disproportionately affects research from underrepresented regions and fields. In low- and middle-income countries where conference proceedings and preprints are more prevalent, failure to classify such work appropriately means these contributions become invisible in global rankings and analytics.

This is not a call to abandon OpenAlex or Semantic Scholar. On the contrary, they represent necessary correctives to closed infrastructures. But as the open science movement matures, it must confront the realities of its own limitations. Metadata quality is not a luxury—it is the infrastructure of meaning in digital scholarship. Policymakers and research institutions must invest in metadata governance as they do in open access. That means standardizing typologies across platforms, integrating human-curated indexing where feasible, and ensuring that openness does not come at the cost of bibliometric precision. If we fail to act, the promise of open science risks becoming a mirage: more data, less insight.

Figure and sources from:
Haupka, N., Culbert, J. H., Schniedermann, A., Jahn, N., & Mayr, P. (2025). Analysis of the Publication and Document Types in OpenAlex, Web of Science, Scopus, PubMed and Semantic Scholar. Retrieved from https://arxiv.org/abs/2406.15154v2

Leave a Comment Here

This site uses Akismet to reduce spam. Learn how your comment data is processed.