Investigating missing country information in OpenAlex — and a method to recover some of it
How author affiliation history can significantly close the country-attribution gap
If you build a tool that visualises which countries publish in a given academic journal — and you source your data from OpenAlex — you'll eventually open a journal page and find a country-stacked-bar chart that adds up to far less than the journal's actual paper count. The remaining papers haven't disappeared; they show up in the "papers per year" line above the stack but contribute nothing to the country trends because OpenAlex's institutional graph doesn't yet have a country recorded for them.
For one particular journal we audited while building Journal Trends, 1,156 of its 1,456 OpenAlex-indexed works (≈ 79 %) were of unknown country origin — leaving the country chart showing only about a fifth of the actual catalogue. We don't yet know how typical this is across publishing as a whole (we plan to find out — see the end of the post), but the size of the gap on a single journal was striking enough to investigate properly.
This post lays out:
- Which OpenAlex fields we audited — and why only one provides a meaningful fallback
- The author-affiliation-history trick that can recover an additional ~12 percentage points of country coverage (~24 % of the missing papers)
- The trade-offs and what we're shipping in Journal Trends
- A planned journal-wide audit
What we audited
For that one journal, we pulled the full OpenAlex record (not the slim projection we usually fetch) for every paper and asked: is there ANY field, anywhere, that gives us a country signal we're not already using?
| Source | Papers covered | Useful as a fallback? |
|---|---|---|
authorships[].institutions[].country_code | 300 / 1,456 (21 %) | — baseline |
authorships[].countries[] | 300 / 1,456 (21 %) | No — exact same papers, derived from the same data |
Top-level countries_distinct_count | 300 / 1,456 (21 %) | No — same |
raw_affiliation_strings[] | 349 / 1,456 (24 %) | Marginal — +49 papers via brittle text-parsing for country names |
Top-level institutions_distinct_count > 0 | 1,026 / 1,456 (70 %) | No — counts the idea of institutions but reveals nothing about which countries |
So the standard fields are a dead-end. Any approach that stays within a single OpenAlex Work record is bounded by the 21 % baseline for this journal.
The author trick
OpenAlex's /authors/<id> endpoint returns each author's
affiliation history: a list of {institution, years[]}
entries showing where they've been affiliated and when. So even if Paper X (2010,
by Author A) has no institution attached, Paper Y (2010 in a different journal by
Author A) might — and A's profile records that affiliation with the year range.
For each paper with authors but no country attribution, we can:
- Pull the author IDs from
authorships[].author.id - Fetch each author from
/authors/<id> - Walk their
affiliations[]list looking for an entry whoseyears[]covers the paper'spublication_year - If found, take the
institution.country_codeand credit it to the paper
Recovery rate on a 50-paper sample
We ran this on a 50-paper sample drawn from the journal's 726 "papers-with-authors-but-no-country" subset (the 430 papers with no authors at all are unrecoverable through this method).
| Outcome | Share | Note |
|---|---|---|
| Country recovered via year-matched author affiliation | 24 % (12 papers) | Reliable; year-matched institutional history is OpenAlex's own data |
| Still no country after fallback | 76 % (38 papers) | Author either has no profile (affiliations[] empty) or no entry covering the paper's year |
Of the 96 distinct authors fetched, only 45 (47 %) had any populated
affiliations[] history at all — the rest are "ghost" author
records with just an ID and a name. The recoverability of any given paper depends
heavily on whether its co-authors happen to be actively-profiled researchers.
Extrapolated to the whole journal: coverage would move from 21 % → ~33 % — a 12-percentage-point lift, every point of which is grounded in OpenAlex's own institutional graph.
Recovered countries from the sample looked plausible — researchers showed institutional affiliations consistent with their field and era. Unrecovered cases skewed heavily toward older single-author papers where the author has no OpenAlex profile beyond a name.
Trade-offs to be honest about
- The recovered data is inferred, not stated. Author A was affiliated with University X in 2015; we attach that to a 2015 paper by A in a journal that didn't record affiliations. Maybe A was a visiting fellow elsewhere when this specific paper was written. The base rate is plausible but not perfect.
- Author affiliation history is itself OpenAlex's reconstruction. It can be wrong. Garbage-in, garbage-out is a real risk.
- Older / smaller-journal authors lack profiles entirely. The papers we can't recover skew heavily toward older work — exactly where the gap tends to be biggest.
- The integrity-research use case requires care. If you're using country attribution to flag "this journal is suspiciously concentrated in country X", an inferred country isn't quite the same as a paper-level recorded one. Consider weighting them differently or visually distinguishing the source.
What we ship in Journal Trends
For the immediate release we're doing the honest minimum:
- Add an "Unknown" bucket in the country stacked-bar chart, in a darker grey distinct from the existing "Other" tail. Papers with no country attribution land there — so the stack always sums to total works. Users see at a glance how much of a journal's catalogue lacks country attribution rather than seeing data silently dropped.
- Footnote on the chart linking to this post.
For a follow-on release:
- The cached-author-affiliations fallback described above,
with a small
authors_cachetable so each author is looked up once across the lifetime of Journal Trends. - Source transparency in the tooltip — when a paper's country comes from author-affiliation inference (not its own institutions), label it differently so users can weight the signal accordingly.
What's next: a journal-wide audit
A single-journal observation tells us the problem exists; it doesn't tell us how widespread or severe it is across academic publishing as a whole, nor how much of it the author-fallback method actually fixes at scale.
We plan to run this audit across every journal currently in Journal Trends to measure:
- The actual distribution of the coverage gap — what share of journals have <50 % country attribution, what share are above 90 %, where the mean sits
- How the gap correlates with journal characteristics — age, publisher, discipline, whether the journal is delisted by Scopus
- The empirical recovery rate of the author-fallback method at full scale, not just on a 50-paper sample
- Which kinds of journals benefit most from the fallback — likely younger, active-researcher-heavy fields where authors are well-profiled
Results in a future post.
The bigger picture
For research-integrity work — which is why Journal Trends exists — the rule should be: prefer transparent partial coverage over invisible completeness. A chart that says "we don't know about this slice" is more useful than a chart that quietly counts only the slice it knows about and pretends that's the whole picture.
The author-affiliation trick won't close the gap for every journal — a 1990s practitioner journal might stay only partially recoverable forever — but it's an honest lift, free of new infrastructure beyond a cache table. And it scales gracefully: as OpenAlex back-fills more author profiles over time, the recoverable fraction grows on its own.