Issues with OA FWCI and citation percentiles

31 views
Skip to first unread message

Matthijs de Zwaan

unread,
Sep 3, 2025, 8:29:28 AMSep 3
to OpenAlex Community
Dear all
I have some questions about the FWCI and citation-based percentiles. Following up on a question from a researcher at my university, I tried to reconstruct the FWCI, citation percentiles, and corresponding top-percentile groups, but I cannot. Would very much appreciate your thoughts.

I do not have the infrastructure to recreate this for the entire database, so I decided to look at a small subset and downloaded all works with publication year 2018, of type article,  subfield id 3107, and source_type journal (see here for python code). In my understanding of the documentation (https://help.openalex.org/hc/en-us/articles/24735753007895-Field-Weighted-Citation-Impact-FWCI), this should be one stratum for the FWCI and the normalised percentiles.

Using this sample, I reconstructed the FWCI by summing citations from the year of publication and three years thereafter using the data in `counts_by_year`, and then normalizing using the average of that sum for all works in the sample. I take the OpenAlex FWCI directly from `fwci` (python code here). Comparing the two FWCI versions, I find quite large differences. The difference between my version and the OA version is 10% (of the OA version) or less for only 230 cases. For a lot of cases, the difference is over 50%. In many cases, the number of citations is only small, so a minor mistake (say, counting 2 citations instead of 1) can have a large effect, but cases with more citations are not uncommon. For example, https://openalex.org/W2728155641 (with 14 relevant citations) has an OA FWCI of 3.4, where I find a FWCI of 1.4.

I also have problems recreating the citation counts, and the citation percentiles in OpenAlex, even when using the OA data directly. Adding up all the citations from `counts_by_year` gives me different results than what is given in `cited_by_count`. For most cases there is no difference, but there is a difference for more than 22k works.

As for the percentiles, I see some surprising results. Works that seem to have the same number of citations can have very different percentiles. For example, https://openalex.org/W3104182342 and https://openalex.org/W2772706355 both have one incoming citation (both directly in `cited_by_count` and manually summing citations from `counts_by_year`), but the first is in percentile 76, and the second in 58. And in general, constructing the percentiles in this subset using either `cited_by_count` or `counts_by_year` gives me results that are quite different from the percentiles in OpenAlex. 

I have tried to inspect the source code (https://github.com/ourresearch/openalex-databricks/blob/65dab9c8a8d61d4d1718044e965dcfe452fac0ba/jobs/weekly_metric_creation/monthly_fwci_percentile_updates.py) but it is difficult to see exactly what is going on and where the differences with my code are.

I hope that someone can help me explain these differences. Thanks for your help!

Matthijs

Samuel Mok

unread,
Sep 3, 2025, 7:01:02 PMSep 3
to Matthijs de Zwaan, OpenAlex Community
Hiya Matthijs, 

Looking at the OpenAlex code for calculating the FWCI, a couple of things jump out that could explain why your recalculation didn't result in the same answers.  However, the two works you provided as an example (https://openalex.org/W3104182342 and https://openalex.org/W2772706355) should definitely be the same, there is no reason they would have differing FWCIs; so I think that's a bug -- or perhaps one item has not been updated for a while or something? It should be easy to identify these cases programmatically in the backend: all works with the same type, pubyear, number of citations, and subfield should have the same fwci. Perhaps a good idea for the OA team to include this sanity check in the pipeline?

For the rest, here are some possibilities that might explain the difference between your FWCI and OpenAlex's:

1. Percentiles: On line 316 you can find the code for calculating these: 

w_pert = Window.partitionBy(['pub_year','work_type','subfield_id']).orderBy(F.col('total_citations'))
w_max = Window.partitionBy(['pub_year','work_type','subfield_id'])

They partition the data based on `pub_year`, `work_type`, and `subfield_id`. I believe they use the `total_citations` count to determine the denominator for these, and not the 4-year window mentioned in the FWCI calcs, as far as I can tell. They also don't filter out items without a journal, they just group by `work_type`, which I'll get into in the next point.
The actual use for the calculation of the percentile is done on line 339, using w_pert and w_max:

final_percentile_df = counts_all_citations \
    .withColumn('rank', F.rank().over(w_pert)) \
    .withColumn('max_rank', F.max(F.col('rank')).over(w_max)) \
    .join(normalized_counts, how='inner', on=['pub_year','subfield_id','work_type'])

Here's another difference: they use the `rank()` sql function to calculate the rank and max_rank, and then use these to determine the final values:

final_percentile_df.withColumn("normalized_percentile", (F.col('rank')-1)/(F.col('max_rank')+1))

The F.rank() function does the same as your pandas code with method='min'. However, the way you calculate the percentage (with pct=true) is different, as OpenAlex uses a custom formula. Normally (as pandas does with pct=true) you'll just do percentile = rank / N. OpenAlex uses (rank - 1) / (max_rank + 1) to calculate the percentage instead. I don't know for certain why; but this does have some advantages: papers with zero citations (==rank 1) will get 0% exactly as a percentage, while the naive use gets 0.001% as a result. At the top range, the OA method will never reach 100%, which seems nice as well, because what does a 100% score mean in this case otherwise? It's also closer to a cdf distribution I guess, making for a nicer spread.

2. As mentioned in the docs, they use different strata for conference proceedings and journal papers, even if both are marked as `articles`, so this means you need to filter for 'journal' items. This is definitely done differently by you and in the OpenAlex code, as it's a bit of a pain to do for users. Your Python code uses the primary_location for each work -- if that location has a source and the type == journal, keep it, otherwise drop it. I think that's the most logical way to do it using the data structure; but this misses items without primary_location.source.type == journal, which happens for various reasons: improperly identified source, or the journal source is not seen as the primary_location for some reason, etcetera.
Now, I'm not certain of what impact this makes as I don't know the exact details of the database OA uses in the backend, but the OpenAlex code directly pulls a journal_id for each work from their database -- without traversing location -> source -> ... . So regardless of whether the source data is actually different, that's definitely a different data structure. I don't think it's a stretch to assume some data will get mangled, lost, and/or modified during the parsing of this data from the backend database to production. 
Here is the relevant code: On line 106 they pull the works data including the journal_id (marked red)

works_1 = spark.read.parquet(f"{database_copy_save_path}/mid/work")\
    .dropDuplicates() \
    .filter(F.col('merge_into_id').isNull()) \
    .select(F.col('paper_id').alias('paper_reference_id'), F.col('publication_date').alias('pub_date'), 'type','journal_id')

This journal_id is used to determine the journal_type in line 207:

journal_types = spark.read.parquet(f"{database_copy_save_path}/mid/journal").filter(F.col('merge_into_id').isNull()) \
    .select('journal_id',F.col('display_name').alias('journal_name'), F.col('type').alias('journal_type'))

Whis is then used for the grouping of articles vs conference proceedings in line 213 to 224:

work_types = spark.read.parquet(f"{database_copy_save_path}/mid/work")\
    .filter(F.col('merge_into_id').isNull()) \
    .select(F.col('paper_id').alias('paper_reference_id'), 'type', 'journal_id')\
    .join(journal_types, on='journal_id', how='left') \
    .withColumn(
'work_type_final', F.when((F.col('type')=='article') & (F.col('journal_type')=='conference'), 'conference_article').otherwise(F.col('type'))) \
    .select('paper_reference_id', 'work_type_final')

work_types.cache().count()

# COMMAND ----------

display(work_types.groupBy('work_type_final').count().orderBy('count'))

3. Finally, the FWCI is apparently calculated once a month (looking at the title of the .py file). When you retrieve the data from the API for calculating the expected number of items, you're probably not retrieving the same data, as they'll update citation info regularly, not just once a month. As you mentioned, you're often dealing with low amounts of citations, so perhaps an increase in the amount of citations for the papers in this subset due to accumulated daily updates might result in some differences -- but these should be minor, especially compared to the previous 2 points.

Oh, and send my regards to Robin, Cees, and Mark if you see any of them around there :)
Cheers,
Samuel Mok
Universiteit Twente


--
You received this message because you are subscribed to the Google Groups "OpenAlex Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openalex-commun...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/openalex-community/e52e01d7-c010-4f93-9461-1b85cf247499n%40googlegroups.com.
Reply all
Reply to author
Forward
Message has been deleted
Message has been deleted
0 new messages