for calculating the FWCI, a couple of things jump out that could explain why your recalculation didn't result in the same answers.
However, the two works you provided as an example (
) should definitely be the same, there is no reason they would have differing FWCIs; so I think that's a bug -- or perhaps one item has not been updated for a while or something? It should be easy to identify these cases programmatically in the backend: all works with the same type, pubyear, number of citations, and subfield should have the same fwci. Perhaps a good idea for the OA team to include this sanity check in the pipeline?
For the rest, here are some possibilities that might explain the difference between your FWCI and OpenAlex's:
1. Percentiles: On line 316 you can find the code for calculating these:
w_pert = Window.partitionBy(['pub_year','work_type','subfield_id']).orderBy(F.col('total_citations'))
w_max = Window.partitionBy(['pub_year','work_type','subfield_id'])
They partition the data based on `pub_year`, `work_type`, and `subfield_id`. I believe they use the `total_citations` count to determine the denominator for these, and
not the 4-year window mentioned in the FWCI calcs, as far as I can tell. They also don't filter out items without a journal, they just group by `work_type`, which I'll get into in the next point.
The actual use for the calculation of the percentile is done on line 339, using w_pert and w_max:
final_percentile_df = counts_all_citations \
.withColumn('rank', F.rank().over(w_pert)) \
.withColumn('max_rank', F.max(F.col('rank')).over(w_max)) \
.join(normalized_counts, how='inner', on=['pub_year','subfield_id','work_type'])
Here's another difference: they use the `rank()` sql function to calculate the rank and max_rank, and then use these to determine the final values:
final_percentile_df.withColumn("normalized_percentile", (F.col('rank')-1)/(F.col('max_rank')+1))
The F.rank() function does the same as your pandas code with method='min'. However, the way you calculate the percentage (with pct=true) is different, as OpenAlex uses a custom formula. Normally (as pandas does with pct=true) you'll just do
percentile = rank / N. OpenAlex uses
(rank - 1) / (max_rank + 1) to calculate the percentage instead. I don't know for certain why; but this does have some advantages: papers with zero citations (==rank 1) will get 0% exactly as a percentage, while the naive use gets 0.001% as a result. At the top range, the OA method will never reach 100%, which seems nice as well, because what does a 100% score mean in this case otherwise? It's also closer to a cdf distribution I guess, making for a nicer spread.
2. As mentioned in the docs, they use different strata for conference proceedings and journal papers, even if both are marked as `articles`, so this means you need to filter for 'journal' items. This is definitely done differently by you and in the OpenAlex code, as it's a bit of a pain to do for users. Your Python code uses the primary_location for each work -- if that location has a source and the type == journal, keep it, otherwise drop it. I think that's the most logical way to do it using the data structure; but this misses items without primary_location.source.type == journal, which happens for various reasons: improperly identified source, or the journal source is not seen as the primary_location for some reason, etcetera.
Now, I'm not certain of what impact this makes as I don't know the exact details of the database OA uses in the backend, but the OpenAlex code directly pulls a journal_id for each work from their database -- without traversing location -> source -> ... . So regardless of whether the source data is actually different, that's
definitely a different data structure. I don't think it's a stretch to assume some data will get mangled, lost, and/or modified during the parsing of this data from the backend database to production.
Here is the relevant code: On line 106 they pull the works data including the journal_id (marked red)
Whis is then used for the grouping of articles vs conference proceedings in line 213 to 224:
work_types = spark.read.parquet(f"{database_copy_save_path}/mid/work")\
.filter(F.col('merge_into_id').isNull()) \
.select(F.col('paper_id').alias('paper_reference_id'), 'type', 'journal_id')\
.join(journal_types, on='journal_id', how='left') \
.withColumn('work_type_final', F.when((F.col('type')=='article') & (F.col('journal_type')=='conference'), 'conference_article').otherwise(F.col('type'))) \
.select('paper_reference_id', 'work_type_final')
work_types.cache().count()
# COMMAND ----------
display(work_types.groupBy('work_type_final').count().orderBy('count'))
3. Finally, the FWCI is apparently calculated once a month (looking at the title of the .py file). When you retrieve the data from the API for calculating the expected number of items, you're probably not retrieving the same data, as they'll update citation info regularly, not just once a month. As you mentioned, you're often dealing with low amounts of citations, so perhaps an increase in the amount of citations for the papers in this subset due to accumulated daily updates might result in some differences -- but these should be minor, especially compared to the previous 2 points.
Oh, and send my regards to Robin, Cees, and Mark if you see any of them around there :)
Cheers,
Samuel Mok
Universiteit Twente