How do you decide to merge entities?

124 views
Skip to first unread message

Ashish Uppala

unread,
Aug 10, 2022, 4:56:55 PM8/10/22
to OpenAlex users
Hey again,

When it comes to merging e.g. two affiliations, is the deduplication process you have automated, or is it manual -- i.e. someone reports it and you go and add those to the merged entities list?

I'm mostly asking because I noticed for example that "I182273258" (Florida State University College of Arts and Sciences) is being merged into "I103163165" (Florida State University), which I can understand, but does this mean that all instances of colleges within universities are now treated as that main "university"? Is this going to be applied consistently across unis?

Hope that made sense. If you could let me know how that merging works and the thought process it'd be very useful; I think we're probably going to mirror the affiliation list for now to avoid weirdness but just trying to think through implications of potentially preserving some of the older affiliations instead of explicitly deleting, in case it turns out we do need them.

Also, semi related: If you merge two things but realize it was wrong or want to back track, what's the mechanism for "un-merging"? Just re-creating that as a new affiliation, updating mappings to works, etc., and then re-ingesting it in the next dump?

Ashish

Richard Orr

unread,
Aug 11, 2022, 2:08:09 PM8/11/22
to Ashish Uppala, OpenAlex users
Hi Ashish,

Merging Institutions is entirely a manual process; at the moment there have only been 43 merges. Most of these were true duplicates (variant spellings of the same name) or acquisitions like Sun Microsystems -> Oracle. We don't have plans to systematically merge colleges or departments into their parent institutions. The cases where we've done this so far have been to replace an affiliation that has no Research Organization Repository ID with one that does.

As for un-merging, that's a great question for the group. The messy part is correcting all the internal references; if we were to un-merge two authors we would have to re-credit the "new" author with their old Works. But all that would be up to us.

We don't have a plan for the IDs yet. I can see two ways it could go:
  1. Create a new ID for the un-merged entity as you suggest, with all the properties of the old one.
  2. Literally un-merge it by removing the merge relationship in our database. The reborn entity would appear in a new snapshot partition with its old ID and be removed from the merged entities list, and in the API the old ID would appear everywhere it used to.
I believe either would work as well as the other for anyone using a short-lived API response or keeping an OpenAlex snapshot and cross referencing only between Entities in the snapshot. But in real-world use cases I think #2 makes it clearer that the "new" entity isn't new, and makes it easier to maintain any mappings from OpenAlex Entities to their real-world counterparts.

Does one sound better than the other to you?

Thanks,
Richard

--
You received this message because you are subscribed to the Google Groups "OpenAlex users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openalex-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openalex-users/5a5d42c6-bbba-4d4d-9388-50d3b7cf3b2en%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Richard Orr
Lead Developer - Unpaywall, OpenAlex
OurResearchWe build tools to make scholarly research more open, connected, and reusable—for everyone.

Ashish Uppala

unread,
Aug 11, 2022, 2:24:44 PM8/11/22
to OpenAlex users
Hi Richard,

Thanks for the quick response. This all makes sense to me. If we find authors or other records to be duplicated, but they haven't been merged on your end, what's the best way to report that to you?

Regarding "un-merging" -- I think either of those work. Personally I have a slight preference to #1 because of how I implemented merging, but I can see why #2 is preferred to preserve the old identifiers. Either is fine with me.

For context, the way I deal with merging: I have a table that tracks all the merged entities from your snapshot (same schema as your file), and then uses that as a job table to process data deletions for the old record after the latest snapshot is ingested. If you later deleted a merged entity record, I'd probably need to know that something was actually "deleted" since right now my ingestion just upserts them into the job table.

This isn't a big deal though, I can easily update my ingestion to figure that out, so I'm happy for someone else to weigh in if they have a stronger opinion.

Thanks again,
Ashish

Richard Orr

unread,
Aug 11, 2022, 4:20:45 PM8/11/22
to OpenAlex users
Thanks for explaining your process. I didn't expect that many people would handle merges statefully; we're trying to wave our hands and make the old Entity go away but I can see why you'd want to track everything you can.

The best place to report bugs or data errors is sup...@openalex.org. Thanks!

Richard
Reply all
Reply to author
Forward
0 new messages