Profile RNS--Author Disambiguation and Linux Port

Tanu Malik

unread,

Feb 2, 2012, 5:26:46 PM2/2/12

to profi...@googlegroups.com, ian...@gmail.com

We have deployed a Profiles systems at our institution, and we want to customize Profiles for use by researchers in non-medical domains, such as computer science, social science and astronomy.

We had preliminary success in importing directory information and publications for some of the researchers into Profiles.

As we continue to investigate how to use the system with non-PubMed data repositories, a big issue is author disambiguation.

The Profiles system provides an author disambiguation service that is PubMed specific.

Since its source code is not available we cannot instrument it for non-medical domains.

I am wondering if the author disambiguation source code will be open-sourced or made available through some agreement so we can make it work with other publication repositories?

On a related note, are there any on going efforts that are developing a linux port of the system?

Some IT departments on linux proficient and software maintainability is a huge issue.

Tanu

Weber, Griffin M

unread,

Feb 3, 2012, 5:02:58 PM2/3/12

to profi...@googlegroups.com, ian...@gmail.com

First of all, the presentation I gave about the disambiguation service at the last Profiles Users Group meeting is now on the Profiles website at http://profiles.catalyst.harvard.edu on the Community page.

Yes, the plan is to release the disambiguation service code at some point after we have completed the Profiles RNS 1.0 Software. However, the code is one small part of it, and by itself it is not very useful, especially when considering other types of publication repositories.

A second component of the Pubmed disambiguation service is the infrastructure needed to maintain an up-to-date copy of Pubmed data. You don't just click a link and download Pubmed . You first need to obtain an NLM license, and the database is distributed as hundreds of files that must be assembled on your server. In addition, there are nightly update files that must be applied to keep the data accurate. That is just the raw Pubmed data. The disambiguation code runs on a separate database, which is derived from the Pubmed data and contains pre-calculated/aggregate tables. Both databases require significant hardware to run. We have dedicated development and production servers, each costing around $50,000. Even on that type of hardware, it takes a couple days to update the derived disambiguation data from the raw Pubmed data. In order to avoid downtime, we actually have two copies of the database. One is always a live database, which is used by the disambiguation web service. The second is offline and being updated with the latest Pubmed data. Each week we flip the two databases so that the data available through the web service is never more than one week old. We currently have none of this documented, which would be needed for another institution to really setup their own local service. Recombinant uses a somewhat different architecture. They invested a lot of resources into that, and my staff at Harvard spent numerous hours walking them through the setup process.

A third part of disambiguation is the probability model. This is derived experimentally and is specific to a data source. The equations, variables, and coefficients we use for Pubmed articles probably would not work well for another kind of repository. It's like any other kind of statistical modeling. For example, the Framingham Risk Score is a well-known model for predicting heart disease. However, you can't just plug in breathing rate in the variable for blood pressure and expect it to correctly predict the risk of lung cancer. Each model requires collecting training data sets, selecting variables, and applying different statistical approaches before finding something that works.

Within Harvard we recently launched a project to expand Harvard's Profiles database to include faculty from the entire University. For that we are pulling data from a variety of sources, such as commercial publication data, patents, Harvard Course Catalog, etc. Every one of these requires person disambiguation. As we work on this project, we will make services/code available when possible. If anyone would like to talk to me about working with specific types of data or populations of researchers, then let me know.

Thanks,

Griffin Weber

From: profi...@googlegroups.com [profi...@googlegroups.com] On Behalf Of Tanu Malik [tan...@gmail.com]
Sent: Thursday, February 02, 2012 5:26 PM
To: profi...@googlegroups.com
Cc: ian...@gmail.com
Subject: [ProfilesRNS] Profile RNS--Author Disambiguation and Linux Port

--
You received this message because you are subscribed to the Google Groups "ProfilesRNS" group.
To view this discussion on the web visit https://groups.google.com/d/msg/profilesrns/-/RHLW8a3iMzwJ.
To post to this group, send email to profi...@googlegroups.com.
To unsubscribe from this group, send email to profilesrns...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/profilesrns?hl=en.

Weber, Griffin M

unread,

Feb 3, 2012, 5:16:07 PM2/3/12

to profi...@googlegroups.com, ian...@gmail.com

I am not aware of any efforts to develop a Linux port. However, I don't think it would be very difficult. Most of what the .NET code does is simply apply XSLT files to RDF or XML . The complexity is all embedded within the XSLT and data files, which is platform/language independent.

Converting the SQL Server code to Oracle or another database would be more difficult. To optimize performance, the code takes advantage of some of the specific ways SQL Server works with XML , which took us a long time to figure out. So, you might need similar effort to get it to work well on a different database platform.

Thanks,

Griffin Weber

From: profi...@googlegroups.com [profi...@googlegroups.com] On Behalf Of Tanu Malik [tan...@gmail.com]
Sent: Thursday, February 02, 2012 5:26 PM
To: profi...@googlegroups.com
Cc: ian...@gmail.com
Subject: [ProfilesRNS] Profile RNS--Author Disambiguation and Linux Port

--

Tanu Malik

unread,

Feb 7, 2012, 4:03:46 PM2/7/12

to profi...@googlegroups.com, ian...@gmail.com

Dear Griffin,

Thanks for your detailed response. We have a better understanding of
the issues at hand.