Valid OpenLibrary topics being deleted - why?

58 views
Skip to first unread message

Tom Morris

unread,
May 29, 2013, 3:58:51 PM5/29/13
to freebase...@googlegroups.com
I asked this in another thread a few days ago, but am pulling it into its own thread for more visibility.  I've since run across more instances of deleted topics such as:


Since the MDOs are called:

Openlibrary_cleanup

this appears to be intentional, but I don't remember any discussion of it.

What topics have been deleted?  Why?

Tom


On Sat, May 25, 2013 at 6:07 PM, Tom Morris <tfmo...@gmail.com> wrote:
I'll echo Thad's request for a description of the OpenLibrary load process, but I've also got a specific question.  Why is stuff being deleted?  For example, several books and book editions from this author disappeared over the weekend:


They were linked to the wrong author, but I'm not sure just deleting them outright is the correct solution.

Tom


On Fri, May 24, 2013 at 3:53 PM, Thad Guidry <thadg...@gmail.com> wrote:
BTW, we don't have a lot of info documented about what things we pull in from Open Library Project, if it's only new records, how we handle merges, what we do with orphan Topic records no longer in Open Library Project, etc... the basic ingestion strategy described in a paragraph would be helpful to know.

Could someone on the Data Team type up a quick paragraph and take 3 minutes of their time to do this on the wiki page ? http://wiki.freebase.com/wiki/Open_Library_Project

which is linked to from the main Data Sources wiki page.

Thanks Data Team !
 


On Fri, May 24, 2013 at 12:10 PM, Michael Masouras <maso...@google.com> wrote:
We are putting freebase on read-only mode again, starting now. The read-only period will end Saturday 10pm CA unless we get through our data loads faster. 

The data being loaded is music, wikipedia and open library related.

Michael

--
You received this message because you are subscribed to the Google Groups "Freebase Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to freebase-discu...@googlegroups.com.
To post to this group, send email to freebase...@googlegroups.com.
Visit this group at http://groups.google.com/group/freebase-discuss?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 



--
-Thad
http://www.freebase.com/view/en/thad_guidry

--
You received this message because you are subscribed to the Google Groups "Freebase Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to freebase-discu...@googlegroups.com.
To post to this group, send email to freebase...@googlegroups.com.
Visit this group at http://groups.google.com/group/freebase-discuss?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 


Gordon Mackenzie

unread,
May 29, 2013, 7:55:28 PM5/29/13
to freebase...@googlegroups.com
Tom, Thad et al.

Sorry about the mis-communication about Open Library deletions. The community should have been given notification of this impending deletion operation last month but was not.

We've been having increasing issues of poor reconciliation due to the OpenLibrary data over the past years and we've spent a good deal of our internal staff resources merging and fixing (and deleteing) OL topics and assertions where possible. This was noticeably impacting our efforts in increasing the book data coverage in the past 2 years and we have found the new incoming data was of an appreciably greater quality than the existing OL data. 

As it is known already, Freebase stopped importing OL data in 2009 (I believe) for the very same reason, the overall quality was found to be less than we had hoped. There will be no further data imported from OL.

It was decided to target any OL-sourced Author, Book or Book Edition topic where no human-curation or additional data assertion was ever made and delete that targeted topic if it met that criteria. We hope you will find that increased volume and breadth of the higher-quality incoming data will offset some of the losses in book edition, book and author data from OL. About 750K OL topics were deleted in this operation.

Gordon

Gordon Mackenzie | Schema Wrangler (Ontologist) |  gmac...@google.com |  
 

Tom Morris

unread,
Jun 1, 2013, 10:25:03 AM6/1/13
to freebase...@googlegroups.com
Thanks for the update Gordon.  Three quarters of a million topics is a lot.  Deleting them without notice or discussion is very unfortunate.  It kinds of makes the whole flag-then-vote mechanism look kind of silly.

I think there are two, fairly separate, issues here:

1. What is, or should be, the relationship among Freebase, it's community, Google, and the Knowledge Graph?
2. What is the best way to improve the state of book metadata?

There's no question that Google invests far more in Freebase than anyone else and certainly has the right (well, the legal right anyway) to do whatever it wants with it.  Having said that, I personally believe that Freebase has benefited greatly from the contributions of its community over the years and that the net benefit to Google of rebuilding a vibrant community would outweigh the cost.  It might be a community of interested organizations rather than individual contributors, but paying out of pocket by the fact, even if it's only at a rate of $0.50/hr, doesn't really scale and doesn't do as good a job as focused domain expertise or pre-curated data.  In addition to the dumb labor and the smart graph algorithms, I still think there's a need for domain expertise either applied directly or in the form of pre-curated data to be ingested.  Part of cultivating the willingness of people to contribute that data and expertise is being respectful of how it's treated.  I could write lots and lots on this, but I'll leave it there.

Book metadata is a mess not just in Freebase but globally -- intergalactically, even.  Google Books has historically been terrible, but has gradually improved over the years. Still, many of the GB keys which were added to Freebase just 404 now when you try to follow the links and a number of other issues remain, so they're still not "there."  OpenLibrary data is composed of a variety of sources ranging from very good (the Library of Congress) to complete crap (Amazon), but the good thing is that they retain full provenance of where the data came from.  There are a bunch of ISFDB books which were imported as editions with no associated works.  There are a bunch of things which have been cleaned up in ISFDB or Stanford Library which remain bad in Freebase.  There are books in Wikipedia which are untyped in Freebase.  The list goes on and on...

The real question isn't "How bad is it now?" but "How do we efficiently make it better?"  Deleting everything and starting over is one approach, but it's certainly no worse than the ancient SFMOMA or Olympic Athlete data loads which were cleaned up in situ, although they were much smaller in scale.  Looking at the stated deletion criteria though:


It was decided to target any OL-sourced Author, Book or Book Edition topic where no human-curation or additional data assertion was ever made and delete that targeted topic if it met that criteria. We hope you will find that increased volume and breadth of the higher-quality incoming data will offset some of the losses in book edition, book and author data from OL. About 750K OL topics were deleted in this operation.

I can't help but wonder whether this is really the most efficient path forward.  For example, the criteria sound like it would leave intact all of the conflated OL author records which have been manually flagged for splitting or which have been edited to remove a work or two and move it to the correct author.  Did all of the books who's titles I edited to fix their leading articles automatically receive a stay of execution, no matter how bad the underlying data?

There are a bunch of obviously bad things like authors with multiple LCNAF IDs (although some of them represent pseudonyms and are OK), non-book titles (dumpbin, calendars, blank diaries, etc), etc which could have been cleaned up first and incrementally to see how much it would improve things.

All of the deleted topics I've come across so far have looked like perfectly valid books & book editions, which makes me wonder just how much good data got thrown out with the bad.  What was the breakdown of topics types anyway? ie how many authors vs book editions?

I've got to admit, the whole thing is pretty frustrating, but I'm hoping there's a reasonable path forward from here.

Tom

Jason Douglas

unread,
Jun 1, 2013, 3:45:06 PM6/1/13
to freebase...@googlegroups.com
On Sat, Jun 1, 2013 at 7:25 AM, Tom Morris <tfmo...@gmail.com> wrote:
Thanks for the update Gordon.  Three quarters of a million topics is a lot.  Deleting them without notice or discussion is very unfortunate.  It kinds of makes the whole flag-then-vote mechanism look kind of silly.

I think there are two, fairly separate, issues here:

1. What is, or should be, the relationship among Freebase, it's community, Google, and the Knowledge Graph?
2. What is the best way to improve the state of book metadata?

There's no question that Google invests far more in Freebase than anyone else and certainly has the right (well, the legal right anyway) to do whatever it wants with it.  Having said that, I personally believe that Freebase has benefited greatly from the contributions of its community over the years and that the net benefit to Google of rebuilding a vibrant community would outweigh the cost.  It might be a community of interested organizations rather than individual contributors, but paying out of pocket by the fact, even if it's only at a rate of $0.50/hr, doesn't really scale and doesn't do as good a job as focused domain expertise or pre-curated data.  In addition to the dumb labor and the smart graph algorithms, I still think there's a need for domain expertise either applied directly or in the form of pre-curated data to be ingested.  Part of cultivating the willingness of people to contribute that data and expertise is being respectful of how it's treated.  I could write lots and lots on this, but I'll leave it there.

I'm only going to speak to your #1, because I've been thinking about it too (and I'm no expert on #2 ;-)...

I think you touched on the core issue...  Freebase started with a wikipedia-ish edit model that was very focused on "onesy-twosy" editing of individual topics.  Basically, that's still the model we have today.

I think Freebase's "single source of truth, transactional store" approach might not actually be the best long-term approach if where the bulk of contributions are really going to come from is independent, pre-curated datasets, as you say.  For those dataset owners/communities, especially the ones that are constantly changing, dealing with syncing changes is so onerous (not to mention recon), why bother?  For Wikipedia, MBZ, etc. we've taken on that burden ourselves, but as you say, that doesn't scale.

As you might suspect, this tension even exists within Google and so we do often take a fundamentally different "compositional" approach in those cases... where source graphs exist (and update) independently and then are overlaid on each other based on independent reconciliation evidence.

We've been exposing this approach publicly in Freebase a tiny bit, even.  This is what's going on these days with Wikipedia blurbs and some of the UN/WorldBank datasets that appear in the Search and Topic APIs, but not MQL.  That's because they're composed in, rather than written to graphd.  Both Wikipedia and the UN update their data a lot, independently, and yet that doesn't put extra strain on Freebase.  It also makes fixing issues in bulk much easier.  In the case of OpenLibrary, you could see making a group decision of whether to include the entire dataset, none of it, or some subset based on rules that could change over time.

The obvious challenge with trying to embrace this approach more broadly would be how to handle gardening and onesy-twosey edits, which presumably don't entirely disappear as a use case.  Options for this are being explored and I think it really is a solvable problem... but it would mean a pretty fundamentally different approach to the Freebase project and I'd be curious to hear your (and others) thoughts on that.

Dan Scott

unread,
Apr 6, 2014, 10:27:52 PM4/6/14
to freebase...@googlegroups.com


On Saturday, June 1, 2013 10:25:03 AM UTC-4, Tom Morris wrote:
Thanks for the update Gordon.  Three quarters of a million topics is a lot.  Deleting them without notice or discussion is very unfortunate.  It kinds of makes the whole flag-then-vote mechanism look kind of silly.

I think there are two, fairly separate, issues here:

1. What is, or should be, the relationship among Freebase, it's community, Google, and the Knowledge Graph?
2. What is the best way to improve the state of book metadata?

There's no question that Google invests far more in Freebase than anyone else and certainly has the right (well, the legal right anyway) to do whatever it wants with it.  Having said that, I personally believe that Freebase has benefited greatly from the contributions of its community over the years and that the net benefit to Google of rebuilding a vibrant community would outweigh the cost.  It might be a community of interested organizations rather than individual contributors, but paying out of pocket by the fact, even if it's only at a rate of $0.50/hr, doesn't really scale and doesn't do as good a job as focused domain expertise or pre-curated data.  In addition to the dumb labor and the smart graph algorithms, I still think there's a need for domain expertise either applied directly or in the form of pre-curated data to be ingested.  Part of cultivating the willingness of people to contribute that data and expertise is being respectful of how it's treated.  I could write lots and lots on this, but I'll leave it there.

Book metadata is a mess not just in Freebase but globally -- intergalactically, even.  Google Books has historically been terrible, but has gradually improved over the years. Still, many of the GB keys which were added to Freebase just 404 now when you try to follow the links and a number of other issues remain, so they're still not "there."  OpenLibrary data is composed of a variety of sources ranging from very good (the Library of Congress) to complete crap (Amazon), but the good thing is that they retain full provenance of where the data came from.  There are a bunch of ISFDB books which were imported as editions with no associated works.  There are a bunch of things which have been cleaned up in ISFDB or Stanford Library which remain bad in Freebase.  There are books in Wikipedia which are untyped in Freebase.  The list goes on and on...

Confession: I am a systems librarian, and am well aware of the state of open book metadata, well... pretty much everywhere.

To attempt to atone for that, I've been working with the W3 Schema.org BibExtend group with a focus on enabling library systems such as Evergreen, Koha, and VuFind to express clean schema.org structured data via RDFa. I'm reasonably pleased with the progress so far, but now need to move my implementations beyond expressing literals within a linked data silo to linking out, and Freebase seemed like a good first place to link to for editions of books.

However, I was surprised at just how few book instances exist in Freebase to which I can link. The thread that I'm jumping off of appears to be the last one that discussed the state of book metadata, so I was wondering if there have been any updates on either of the issues that Tom brought up. I would love to help.

Tom Morris

unread,
Apr 7, 2014, 9:46:48 AM4/7/14
to freebase...@googlegroups.com
On Sun, Apr 6, 2014 at 10:27 PM, Dan Scott <den...@gmail.com> wrote:

However, I was surprised at just how few book instances exist in Freebase to which I can link. The thread that I'm jumping off of appears to be the last one that discussed the state of book metadata, so I was wondering if there have been any updates on either of the issues that Tom brought up. I would love to help.

I won't repeat/extend my earlier rant, but I'd specifically be interested in hearing what happened to the supposedly much better quality data that was going to be added to the graph to replace everything that was deleted.  This was supposedly imminent almost a year ago.

Tom 
Reply all
Reply to author
Forward
0 new messages