Thanks for the update Gordon. Three quarters of a million topics is a lot. Deleting them without notice or discussion is very unfortunate. It kinds of makes the whole flag-then-vote mechanism look kind of silly.
I think there are two, fairly separate, issues here:
1. What is, or should be, the relationship among Freebase, it's community, Google, and the Knowledge Graph?
2. What is the best way to improve the state of book metadata?
There's no question that Google invests far more in Freebase than anyone else and certainly has the right (well, the legal right anyway) to do whatever it wants with it. Having said that, I personally believe that Freebase has benefited greatly from the contributions of its community over the years and that the net benefit to Google of rebuilding a vibrant community would outweigh the cost. It might be a community of interested organizations rather than individual contributors, but paying out of pocket by the fact, even if it's only at a rate of $0.50/hr, doesn't really scale and doesn't do as good a job as focused domain expertise or pre-curated data. In addition to the dumb labor and the smart graph algorithms, I still think there's a need for domain expertise either applied directly or in the form of pre-curated data to be ingested. Part of cultivating the willingness of people to contribute that data and expertise is being respectful of how it's treated. I could write lots and lots on this, but I'll leave it there.
Book metadata is a mess not just in Freebase but globally -- intergalactically, even. Google Books has historically been terrible, but has gradually improved over the years. Still, many of the GB keys which were added to Freebase just 404 now when you try to follow the links and a number of other issues remain, so they're still not "there." OpenLibrary data is composed of a variety of sources ranging from very good (the Library of Congress) to complete crap (Amazon), but the good thing is that they retain full provenance of where the data came from. There are a bunch of ISFDB books which were imported as editions with no associated works. There are a bunch of things which have been cleaned up in ISFDB or Stanford Library which remain bad in Freebase. There are books in Wikipedia which are untyped in Freebase. The list goes on and on...
The real question isn't "How bad is it now?" but "How do we efficiently make it better?" Deleting everything and starting over is one approach, but it's certainly no worse than the ancient SFMOMA or Olympic Athlete data loads which were cleaned up in situ, although they were much smaller in scale. Looking at the stated deletion criteria though:
I can't help but wonder whether this is really the most efficient path forward. For example, the criteria sound like it would leave intact all of the conflated OL author records which have been manually flagged for splitting or which have been edited to remove a work or two and move it to the correct author. Did all of the books who's titles I edited to fix their leading articles automatically receive a stay of execution, no matter how bad the underlying data?
There are a bunch of obviously bad things like authors with multiple LCNAF IDs (although some of them represent pseudonyms and are OK), non-book titles (dumpbin, calendars, blank diaries, etc), etc which could have been cleaned up first and incrementally to see how much it would improve things.
All of the deleted topics I've come across so far have looked like perfectly valid books & book editions, which makes me wonder just how much good data got thrown out with the bad. What was the breakdown of topics types anyway? ie how many authors vs book editions?
I've got to admit, the whole thing is pretty frustrating, but I'm hoping there's a reasonable path forward from here.
Tom