Freebase+WebQuestions Neighbors Dataset

92 views
Skip to first unread message

Petr Baudis

unread,
Jun 26, 2015, 2:38:24 PM6/26/15
to qa-...@googlegroups.com
Hi!

For training models for QA over Freebase, I need a dataset that,
for each WebQuestions question, contains the path between the Freebase
base and answer-bearing entities. Right now, WebQuestions doesn't even
mention the answer-bearing Freebase entities. I took a look at Jacana's
Freebase dataset, but I couldn't find any data I could quite use there.
SEMPRE doesn't seem to carry this kind of data either.

(I suppose it won't be too hard to make a script that runs a 2-degree
exhaustive search, but figured I could shoot out a quick question
first so I don't build a redundant dataset.)


P.S.: What are your plans for knowledge bases to use after Jun 30 when
the Freebase finally sunsets? It seems to me that Wikidata is still no
match for the Freebase comprehensiveness...

--
Petr Baudis
If you have good ideas, good data and fast computers,
you can do almost anything. -- Geoffrey Hinton

Jack Park

unread,
Jun 26, 2015, 3:06:37 PM6/26/15
to Petr Baudis, qa-...@googlegroups.com
As I recall, the freebase data is downloadable as a tarball. Would that remain sufficient at least for early development?

--
You received this message because you are subscribed to the Google Groups "qa-oss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to qa-oss+un...@googlegroups.com.
To post to this group, send email to qa-...@googlegroups.com.
Visit this group at http://groups.google.com/group/qa-oss.
To view this discussion on the web visit https://groups.google.com/d/msgid/qa-oss/20150626183818.GJ2760%40machine.or.cz.
For more options, visit https://groups.google.com/d/optout.

Xuchen Yao

unread,
Jun 26, 2015, 5:33:02 PM6/26/15
to Jack Park, Petr Baudis, qa-...@googlegroups.com
The Freebase Topic API returns you a JSON file on a particular topic. Then it's relatively easy to:

1. do string matching with WebQuestions answer node and JSON field
2. extract the full dictionary path from top node to the matched JSON field, this will require a recursive function.

If you need a dump of all JSON files with entities in WebQuestions, Jacana released it. If you need more entities beyond, then have to call the Freebase Topic API.

On Wikidata:

I'm hoping to build an internal version of the Freebase graph database using its RDF dumps. It's a huge engineering challenge and needs powerful machines. If someone is seriously interested in this and has resources, I'm happy to take it offline and have some more in-depth discussion.

Xuchen

Petr Baudis

unread,
Jun 26, 2015, 8:52:48 PM6/26/15
to Xuchen Yao, Jack Park, qa-...@googlegroups.com
On Fri, Jun 26, 2015 at 02:32:42PM -0700, Xuchen Yao wrote:
> The Freebase Topic API returns you a JSON file on a particular topic. Then
> it's relatively easy to:
>
> 1. do string matching with WebQuestions answer node and JSON field
> 2. extract the full dictionary path from top node to the matched JSON
> field, this will require a recursive function.

Ah, so you are using the JSON API. I better hurry if I'd use it too,
I guess. I generally prefer to deal with RDF, but you mentioned that
the results ordering helps.

> If you need a dump of all JSON files with entities in WebQuestions, Jacana
> released it. If you need more entities beyond, then have to call the
> Freebase Topic API.

In Jacana I found the list of question entities, maybe even answer
entities (I wasn't sure) but not graph paths connecting the two...

> On Wikidata:
>
> I'm hoping to build an internal version of the Freebase graph database
> using its RDF dumps. It's a huge engineering challenge and needs powerful
> machines. If someone is seriously interested in this and has resources, I'm
> happy to take it offline and have some more in-depth discussion.

Do you mean just getting a SPARQL endpoint for Freebase RDF?
YodaQA runs a public one and has some docs about how to set one up:

https://github.com/brmson/yodaqa/tree/master/data/freebase

(Feel free to use it *lightly* for other purposes than using YodaQA.
Unfortunately, it really is a lot of data and more complex queries can
easily take minutes.)


(Many people seem to use Virtuoso, maybe it's faster/better/...
But when I was using Virtuoso - just for DBpedia -, it always ended up
randomly corrupting its own database throughout imports, maybe I was
using some buggy version but I gave up on it.)

Petr Baudis

unread,
Jun 26, 2015, 8:54:24 PM6/26/15
to Jack Park, qa-...@googlegroups.com
On Fri, Jun 26, 2015 at 12:06:36PM -0700, Jack Park wrote:
> As I recall, the freebase data is downloadable as a tarball. Would that
> remain sufficient at least for early development?

Certainly. But it will be going more and more stale, and the data is
so big that for many, setting up a database on top of it may be a hassle
for many, as mentioned in the other email.

Petr Baudis

Xuchen Yao

unread,
Jun 26, 2015, 10:26:31 PM6/26/15
to Petr Baudis, Jack Park, qa-...@googlegroups.com
On Fri, Jun 26, 2015 at 5:52 PM, Petr Baudis <pa...@ucw.cz> wrote:
On Fri, Jun 26, 2015 at 02:32:42PM -0700, Xuchen Yao wrote:
> The Freebase Topic API returns you a JSON file on a particular topic. Then
> it's relatively easy to:
>
> 1. do string matching with WebQuestions answer node and JSON field
> 2. extract the full dictionary path from top node to the matched JSON
> field, this will require a recursive function.

  Ah, so you are using the JSON API.  I better hurry if I'd use it too,
I guess.  I generally prefer to deal with RDF, but you mentioned that
the results ordering helps.

> If you need a dump of all JSON files with entities in WebQuestions, Jacana
> released it. If you need more entities beyond, then have to call the
> Freebase Topic API.

  In Jacana I found the list of question entities, maybe even answer
entities (I wasn't sure) but not graph paths connecting the two...

That's correct. You have to write some scripts yourself.
 

> On Wikidata:
>
> I'm hoping to build an internal version of the Freebase graph database
> using its RDF dumps. It's a huge engineering challenge and needs powerful
> machines. If someone is seriously interested in this and has resources, I'm
> happy to take it offline and have some more in-depth discussion.

  Do you mean just getting a SPARQL endpoint for Freebase RDF?
YodaQA runs a public one and has some docs about how to set one up:

        https://github.com/brmson/yodaqa/tree/master/data/freebase

(Feel free to use it *lightly* for other purposes than using YodaQA.
Unfortunately, it really is a lot of data and more complex queries can
easily take minutes.)


  (Many people seem to use Virtuoso, maybe it's faster/better/...
But when I was using Virtuoso - just for DBpedia -, it always ended up
randomly corrupting its own database throughout imports, maybe I was
using some buggy version but I gave up on it.)


No, I was thinking of building a `graphd` (Google's internal version of graph DB, backbone of Freebase) type of graph database to host Freebase.

This is used to solve a simple task: given any two entities in the graph, what's the nearest path between them? A generic graph DB should handle this easily. But for triple store DBs, one has to fire *a lot* of queries while walking the graph. It might be fine to do this  with a single thread. But what if 100 people ask you questions at the same time...

Xuchen

Petr Baudis

unread,
Jun 27, 2015, 5:57:21 AM6/27/15
to Xuchen Yao, Jack Park, qa-...@googlegroups.com
On Fri, Jun 26, 2015 at 07:26:11PM -0700, Xuchen Yao wrote:
> > > On Wikidata:
> > >
> > > I'm hoping to build an internal version of the Freebase graph database
> > > using its RDF dumps. It's a huge engineering challenge and needs powerful
> > > machines. If someone is seriously interested in this and has resources,
> > > I'm happy to take it offline and have some more in-depth discussion.
> >
> > Do you mean just getting a SPARQL endpoint for Freebase RDF?
..snip..
> No, I was thinking of building a `graphd` (Google's internal version of
> graph DB, backbone of Freebase) type of graph database to host Freebase.

Ah, I see! Yeah, it's a shame graphd isn't opensource.

> This is used to solve a simple task: given any two entities in the graph,
> what's the nearest path between them? A generic graph DB should handle this
> easily. But for triple store DBs, one has to fire *a lot* of queries while
> walking the graph. It might be fine to do this with a single thread. But
> what if 100 people ask you questions at the same time...

I see. I have been wondering about this task further down the road as
well. But don't some RDF databases support this as an extension?

https://www.mail-archive.com/virtuos...@lists.sourceforge.net/msg02426.html
http://wimmics.inria.fr/node/27

Still seems a lot easier to me to do this as a relatively simple
modification of an existing graph DB; it actually *is* a generic graph
DB, just with API somewhat shaped around different usage modes.

Jack Park

unread,
Jun 27, 2015, 9:39:43 AM6/27/15
to Petr Baudis, Xuchen Yao, qa-...@googlegroups.com
Systap's Blazegraph

is a Tinkerpop Blueprints graph database running on top of their Bigdata quadstore RDF engine.  I've used Bigdata for RDF-based topic maps in the past, and did some experiments with Blazegraph recently.

Jack

On Sat, Jun 27, 2015 at 2:57 AM, Petr Baudis <pa...@ucw.cz> wrote:
On Fri, Jun 26, 2015 at 07:26:11PM -0700, Xuchen Yao wrote:
<snip>

Petr Baudis

unread,
Jul 7, 2015, 7:19:57 PM7/7/15
to qa-...@googlegroups.com
Hi!

On Fri, Jun 26, 2015 at 08:38:19PM +0200, Petr Baudis wrote:
> For training models for QA over Freebase, I need a dataset that,
> for each WebQuestions question, contains the path between the Freebase
> base and answer-bearing entities. Right now, WebQuestions doesn't even
> mention the answer-bearing Freebase entities. I took a look at Jacana's
> Freebase dataset, but I couldn't find any data I could quite use there.
> SEMPRE doesn't seem to carry this kind of data either.
>
> (I suppose it won't be too hard to make a script that runs a 2-degree
> exhaustive search, but figured I could shoot out a quick question
> first so I don't build a redundant dataset.)

We cleaned up the WebQuestions dataset, creating finer splits,
assigning ids to all questions, storing the answers in a more reasonable
format, annotating each question with extra data (most importantly
these Freebase property paths between concepts and answers), and various
Python scripts to make common tasks easy:

https://github.com/brmson/dataset-factoid-webquestions

Of course, WebQuestions is probably going to be too easy pretty soon,
but hopefully most of this might be reusable for better datasets too.

This is not v1.0 yet. Feedback (and even pull requests) are very
welcome!

> P.S.: What are your plans for knowledge bases to use after Jun 30 when
> the Freebase finally sunsets? It seems to me that Wikidata is still no
> match for the Freebase comprehensiveness...

Luckily, as of now, the web interface is still up...

Petr Baudis
Reply all
Reply to author
Forward
0 new messages