Fixing the RDF Dumps

1,597 views
Skip to first unread message

Shawn Simister

unread,
Aug 29, 2013, 2:13:09 PM8/29/13
to Freebase Discuss
I've been working on a new version of the Freebase RDF dumps which will address many of the issues that have been discovered since we first started publishing the data as RDF. 

You can see the test dump (21GB gzip) that I did yesterday here:
http://commondatastorage.googleapis.com/freebase-public/rdf/freebase-rdf-2013-08-27-test.gz

The biggest change in these dumps is that the format has switched to N-Triples from Turtle. In practice this a very minimal change since N-Triples is a subset of Turtle which follows the same one-triple-per-line format that we have now. The benefits of the new format are:
  • No need for @prefix statements at the top of the dump. This allows the dumps to be arbitrarily split into chunks which can be processed independently.
  • No more CURIEs mean that RDF triple stores can load the data faster without having to expand each URI.
  • The N-Triples syntax is also used by DBpedia and being considered by Wikidata. Sharing a data format with these projects makes it easier to build tools and apps that can pull from all three data sources.
To compare how this new format looks, the current RDF dumps look like this:

@prefix ns: <http://rdf.freebase.com/ns/>.
@prefix key: <http://rdf.freebase.com/key/>.
@prefix owl: <http://www.w3.org/2002/07/owl#>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>.

ns
:m.012rkqx    ns:type.object.type ns:common.topic. ns:m.012rkqx    ns:type.object.name "High Fidelity"@en. ns:m.012rkqx    ns:type.object.type ns:music.single.

Whereas the new RDF dumps look like this:

As I mentioned above, this new release addresses many of the issues raised on this mailing list in the past couple of months. 
  • We now differentiate between xsd:date, xsd:datetime, xsd:time, xsd:gYear and xsd:gYearMonth.
  • URL property values are cleaned up so that any invalid characters (eg. trailing \) get escaped. This means that URLs can be parsed without crashing.
  • Keys are properly URI escaped.
  • A tab has been added before the period at the end of each line to accommodate older parsers.
  • Newlines and tabs have been escaped in descriptions and keys.
  • Errors with improper prefixes being applied to URIs are no longer an issue.
  • No more @prefix headers showing up in the middle of the dumps (or anywhere now).
There are still some bugs that I'm hunting down including:
  • Missing triples linking the human-readable schema IDs to MIDs.
  • Some geolocations have weird property paths as predicates.
  • Occasional triples which are missing an object.
If you rely on the Freebase data dumps please take a look at the new format and let me know what you think. If no one objects, this code will go live and appear in next week's dumps.

--
Shawn Simister

Knowledge Developer Relations
Google

Thad Guidry

unread,
Aug 29, 2013, 4:43:01 PM8/29/13
to Freebase Discuss
Exciting stuff, Shawn and team !!

As it happens, last night I pulled down the old RDF dump....beginning my own trademark tricks on cleanup... but good to know your already headed down that path.

I will download this N-Triples format version and let you know if I see anything that is a show stopper for me and my processing.

+1 for the N-triples format
( hoping they are complete - Will you have syntax validation in the export pipeline ?  or do you just shove it out ? )

-- 

Shawn Simister

unread,
Aug 29, 2013, 5:25:50 PM8/29/13
to Freebase Discuss
Thanks Thad. Right now I just write it out and upload it straight to Cloud Storage. Validation happens separately and does not currently block the upload from happening.


--
You received this message because you are subscribed to the Google Groups "Freebase Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to freebase-discu...@googlegroups.com.
To post to this group, send email to freebase...@googlegroups.com.
Visit this group at http://groups.google.com/group/freebase-discuss.
For more options, visit https://groups.google.com/groups/opt_out.

Dan

unread,
Aug 30, 2013, 4:36:57 AM8/30/13
to freebase...@googlegroups.com
Thanks Shawn this looks great, I'll have a look over this in the next few days and let you know.

- Dan

Dan

unread,
Sep 2, 2013, 4:59:07 PM9/2/13
to freebase...@googlegroups.com
I've had a look over this now and generally it's good.

It's jumped quite a bit in the uncompressed size to 250gb, but that's ok for how we're processing but it might be quicker for us if the compression was splittable which gzip isn't. Currently we pull it, then re-compress into a splittable form into S3.

I'm not sure if you explained this or not, what do the URI's with the ".." in mean? Here's one for an example:


Then finally type.object.key now seem to be a string rather than URI, is that correct?

Also will this affect the RDF web endpoint? It's quite handy those two being the same output.

Thanks,
Dan

Tom Morris

unread,
Sep 2, 2013, 6:37:25 PM9/2/13
to freebase...@googlegroups.com
On Mon, Sep 2, 2013 at 4:59 PM, Dan <danha...@gmail.com> wrote:

It's jumped quite a bit in the uncompressed size to 250gb, but that's ok for how we're processing but it might be quicker for us if the compression was splittable which gzip isn't. Currently we pull it, then re-compress into a splittable form into S3.

Shawn Simister

unread,
Sep 3, 2013, 6:37:23 PM9/3/13
to Freebase Discuss
Hmm, since these dumps are generated by mapreduce the final archive is actually already composed of many separate gzip files cat-ed together as described in that Stackoverflow post. I guess all thats needed is some sort of index for the offsets to spilt it apart again. Would a txt file of offsets be sufficient?

I looked into the properties with the ... in the middle and I'll be removing them in the next version of the dumps. Its just duplicate data from CVT values that got in by mistake. You can safely ignore any triple with .. in it without losing any information.

I also investigated why the RDF dumps didn't run last night and it looks like I mixed up something in a config file - d'oh. I kicked off the job manually just now so there should be a new dump in 5-6 hours and regular dumps again every Monday morning after that.


--
You received this message because you are subscribed to the Google Groups "Freebase Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to freebase-discu...@googlegroups.com.
To post to this group, send email to freebase...@googlegroups.com.
Visit this group at http://groups.google.com/group/freebase-discuss.
For more options, visit https://groups.google.com/groups/opt_out.

Dan

unread,
Sep 4, 2013, 5:21:32 AM9/4/13
to freebase...@googlegroups.com
That could work, splittable lzo works in a similar way which we're using at the moment. The LzoInputFormat for Hadoop has support for an index file here


There's an IzoIndex class that has the method for creating the indexes to be used by this.

So we could write a custom input format like that to load it from a single file, or I guess more simply we can split the single file into chunks based on the index file to store separately in S3 or HDFS. Either-way I think that would be great.

Thanks for looking into the ".." issue too, that sounds good.

 - Dan

Paul Houle

unread,
Sep 6, 2013, 2:28:36 PM9/6/13
to freebase...@googlegroups.com
      I had some trouble with the Freebase-specific part of my toolchain but once I fixed that,  it looks like the new dump file is more compatible than ever before.  From the -27 dump I wound up with 1.1 billion valid facts and about 290,000 invalids.

      At some point this month I'm going to be taking a close look at the dates,  but things look good from here.

Shawn Simister

unread,
Sep 6, 2013, 4:15:02 PM9/6/13
to Freebase Discuss
That's great to hear Paul. Do you know if the predicates with ".." in them count towards those 290k invalid statements? If that's the case, then we may be able to bring that number way down in the next version of the dumps.


On Fri, Sep 6, 2013 at 11:28 AM, Paul Houle <onto...@gmail.com> wrote:
      I had some trouble with the Freebase-specific part of my toolchain but once I fixed that,  it looks like the new dump file is more compatible than ever before.  From the -27 dump I wound up with 1.1 billion valid facts and about 290,000 invalids.

      At some point this month I'm going to be taking a close look at the dates,  but things look good from here.

--
You received this message because you are subscribed to the Google Groups "Freebase Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to freebase-discu...@googlegroups.com.
To post to this group, send email to freebase...@googlegroups.com.
Visit this group at http://groups.google.com/group/freebase-discuss.
For more options, visit https://groups.google.com/groups/opt_out.

Shawn Simister

unread,
Sep 6, 2013, 6:08:38 PM9/6/13
to Freebase Discuss
Here's a list of the sizes in bytes of each of the 200 gzip files that make up the freebase-rdf-2013-08-27-test.gz dump: https://docs.google.com/file/d/0B5V_3YYf9JnIaHp5ckdvUklXRDA/edit?usp=sharing

You can use something like the following Python script to quickly split the file apart:

import sys
input_file = sys.argv[1]
offset_file = sys.argv[2]
output_file = input_file.replace('.gz','')
file_sizes = map(long, open(offset_file).read().split("\n"))
with open(input_file,"rb") as f:
    offset = 0
    part = 1
    for file_size in file_sizes:
        f.seek(offset)
        byte_data = f.read(file_size)
        with open(output_file+'-part-'+str(part)+'-of-'+str(len(file_sizes))+'.gz', 'wb') as o:
            o.write(byte_data)
        offset += file_size
        part += 1

If this turns out to be the best approach I can look into getting the list of file sizes automatically generated with the weekly dumps.

Another alternative is that you can manually determine the offsets in any of the Freebase RDF data dumps by scanning through and looking for the 2-byte magic number (1F 8B) that appears at the beginning of every gzip file. 

import struct
import sys
input_file = sys.argv[1]
with open(input_file,"rb") as f:
b = ' '
while len(b) > 0:
b = f.read(1)
if b == '\x1F':
b = f.read(1)
if b == '\x8B':
offset = f.tell()-2
if offset > 0:
print offset

This will be slower because you have to scan the entire file byte-by-byte but probably not as slow (and definitely more space efficient) than decompressing and re-compressing. I haven't tried implementing this as a Hadoop InputFormat but if anyone gets this working please let me know.

Shawn Simister

unread,
Sep 6, 2013, 6:28:30 PM9/6/13
to Freebase Discuss
Slight correction, that file list of the sizes in bytes of each of the 200 gzip files that make up the freebase-rdf-2013-09-03-21-57.gz data dump.

Dan

unread,
Oct 16, 2013, 10:51:04 AM10/16/13
to freebase...@googlegroups.com
Did the generation of the splits on a weekly basis go anywhere?

I can't see them along side the dump. I'm going to be taking a look at working with the dump again soon so would be good to be able to use the split form.

Thanks,
Dan

Shawn Simister

unread,
Oct 17, 2013, 9:53:36 PM10/17/13
to Freebase Discuss
I didn't get any feedback on the test data that I posted a while back so I wasn't prioritizing this but if its something that you're willing to test I'd be happy to work on getting it running in production. Did you try the file I uploaded before?

Dan

unread,
Oct 18, 2013, 1:38:51 PM10/18/13
to freebase...@googlegroups.com
Ah ok, yes I'd be happy to help test it out, I didn't try it last time.

I've just tried splitting the dump for the splits you gave it this thread, and using your Python code with a slight modification it worked fine. On an ec2 instance it took a few minutes to split the file and upload into S3.

Once it was there I just loaded it natively in Pig as a directory of gzipped file to count the dump of tuples. This ran fine and took around 15m on 5 * m1.xlarge instances to count the 1,982,300,898 rows.

So that seems to work fine, and speeds up the initial loading into a form to get the RDF in Hadoop. Makes it a lot easier to deal with too.

Let me know if you want to test anything further.

Thanks,
Dan

Shawn Simister

unread,
Oct 18, 2013, 5:58:26 PM10/18/13
to Freebase Discuss
Great, I'll work on getting this uploaded alongside the data dump every week.

Ewa Szwed

unread,
Oct 23, 2013, 8:31:38 AM10/23/13
to freebase...@googlegroups.com
Hi,
Wanted to ask a status of this work.
Was the format eventually changed in 'official' dumps download page?
If not what is the status of this transition for time now.

Shawn Simister

unread,
Oct 23, 2013, 5:12:20 PM10/23/13
to Freebase Discuss
Yes, the data dumps are now using the N-Triples format. When the updated docs go live it will show the new size which is around 20GB that not so much based on the new format as it is just the natural growth of Freebase data.


--
You received this message because you are subscribed to the Google Groups "Freebase Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to freebase-discu...@googlegroups.com.
To post to this group, send email to freebase...@googlegroups.com.

Ewa Szwed

unread,
Oct 31, 2013, 11:48:57 AM10/31/13
to freebase...@googlegroups.com

Hi,
Thank you very much for your answer.
I have another question though.
I was trying to upload the current (N-triples: 2013-10-20) freebase dump to Apache Jena TDB and it was complaining on object with values of true and false.
When they are replaced with "true" and "false" (with double quotes added) the import is fine.
Is there a chance that we can have it fixed in the dump?
I would appreciate any answer.

W dniu czwartek, 29 sierpnia 2013 19:13:09 UTC+1 użytkownik Shawn Simister napisał:

Shawn Simister

unread,
Oct 31, 2013, 9:05:00 PM10/31/13
to Freebase Discuss
Hmm, yeah it appears that this is a bug. Turtle supports true and false as equivalent to "true"^^xsd:boolean and "false"^^xsd:boolean but even though N-Triples is a subset of Turtle it doesn't support the simplified boolean syntax.

Do folks using the dumps have a preference about whether the boolean values should be typed as "true"^^xsd:boolean or just serialized as strings like "true"?


--
You received this message because you are subscribed to the Google Groups "Freebase Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to freebase-discu...@googlegroups.com.
To post to this group, send email to freebase...@googlegroups.com.
Visit this group at http://groups.google.com/group/freebase-discuss.
For more options, visit https://groups.google.com/groups/opt_out.

Thad Guidry

unread,
Oct 31, 2013, 9:12:10 PM10/31/13
to Freebase Discuss
I think it does though, right ?  And Turtle borrows from EBNF Notation3 ...  'true' or 'false'.

Boolean may be written directly as true or false and correspond to the the XML Schema Datatype xsd:boolean in both syntax and datatype URI.

# this is not a complete turtle document
true
false
# same in long form
"true"^^xsd:boolean
"false"^^<http://www.w3.org/2001/XMLSchema#boolean>

I think you are fine with just "true" and "false".

Ewa Szwed

unread,
Nov 1, 2013, 6:58:51 AM11/1/13
to freebase...@googlegroups.com
I am happy with "true" and "false" as well.
This problem also exist for decimals: line: 92724 Illegal object: [DECIMAL: 9.3] can this be taken care of too?


2013/11/1 Thad Guidry <thadg...@gmail.com>

--
You received this message because you are subscribed to a topic in the Google Groups "Freebase Discuss" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/freebase-discuss/AG5sl7K5KBE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to freebase-discu...@googlegroups.com.

Shawn Simister

unread,
Nov 1, 2013, 7:36:25 PM11/1/13
to Freebase Discuss
Yes, all literal values will be quoted according to the spec.

Ewa Szwed

unread,
Nov 4, 2013, 6:19:04 AM11/4/13
to freebase...@googlegroups.com
Hi,
The problem also appear when numbers are represented as follows:
60e-20
2.3e+30
Need to include these in double quotes too. With these small transformations I was able to load first 50 millions triples to Jena.
When can we expect these modifications to be added to the dump?
Best regards.


2013/11/1 Shawn Simister <simi...@google.com>

Ewa Szwed

unread,
Nov 6, 2013, 5:56:08 AM11/6/13
to freebase...@googlegroups.com
Hi,
I have another question on this, is there anything else that still needs to be done to clean up current N-triples dumps. I have found a few articles on clear up (e.g. removing invalid triples) turtls format in past and was wondering that it was also taken care of here.
Would appreciate any comment.


W dniu czwartek, 29 sierpnia 2013 19:13:09 UTC+1 użytkownik Shawn Simister napisał:

Ewa Szwed

unread,
Nov 7, 2013, 7:22:07 AM11/7/13
to freebase...@googlegroups.com
I meant cleanups from this note:
http://people.apache.org/~andy/Freebase20121223/Notes.txt
I know you were mentioning some of them in this thread but I am not sure what is the status of this work - and if you know of some modification to do in the near future.
I have loaded 50 million triples from this dump and it works fine - but only checked some basic sparql queries on it which give me all I need but does not check all the rdf features used in the dump.
Also - could we have data dump timestamp exposed more on the download page: https://developers.google.com/freebase/data.
I am looking for an easy way to detect that a new dump was published without downloading it.


W dniu czwartek, 29 sierpnia 2013 19:13:09 UTC+1 użytkownik Shawn Simister napisał:

Dan

unread,
Nov 7, 2013, 7:35:37 AM11/7/13
to freebase...@googlegroups.com
A lot of those are fixed now so you shouldn't need to run cleaning as in that script anymore.

You can also find a list of the dumps here http://commondatastorage.googleapis.com/freebase-public/ it's the same xml API as AWS S3.

 - Dan



--

Ewa Szwed

unread,
Nov 7, 2013, 8:49:25 AM11/7/13
to freebase...@googlegroups.com
Thank you. I will use that.
Shawn does the rdf/freebase-rdf-2013-11-03-00-00.gz have these double quotes fixed?


2013/11/7 Dan <danha...@gmail.com>

--
You received this message because you are subscribed to a topic in the Google Groups "Freebase Discuss" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/freebase-discuss/AG5sl7K5KBE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to freebase-discu...@googlegroups.com.

Ewa Szwed

unread,
Nov 15, 2013, 9:02:56 AM11/15/13
to freebase...@googlegroups.com
Hi,
I would like to ask again if we can expect these double quotes literals issue to be fixed in next coming dumps. I am working on the system that would require the freebase triple store to be updated everytime the new dump is released. With these issues it adds 2 days of processing time before I can start importing to Jena. I would appreciate any answer.:)

Mark Johnson

unread,
Nov 21, 2013, 12:13:17 AM11/21/13
to freebase...@googlegroups.com
Hi,

Is it possible to get the sizes of the 200 gzip files that make up the current freebase dump, please?  I've found that searching for the magic bytes doesn't work -- it produces hundreds of thousands of matches -- presumably because the magic bytes can also appear inside a well-formed gzip file.

Thanks in advance for your help,

Mark

Dan

unread,
Nov 21, 2013, 4:39:39 AM11/21/13
to freebase...@googlegroups.com
I've found the same with the magic bytes, I guess getting the sizes along side the dumps would be useful to quite a few people now.

Andrea Di Menna

unread,
Nov 21, 2013, 9:56:57 AM11/21/13
to freebase...@googlegroups.com
Hi,
have you tried searching for \x1F\x8B\x08 ?
As per [1] \x08 is the compression method which is set to 08 for deflate.

Cheers
Andrea



2013/11/21 Dan <danha...@gmail.com>

Dan

unread,
Nov 21, 2013, 10:03:47 AM11/21/13
to freebase...@googlegroups.com
This is what I tried last week: https://gist.github.com/danharvey/629b5b652b77359ad895

So it includes checking for deflate too but that still seems to split incorrectly even though the chunks are larger.

 - Dan

Tom Morris

unread,
Nov 21, 2013, 10:39:17 AM11/21/13
to freebase...@googlegroups.com
On Thu, Nov 21, 2013 at 10:03 AM, Dan <danha...@gmail.com> wrote:

That gist appears to require an external list of offsets rather than scanning the whole file.
 
The full signature (low byte on the left) is actually:

1f 8b 08 00 00 00 00 00 00 00 

and I bet a run of zeros like that is pretty rare in practice, but I don't think there's any guarantee that the signature couldn't occur in the compressed data.

Shawn said he'd make the list of offsets available.  That's definitely the most straightforward solution.  Otherwise, the only guaranteed way to do this is have your splitter decompress, split, and (optionally) recompress before writing the splits.

Tom

Dan

unread,
Nov 21, 2013, 12:30:31 PM11/21/13
to freebase...@googlegroups.com
Ha yes sorry, I created that list first with grep -obUaP "\x1F\x8B\x08" freebase-rdf-2013-11-10-00-00.gz | cut -d ":" -f 1 > splits

But I think the offsets is the only reliable and fast/sane way to do this.

- Dan



--
You received this message because you are subscribed to the Google Groups "Freebase Discuss" group.

Tom Morris

unread,
Nov 21, 2013, 1:49:17 PM11/21/13
to freebase...@googlegroups.com
Using an old dump from June, I got 200 splits (which sounds right) in six and half minutes on my laptop using this command:

    time grep -obUaP "\x1F\x8B\x08\x00\x00\x00\x00\x00" freebase-rdf-2013-06-30-00-00.gz | cut -d ":" -f 1 > splits.txt

Seems like there should be a variant of split/head/tail that takes a list of byte offsets, so you don't have to resort to Python, but I don't know of such a beast.

Tom

Vijay Mohan

unread,
Dec 4, 2013, 4:49:27 PM12/4/13
to freebase...@googlegroups.com
We have downloaded the dump freebase-rdf-2013-12-01-00-00.gz  from http://commondatastorage.googleapis.com/freebase-public/rdf/freebase-rdf-2013-12-01-00-00.gz

We having problem spiting file, you guys have the offset file with weekly dump now?

Tom

--
You received this message because you are subscribed to the Google Groups "Freebase Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to freebase-discu...@googlegroups.com.
To post to this group, send email to freebase...@googlegroups.com.
Visit this group at http://groups.google.com/group/freebase-discuss.
For more options, visit https://groups.google.com/groups/opt_out.



--
Shawn Simister

Knowledge Developer Relations
Google

--
You received this message because you are subscribed to the Google Groups "Freebase Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to freebase-discu...@googlegroups.com.
To post to this group, send email to freebase...@googlegroups.com.
Visit this group at http://groups.google.com/group/freebase-discuss.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "Freebase Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to freebase-discu...@googlegroups.com.
To post to this group, send email to freebase...@googlegroups.com.
Visit this group at http://groups.google.com/group/freebase-discuss.
For more options, visit https://groups.google.com/groups/opt_out.

Shawn Simister

unread,
Dec 5, 2013, 8:49:01 PM12/5/13
to Freebase Discuss
I'm still working on getting these autogenerated as part of the weekly dump. Here are the offsets for the 2013-12-01 dump.
offsets.txt

Will Dowling

unread,
Dec 6, 2013, 6:13:02 AM12/6/13
to freebase...@googlegroups.com
Hey mate,

Just working with the freebase-rdf-2013-12-01-00-00 dump and getting parser errors.
It looks like there are still values that aren't quoted and aren't matching the NTriples grammar.

Specifically I've noted the following predicates coming through unquoted:
  • /ns/type.property.unique
  • /ns/measurement_unit.dated_integer.number
  • /ns/measurement_unit.dated_float.number
  • /ns/type.content.length
Without knowing too much about how this is generated, would it be unreasonable for the default serialisation of these objects to be quoted?
That would be sane when considered in the context of the NTriples spec, and wouldn't break with new types later.

Either way, thanks for the awesome work :)

Shawn Simister

unread,
Dec 6, 2013, 1:11:42 PM12/6/13
to Freebase Discuss
Yes, I've got that fixed in a test build it just hasn't made it to production yet. I'll update the list as soon as its ready.


--
You received this message because you are subscribed to the Google Groups "Freebase Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to freebase-discu...@googlegroups.com.
To post to this group, send email to freebase...@googlegroups.com.
Visit this group at http://groups.google.com/group/freebase-discuss.
For more options, visit https://groups.google.com/groups/opt_out.

Ranjeet Shinde

unread,
Jan 14, 2014, 6:01:10 AM1/14/14
to freebase...@googlegroups.com
Hello ,

I am going to Freebase RDF for one of my project.
Current size of this RDF is 22 GB.
Can I use Apache Jena TDB parser ?
As I am new to freebase RDF.
I will be thankful if anyone could guide on what parser to use.

Regards,
Ranjeet

Shawn Simister

unread,
Jan 14, 2014, 2:40:39 PM1/14/14
to Freebase Discuss
Can you describe your project a little bit more Ranjeet? Sometimes Jena is a good fit, other times grep is sufficient. It depend on what you're building.


--
You received this message because you are subscribed to the Google Groups "Freebase Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to freebase-discu...@googlegroups.com.
To post to this group, send email to freebase...@googlegroups.com.
Visit this group at http://groups.google.com/group/freebase-discuss.
For more options, visit https://groups.google.com/groups/opt_out.

Ranjeet Shinde

unread,
Jan 17, 2014, 2:03:51 AM1/17/14
to freebase...@googlegroups.com
Hello Shawan,

Thanks for reply.
I need to find out all the films and related information (actor,director,release date).
I suspect there is also issue with data dump while using Jena it gives error for 150.0 (decimal) and value true.
I have to fix the dump as well :(

Regards,
Ranjeet

Ewa Szwed

unread,
Jan 21, 2014, 6:51:11 AM1/21/14
to freebase...@googlegroups.com
Hi Shawn,
Wanted to ask again about the status of fixing dumps with enclosing literals into double quotes. Should I still expect it to happen in the near future?

Shawn Simister

unread,
Jan 22, 2014, 7:39:26 PM1/22/14
to Freebase Discuss
Yes, I have a test version of this right now which I'm getting into production. Please take a look at let me know if this addresses your concerns:



--
You received this message because you are subscribed to the Google Groups "Freebase Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to freebase-discu...@googlegroups.com.
To post to this group, send email to freebase...@googlegroups.com.

Dan

unread,
Jan 23, 2014, 5:47:23 AM1/23/14
to freebase...@googlegroups.com
From the previous mentioned issues, numbers and dates seem to be quoted correctly now.

I'm still seeing some of the ".." relations e.g. this row from the dump:

<http://rdf.freebase.com/ns/m.021r9f>
<http://rdf.freebase.com/ns/tv.tv_actor.starring_roles..tv.regular_tv_appearance.series>
<http://rdf.freebase.com/ns/m.0330r> .

I spotted another issue yesterday. Values in the http://rdf.freebase.com/key name space seem to have key repeated so are

http://rdf.freebase.com/key/key.wikipedia.en

rather than

http://rdf.freebase.com/key/wikipedia.no

which is what I think they should be.

But other than those it's working well for me now, and the splitting we discussed before is fine too finding the gzip headers.

Thanks,
Dan

Shawn Simister

unread,
Jan 23, 2014, 2:02:51 PM1/23/14
to Freebase Discuss
I'm glad you're able to split the gzip files correctly. I'm testing a new featured on the dump I linked to above that includes all the individual file sizes in a header at the beginning of the file. Here's the Hadoop InputFormat that reads it and generates file splits from the header info.

I'll look into the other two bugs that you found and see if I can get them fixed in the next update.
SplittableGzipInputFormat.java

Dan

unread,
Jan 24, 2014, 3:21:46 AM1/24/14
to freebase...@googlegroups.com
Thanks for looking into them.

Having the file sizes would simplify the code we're using and also make it less risky so that's great too. I'll have a look at that once they're being generated.

Thanks,
Dan

sameer....@gmail.com

unread,
Jan 24, 2014, 3:41:41 AM1/24/14
to freebase...@googlegroups.com
Hi Shawn ,

                I am trying to automate the process of downloading the Freebase dump every week. One question - will the file name be always "freebase-rdf-latest.gz" ?

Thad Guidry

unread,
Jan 24, 2014, 11:54:17 AM1/24/14
to Freebase Discuss
+1  ... I would like it to remain consistent with the "latest" term in the filename as suggested.  Makes it easier for me also.


Ewa Szwed

unread,
Jan 27, 2014, 11:07:41 AM1/27/14
to freebase...@googlegroups.com
Hi Shawn,
I have tested the dump on:
and the issue was fixed.
I am able to load the dump to Jena with no errors.

Shawn Simister

unread,
Jan 27, 2014, 2:47:38 PM1/27/14
to Freebase Discuss
Yes, I'll update the pipeline to always generate the file with the "latest" suffix.


--
You received this message because you are subscribed to the Google Groups "Freebase Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to freebase-discu...@googlegroups.com.
To post to this group, send email to freebase...@googlegroups.com.
Visit this group at http://groups.google.com/group/freebase-discuss.
For more options, visit https://groups.google.com/groups/opt_out.

Shawn Simister

unread,
Jan 27, 2014, 2:48:07 PM1/27/14
to Freebase Discuss
Glad to hear that fixed your issues Ewa. I'll update this thread again when this is running in production.


--
You received this message because you are subscribed to the Google Groups "Freebase Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to freebase-discu...@googlegroups.com.
To post to this group, send email to freebase...@googlegroups.com.
Visit this group at http://groups.google.com/group/freebase-discuss.
For more options, visit https://groups.google.com/groups/opt_out.

sameer....@gmail.com

unread,
Jan 30, 2014, 1:11:13 AM1/30/14
to freebase...@googlegroups.com
Hi Shawn,

           Can you tell me at what time does the free quota per application gets reset ? 

Regards,
Sameer

souri datta

unread,
Feb 6, 2014, 11:54:28 PM2/6/14
to freebase...@googlegroups.com
Hi Shawn,
 have you already updated the pipeline ?  Can we expect freebase-rdf-latest.gz to be latest from now on?

Currently https://developers.google.com/freebase/data shows 'freebase-rdf-2014-02-02-00-00' to be the latest dump which is different from freebase-rdf-latest.gz . Can you please have a look?

Thanks,
Souri
Message has been deleted

Dan

unread,
Feb 14, 2014, 12:00:42 PM2/14/14
to freebase...@googlegroups.com
latest was a one off dump to test a new dump so ignore that.

The latest dumps are the ones with timestamps.

 - Dan



On 14 February 2014 16:54, Josh Sumali <j.su...@gmail.com> wrote:
Current Latest has a LastModified of <LastModified>2014-01-04T03:27:39.783Z</LastModified>, but I think there is a newer file:

<Key>rdf/freebase-rdf-2014-02-02-00-00.gz</Key>
<Generation>1391309397565000</Generation>
<MetaGeneration>1</MetaGeneration>
<LastModified>2014-02-02T02:49:57.181Z</LastModified>

so Latest seems a bit out of date... any ideas?

Ewa Szwed

unread,
Mar 6, 2014, 8:05:02 AM3/6/14
to freebase...@googlegroups.com
Hello Shawn,
Some time ago I was checking your test dump to see if the literals were enclosed in double quotes and this change was done so that my problem disappeared.
This change however never gone to production so when today I checked: 
rdf/freebase-rdf-2014-03-02-00-00.gz
the problem still exists there.
Can I expect this to be fixed and when?
Also what is the content of the latest at the moment:
<Contents>
<Key>rdf/freebase-rdf-latest.gz</Key>
<Generation>1388806059946000</Generation>
<MetaGeneration>1</MetaGeneration>
<LastModified>2014-01-04T03:27:39.783Z</LastModified>
<ETag>"d8ad992680f877162c6c6c839287a533"</ETag>
<Size>27692521045</Size>
<Owner>
<ID>
00b4903a979c59476077b7bbd56dee830753de4d43027ee3ee8681a5d8453197
</ID>
</Owner>
</Contents>
LastModified suggests this is not the latest dump? Is that correct? 

Ewa Szwed

unread,
Apr 11, 2014, 9:48:15 AM4/11/14
to freebase...@googlegroups.com
Hi Shawn,
I was wondering what is the status of the 'latest' freebase dump - is this dump really the latest? The modification date is not ok there - http://commondatastorage.googleapis.com/freebase-public
2014-01-04T03:27:39
And the size does not match the one from: 2014-04-06
Is it really the latest dump?

Ewa Szwed

unread,
Sep 25, 2014, 9:16:56 AM9/25/14
to freebase...@googlegroups.com
Hi everybody,

I would like to come back to the problem of splitting the dump (using offsets list) before it is processed on Hadoop.

What is your recommendation here for today? I am asking because I have a problem with the approach I learnt from you some time ago.

I was advised to do this:

grep -obUaP "\x1F\x8B\x08\x00\x00\x00\x00\x00" $UPDATER_DATA_DUMPS_DIR/freebase-rdf-2014-09-21-00-00.gz | cut -d ":" -f 1 > splits.txt

And then I run splitter.py script as follows:

import sys
input_file = sys.argv[1]
offset_file = sys.argv[2]
output_file = input_file.replace('.gz','')
offsets = map(long, open(offset_file).read().rstrip().split("\n"))
 
with open(input_file,"rb") as f:
    part = 1
    previous = 0
    for offset in offsets:
        print offset
        if offset == 0:
          continue
 
        file_size = offset - previous
        print file_size
        byte_data = f.read(file_size)
        print f.tell
 
        with open('splits/part-'+str(part)+'-of-'+str(len(offsets))+'.gz', 'wb') as o:
            o.write(byte_data)
        previous = offset
        part += 1

My question is, is this a correct way to generate splits.txt for every freebase dump?

I think there might be a problem with this approach because working with freebase-rdf-2014-09-21-00-00.gz the number of triples after this process does not match with the before number.

2742939530 total before splitting
2730491205 total after splitting

Can I use grep command to generate splits.txt or I should you a list of offsets from you? If the latter, is the one for freebase-rdf-2014-09-21-00-00.gz available?

Regards,
Ewa

Georgios Noutsos

unread,
Feb 4, 2015, 3:04:22 AM2/4/15
to freebase...@googlegroups.com
Dear Ewa,

I would like to ask if you found duplicate entries in the data dump? We are importing it to POSTGRES and there seems to be a difference between the number of the triples in the dump and the number of triples imported. We are using the last freebase dump - before it moves to Wikidata.

Thank you in advance.

Georgios

Ewa Szwed

unread,
Feb 6, 2015, 5:30:50 AM2/6/15
to freebase...@googlegroups.com
Hi Georgios,
I have never done the dump analysis regarding duplicated triples hence it was never an issue for me. 
I can not say if the number of triples matches what is in the dump because we have various filters applied before the data is loaded there. I did not gather any numbers on that.

Sandro Cavallari

unread,
Apr 17, 2015, 12:35:55 AM4/17/15
to freebase...@googlegroups.com
Hi all,

I'm trying to split a freebase dump using the command

time grep -obUa "\x1F\x8B\x08\x00\x00\x00\x00\x00" freebase-rdf-latest.gz | cut -d ":" -f 1 > splits.txt

on a MacBook Pro machine.

But the file splits.txt is always empty. There is something wrong on using grep on Yosemin ??

is the searched word("\x1F\x8B\x08\x00\x00\x00\x00\x00") wrong?

Thanks,
Sandro
Reply all
Reply to author
Forward
0 new messages