Importing very large RDF

101 views
Skip to first unread message

Heather Packer

unread,
Mar 4, 2011, 1:41:17 PM3/4/11
to 4store-support
Hi,

I'm trying to install YAGO2 RDFS into a 4store kb. The file is very
large (187GB), and it causes 4s-import to crash with a segmentation
fault.
I've re-ran 4s-import with -vv (see below) and it looks like 4s-client
has a problem a little bit into the import.
I also ran 4s-import under valgrind, in case that helps (see below).

Any ideas?


Thanks,
Heather


$ /usr/local/bin/4s-import -vv yago2 yago2_full_20101210.rdfs
removing old data
Reading <file:///home/hp07r/yago2_full_20101210.rdfs>
Pass 1, processed 5000000 triples (5000000)
Pass 2, processed 5000000 triples, 14202 triples/s
Pass 1, processed 5000000 triples (10000000)
Pass 2, processed 5000000 triples, 37123 triples/s
Pass 1, processed 5000000 triples (15000000)
Pass 2, processed 5000000 triples, 37543 triples/s
Pass 1, processed 5000000 triples (20000000)
Pass 2, processed 5000000 triples, 42644 triples/s
Pass 1, processed 5000000 triples (25000000)
Pass 2, processed 5000000 triples, 37546 triples/s
Pass 1, processed 5000000 triples (30000000)
Pass 2, processed 5000000 triples, 50060 triples/s
Pass 1, processed 5000000 triples (35000000)
Pass 2, processed 5000000 triples, 48152 triples/s
Pass 1, processed 5000000 triples (40000000)
Pass 2, processed 5000000 triples, 48961 triples/s
Pass 1, processed 5000000 triples (45000000)
Pass 2, processed 5000000 triples, 44545 triples/s
Pass 1, processed 5000000 triples (50000000)
Pass 2, processed 5000000 triples, 44452 triples/s
Pass 1, processed 5000000 triples (55000000)
Pass 2, processed 5000000 triples, 12587 triples/s
Pass 1, processed 5000000 triples (60000000)
Pass 2, processed 5000000 triples, 3736 triples/s
4store[28704]: 4s-client.c:475 kb=yago2 write_replica(0) failed:
Connection reset by peer
Pass 1, processed 5000000 triples (65000000)
Pass 2, processed 5000000 triples, 3877 triples/s
Pass 1, processed 5000000 triples (70000000)
Pass 2, processed 5000000 triples, 4792 triples/s
Pass 1, processed 5000000 triples (75000000)
Pass 2, processed 5000000 triples, 5495 triples/s
Pass 1, processed 5000000 triples (80000000)
Pass 2, processed 5000000 triples, 7248 triples/s
Pass 1, processed 5000000 triples (85000000)
Pass 2, processed 5000000 triples, 18780 triples/s
Pass 1, processed 5000000 triples (90000000)
Pass 2, processed 5000000 triples, 49802 triples/s
Pass 1, processed 5000000 triples (95000000)
Pass 2, processed 5000000 triples, 54080 triples/s
Pass 1, processed 5000000 triples (100000000)
Pass 2, processed 5000000 triples, 53814 triples/s
Pass 1, processed 5000000 triples (105000000)
Pass 2, processed 5000000 triples, 55516 triples/s
Pass 1, processed 5000000 triples (110000000)
Pass 2, processed 5000000 triples, 54957 triples/s
Pass 1, processed 5000000 triples (115000000)
Pass 2, processed 5000000 triples, 56634 triples/s
Pass 1, processed 5000000 triples (120000000)
Pass 2, processed 5000000 triples, 60010 triples/s
Pass 1, processed 5000000 triples (125000000)
Pass 2, processed 5000000 triples, 64438 triples/s
Pass 1, processed 5000000 triples (130000000)
Pass 2, processed 5000000 triples, 65208 triples/s
Pass 1, processed 5000000 triples (135000000)
Pass 2, processed 5000000 triples, 67814 triples/s
Pass 1, processed 5000000 triples (140000000)
Pass 2, processed 5000000 triples, 67830 triples/s
Pass 1, processed 5000000 triples (145000000)
Pass 2, processed 5000000 triples, 70333 triples/s
Pass 1, processed 5000000 triples (150000000)
Pass 2, processed 5000000 triples, 72168 triples/s
Pass 1, processed 5000000 triples (155000000)
Pass 2, processed 5000000 triples, 69958 triples/s
Pass 1, processed 5000000 triples (160000000)
Pass 2, processed 5000000 triples, 68880 triples/s
Pass 1, processed 5000000 triples (165000000)
Pass 2, processed 5000000 triples, 71081 triples/s
Pass 1, processed 5000000 triples (170000000)
Pass 2, processed 5000000 triples, 67939 triples/s
Pass 1, processed 5000000 triples (175000000)
Pass 2, processed 5000000 triples, 70176 triples/s
Pass 1, processed 5000000 triples (180000000)
Pass 2, processed 5000000 triples, 66729 triples/s
Pass 1, processed 5000000 triples (185000000)
Pass 2, processed 5000000 triples, 69311 triples/s
Pass 1, processed 5000000 triples (190000000)
Pass 2, processed 5000000 triples, 71192 triples/s
Pass 1, processed 5000000 triples (195000000)
Pass 2, processed 5000000 triples, 69065 triples/s
Pass 1, processed 5000000 triples (200000000)
Pass 2, processed 5000000 triples, 71205 triples/s
Pass 1, processed 5000000 triples (205000000)
Pass 2, processed 5000000 triples, 68406 triples/s
Pass 1, processed 5000000 triples (210000000)
Pass 2, processed 5000000 triples, 69395 triples/s
Pass 1, processed 5000000 triples (215000000)
Pass 2, processed 5000000 triples, 73154 triples/s
Pass 1, processed 5000000 triples (220000000)
Pass 2, processed 5000000 triples, 68640 triples/s
Pass 1, processed 5000000 triples (225000000)
Pass 2, processed 5000000 triples, 71499 triples/s
Pass 1, processed 5000000 triples (230000000)
Pass 2, processed 5000000 triples, 67665 triples/s
Pass 1, processed 5000000 triples (235000000)
Pass 2, processed 5000000 triples, 69855 triples/s
Pass 1, processed 5000000 triples (240000000)
Pass 2, processed 5000000 triples, 71274 triples/s
Pass 1, processed 5000000 triples (245000000)
Pass 2, processed 5000000 triples, 61206 triples/s
Pass 1, processed 5000000 triples (250000000)
Pass 2, processed 5000000 triples, 65956 triples/s
Pass 1, processed 5000000 triples (255000000)
Pass 2, processed 5000000 triples, 63816 triples/s
Pass 1, processed 5000000 triples (260000000)
Pass 2, processed 5000000 triples, 66165 triples/s
Pass 1, processed 5000000 triples (265000000)
Pass 2, processed 5000000 triples, 56012 triples/s
Pass 1, processed 5000000 triples (270000000)
Pass 2, processed 5000000 triples, 58260 triples/s
Pass 1, processed 5000000 triples (275000000)
Pass 2, processed 5000000 triples, 57785 triples/s
Pass 1, processed 5000000 triples (280000000)
Pass 2, processed 5000000 triples, 66044 triples/s
Pass 1, processed 5000000 triples (285000000)
Pass 2, processed 5000000 triples, 71316 triples/s
Segmentation fault590000 triples
$




==7111== Memcheck, a memory error detector
==7111== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et
al.
==7111== Using Valgrind-3.5.0 and LibVEX; rerun with -h for copyright
info
==7111== Command: /usr/local/bin/4s-import yago2
yago2_full_20101210.rdfs
==7111==
==7111== Conditional jump or move depends on uninitialised value(s)
==7111== at 0xB822D4: inet_ntop (in /lib/libc-2.5.so)
==7111== by 0xCCF0D6: avahi_address_snprint (in /usr/lib/libavahi-
common.so.3.4.3)
==7111== by 0x8053B68: resolve_callback (4s-mdns.c:115)
==7111== by 0x1AB12B: avahi_service_resolver_event (in /usr/lib/
libavahi-client.so.3.2.1)
==7111== by 0x1A585D: ??? (in /usr/lib/libavahi-client.so.3.2.1)
==7111== by 0x40BA673: dbus_connection_dispatch (in /lib/
libdbus-1.so.3.4.0)
==7111== by 0x1ACAEB: ??? (in /usr/lib/libavahi-client.so.3.2.1)
==7111== by 0xCD21A9: ??? (in /usr/lib/libavahi-common.so.3.4.3)
==7111== by 0xCD2460: avahi_simple_poll_dispatch (in /usr/lib/
libavahi-common.so.3.4.3)
==7111== by 0xCD2ABA: avahi_simple_poll_iterate (in /usr/lib/
libavahi-common.so.3.4.3)
==7111== by 0x80538E4: fsp_mdns_setup_frontend (4s-mdns.c:190)
==7111== by 0x805267D: fsp_open_link (4s-client.c:581)
==7111==
==7111== Conditional jump or move depends on uninitialised value(s)
==7111== at 0xB82582: inet_ntop (in /lib/libc-2.5.so)
==7111== by 0xCCF0D6: avahi_address_snprint (in /usr/lib/libavahi-
common.so.3.4.3)
==7111== by 0x8053B68: resolve_callback (4s-mdns.c:115)
==7111== by 0x1AB12B: avahi_service_resolver_event (in /usr/lib/
libavahi-client.so.3.2.1)
==7111== by 0x1A585D: ??? (in /usr/lib/libavahi-client.so.3.2.1)
==7111== by 0x40BA673: dbus_connection_dispatch (in /lib/
libdbus-1.so.3.4.0)
==7111== by 0x1ACAEB: ??? (in /usr/lib/libavahi-client.so.3.2.1)
==7111== by 0xCD21A9: ??? (in /usr/lib/libavahi-common.so.3.4.3)
==7111== by 0xCD2460: avahi_simple_poll_dispatch (in /usr/lib/
libavahi-common.so.3.4.3)
==7111== by 0xCD2ABA: avahi_simple_poll_iterate (in /usr/lib/
libavahi-common.so.3.4.3)
==7111== by 0x80538E4: fsp_mdns_setup_frontend (4s-mdns.c:190)
==7111== by 0x805267D: fsp_open_link (4s-client.c:581)
==7111==
==7111== Invalid write of size 2
==7111== at 0x804D884: message_new (4s-common.c:137)
==7111== by 0x80523E4: fsp_res_import (4s-client.c:1023)
==7111== by 0x804B84A: buffer_res (import.c:121)
==7111== by 0x804C455: store_stmt (import.c:835)
==7111== by 0x403E221: raptor_rdfxml_generate_statement
(raptor_rdfxml.c:1280)
==7111== by 0x403ECD8: raptor_rdfxml_end_element_handler
(raptor_rdfxml.c:2785)
==7111== by 0x4041203: raptor_sax2_end_element (raptor_sax2.c:948)
==7111== by 0x767E51: ??? (in /usr/lib/libxml2.so.2.6.26)
==7111== by 0x774718: xmlParseChunk (in /usr/lib/libxml2.so.2.6.26)
==7111== by 0x4041A29: raptor_sax2_parse_chunk (raptor_sax2.c:593)
==7111== by 0x403D8BE: raptor_rdfxml_parse_chunk (raptor_rdfxml.c:
1151)
==7111== by 0x402B042: raptor_parser_parse_chunk (raptor_parse.c:
471)
==7111== Address 0x0 is not stack'd, malloc'd or (recently) free'd
==7111==
==7111==
==7111==
==7111== Process terminating with default action of signal 11
(SIGSEGV)
==7111== Access not within mapped region at address 0x0
==7111== at 0x804D884: message_new (4s-common.c:137)
==7111== by 0x80523E4: fsp_res_import (4s-client.c:1023)
==7111== by 0x804B84A: buffer_res (import.c:121)
==7111== by 0x804C455: store_stmt (import.c:835)
==7111== by 0x403E221: raptor_rdfxml_generate_statement
(raptor_rdfxml.c:1280)
==7111== by 0x403ECD8: raptor_rdfxml_end_element_handler
(raptor_rdfxml.c:2785)
==7111== by 0x4041203: raptor_sax2_end_element (raptor_sax2.c:948)
==7111== by 0x767E51: ??? (in /usr/lib/libxml2.so.2.6.26)
==7111== by 0x774718: xmlParseChunk (in /usr/lib/libxml2.so.2.6.26)
==7111== by 0x4041A29: raptor_sax2_parse_chunk (raptor_sax2.c:593)
==7111== by 0x403D8BE: raptor_rdfxml_parse_chunk (raptor_rdfxml.c:
1151)
==7111== by 0x402B042: raptor_parser_parse_chunk (raptor_parse.c:
471)
==7111== If you believe this happened as a result of a stack
==7111== overflow in your program's main thread (unlikely but
==7111== possible), you can try to increase the size of the
==7111== main thread stack using the --main-stacksize= flag.
==7111== The main thread stack size used in this run was 10485760.
==7111==
==7111== HEAP SUMMARY:
==7111== in use at exit: 392,637,298 bytes in 21,549,804 blocks
==7111== total heap usage: 665,905,425 allocs, 644,355,620 frees,
37,348,012,212 bytes allocated
==7111==
==7111==
==7111== Valgrind's memory management: out of memory:
==7111== newSuperblock's request for 86200320 bytes failed.
==7111== 3120476160 bytes have already been allocated.
==7111== Valgrind cannot continue. Sorry.
==7111==
==7111== There are several possible reasons for this.
==7111== - You have some kind of memory limit in place. Look at
the
==7111== output of 'ulimit -a'. Is there a limit on the size of
==7111== virtual memory or address space?
==7111== - You have run out of swap space.
==7111== - Valgrind has a bug. If you think this is the case or
you are
==7111== not sure, please let us know and we'll try to fix it.
==7111== Please note that programs can take substantially more
memory than
==7111== normal when running under Valgrind tools, eg. up to twice
or
==7111== more, depending on the tool. On a 64-bit machine,
Valgrind
==7111== should be able to make use of up 32GB memory. On a 32-
bit
==7111== machine, Valgrind should be able to use all the memory
available
==7111== to a single process, up to 4GB if that's how you have
your
==7111== kernel configured. Most 32-bit Linux setups allow a
maximum of
==7111== 3GB per process.
==7111==
==7111== Whatever the reason, Valgrind cannot continue. Sorry.


Steve Harris

unread,
Mar 4, 2011, 2:05:53 PM3/4/11
to 4store-...@googlegroups.com
Hi Heather,

Interesting... I don't think anyone's tried to import an RDF/XML file that big.

Thanks for the Valgrind trace, looks like it's something in the XML parser, but could be a 4store bug.

Could you try parsing it with rapper -gc, and see if it does the same thing?

The connection line is interesting too:

> 4store[28704]: 4s-client.c:475 kb=yago2 write_replica(0) failed:
> Connection reset by peer

Is this on a cluster, or a single machine?

- Steve

> --
> You received this message because you are subscribed to the Google Groups "4store-support" group.
> To post to this group, send email to 4store-...@googlegroups.com.
> To unsubscribe from this group, send email to 4store-suppor...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/4store-support?hl=en.
>

Mischa

unread,
Mar 4, 2011, 2:49:35 PM3/4/11
to 4store-...@googlegroups.com, 4store-...@googlegroups.com
<snip/>

Sent on the move

On Mar 4, 2011, at 7:05 PM, Steve Harris <s.w.h...@gmail.com> wrote:

> Hi Heather,
>
> Interesting... I don't think anyone's tried to import an RDF/XML file that big.

I will try and find the file from somewheres, am intrigued !

>
> Thanks for the Valgrind trace, looks like it's something in the XML parser, but could be a 4store bug.
>
> Could you try parsing it with rapper -gc, and see if it does the same thing?
>
> The connection line is interesting too:
>
>> 4store[28704]: 4s-client.c:475 kb=yago2 write_replica(0) failed:
>> Connection reset by peer
>
> Is this on a cluster, or a single machine?

I also wonder how much ram would be needed. I can't recall how many triples you get on average per gig of rdfxml.

Mischa

Heather Packer

unread,
Mar 4, 2011, 3:31:05 PM3/4/11
to 4store-...@googlegroups.com
> Could you try parsing it with rapper -gc, and see if it does the same thing?

It looks like it:

$ rapper -gc yago2_full_20101210.rdfs
rapper: Parsing URI file:///home/hp07r/yago2_full_20101210.rdfs with parser guess
rapper: Guessed parser name 'rdfxml'
rapper: Error - - XML parser error: Memory allocation failed
rapper: Error - - XML parser error: attributes construct error
rapper: Error - - XML parser error: Specification mandate value for attribute fact_23200124106
rapper: Error - - XML parser error: attributes construct error
rapper: Error - - XML parser error: error parsing attribute name
rapper: Error - - XML parser error: attributes construct error
rapper: Error - - XML parser error: xmlParseStartTag: problem parsing attributes
rapper: Error - - XML parser error: Couldn't find end of Start Tag rdf:Description
rapper: Failed to parse file yago2_full_20101210.rdfs guess content
rapper: Parsing returned 291250650 triples


>
> The connection line is interesting too:
>
>> 4store[28704]: 4s-client.c:475 kb=yago2 write_replica(0) failed:
>> Connection reset by peer
>
> Is this on a cluster, or a single machine?

A single machine.


Thanks,
Heather

Steve Harris

unread,
Mar 4, 2011, 5:21:01 PM3/4/11
to 4store-...@googlegroups.com
On 2011-03-04, at 20:31, Heather Packer wrote:

>> Could you try parsing it with rapper -gc, and see if it does the same thing?
>
> It looks like it:
>
> $ rapper -gc yago2_full_20101210.rdfs
> rapper: Parsing URI file:///home/hp07r/yago2_full_20101210.rdfs with parser guess
> rapper: Guessed parser name 'rdfxml'
> rapper: Error - - XML parser error: Memory allocation failed

OK, thanks, I'll file a bug with the parser author.

Is the data available in another format? I'd recommend N-Triples for really large imports.

- Steve

Reply all
Reply to author
Forward
0 new messages