Parsing a large ttl file

Abhay Kujur

unread,

Oct 22, 2023, 7:07:27 PM10/22/23

to rdflib-dev

Hello,

I am working on a large ttl file of 20 GB, I try to read in using rdflib but the I am getting a error

killed

I am trying to create a smaller file from this file using grep command.

The sample data is yagoTransitiveType.ttl

grep "wordnet_" yagoTransitiveType.ttl >wordnet_yagoTransitiveType.ttl

The problem is that the file don't read the initial prefix like yago: and other, due to which rdflib is not able to parse the ttl file.

import rdflib

g = rdflib.Graph()

g.parse('yagoTransitiveType.ttl', format='ttl')

How can I fix the issue either by adding 10 lines after running grep command or any other way?

Nicholas Car

unread,

Oct 22, 2023, 8:14:14 PM10/22/23

to rdfli...@googlegroups.com

Turtle files have structure to them that a line-by-line sampler such as grep will break. It's not just about the prefixes but other parts in the Turtle files too since many lines only make sense in groups of lines.

If you want to sample an RDF file line-by-line, you need to serialise the file into N-Triples and then filter that, using some mechanism.

With a samples N-Triples file, you can then convert back to Turtle. to preserve the original prefixes, you can re-add them to the graph when serialising using g.bind("prefix", Namespace("namespace")) for each one.

Regards, Nick

------- Original Message -------

--
http://github.com/RDFLib
---
You received this message because you are subscribed to the Google Groups "rdflib-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rdflib-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rdflib-dev/43db0536-07ae-4e22-b073-9442ed08a5b0n%40googlegroups.com.

Abhay Kujur

unread,

Oct 23, 2023, 5:08:21 AM10/23/23

to rdflib-dev

Hello.

Thank you for making suggestions; the problem is handling large files, can you please suggest any efficient way to transform TTL to N-triple file?

Nicholas Car

unread,

Oct 23, 2023, 6:05:22 AM10/23/23

to rdfli...@googlegroups.com

If you are having trouble with RDFLib, you could use the Jena command line tool called RIOT - https://jena.apache.org/documentation/io/. It can format converts large files like 20GB.

Use RDFLib or RIOT on a machine with lots of RAM.

Nick

------- Original Message -------

To view this discussion on the web visit https://groups.google.com/d/msgid/rdflib-dev/a5bf206c-cb43-495b-bbb2-4a093cfaabebn%40googlegroups.com.

Miel Vander Sande

unread,

Oct 23, 2023, 6:17:57 AM10/23/23

to rdfli...@googlegroups.com

RIOT should use stream parsing, so RAM wise, you should be fine. And otherwise, there's always https://github.com/drobilla/serd, which is highly efficient

Op ma 23 okt 2023 om 12:05 schreef Nicholas Car <ni...@kurrawong.net>:

To view this discussion on the web visit https://groups.google.com/d/msgid/rdflib-dev/UFm0xHjxAB4P78Tu7xOLAM8cl7Y9zWRZaHAHie8jPWz-TGRNEDErYekKHVe9e7RvjUq8r3OUnc0VWX5tQvlL5MfKlZUb0OjQ5tALWGg4gL8%3D%40kurrawong.net.

Boris Pelakh

unread,

Oct 23, 2023, 8:40:02 PM10/23/23

to rdfli...@googlegroups.com

I am curious as to what the use case is, given that you appear to be looking at a subset of the RDF data. Perhaps there is no need to hold the entire graph (or even the subset) in memory at all - just a streaming parser (of any RDF format) with a per-triple event handler to gather whatever data/analytics you need. I am not sure if there a Python implementation, but there are several in Java.