How to load Getty TGN

101 views
Skip to first unread message

Kelvin Kan

unread,
Oct 23, 2017, 10:33:01 PM10/23/17
to Getty Vocabularies as Linked Open Data
Hi, anyone know any better way to load TGNOut_Full.nt (~38GB) into GraphDB? You may also share with me which other tools are suitable for this dataset

Vladimir Alexiev

unread,
Oct 24, 2017, 9:47:01 AM10/24/17
to Getty Vocabularies as Linked Open Data
Hi Kelvin, have you just tried to load the files?
The official endpoint runs GraphDB 6.2.7, but I don't expect problems with the latest version either (let me know),
Gregg Garcia from the Getty could provide extra info, eg about loading time.

Note: Getty loads the Explicit triples, but implements specific reasoning as per http://vocab.getty.edu/doc/#Inference. With the Total triples you don't need to worry about the inference.

Kelvin Kan

unread,
Oct 24, 2017, 10:17:42 PM10/24/17
to Getty Vocabularies as Linked Open Data
Hi, I am using latest version GraphDB 8.3. I have tried to load the datasets in both ways: one is through Workbench, one is using LoadRDF tool, but the loading is always not successful after some time. Can you guide me what is the better way to load this large datasets?

Vladimir Alexiev

unread,
Oct 30, 2017, 4:57:40 AM10/30/17
to Getty Vocabularies as Linked Open Data
We'll try to load all of http://vocab.getty.edu/dataset/tgn/full.zip into GraphDB 8.3 and report.
Cheers! Vladimir

Kelvin Kan

unread,
Oct 30, 2017, 5:23:05 AM10/30/17
to Getty Vocabularies as Linked Open Data
Hi Vladimir, appreciate for your effort

plamen.t...@ontotext.com

unread,
Oct 31, 2017, 4:12:29 AM10/31/17
to Getty Vocabularies as Linked Open Data
Hi Kelvin,
I successfully loaded the data it took me 5 hours. Could you please share logs and your system details.
Thanks,

Kelvin Kan

unread,
Oct 31, 2017, 5:16:36 AM10/31/17
to Getty Vocabularies as Linked Open Data
Hi, thanks for sharing the info. I do not have logs of my data loading. The following is my system details:
- Platform: VMware® Workstation 12 Pro v12.1.1
- OS: Windows Server 2012 R2 64-bit
- RAM: 12GB
- Processor: Intel(R) Core i7 @3.40GHz (2 cores)

Could you please share how you load the data? Thanks

plamen.t...@ontotext.com

unread,
Nov 1, 2017, 9:26:03 AM11/1/17
to Getty Vocabularies as Linked Open Data
Your system seems fine. For the loading I used LoadRDF for the test.
This is what I did:
1. I created a repository with standard configuration(didn't change any options) and named it getty-test-load. 
You could use ruleset "No Inference" instead of the default RDFS-Plus Optimized, this will speed things up.
2. Stopped the database
3. Ran the LoadRDF tool through the console:
bin/loadrdf -i getty-test-load -m parallel -f [path to the data dump]

If you still have issues please share your logs.

Cheers,

Kelvin Kan

unread,
Nov 1, 2017, 9:46:43 PM11/1/17
to Getty Vocabularies as Linked Open Data
Thanks for the info. Would you mind sharing your system details?

plamen.t...@ontotext.com

unread,
Nov 2, 2017, 4:41:40 AM11/2/17
to Getty Vocabularies as Linked Open Data
Yep.
Memory: 16G - usage was below 4G for the LoadRDF process
Processor: Intel® Core™ i7-6700 CPU @ 3.40GHz × 8
OS: Ubuntu 16.04

Kelvin Kan

unread,
Nov 7, 2017, 9:05:23 PM11/7/17
to Getty Vocabularies as Linked Open Data
Thanks for sharing the info. I am currently running the data loading using your method, but it is still running after 24 hours. Therefore, I want to check with you what is the total number of statements in TGNOut_Full.nt.

Plamen Tarkalanov

unread,
Nov 9, 2017, 8:23:35 AM11/9/17
to Getty Vocabularies as Linked Open Data
Hi Kevin,
In the repo I have 
Total:     291,860,768 statements.
inferred:              284 statements



 

Vladimir Alexiev

unread,
Nov 13, 2017, 2:32:36 AM11/13/17
to Getty Vocabularies as Linked Open Data
This query
select * {graph <http://vocab.getty.edu/.well-known/void> {?s void:triples ?o}}
says 205,891,470 for TGN but that doesn't count inferred statements.
The total export loaded by Plamen has all the inferred statements materialized, so we see there are 36M inferred statements.

Kelvin, were you able to load the thing?

Kelvin Kan

unread,
Nov 15, 2017, 8:55:42 PM11/15/17
to Getty Vocabularies as Linked Open Data
I am able to load the dataset, but it took me 7 days to finish loading. I realize that the loading will get slower and slower over time. Is there any way to speed up the process?

Another question: Do I still need to import External Ontologies and GVP Ontology as instructed in the Getty documentation?

Vladimir Alexiev

unread,
Nov 16, 2017, 4:45:30 AM11/16/17
to Getty Vocabularies as Linked Open Data
- You haven't yet told us your server specs. Tell us as much as you can, in particular exact GDB version, amount of RAM, whether you have SSD.
- What reasoning do you use?
- What's the total number of triples (explicit and inferred: that info is shown in a tooltip over the repo name)?
- If you want to see property and class info, load the ontologies. But set NO reasoning before that, since all consequences are already made

My guess is that you're doing plenty of useless inference. We use rather specific inference, see http://vocab.getty.edu/doc/#Inference in particular
Reduced SKOS Inference and Hierarchical Relations Inference. 

So you're best off without any inference. http://vocab.getty.edu/doc/#Total_Exports says "Because it includes all required Inference, you can load it to any repository (even one without RDFS reasoning)" but now that I reread it, it doesn't explicitly say "use no inference".

Kelvin, thanks for these trials! You and Plamen may be the first to trial TGN Total Exports in the last year (Getty are loading http://vocab.getty.edu/doc/#Explicit_Exports then doing inference).

7d is unacceptably bad and not comparable to the 5h it took Plamen, so once we diagnose it here, we'll raise a GraphDB support ticket.

Vladimir Alexiev

unread,
Nov 16, 2017, 4:49:34 AM11/16/17
to Getty Vocabularies as Linked Open Data
Also:
- what exact commands you used for loading (see Plamen's Nov 1)

Size: "291,860,768 statements in the repo".
You could also check the NT files with the unix command "wc -l *": will take a while but should output pretty much the same number.

Kelvin

unread,
Nov 23, 2017, 9:22:25 PM11/23/17
to Getty Vocabularies as Linked Open Data
Here's my server spec:
- OS: Windows Server 2012 R2 Datacenter 64 bit
- RAM: 16GB
- Processor: Intel(R) Core(TM) i3-2100 CPU @ 3.1GHz x 4
- GraphDB 8.3.1
- Using HDD

I am using NO INFERENCE for my repo. Total statements: 291,851,257 (Explicit: 291,851,257, Inferred: 0)

Command used for loading:
loadrdf -f -i TGN -m parallel \...\TGNOut_Full.nt

After the first trial, I did another test again in my same server, the time taken was about 11 hours. I did not change anything after the first loading which took 7 days. Also I did the loading test three times and each loading took about 11 hours as well.

Anyway, I have no problem in loading the dataset now. Thanks so much Vladimir and Plamen for your help.
Reply all
Reply to author
Forward
0 new messages