enable KCB option for rdf2hdt

Gang Fu

unread,

Dec 14, 2015, 9:49:54 AM12/14/15

to BioHDT

I am wondering how I am able to run rdf2hdt conversion with kyotocabinet database (KCB) option. We have a very large RDF nt file that exceeds the memory limit for conversion, and the job got killed using memory....I cannot find anything in README file.

In addition, I took a look at BasicHDT.cpp file and found there is no option for KCB:

void BasicHDT::createComponents() {
	// HEADER
	header = new PlainHeader();

	// DICTIONARY
	std::string dictType = spec.get("dictionary.type");
	if(dictType==HDTVocabulary::DICTIONARY_TYPE_FOUR) {
	dictionary = new FourSectionDictionary(spec);
	} else if(dictType==HDTVocabulary::DICTIONARY_TYPE_PLAIN) {
	dictionary = new PlainDictionary(spec);
	} else if(dictType==HDTVocabulary::DICTIONARY_TYPE_LITERAL) {
	#ifdef HAVE_CDS
	dictionary = new LiteralDictionary(spec);
	#else
	throw "This version has been compiled without support for this dictionary";
	#endif
	} else {
	dictionary = new FourSectionDictionary(spec);
	}

	// TRIPLES
	std::string triplesType = spec.get("triples.type");
	if(triplesType==HDTVocabulary::TRIPLES_TYPE_BITMAP) {
	triples = new BitmapTriples(spec);
	} else if(triplesType==HDTVocabulary::TRIPLES_TYPE_COMPACT) {
	triples = new CompactTriples(spec);
	} else if(triplesType==HDTVocabulary::TRIPLES_TYPE_PLAIN) {
	triples = new PlainTriples(spec);
	} else if(triplesType==HDTVocabulary::TRIPLES_TYPE_TRIPLESLIST) {
	triples = new TriplesList(spec);
	#ifndef WIN32
	} else if (triplesType == HDTVocabulary::TRIPLES_TYPE_TRIPLESLISTDISK) {
	triples = new TripleListDisk();
	#endif
	} else {
	triples = new BitmapTriples(spec);
	}
	}

migumar2

unread,

Dec 14, 2015, 9:58:12 AM12/14/15

to BioHDT

Hi!

Memory consumption is the main bottleneck of the HDT compression process mainly becasue large strings in the dataset. Please consider to use our MapReduce based implementation because it addresses this lack of scalability (http://dataweb.infor.uva.es/projects/hdt-mr/). I hope it is useful for you!

Best,

Miguel

...

Arto Bendiken

unread,

Dec 15, 2015, 10:50:24 AM12/15/15

to Miguel Ángel Martínez Prieto, BioHDT mailing list

Hi Miguel,

On Mon, Dec 14, 2015 at 3:58 PM, migumar2 <migu...@gmail.com> wrote:
> Memory consumption is the main bottleneck of the HDT compression process mainly becasue large strings in the dataset. Please consider to use our MapReduce based implementation because it addresses this lack of scalability (http://dataweb.infor.uva.es/projects/hdt-mr/). I hope it is useful for you!

Any chance you guys could put up the HDT-MR implementation on GitHub
[1], so that people could file bug reports and submit patches as pull
requests?

Thanks,
Arto

[1] https://github.com/rdfhdt

--
Arto Bendiken | @bendiken | @dydradata

migumar2

unread,

Dec 15, 2015, 10:54:23 AM12/15/15

to BioHDT, migu...@gmail.com

Hi Arto,

I will forward this suggestion to our former student José M. Giménez-García who is the author and responsible of this implementation. I hope he can help you!

Miguel

Gang Fu

unread,

Dec 22, 2015, 1:41:20 PM12/22/15

to BioHDT

Hi Miguel,

Do you have the pom.xml file for HDT-MR 'src' codes? It can make building process much easier for me :) Thank you very much.

Best,

Gang

Arto Bendiken

unread,

Jun 13, 2016, 4:54:13 AM6/13/16

to Miguel Ángel Martínez, BioHDT mailing list

Hi Miguel,

On Wed, Dec 16, 2015 at 12:54 AM, migumar2 <migu...@gmail.com> wrote:
> Hi Arto,
>
> I will forward this suggestion to our former student José M. Giménez-García
> who is the author and responsible of this implementation. I hope he can help
> you!

Just to follow up on this, did you ever hear back from your former
student on this question?

Thanks,
Arto

Reply all

Reply to author

Forward