using various alternatives to create a Neo4j database for the first time

117 views
Skip to first unread message

Guenter Hipler

unread,
Mar 21, 2016, 7:22:55 PM3/21/16
to Neo4j
Hi

we are running our first steps with Neo4j and used various alternatives to create an initial database

1) we used the Java API with an embedded database
here
https://github.com/linked-swissbib/swissbib-metafacture-commands/blob/neo4j-tests/src/main/java/org/swissbib/linked/mf/writer/NeoIndexer.java#L76
a transaction is closed which surrounds 20.000 nodes with relationships to around 40.000 other nodes.
We are surprised the Transaction.close() method needs up to 30 seconds to write these nodes to disk


2) then I wanted to compare my results with the neo4j-import script provided by the Neo4J-server
Using this method I have difficulties with the format of the csv-files

My small examples:
first node file:
lsId:ID(localsignature),:LABEL
"NEBIS/002527587",LOCALSIGNATURE
"OCoLC/637556711",LOCALSIGNATURE



second node file:
brId:ID(bibliographicresource),active,:LABEL
146404300,true,BIBLIOGRAPHICRESOURCE


relationship file
:START_ID(bibliographicresource),:END_ID(localsignature),:TYPE
146404300,"NEBIS/002527587",SIGNATUREOF
146404300,"OCoLC/637556711",SIGNATUREOF

./neo4j-import --into [path-to-db]/test.db/ --nodes files/br.csv --nodes files/br.csv --relationships:SIGNATUREOF files/signatureof.csv
which throws the exception

Done in 191ms
Prepare node index
Exception in thread "Thread-3" org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.DuplicateInputIdException: Id '146404300' is defined more than once in bibliographicresource, at least at /home/swissbib/environment/tools/neo4j-community-2.3.2/bin/files/br.csv:2 and /home/swissbib/environment/tools/neo4j-community-2.3.2/bin/files/br.csv:2
    at org.neo4j.unsafe.impl.batchimport.input.BadCollector$2.exception(BadCollector.java:107)
    at org.neo4j.unsafe.impl.batchimport.input.BadCollector.checkTolerance(BadCollector.java:176)
    at org.neo4j.unsafe.impl.batchimport.input.BadCollector.collectDuplicateNode(BadCollector.java:96)
    at org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.detectDuplicateInputIds(EncodingIdMapper.java:590)
    at org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.buildCollisionInfo(EncodingIdMapper.java:494)
    at org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.prepare(EncodingIdMapper.java:282)
    at org.neo4j.unsafe.impl.batchimport.IdMapperPreparationStep.process(IdMapperPreparationStep.java:54)
    at org.neo4j.unsafe.impl.batchimport.staging.LonelyProcessingStep$1.run(LonelyProcessingStep.java:56)
Duplicate input ids that would otherwise clash can be put into separate id space, read more about how to use id spaces in the manual: http://neo4j.com/docs/2.3.2/import-tool-header-format.html#import-tool-id-spaces
Caused by:Id '146404300' is defined more than once in bibliographicresource, at least at /home/swissbib/environment/tools/neo4j-community-2.3.2/bin/files/br.csv:2 and /home/swissbib/environment/tools/neo4j-community-2.3.2/bin/files/br.csv:2


I can't see any differences in the documentation of
http://neo4j.com/docs/2.3.2/import-tool-header-format.html#import-tool-id-spaces
because I tried to use the ID space notation (as far as I can see...)

Thanks for any hints!

Günter

Michael Hunger

unread,
Mar 22, 2016, 3:16:32 AM3/22/16
to ne...@googlegroups.com
Hi Guenter,


Am 21.03.2016 um 23:43 schrieb 'Guenter Hipler' via Neo4j <ne...@googlegroups.com>:

Hi

we are running our first steps with Neo4j and used various alternatives to create an initial database

1) we used the Java API with an embedded database
here
https://github.com/linked-swissbib/swissbib-metafacture-commands/blob/neo4j-tests/src/main/java/org/swissbib/linked/mf/writer/NeoIndexer.java#L76
a transaction is closed which surrounds 20.000 nodes with relationships to around 40.000 other nodes.
We are surprised the Transaction.close() method needs up to 30 seconds to write these nodes to disk


It depends on your disk performance, I hope you're not using a spinning disk?

so you have 20k + 40k + 40k++ records that you write (plus properties) ?
Then you'd need 4G heap and a fast disk to write them away quickly.

An option is to reduce batch size to e.g. 10k per tx in total. If your domain allows it you can also parallelize node-creation and rel-creation each (watch out for writing to the same nodes though). 

I have some comments on your code below, esp. the tx handling for the schema creation has to be fixed.

For the initial import neo4j-import should work well for you.

You only made the mistake of using your first file twice on the command line, you probably wanted to use the second file in the second place.

./neo4j-import --into [path-to-db]/test.db/ --nodes files/br.csv --nodes files/br.csv --relationships:SIGNATUREOF files/signatureof.csv

You can also provide the overarching label for the nodes on the commandline.


Cheers, Michael

Code comments:

package org.swissbib.linked.mf.writer;


@Description("Transforms documents to a Neo4j graph.")
@In(StreamReceiver.class)
@Out(Void.class)
public class NeoIndexer extends DefaultStreamPipe<ObjectReceiver<String>> {

    private final static Logger LOG = LoggerFactory.getLogger(NeoIndexer.class);
    GraphDatabaseService graphDb;
    File dbDir;
    Node mainNode;
    Transaction tx;
    int batchSize;
    int counter = 0;
    boolean firstRecord = true;


    public void setBatchSize(String batchSize) {
        this.batchSize = Integer.parseInt(batchSize);
    }

    public void setDbDir(String dbDir) {
        this.dbDir = new File(dbDir);
    }

    @Override
    public void startRecord(String identifier) {

        if (firstRecord) {
// is there no explicit onStartStream method in your API ?
// otherwise it might be better to pass the graphDB in to the constructor or via  a setter
            graphDb = new GraphDatabaseFactory().newEmbeddedDatabase(dbDir);
            tx = graphDb.beginTx();
            graphDb.schema().indexFor(lsbLabels.PERSON).on("name");
            graphDb.schema().indexFor(lsbLabels.ORGANISATION).on("name");
            graphDb.schema().indexFor(lsbLabels.BIBLIOGRAPHICRESOURCE).on("name");
            graphDb.schema().constraintFor(lsbLabels.PERSON).assertPropertyIsUnique("name");
            graphDb.schema().constraintFor(lsbLabels.ORGANISATION).assertPropertyIsUnique("name");
            graphDb.schema().constraintFor(lsbLabels.BIBLIOGRAPHICRESOURCE).assertPropertyIsUnique("name");
            graphDb.schema().constraintFor(lsbLabels.ITEM).assertPropertyIsUnique("name");
            graphDb.schema().constraintFor(lsbLabels.LOCALSIGNATURE).assertPropertyIsUnique("name");
            tx.success();
// misses tx.close() as this is a schema tx which can't be mixed with data tx
// also if this tx is not finished the indexes and constrains will not be in place so your lookups will be slow
            firstRecord = false;
// create new tx after tx.close()
        }

        counter += 1;
        LOG.debug("Working on record {}", identifier);
        if (identifier.contains("person")) {
            mainNode = createNode(lsbLabels.PERSON, identifier, false);
        } else if (identifier.contains("organisation")) {
            mainNode = createNode(lsbLabels.ORGANISATION, identifier, false);
        } else {
            mainNode = createNode(lsbLabels.BIBLIOGRAPHICRESOURCE, identifier, true);
        }

    }

    @Override
    public void endRecord() {
        tx.success();
        if (counter % batchSize == 0) {
            LOG.info("Commit batch upload ({} records processed so far)", counter);
            tx.close();
            tx = graphDb.beginTx();
        }
        super.endRecord();
    }

    @Override
    public void literal(String name, String value) {
        Node node;

        switch (name) {
            case "br":
                node = graphDb.findNode(lsbLabels.BIBLIOGRAPHICRESOURCE, "name", value);
                mainNode.createRelationshipTo(node, lsbRelations.CONTRIBUTOR);
                break;
            case "bf:local":
                node = createNode(lsbLabels.LOCALSIGNATURE, value, false);
                node.createRelationshipTo(mainNode, lsbRelations.SIGNATUREOF);
                break;
            case "item":
                node = createNode(lsbLabels.ITEM, value, false);
                node.createRelationshipTo(mainNode, lsbRelations.ITEMOF);
                break;
        }
    }
// naming of variables!
// you might consider to use an :Active label instead which is more efficient than a property
// but good that you only set the property for the true value

    private Node createNode(Label l, String v, boolean a) {
        Node n = graphDb.createNode(l);
        n.setProperty("name", v);
        if (a)
            n.setProperty("active", "true");
        return n;
    }

    @Override
    protected void onCloseStream() {
        LOG.info("Cleaning up (altogether {} records processed)", counter);
// does this always happen after an endRecord? otherwise you need a tx.success() here
        tx.close();
    }

    private enum lsbLabels implements Label {
        BIBLIOGRAPHICRESOURCE, PERSON, ORGANISATION, ITEM, LOCALSIGNATURE
    }

    public enum lsbRelations implements RelationshipType {
        CONTRIBUTOR, ITEMOF, SIGNATUREOF
    }
}


Thanks for any hints!

Günter


--
You received this message because you are subscribed to the Google Groups "Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Guenter Hipler

unread,
Apr 10, 2016, 1:42:35 PM4/10/16
to Neo4j
Hi Michael,

thanks for your hints and sorry for my delayed response.

In the meantime (shortly after your response)
- my colleague found an error in his index definitions (because of your remarks) which made search on existing labels ways faster.
- he doesn't realize the same difficulties (slow performance) in writing new nodes I have. I still have to  check up the reason for this differences.

By the way: Do you have in mind to provide a workshop with Neo4j as topic in June around Berlin Buzzwords as you have done it last year? - I would be interested to take part.

The week before I will met people from SLUB in Dresden.

Günter

Michael Hunger

unread,
Apr 10, 2016, 2:46:14 PM4/10/16
to ne...@googlegroups.com
Last year we had a hackathon the Sunday before. And I presented on graph compute with Neo4j.

You can also meet me in Dresden when I'm around.

Please let me know if you want to meet. Where are you originally located?


Guenter Hipler

unread,
Apr 13, 2016, 2:53:24 PM4/13/16
to Neo4j, Sebastian Schüpbach
Yes, it was a nice day in the C-Base Raumstation.
I'm located in Switzerland, Basel and I'm working for the swissbib project (https://www.swissbib.ch/)  that's the reason why I'm going to meet people from SLUB to get more familiar with their D:Swarm solution. So in general Dresden is far away.... and Buzzwords would be a possibility.

Very best wishes, Günter
Reply all
Reply to author
Forward
0 new messages