De-duplication is very slow?

43 views
Skip to first unread message

Sven Hodapp

unread,
Apr 21, 2015, 11:03:25 AM4/21/15
to orient-...@googlegroups.com
Hi together,

for my project I want to import a lot of data into the database. But the data should be de-duplicated.

First of all, without de-duplication the insertion takes about 500 ms (a minimal test set):

    docelem = graph.addVertex("class:DocElem", "uri", uri, "type", type, "model", model);

    docelem.setProperties(attrs);

    graph.commit();


Now I don't want that the is uploaded twice, so I'll check it like this, if it's already in the database:


    Iterable<Vertex> itergraph.query()

        .has("uri", Compare.EQUAL, uri)

        .limit(1)

        .vertices();


Then, with iter.iterator().hasNext() I'm checking, if the vertex is already in the database. But this is dead slow (even indexed, or I've made a mistake)! Now it takes about 15 s for inserting.


You can suggest a better solution? The best case would be, if I don't have to call the database; and the database recognizes that the requested uri is already inserted and may only update the entry, or something like that!


Note 1: With println instead of db-insert the code needs about 50 ms to fetch/create the data. Is is possible to go faster?

Note 2: I'm using OrientDB 2.1-rc1 with remote connection (on the same host).


Regards,

Sven

Luigi Dell'Aquila

unread,
Apr 21, 2015, 11:09:07 AM4/21/15
to orient-...@googlegroups.com
Hi Sven,

how are you connected to the db (remote or plocal)?
Could you post the details of your db schema, in terms of classes, properties and index definitions?

Thanks

Luigi


--

---
You received this message because you are subscribed to the Google Groups "OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to orient-databa...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Luigi Dell'Aquila

unread,
Apr 21, 2015, 11:10:43 AM4/21/15
to orient-...@googlegroups.com
Ok, I saw the remote only now. If you want to go much master you should use plocal (this avoids network overhead).
If you need other clients to connect to the db in remote mode, you can start the server as embedded, see http://orientdb.com/docs/last/Embedded-Server.html

Luigi Dell'Aquila

unread,
Apr 21, 2015, 11:11:19 AM4/21/15
to orient-...@googlegroups.com
much *F*aster of course, not *m*aster...

Sven Hodapp

unread,
Apr 22, 2015, 2:52:59 AM4/22/15
to orient-...@googlegroups.com
Hi Luigi,

thanks for your response! I've tried the plocal and memory mode. But all I get is a stack trace:

    Cannot open the storage 'somedb' because it does not exist in path: somedb

Also if I'll pass valid unix paths. Maybe because of 2.1-rc1?

Never the less, it should be possible to do such transactions also on remote. So that it is possible for multiple worker to upload data to the database.

Any ideas how to make this faster?

Regards,
Sven

Luigi Dell'Aquila

unread,
Apr 22, 2015, 3:34:46 AM4/22/15
to orient-...@googlegroups.com
Hi Sven,

the error you are getting is quite strange, are you using an absolute file path for that? 
eg. plocal:/home/user1/my/db

In OrientDB there is an option that is called UPSERT that does exactly what you need. Unfortunately you cannot use it from the java API but you have to use SQL. Here is an example statement to do this:

"class:DocElem""uri"uri"type"type"model"model

String statement = "UPDATE DocElem set uri = ?, type = ?, model = ? UPSERT WHERE uri = ?";
db.command(new OCommandSQL(statement)).execute(uri, type, model, uri);
Hope it helps

Luigi
Reply all
Reply to author
Forward
0 new messages