Importing from huge CSV data with Java Graph API

71 views
Skip to first unread message

Fabio Rinnone

unread,
Jun 19, 2016, 6:50:17 AM6/19/16
to OrientDB
Hi there,

I'm trying to load data from huge CSV files using Java Graph API. Most
bigger CSV file has about 1M of rows. I would understand what is the
best approach to create edges between rows of couples of CSV files.

I've tried two different approaches:

1) store all vertices and after iterate all vertices of a class and for
every vertices (sources) find all targets vertices from second class
using a SELECT query on an indexed attribute. This approach is very
slowly and I obtain for every query execution the following alert:

Query 'SELECT FROM cluster:clusterName WHERE primaryid = 117538101'
fetched more than 50000 records: to speed up the execution, create an
index or change the query to use an existent index

I don't understand this alert because I've definited an index on my
primaryid property.

2) store all vertices on graph and create two HashMap (key=primaryid,
value=OrientVertex) that contains all vertices loaded and use it to find
target vertices for every source: this approach is more quickly but it
is use a lot of memory. I'm executing the process on a server that have
12GB of RAM but it's insufficient.

Can you suggest me another approach more efficient, please?

What is exactly the approach used by ETL, for instance?

--
Fabio Rinnone
Skype: fabiorinnone
Web: http://www.fabiorinnone.eu


signature.asc

Luca Garulli

unread,
Jun 19, 2016, 8:15:16 AM6/19/16
to OrientDB
Hi Fabio,

The ETL approach is to query the index every time. However, if you received that message, it means the index wasn't used in the query, so lookups will be super slow.

Could you please execute this query from console or studio against your db?

EXPLAIN SELECT FROM cluster:clusterName WHERE primaryid = 117538101'

And check if there is any involvedIndexes?


Best Regards,

Luca Garulli
Founder & CEO


--

---
You received this message because you are subscribed to the Google Groups "OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to orient-databa...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Fabio Rinnone

unread,
Jun 19, 2016, 9:53:22 AM6/19/16
to orient-...@googlegroups.com
From console 'info class' command show me this index correctly defined:

INDEXES (1 altogether)
-------------------------------+----------------+
NAME | PROPERTIES |
-------------------------------+----------------+
primary_demo | primaryid |
-------------------------------+----------------+

If I execute the following query:

SELECT FROM cluster:demo15q4 WHERE primaryid = 117596311

I obtain correct result in 6.456 sec(s): it is too much time for a
cluster with 314858 records.

I'm yet using OrientDB 2.1.16, do you suggest me to migrate to 2.2.x
version?

Thank you.

Il 19/06/2016 14:14, Luca Garulli ha scritto:
> Hi Fabio,
>
> The ETL approach is to query the index every time. However, if you
> received that message, it means the index wasn't used in the query, so
> lookups will be super slow.
>
> Could you please execute this query from console or studio against your db?
>
> EXPLAIN SELECT FROM cluster:clusterName WHERE primaryid = 117538101'
>
> And check if there is any involvedIndexes?
>
>
> Best Regards,
>
> Luca Garulli
> Founder & CEO
> OrientDB <http://orientdb.com/>
signature.asc

Luca Garulli

unread,
Jun 19, 2016, 10:03:41 AM6/19/16
to OrientDB
Hi Fabio,

I definitely suggest to move to v2.2. However, could you please try the EXPLAIN?

Best Regards,

Luca Garulli
Founder & CEO


Fabio Rinnone

unread,
Jun 19, 2016, 10:13:23 AM6/19/16
to orient-...@googlegroups.com
Sorry, I did not understand, the output of EXPLAIN is the following:

Profiled command
'{documentReads:314858,current:#23:314857,recordReads:314858,fetchingFromTargetElapsed:6029,evaluated:314858,user:#5:0,tips:[1],elapsed:6030.6943,resultType:collection,resultSize:1}'
in 6.037000 sec(s):
{"@type":"d","@version":0,"documentReads":314858,"current":"#23:314857","recordReads":314858,"fetchingFromTargetElapsed":6029,"evaluated":314858,"user":"#5:0","tips":["Query
'SELECT FROM cluster:demo15q4 WHERE primaryid = 117596311' fetched more
than 50000 records: to speed up the execution, create an index or change
the query to use an existent
index"],"elapsed":6030.6943,"resultType":"collection","resultSize":1,"@fieldTypes":"documentReads=l,current=x,recordReads=l,fetchingFromTargetElapsed=l,evaluated=l,user=x,elapsed=f"}

The tip is the same of the Java application output.

Il 19/06/2016 16:03, Luca Garulli ha scritto:
> Hi Fabio,
>
> I definitely suggest to move to v2.2. However, could you please try the
> EXPLAIN?
>
> Best Regards,
>
> Luca Garulli
> Founder & CEO
> OrientDB LTD <http://orientdb.com/>
signature.asc

Luca Garulli

unread,
Jun 19, 2016, 10:17:10 AM6/19/16
to OrientDB
Ok, so no index has been used. Could you please drop and recreate it? If you are using 'primaryid' to just lookups and never ranges (ex: SELECT FROM cluster:demo15q4 WHERE primaryid > 117596311) I suggest you to use the HASH_INDEX because it's faster.



Best Regards,

Luca Garulli
Founder & CEO


Fabio Rinnone

unread,
Jun 19, 2016, 10:35:23 AM6/19/16
to orient-...@googlegroups.com
I've tried following commands from console:

> DROP INDEX demo.primary_demo
Index removed successfully

> CREATE INDEX demo.primaryid UNIQUE_HASH_INDEX
Created index successfully with 314858 entries in 15.230000 sec(s).

but

EXPLAIN SELECT FROM cluster:demo15q4 WHERE primaryid = 117596311

produces the same result.

But I noticed now that if I try to execute the query not into cluster
but into class as following:

EXPLAIN SELECT FROM demo WHERE primaryid = 117596311

the query obtains result in 0.006000 sec(s): it is very best time! But
why? Is it not possible to query a cluster using an indexed property?

I will load many cluster and I'm not sure if in the future the query can
maintain these optimal results, if I query is executed into entire class
rather than a specified cluster.

Il 19/06/2016 16:16, Luca Garulli ha scritto:
> Ok, so no index has been used. Could you please drop and recreate it? If
> you are using 'primaryid' to just lookups and never ranges (ex: SELECT
> FROM cluster:demo15q4 WHERE primaryid > 117596311) I suggest you to use
> the HASH_INDEX because it's faster.
>
>
>
> Best Regards,
>
> Luca Garulli
> Founder & CEO
signature.asc

Luca Garulli

unread,
Jun 19, 2016, 2:42:43 PM6/19/16
to OrientDB
I think you've just create a manual index (it's up to you to populate it). You need an automatic index, so let's keep it strict to the SQL syntax:


CREATE INDEX demo.primaryid ON demo(primaryid) UNIQUE_HASH_INDEX



Best Regards,

Luca Garulli
Founder & CEO


Fabio Rinnone

unread,
Jun 20, 2016, 4:54:07 AM6/20/16
to orient-...@googlegroups.com
I have the same result: if I query the cluster the index is not used.
Can be a bug?

Il 19/06/2016 20:42, Luca Garulli ha scritto:
> I think you've just create a manual index (it's up to you to populate
> it). You need an automatic index, so let's keep it strict to the SQL syntax:
>
>
> CREATE INDEX demo.primaryid ON demo(primaryid) UNIQUE_HASH_INDEX
>
>
>
> Best Regards,
>
> Luca Garulli
> Founder & CEO
signature.asc

Luca Garulli

unread,
Jun 20, 2016, 9:02:21 AM6/20/16
to OrientDB
Hi Fabio,

Sorry I didn't read your last message. You're right, indexes works only at class level. We have a proposal to support indexes at cluster level, but in the meanwhile you can create one class per cluster that extend the base "demo":

create class demo15q4 extends demo
create index demo15q4.primaryid on demo15q4(primaryid) UNIQUE_HASH_INDEX
create class demo15q5 extends demo
create index demo15q5.primaryid on demo15q5(primaryid) UNIQUE_HASH_INDEX
...


In this case both queries would work with index:

SELECT FROM demo WHERE primaryid = 117596311
SELECT FROM demo15q4 WHERE primaryid = 117596311


Best Regards,

Luca Garulli
Founder & CEO


Fabio Rinnone

unread,
Jun 23, 2016, 4:24:27 AM6/23/16
to orient-...@googlegroups.com
Thank you Luca, now it seems it works very good.

Best regards.

Il 20/06/2016 15:01, Luca Garulli ha scritto:
> Hi Fabio,
>
> Sorry I didn't read your last message. You're right, indexes works only
> at class level. We have a proposal to support indexes at cluster level,
> but in the meanwhile you can create one class per cluster that extend
> the base "demo":
>
> create class demo15q4 extends demo
> create index demo15q4.primaryid on demo15q4(primaryid) UNIQUE_HASH_INDEX
> create class demo15q5 extends demo
> create index demo15q5.primaryid on demo15q5(primaryid) UNIQUE_HASH_INDEX
> ...
>
>
> In this case both queries would work with index:
>
> SELECT FROM demo WHERE primaryid = 117596311
> SELECT FROM demo15q4 WHERE primaryid = 117596311
>
>
> Best Regards,
>
> Luca Garulli
> Founder & CEO
signature.asc
Reply all
Reply to author
Forward
0 new messages