Loading data from CSV into specified cluster

79 views
Skip to first unread message

Fabio Rinnone

unread,
Apr 15, 2016, 6:29:47 AM4/15/16
to OrientDB
Hi,

there is a way to loading data from CSV via ETL module only into one
specified cluster of class? I have some csv files and I would like to
load individual CSV file data only on a specific cluster. I must have
the same number of clusters in the number of csv file and each csv file
must be loaded in a single cluster.

This option in ETL loader, before classes definition, don't work for me:

"cluster":"clusterName",

Thanks to everybody.

--
Fabio Rinnone
Skype: fabiorinnone
Web: http://www.fabiorinnone.eu

signature.asc

Roberto Franchini

unread,
Apr 15, 2016, 10:10:38 AM4/15/16
to orient-...@googlegroups.com
On Fri, Apr 15, 2016 at 12:29 PM, Fabio Rinnone
<fabio.r...@gmail.com> wrote:
> Hi,
>
> there is a way to loading data from CSV via ETL module only into one
> specified cluster of class? I have some csv files and I would like to
> load individual CSV file data only on a specific cluster. I must have
> the same number of clusters in the number of csv file and each csv file
> must be loaded in a single cluster.
>
> This option in ETL loader, before classes definition, don't work for me:
>
> "cluster":"clusterName",
>

Well, it's a bug.
The clusterName is not well managed inside the ETL:

https://github.com/orientechnologies/orientdb/issues/5987

ETL doesn't create clusters, at the moment: they should be present in
the database BEFORE launch ETL.
RF

--
Best regards,

Roberto Franchini

OrientDB LTD - http://orientdb.com

Fabio Rinnone

unread,
Apr 16, 2016, 3:24:48 PM4/16/16
to orient-...@googlegroups.com
Thank you, now it creates cluster, but my question is little different:
if I create for instance a class named "foo" and in my ETL loader I
define a new class named for instance "foocluster", when I launch ETL
(we assume that csv has 2000 rows), it loads 1000 rows into "foo"
cluster and the others 1000 into "foocluster", but I want that all 2000
rows will loaded only into "foocluster".

Is correct my assumption or probably don't I understand exactly the
meaning of role of clusters in OrientDB?

Thank you again.
signature.asc

scott molinari

unread,
Apr 18, 2016, 12:46:50 AM4/18/16
to OrientDB, fabio.r...@gmail.com
What is your understanding of the meaning of a cluster?

Scott

Roberto Franchini

unread,
Apr 18, 2016, 4:25:38 AM4/18/16
to orient-...@googlegroups.com
On Sat, Apr 16, 2016 at 9:24 PM, Fabio Rinnone <fabio.r...@gmail.com> wrote:
> Il 15/04/2016 16:10, Roberto Franchini ha scritto:
>> On Fri, Apr 15, 2016 at 12:29 PM, Fabio Rinnone
>> <fabio.r...@gmail.com> wrote:
[cut]

>
> Thank you, now it creates cluster, but my question is little different:
> if I create for instance a class named "foo" and in my ETL loader I
> define a new class named for instance "foocluster", when I launch ETL
> (we assume that csv has 2000 rows), it loads 1000 rows into "foo"
> cluster and the others 1000 into "foocluster", but I want that all 2000
> rows will loaded only into "foocluster".
>
> Is correct my assumption or probably don't I understand exactly the
> meaning of role of clusters in OrientDB?

First of all: docmentation

http://orientdb.com/docs/last/Tutorial-Classes.html

http://orientdb.com/docs/last/Tutorial-Clusters.html

Now, ETL. If you configure ET to store on a given cluster, all the
document loaded will be store in that cluster.
So, you can load different data's partition on different clusters of
the same class.
Suppose to have 12 CSVs, one for each month. Each CSV contains
contains invoices for a single month:
invoices_01.csv contains invoices for January
invoices_12.csv contains invoices for December

It could be useful to "partion" Invoice class in 12 clusters, and load
each csv on its own cluster.

I hope this could clarify what's the purpose of Clusters.

Fabio Rinnone

unread,
Apr 18, 2016, 5:29:12 AM4/18/16
to orient-...@googlegroups.com
Il 18/04/2016 10:25, Roberto Franchini ha scritto:

> Now, ETL. If you configure ET to store on a given cluster, all the
> document loaded will be store in that cluster.
> So, you can load different data's partition on different clusters of
> the same class.
> Suppose to have 12 CSVs, one for each month. Each CSV contains
> contains invoices for a single month:
> invoices_01.csv contains invoices for January
> invoices_12.csv contains invoices for December
>
> It could be useful to "partion" Invoice class in 12 clusters, and load
> each csv on its own cluster.
>
> I hope this could clarify what's the purpose of Clusters.

Thank you for the reply, I have read documentation and I think that I
correctly understand the role of clusters in OrientDB.

So, I will explain my issue with your invoices example:

suppose we have two csv files, invoices_01.csv defined as follow:

"id","customer","total"
"1","John","1000"
"2","Bob","250"
"3","Jack","630"
"4","Alice","900"

and invoices_02.csv defined as follow:

"id","customer","total"
"1","John","1000"
"2","Bob","250"
"3","Jack","630"
"4","Alice","900"

So, I would to create a class named invoices (with default main cluster
named invoices) and two more cluster named respectively invoices_01 (for
the data of the first csv file) and invoices_02 (for the data of the
second one).

I define my first ETL loader as follow:

"loader": {
"orientdb": {
"dbURL": "plocal:../databases/invoices",
"wal": false,
"tx": false,
"batchCommit": 10000,
"dbType": "graph",
"cluster": "invoices_01",
"classes": [
{"name": "invoices", "extends": "V"}
], "indexes": [
{"class":"invoices", "fields":["id:integer"], "type":"UNIQUE" }
]
}

Look at the parameter "cluster" with value "invoice_01" (the second json
ETL loader is similar, it changes only for cluster name and csv file path).

When I launh first ETL module I expect it creates a class with two
clusters named respectively invoices and invoices_01 and I expect that
invoices cluster contains no records and invoices_01 contains all 4
records of csv file.

But my output is different: it creates two clusters respectively with
ids 11 (invoices) and 12 (invoices_01) and it loades data into classes
as follow:

[1:vertex] DEBUG Transformer output: v(invoices)[#11:0]
[2:vertex] DEBUG Transformer output: v(invoices)[#12:0]
[3:vertex] DEBUG Transformer output: v(invoices)[#11:1]
[4:vertex] DEBUG Transformer output: v(invoices)[#12:1]

I think this is not correct, because I think that my loader should be
load data only into cluster with id 12.

However, when I launch the second ETL loader the results is similar: it
creates a new cluster named invoices_02 with id 13 and the log contains:

[1:vertex] DEBUG Transformer output: v(invoices)[#11:2]
[2:vertex] DEBUG Transformer output: v(invoices)[#12:2]
[3:vertex] DEBUG Transformer output: v(invoices)[#13:0]
[4:vertex] DEBUG Transformer output: v(invoices)[#11:3]

I think that the second ETL loader should load data only into cluster
with id 13 (invoices_02). Finally I have three clusters (11, 12, 13)
which contains respectively 4, 3 and 1 records.

I don't know what's my error.
signature.asc

Fabio Rinnone

unread,
Apr 18, 2016, 5:35:24 AM4/18/16
to orient-...@googlegroups.com
Il 18/04/2016 10:25, Roberto Franchini ha scritto:
> Now, ETL. If you configure ET to store on a given cluster, all the
> document loaded will be store in that cluster.
> So, you can load different data's partition on different clusters of
> the same class.
> Suppose to have 12 CSVs, one for each month. Each CSV contains
> contains invoices for a single month:
> invoices_01.csv contains invoices for January
> invoices_12.csv contains invoices for December
>
> It could be useful to "partion" Invoice class in 12 clusters, and load
> each csv on its own cluster.
>
> I hope this could clarify what's the purpose of Clusters.

In my example csv data are different:

invoice01.csv:

"id","customer","total"
"1","John","1000"
"2","Bob","250"
"3","Jack","630"
"4","Alice","900"

invoices_02.csv:

"id","customer","total"
"5","Jimmy","1200"
"6","Bart","1250"
"7","Bob","920"
"8","John","200"

because ids are unique.

Sorry for the mistake: I wrote a hurry.
signature.asc

Roberto Franchini

unread,
Apr 19, 2016, 3:08:19 AM4/19/16
to orient-...@googlegroups.com
On Mon, Apr 18, 2016 at 11:35 AM, Fabio Rinnone
<fabio.r...@gmail.com> wrote:
> Il 18/04/2016 10:25, Roberto Franchini ha scritto:
>> Now, ETL. If you configure ET to store on a given cluster, all the
>> document loaded will be store in that cluster.
>> So, you can load different data's partition on different clusters of
>> the same class.
>> Suppose to have 12 CSVs, one for each month. Each CSV contains
>> contains invoices for a single month:
>> invoices_01.csv contains invoices for January
>> invoices_12.csv contains invoices for December
>>
[cut]

You're right, there's another bug.
I guess your configuration contains the vertex transformer: if you
avoid to use it, documents will be saved in the configured cluster.
I'm on it, trying to fix this wrong behaviour

Fabio Rinnone

unread,
Apr 19, 2016, 5:21:52 PM4/19/16
to orient-...@googlegroups.com
Il 19/04/2016 09:07, Roberto Franchini ha scritto:
> You're right, there's another bug.
> I guess your configuration contains the vertex transformer: if you
> avoid to use it, documents will be saved in the configured cluster.
> I'm on it, trying to fix this wrong behaviour

Yes, if I remove vertex transformers from my configuration it works
properly.
signature.asc

Fabio Rinnone

unread,
Apr 20, 2016, 6:28:34 AM4/20/16
to orient-...@googlegroups.com
Il 19/04/2016 09:07, Roberto Franchini ha scritto:
> You're right, there's another bug.
> I guess your configuration contains the vertex transformer: if you
> avoid to use it, documents will be saved in the configured cluster.

It works if I remove vertex transformer, but in some classes I need
edges transformers to implement relationships beetween classes. For
instance, if I remove vertex transformer but not edge transformer, I
obtain following error for every edge in import process:

Error in Pipeline execution:
com.orientechnologies.orient.etl.transformer.OTransformException: edge:
input type
'com.orientechnologies.orient.core.record.impl.ODocument$1$1@53667cbe'
is not supported
signature.asc
Reply all
Reply to author
Forward
0 new messages