ETL: how to import a huge csv and generate different vertexes/edges based on the csv column values

760 views
Skip to first unread message

Lars Plessmann

unread,
Jan 21, 2015, 2:06:15 PM1/21/15
to orient-...@googlegroups.com
I have a really huge CSV (about 240GB) file with several columns (lets say there are columns A - H).
The first column A is the primary key of the main record (vertex MainRecord). But the columns D, E, F, G are columns which should be stored in an own vertex (because these fields are redundant over all the records and I dont want to store them in the main record again and again). So the column value of D-G itself should be stored as a property called "title" in a new vertex (but it should not generate duplicates). Afterwards these vertexes needs to be linked.
Is this possible to reach this with an single orient-etl configuration? I think the only way I know is to split the huge csv file's columns and create sepperate files for each vertex. But I dont want to do this if that is not neccessairy (file is so big).
I hope you can give me an advice?

I try to describe it in the config json syntax what I need (of course, this will not work):

{
"source": {
"file": {
"path": "dataexport.csv"
}
},
"extractor": {"row": {}},
"transformers": [
{
"csv": {
"separator": ",",
"nullValue": "NULL",
"skipFrom": -1,
"skipTo": -1
}
},
{
"field": {
"fieldName": "_id",
"expression": "$input._id.substring(9, 33)"
}
},
{
"field": {
"fieldName": "colD",
"class": "ColumnD",
"classProperty": "title"
}
},
{
"field": {
"fieldName": "colE",
"class": "ColumnE",
"classProperty": "title"
}
},

{
"field": {
"fieldName": "colF",
"class": "ColumnF",
"classProperty": "title"
}
{
"field": {
"fieldName": "colG",
"class": "ColumnG",
"classProperty": "title"
}
}
},
{
"vertex": {"class": "MainRecord"}
}
],
"loader": {
"orientdb": {
"dbURL": "remote:127.0.0.1/msales_testing",
"dbUser": "admin",
"dbPassword": "admin",
"dbAutoCreate": true,
"dbType": "graph",
"classes": [
{
"name": "MainRecord",
"extends": "V"
},
{
"name": "ColumnD",
"extends": "V"
},
{
"name": "
ColumnE",
"extends": "V"
},
{
"name": "
ColumnF",
"extends": "V"
},
{
"name": "
ColumnG",
"extends": "V"
}

],
"indexes": [
{
"class": "MainRecord",
"fields": ["_id:string"],
"type": "UNIQUE"
}
]
}
}
}



By the way: _id is in the MongoDB ObjectID format. I just want to store the original hex value, so I used the substring sql method to extract the hex id. Maybe there is a better way.


regards
Lars

Luca Garulli

unread,
Jan 21, 2015, 5:09:40 PM1/21/15
to orient-database
Hi Lars,
I don't know the "class" and "classProperty" fields in "field" transformer. This is the syntax:


Furthermore why do you need multiple classes?

Lvc@


--

---
You received this message because you are subscribed to the Google Groups "OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to orient-databa...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Lars Plessmann

unread,
Jan 22, 2015, 5:11:03 AM1/22/15
to orient-...@googlegroups.com
Hi Luca,

I know these fields do not exist. This should only be an example what I'm looking for for a better understanding (see my note above the config file ;)).
The Wiki is not really informative about complex Transformation :-(

I think its more efficient to store records in its own classes instead of building document records with lots of redundant fields (esp. during reporting/OLAP)? So that is the reason I want to use multiple classes (reduce space and enhance query speed).

So let me make an concrete example of an CSV file I want to process:

id, voucherid, description,               creationdate,          brand,     model,       customerid
1,  4711,      "Special Offer III",       "2012-05-02T09:12:17", "Samsung", "Galaxy S3", 50112
2,  4712,      "Special Offer II",        "2012-05-03T09:1417",  "HTC",     "One",       50113
3,  4713,      "Special Offer III",       "2012-05-04T10:17:17", "HTC",     "One",       50002
4,  4714,      "Special Offer for Mr XY", "2012-05-04T11:09:17", "Apple",   "iPhone 5",  50017
[...]


So I want to create the class "Order" (extends V), "Brand" (extends V), "Model" (extends V), "Customer" (extends V).
The Brand and Model and the Customer should be linked with the Order and I do not want duplicates in the class Order, Brand, Model, Customer.

class Order gets the properties voucherid, description, creationdate.
class Brand gets the property title, where the value of csv field "brand" is stored.
class Model gets the property title, where the value of csv field "model" is stored.
class Customer gets the property customerid (firstname, lastname etc.), where the csv field "customerid" is stored (and firstname, lastname etc.)

I want to run a ETL job just one time (because there are millions of records) because its a huge file instead of running the job for each Class again and again.
I hope you know what I want?


regards
Lars

kurtuluş yılmaz

unread,
Mar 30, 2015, 4:02:51 AM3/30/15
to orient-...@googlegroups.com
Hi Lars;
Can you find a solution to migrate data from this type of table. I also need to migrate data from this type table. If you find a solution can you describe it.
Best regrads.

22 Ocak 2015 Perşembe 12:11:03 UTC+2 tarihinde Lars Plessmann yazdı:

sck2015

unread,
Mar 31, 2015, 5:10:38 PM3/31/15
to orient-...@googlegroups.com
Hi Lars and Luca, was this ever resolved? This is related to my question re: creating multiple vertices from within a single etl config json file. Similarly, another post asked how to create edges between two vertices that already exist via the etl (is this possible). My perception is that different vertex classes have to be created in separate source files in the etl.  I'd like to understand if this is the case, how to then create edges between *two pre-existing vertices* in the etl if possible. Thanks, Sonu

kurtuluş yılmaz

unread,
Apr 3, 2015, 4:17:40 PM4/3/15
to orient-...@googlegroups.com
Hi I find a solution for this. Actually  you can create edge between existing nodes  you can use command property of transformers at json etl file. I  send you my solution. It can help you .
 Key point at config json  
{
            "command": {
                "command": "create edge cookieUsedBy from (select  from Cookie where compId = ${input.compId}) to (select from Member where member_id = ${input.member_id})",
                "output": "edge"
            }
Key point at config json 

Best regards.


{
    "config": {
        "log": "debug"
    },
    "extractor": {
        "jdbc": {
            "driver": "com.mysql.jdbc.Driver",
            "url": "",
            "userName": "log",
            "userPassword": "",
            "query": "select compId,member_id from login_log where member_id = 12034788 ",
            "fetchSize": 100
        }
    },
    "transformers": [
        {
            "command": {
                "command": "create edge cookieUsedBy from (select  from Cookie where compId = ${input.compId}) to (select from Member where member_id = ${input.member_id})",
                "output": "edge"
            }
        }
    ],
    "loader": {
        "orientdb": {
            "dbURL": "remote: localhost/",
            "dbUser": "root",
            "dbPassword": "",
            "dbAutoCreate": true,
            "tx": true,
            "batchCommit": 1000,
            "wal": false,
            "dbType": "graph",
            "classes": [
                {
                    "name": "Cookie",
                    "extends": "V"
                }
            ],
            "indexes": [
                {
                    "class": "Cookie",
                    "fields": [
                        "compId: String"
                    ],
                    "type": "UNIQUE_HASH_INDEX"
                }
            ]
        }
    }
}

1 Nisan 2015 Çarşamba 00:10:38 UTC+3 tarihinde sck2015 yazdı:

Kunal Goyal

unread,
Mar 8, 2018, 2:32:36 AM3/8/18
to OrientDB
HI Lars,

did you get the solution for this problme ??

Maxim Nikolaev

unread,
Mar 8, 2018, 2:44:26 AM3/8/18
to orient-...@googlegroups.com
in my suggestion , you must use java api, it work 100%

чт, 8 марта 2018 г. в 10:32, Kunal Goyal <12ku...@gmail.com>:
--
Reply all
Reply to author
Forward
0 new messages