Suppress exceptions in ETL

91 views
Skip to first unread message

Gregor Frey

unread,
Feb 13, 2015, 10:40:58 AM2/13/15
to orient-...@googlegroups.com
Is it possible to suppress exceptions in ETL, skip the current document and proceed with the next one?
Background: I have a highly denormalized excel, where lines contain a value which will be unique in the database. I would like to skip any duplicate keys and just continue with the next line. Is this possible?
Ciao
Gregor

Luca Garulli

unread,
Feb 15, 2015, 1:21:32 PM2/15/15
to orient-database
Hi Gregor,
Please could you post here the stack trace?

Lvc@


--

---
You received this message because you are subscribed to the Google Groups "OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to orient-databa...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gregor Frey

unread,
Feb 16, 2015, 4:53:16 AM2/16/15
to orient-...@googlegroups.com
Salve!
the exception complains about a duplicate record, and I think that's quite okay. The only thing is that I do not want to stop the whole ETL process, when this exception occurs but rather want to skip the record and continue with the next.
To better understand my requirement: I have an excel that contains employees of a company. This excel also contains the cost centers of each the employees, in the same row as name, address and so on. So the excel is far from being normalized (this is quite often the case with excels). My problem is that I want to create not only the employees (as vertices) out of this excel, but the cost centers as well. But as the cost centers occur more than once, I get duplicate key errors.

Here is the exception:

[2:vertex] DEBUG Transformer input: {cc:Geschäftsleitung,cc_no:1,responsible:0,next_cc:1}

Error in Pipeline execution: com.orientechnologies.orient.core.storage.ORecordDuplicatedException: Cannot index record CostCenter{cc:Geschäftsleitung,cc_no:1,responsible:0,next_cc:1}: found duplicated key '1' in index 'CostCenter.cc_no' previously assigned to the record #11:0 RID=#11:0

com.orientechnologies.orient.core.storage.ORecordDuplicatedException: Cannot index record CostCenter{cc:Geschäftsleitung,cc_no:1,responsible:0,next_cc:1}: found duplicated key '1' in index 'CostCenter.cc_no' previously assigned to the record #11:0 RID=#11:0

at com.orientechnologies.orient.core.index.OIndexTxAwareOneValue.checkEntry(OIndexTxAwareOneValue.java:221)


And, if anyone is interested, this is my config file (perhaps someone as proposals to improve it):

{

    "config": {

        "log": "debug"

    },

    "source": {

        "file": {

            "path": "/Users/d022051/NetBeansProjects/ETL/Personalstamm - Kostenstellenhierarchie.csv"

        }

    },

    "extractor": {

        "row": {}

    },

    "transformers": [

        {

            "csv": {

                "skipFrom": 0,

                "skipTo": 0,

                "nullValue": "",

                "separator": ";",

                "columns": ["id:integer", "name", "firstname", "title", "birthday:string", "function", "country", "cc", "cc_no:integer", "responsible:integer", "next_cc:integer"]

            }

        }, {

            "field": {

                "fieldName": "id",

                "operation": "remove"

            }

        },

        {

            "field": {

                "fieldName": "name",

                "operation": "remove"

            }

        },

        {

            "field": {

                "fieldName": "firstname",

                "operation": "remove"

            }

        },

        {

            "field": {

                "fieldName": "title",

                "operation": "remove"

            }

        },

        {

            "field": {

                "fieldName": "birthday",

                "operation": "remove"

            }

        },

        {

            "field": {

                "fieldName": "function",

                "operation": "remove"

            }

        },

        {

            "field": {

                "fieldName": "country",

                "operation": "remove"

            }

        },

        {

            "vertex": {

                "class": "CostCenter"

            }

        }

    ],

    "loader": {

        "orientdb": {

            "dbURL": "plocal:/Users/d022051/NetBeansProjects/ETL/db",

            "dbType": "graph",

            "dbAutoDropIfExists": false,

            "dbAutoCreate": true,

            "standardElementConstraints": false,

            "classes": [

                {

                    "name": "CostCenter",

                    "extends": "V"

                }

            ],

            "indexes": [

                {

                    "class": "CostCenter",

                    "fields": ["cc_no:integer"],

                    "type": "UNIQUE"

                }

            ]

        }

    }

}




Luca Garulli

unread,
Feb 16, 2015, 5:00:01 AM2/16/15
to orient-database
Hi Gregor,
Please could you post here the full stack trace? I need the ETL module that report this problem. I see this:

at com.orientechnologies.orient.core.index.OIndexTxAwareOneValue.checkEntry(OIndexTxAwareOneValue.java:221)

But not the rest.

Lvc@


Gregor Frey

unread,
Feb 16, 2015, 6:52:23 AM2/16/15
to orient-...@googlegroups.com
Thanks for looking into this issue!

Here is the full stack trace:

Error in Pipeline execution: com.orientechnologies.orient.core.storage.ORecordDuplicatedException: Cannot index record CostCenter{cc:Geschäftsleitung,cc_no:1,responsible:0,next_cc:1}: found duplicated key '1' in index 'CostCenter.cc_no' previously assigned to the record #11:0 RID=#11:0

com.orientechnologies.orient.core.storage.ORecordDuplicatedException: Cannot index record CostCenter{cc:Geschäftsleitung,cc_no:1,responsible:0,next_cc:1}: found duplicated key '1' in index 'CostCenter.cc_no' previously assigned to the record #11:0 RID=#11:0

at com.orientechnologies.orient.core.index.OIndexTxAwareOneValue.checkEntry(OIndexTxAwareOneValue.java:221)

at com.orientechnologies.orient.core.index.OClassIndexManager.checkIndexedPropertiesOnCreation(OClassIndexManager.java:322)

at com.orientechnologies.orient.core.index.OClassIndexManager.checkIndexes(OClassIndexManager.java:581)

at com.orientechnologies.orient.core.index.OClassIndexManager.onRecordBeforeCreate(OClassIndexManager.java:423)

at com.orientechnologies.orient.core.hook.ODocumentHookAbstract.onTrigger(ODocumentHookAbstract.java:218)

at com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx.callbackHooks(ODatabaseDocumentTx.java:966)

at com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx.executeSaveRecord(ODatabaseDocumentTx.java:1686)

at com.orientechnologies.orient.core.tx.OTransactionNoTx.saveRecord(OTransactionNoTx.java:94)

at com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx.save(ODatabaseDocumentTx.java:2274)

at com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx.save(ODatabaseDocumentTx.java:117)

at com.orientechnologies.orient.core.record.impl.ODocument.save(ODocument.java:1704)

at com.orientechnologies.orient.core.record.impl.ODocument.save(ODocument.java:1695)

at com.tinkerpop.blueprints.impls.orient.OrientElement.save(OrientElement.java:304)

at com.tinkerpop.blueprints.impls.orient.OrientElement.save(OrientElement.java:284)

at com.tinkerpop.blueprints.impls.orient.OrientElement.setProperty(OrientElement.java:186)

at com.orientechnologies.orient.etl.transformer.OVertexTransformer.executeTransform(OVertexTransformer.java:77)

at com.orientechnologies.orient.etl.transformer.OAbstractTransformer.transform(OAbstractTransformer.java:37)

at com.orientechnologies.orient.etl.OETLPipeline.execute(OETLPipeline.java:108)

at com.orientechnologies.orient.etl.OETLProcessor.executeSequentially(OETLProcessor.java:483)

at com.orientechnologies.orient.etl.OETLProcessor.execute(OETLProcessor.java:291)

at com.orientechnologies.orient.etl.OETLProcessor.main(OETLProcessor.java:163)

ETL process halted: com.orientechnologies.orient.etl.OETLProcessHaltedException: com.orientechnologies.orient.core.storage.ORecordDuplicatedException: Cannot index record CostCenter{cc:Geschäftsleitung,cc_no:1,responsible:0,next_cc:1}: found duplicated key '1' in index 'CostCenter.cc_no' previously assigned to the record #11:0 RID=#11:0

Gregor Frey

unread,
Feb 16, 2015, 11:30:11 AM2/16/15
to
Hi,
just for a test I patched the class OVertexTransformer to ignore ORecordDuplicatedExceptions. But then I found that when the OAbstractTransformer was about to log the missing of the document, I got the same exception again. The reason is that the toString() method of the OVertex implicitly calls a save, which then throws the exception. See the stack trace below. But, perhaps even more importantly, I was wondering whether I you wouldn't expect the toString() method to have no side-effects? This could also become a performance issue in the end. I'm not in the details, of course. But from a high-level point of view, I think it would be better to not even try to do any data modifications within a toString() method.
Ciao
Gregor


 
save_exception.txt

Rajendra Prasad Gujja

unread,
Aug 21, 2017, 7:26:02 PM8/21/17
to OrientDB
Hi Gregor,

I still see this problem, is there any fix for this?

Thanks,
RP.

Rajendra Prasad Gujja

unread,
Aug 21, 2017, 8:00:46 PM8/21/17
to OrientDB
I think I just need to set the log level to "INFO" to surpass this particular exception, but for some reason, in my case, it created some wired empty vertices with no class or no name.
:( :( 

Gregor Frey

unread,
Aug 22, 2017, 2:01:54 PM8/22/17
to OrientDB

I have to das that I didn't look into OrientDB forum quite a while, so I cannot say what the state is.


--

---
You received this message because you are subscribed to a topic in the Google Groups "OrientDB" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/orient-database/cANKYwMOMUA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to orient-databa...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages