OpenRefine fix to overcome RAM limitation

1,601 views
Skip to first unread message

Luca Garulli

unread,
Dec 2, 2013, 8:43:15 AM12/2/13
to openr...@googlegroups.com
Hi guys,
good job with OpenRefine! The only problem I see is the usage of Heap memory that limit the real use cases with huge files. So I've put OrientDB embedded inside of using RAM and everything works smooth and you can finally work against millions of rows.

In practice instead of use ArrayList<Row> I've implemented a OrientList<Row>. OrientList saves everything on disk, so the initial cost of importing and online operations are a bit slower. We could still optimize it by providing a Row + Cell implementation that rely on OrientDB documents avoiding un-marshalling everytime it access to a row/cell.

However this is only a POC I've pushed on my GitHub:


WDYT? Any thoughts? Anybody is interested on using it?

Lvc@

Thad Guidry

unread,
Dec 2, 2013, 9:53:52 AM12/2/13
to openr...@googlegroups.com
Luca,

You picked perhaps the most interesting and exciting DB that is out there now, and that I myself have been eyeballing, but have yet to install it.  When I get some time, I will try to play a bit with the implementation and see through the rough edges, if there are any.

  Ideally, you would wire up OpenRefine's RecordRow to the OrientDB Document Type Record, I think.  Another idea, is to support Recursion itself somehow (like a pseudo-cross() function which we do not have in OpenRefine) that implements OrientDB's Traverse... https://github.com/orientechnologies/orientdb/wiki/Java-Traverse

For others interested,

OrientDB brings Multi-Master Replication and a Graph model with graph operators layered on a 100% ACID compliant SQL DB with HTTP Rest and JSON API.  (Pretty cutting edge without learning a new Query language.... still plain old SQL)




--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



--

Luca Garulli

unread,
Dec 2, 2013, 7:00:11 PM12/2/13
to openr...@googlegroups.com
Hi Thad,
with OrientDB under the hood we could:
- remove the current file format as compressed text file with 1 json per line. In facts every time you open a project it re-parse the entire file in RAM. With OrientDB this would be skipped
- filter rows by SQL instead of just like and regexp
- have relationships/link/joins against other entities like Refine Projects. This means you could import 2 CSV and set a relationship between records across projects

Current implementation I pushed has OrientDB embedded, so no server is needed.

Lvc@

Tom Morris

unread,
Dec 3, 2013, 10:03:07 AM12/3/13
to openr...@googlegroups.com
On Mon, Dec 2, 2013 at 8:43 AM, Luca Garulli <l.ga...@gmail.com> wrote:
good job with OpenRefine! The only problem I see is the usage of Heap memory that limit the real use cases with huge files.

Well, OpenRefine has many more problems than that, but that particular item was a conscious design choice by the original designers to prioritize interactivity over massive database sizes.  People seem to like the instant feedback.  Of course, they'd rather not have to make any tradeoffs at all. :-)
 
So I've put OrientDB embedded inside of using RAM and everything works smooth and you can finally work against millions of rows.

 I don't suppose you have a little graph of performance vs database size for us to peruse?  That's the interesting part.

On Mon, Dec 2, 2013 at 7:00 PM, Luca Garulli <l.ga...@gmail.com> wrote:

with [a SQL database] under the hood we could:
- filter rows by SQL instead of just like and regexp
- have relationships/link/joins against other entities like Refine Projects. This means you could import 2 CSV and set a relationship between records across projects

Those might be interesting design directions to explore, but you'd end up with quite a different beast than today's OpenRefine. 

Tom

Thad Guidry

unread,
Dec 3, 2013, 10:37:15 AM12/3/13
to openr...@googlegroups.com
Luca,

OpenRefine fits a unique niche that is hard to come by in a Free Cost, OpenSource way.
I have used many costly ETL tools and databases, but none of them approach the instant feedback and flow that OpenRefine does for "cleaning and refining".

I agree that handling larger and larger data cleaning operations would be a Win.  But like Tom says, there's a tradeoff.  For me personally, I typically will do mass cleanup within  validation routines inside my databases or ETL tool steps....passing the data fields through validation and schema, etc.  When I can, I typically grab and fire up OpenRefine to do this, since it is much faster and easier.

I agree with Tom that extending OpenRefine and molding it into an ETL tool would not be an efficient choice.  What would be a better choice, would be allowing OpenRefine to be a front-end for data exploration and cleanup for the ETL tools and databases.  For instance, Distinct and Unique operations in SQL are important query types that OpenRefine could provide better views through its Faceting mechanisms than existing databases and tools have.

Imagine this....Just having a SQL dialog window (that uses a connector style and grabs row based data from a SQL store) would be a Win in OpenRefine...where you could then manipulate the data and clean it up in Refine...and then finally push back to the SQL store with an Update button ... keeping which rows have changed and the changes needed.  Thus saving someone from having to do Import/Export operations.

It sounds like you have a good general idea of "putting a data storage layer under OpenRefine that is SQL based".  But I think what is really needed is "connecting and interacting with data storage layers, particularly, SQL storage layers".  Keeping OpenRefine as a front-end only.

Keep hacking !!
-- 

fabio.tacchelli

unread,
Dec 4, 2013, 9:39:34 AM12/4/13
to openr...@googlegroups.com
Hi!
I'm trying to test you're implementation of refine. I'm creating a new project with the following clipboard input:
name surname
test test
but as soon as i clic the "Create Project" i get this exception:
Exception in thread "Thread-23" com.orientechnologies.orient.core.exception.ODatabaseException: Database instance is not set in current thread. Assure to set it with: ODatabaseRecordThreadLocal.INSTANCE.set(db);
        at com.orientechnologies.orient.core.db.ODatabaseRecordThreadLocal.get(ODatabaseRecordThreadLocal.java:31)
        at com.orientechnologies.orient.core.record.ORecordAbstract.getDatabase(ORecordAbstract.java:271)
        at com.orientechnologies.orient.core.record.ORecordSchemaAwareAbstract.setClassName(ORecordSchemaAwareAbstract.java:102)
        at com.orientechnologies.orient.core.record.impl.ODocument.<init>(ODocument.java:169)
        at com.google.refine.model.Project$1.serialize(Project.java:111)
        at com.google.refine.model.Project$1.serialize(Project.java:104)
        at com.google.refine.model.OrientList.add(OrientList.java:107)
        at com.google.refine.importers.TabularImportingParserBase.readTable(TabularImportingParserBase.java:202)
        at com.google.refine.importers.SeparatorBasedImporter.parseOneFile(SeparatorBasedImporter.java:123)
        at com.google.refine.importers.ImportingParserBase.parseOneFile(ImportingParserBase.java:122)
        at com.google.refine.importers.ImportingParserBase.parse(ImportingParserBase.java:83)
        at com.google.refine.importing.ImportingUtilities.createProjectSynchronously(ImportingUtilities.java:1017)
        at com.google.refine.importing.ImportingUtilities.access$200(ImportingUtilities.java:92)

Any clues?

Luca Garulli

unread,
Dec 9, 2013, 7:52:25 PM12/9/13
to openr...@googlegroups.com


On Tuesday, 3 December 2013 16:03:07 UTC+1, Tom Morris wrote:
On Mon, Dec 2, 2013 at 8:43 AM, Luca Garulli <l.ga...@gmail.com> wrote:
good job with OpenRefine! The only problem I see is the usage of Heap memory that limit the real use cases with huge files.

Well, OpenRefine has many more problems than that, but that particular item was a conscious design choice by the original designers to prioritize interactivity over massive database sizes.  People seem to like the instant feedback.  Of course, they'd rather not have to make any tradeoffs at all. :-)

Putting OrientDB under the hood slowed down operation about 2x. I guess with an extra effort I could improve the OrientDB implementation. Probably 2x slower than now it's acceptable for most users? Or probably when the user open a file can select if use Heap or OrientDB to manage it with a simple checkbox.
 

So I've put OrientDB embedded inside of using RAM and everything works smooth and you can finally work against millions of rows.

 I don't suppose you have a little graph of performance vs database size for us to peruse?  That's the interesting part.

Do you have few CSV files to load on both to have an idea about performance on the same hw?
 
On Mon, Dec 2, 2013 at 7:00 PM, Luca Garulli <l.ga...@gmail.com> wrote:

with [a SQL database] under the hood we could:
- filter rows by SQL instead of just like and regexp
- have relationships/link/joins against other entities like Refine Projects. This means you could import 2 CSV and set a relationship between records across projects

Those might be interesting design directions to explore, but you'd end up with quite a different beast than today's OpenRefine. 

To be honest I'd like to implement also the JDBC source. What makes Refine great is the approach and the user interface totally amazing. I'd like to have such tool in the OrientDB suite, maybe also batch with a ETL like engine.

Lvc@

 

Tom

Luca Garulli

unread,
Dec 9, 2013, 7:56:34 PM12/9/13
to openr...@googlegroups.com


On Tuesday, 3 December 2013 16:37:15 UTC+1, Thad Guidry wrote:
Luca,

OpenRefine fits a unique niche that is hard to come by in a Free Cost, OpenSource way.
I have used many costly ETL tools and databases, but none of them approach the instant feedback and flow that OpenRefine does for "cleaning and refining".

Totally agree.
 
I agree that handling larger and larger data cleaning operations would be a Win.  But like Tom says, there's a tradeoff.  For me personally, I typically will do mass cleanup within  validation routines inside my databases or ETL tool steps....passing the data fields through validation and schema, etc.  When I can, I typically grab and fire up OpenRefine to do this, since it is much faster and easier.

That's the point. Imagine to work with Refine with RAM or OrientDB under the hood (this could be configurable), then save the steps in JSON format somewhere and then let to Refine to work in Batch.
 
I agree with Tom that extending OpenRefine and molding it into an ETL tool would not be an efficient choice.  What would be a better choice, would be allowing OpenRefine to be a front-end for data exploration and cleanup for the ETL tools and databases.  For instance, Distinct and Unique operations in SQL are important query types that OpenRefine could provide better views through its Faceting mechanisms than existing databases and tools have.

Imagine this....Just having a SQL dialog window (that uses a connector style and grabs row based data from a SQL store) would be a Win in OpenRefine...where you could then manipulate the data and clean it up in Refine...and then finally push back to the SQL store with an Update button ... keeping which rows have changed and the changes needed.  Thus saving someone from having to do Import/Export operations.

This is very close to my idea, but I think the ETL solution is even possible and easy to implement. Let's the user to play with data, then save the "change stack" in JSON and apply them triggered by a cron event or other.
 
It sounds like you have a good general idea of "putting a data storage layer under OpenRefine that is SQL based".  But I think what is really needed is "connecting and interacting with data storage layers, particularly, SQL storage layers".  Keeping OpenRefine as a front-end only.

Keep hacking !!

Probably my changes will go to the "OrientDB Refine" branch waiting if any other is interested to merge.

Lvc@

Raymond Wong

unread,
Jan 12, 2014, 11:41:25 PM1/12/14
to openr...@googlegroups.com
But I think what is really needed is "connecting and interacting with data storage layers, particularly, SQL storage layers".  Keeping OpenRefine as a front-end only

Kind of like MySQL that the "engine" can be switched out and keeping the front-end interaction the same. That will push OpenRefine to another level. :-)

Ray
Reply all
Reply to author
Forward
0 new messages