--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
good job with OpenRefine! The only problem I see is the usage of Heap memory that limit the real use cases with huge files.
So I've put OrientDB embedded inside of using RAM and everything works smooth and you can finally work against millions of rows.
with [a SQL database] under the hood we could:
- filter rows by SQL instead of just like and regexp- have relationships/link/joins against other entities like Refine Projects. This means you could import 2 CSV and set a relationship between records across projects
On Mon, Dec 2, 2013 at 8:43 AM, Luca Garulli <l.ga...@gmail.com> wrote:
good job with OpenRefine! The only problem I see is the usage of Heap memory that limit the real use cases with huge files.Well, OpenRefine has many more problems than that, but that particular item was a conscious design choice by the original designers to prioritize interactivity over massive database sizes. People seem to like the instant feedback. Of course, they'd rather not have to make any tradeoffs at all. :-)
So I've put OrientDB embedded inside of using RAM and everything works smooth and you can finally work against millions of rows.I don't suppose you have a little graph of performance vs database size for us to peruse? That's the interesting part.
with [a SQL database] under the hood we could:- filter rows by SQL instead of just like and regexp- have relationships/link/joins against other entities like Refine Projects. This means you could import 2 CSV and set a relationship between records across projectsThose might be interesting design directions to explore, but you'd end up with quite a different beast than today's OpenRefine.
Tom
Luca,OpenRefine fits a unique niche that is hard to come by in a Free Cost, OpenSource way.I have used many costly ETL tools and databases, but none of them approach the instant feedback and flow that OpenRefine does for "cleaning and refining".
I agree that handling larger and larger data cleaning operations would be a Win. But like Tom says, there's a tradeoff. For me personally, I typically will do mass cleanup within validation routines inside my databases or ETL tool steps....passing the data fields through validation and schema, etc. When I can, I typically grab and fire up OpenRefine to do this, since it is much faster and easier.
I agree with Tom that extending OpenRefine and molding it into an ETL tool would not be an efficient choice. What would be a better choice, would be allowing OpenRefine to be a front-end for data exploration and cleanup for the ETL tools and databases. For instance, Distinct and Unique operations in SQL are important query types that OpenRefine could provide better views through its Faceting mechanisms than existing databases and tools have.
Imagine this....Just having a SQL dialog window (that uses a connector style and grabs row based data from a SQL store) would be a Win in OpenRefine...where you could then manipulate the data and clean it up in Refine...and then finally push back to the SQL store with an Update button ... keeping which rows have changed and the changes needed. Thus saving someone from having to do Import/Export operations.
It sounds like you have a good general idea of "putting a data storage layer under OpenRefine that is SQL based". But I think what is really needed is "connecting and interacting with data storage layers, particularly, SQL storage layers". Keeping OpenRefine as a front-end only.Keep hacking !!
But I think what is really needed is "connecting and interacting with data storage layers, particularly, SQL storage layers". Keeping OpenRefine as a front-end only