Here's a bit more info on how Apache Arrow folks say we might leverage them...
Thad Guidry [8:44 PM] Anyone know about us at OpenRefine ? We're curious if/how Arrow might be useful for us to improve our ancient in-memory data model and processing so desktop and laptop users can work with more local data in OpenRefine ?
bhulette [10:28 AM] @Thad Guidry I hadn't heard of OpenRefine before, but it definitely looks like something that could benefit from the Arrow format. The biggest selling point would be the ability to easily interoperate with other tools that use Arrow (e.g. Spark, pandas, etc...) without any serialization costs.
[10:29 AM] I don't know what your current data model looks like, but there could be performance benefits from the columnar layout as well
Thad Guidry [10:53 AM] @bhulette That's described here
https://github.com/OpenRefine/OpenRefine/wiki/Server-Side-Architecturebhulette [11:05 AM] yeah that looks pretty amenable to the arrow format - a loose analogy could be that "column models" are specified by the arrow schema, and the "raw data" is stored in record batches/dictionary batches
Thad Guidry [11:06 AM] @bhulette gotcha
bhulette [11:07 AM] the column groups idea for storing a tree is pretty interesting
[11:08 AM] you would be able to specify blank cells in arrow using validity buffers
Thad Guidry [11:11 AM] @bhulette keep the ideas coming ! (fyi, we had also thought of Apache Ignite)
bhulette [11:12 AM] Im not sure if arrow could help with storing changes
[11:13 AM] but that could be a welcome addition to the project, if people like @wesmckinn think it's in scope :slightly_smiling_face:
[11:15 AM] is the OpenRefine server distributed?
Thad Guidry [11:16 AM] @bhulette no
[11:18 AM] @bhulette OpenRefine is used locally (desktop/laptop) to clean data. We eventually want to separate the backend and frontend, so that we can do large transformations via streaming/batching against Apache BEAM, etc. But our user base, once they get that big, typically use other tools.