Apache Arrow might save some...

Thad Guidry

unread,

Jan 25, 2018, 5:57:17 PM1/25/18

to OpenRefine Development, Holden Karau

https://fosdem.org/2018/schedule/event/big_data_outside_jvm/

Wish I could attend this. Hopefully it gets recorded ?

-Thad

Antonin Delpeuch (lists)

unread,

Jan 25, 2018, 7:15:54 PM1/25/18

to openref...@googlegroups.com

It looks very interesting indeed. FOSDEM usually broadcasts a lot of the
talks they have (if not all?).

Antonin

> --
> You received this message because you are subscribed to the Google
> Groups "OpenRefine Development" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to openrefine-de...@googlegroups.com
> <mailto:openrefine-de...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

Thad Guidry

unread,

Jan 25, 2018, 7:26:51 PM1/25/18

to openref...@googlegroups.com

I've added a Project Card to our Github Projects to research and look into Apache Arrow more...

https://github.com/OpenRefine/OpenRefine/projects/1#card-6959399

To unsubscribe from this group and stop receiving emails from it, send an email to openrefine-de...@googlegroups.com.

Martin Magdinier

unread,

Jan 25, 2018, 8:35:43 PM1/25/18

to openref...@googlegroups.com

Antonin or anyone developer in Europe do you want to go bring back some insight to this mailing list?

This is definitely the type of event where we want to be to learn and present the project to other developers.

There are other interesting presentations like The Open Decision Framework for open source governance and plenty of networking opportunity with people who lead open source community (BSD, OpenJDK). You can even pitch OpenRefine to a room full of UX and designer!

--

Martin Magdinier

2018-01-25 19:26 GMT-05:00 Thad Guidry <thadg...@gmail.com>:

I've added a Project Card to our Github Projects to research and look into Apache Arrow more...
https://github.com/OpenRefine/OpenRefine/projects/1#card-6959399

On Thu, Jan 25, 2018 at 6:15 PM Antonin Delpeuch (lists) <li...@antonin.delpeuch.eu> wrote:

It looks very interesting indeed. FOSDEM usually broadcasts a lot of the
talks they have (if not all?).

Antonin

On 25/01/2018 22:57, Thad Guidry wrote:
> https://fosdem.org/2018/schedule/event/big_data_outside_jvm/
>
> Wish I could attend this. Hopefully it gets recorded ?
>
> -Thad
>
> --
> You received this message because you are subscribed to the Google
> Groups "OpenRefine Development" group.
> To unsubscribe from this group and stop receiving emails from it, send

> an email to openrefine-dev+unsubscribe@googlegroups.com
> <mailto:openrefine-dev+unsubscri...@googlegroups.com>.

> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "OpenRefine Development" group.

To unsubscribe from this group and stop receiving emails from it, send an email to openrefine-dev+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "OpenRefine Development" group.

To unsubscribe from this group and stop receiving emails from it, send an email to openrefine-dev+unsubscribe@googlegroups.com.

Antonin Delpeuch (lists)

unread,

Jan 26, 2018, 11:00:54 AM1/26/18

to openref...@googlegroups.com

I'd love to (it's always a fantastic event), but intuitively I would say
that it would be better if we had applied in advance to give a talk
there (it's totally doable). I have the feeling that if I just show up
there, it will not give a lot of visibility to the project.

(Plus I am quite busy and it's coming up very soon.)

Antonin

On 26/01/2018 01:35, Martin Magdinier wrote:
> Antonin or anyone developer in Europe do you want to go bring back some
> insight to this mailing list?
>
> This is definitely the type of event where we want to be to learn and
> present the project to other developers.
>
> There are other interesting presentations like TheOpen Decision

> <https://fosdem.org/2018/schedule/event/osd_the_open_decision_framework/>Framework

> for open source governance and plenty of networking opportunity with
> people who lead open source community (BSD

> <https://fosdem.org/2018/schedule/event/the_freebsd_fundation_how_we_can_change_the_world/>,
> OpenJDK <https://fosdem.org/2018/schedule/event/gb_qa/>). You can even

> pitch OpenRefine to a room full of UX and designer

> <https://fosdem.org/2018/schedule/event/osd_pitch_your_project/>!
>
>
> *--*
> *Martin Magdinier*

>
>
> 2018-01-25 19:26 GMT-05:00 Thad Guidry <thadg...@gmail.com

> <mailto:thadg...@gmail.com>>:

>
> I've added a Project Card to our Github Projects to research and
> look into Apache Arrow more...
>
> https://github.com/OpenRefine/OpenRefine/projects/1#card-6959399
> <https://github.com/OpenRefine/OpenRefine/projects/1#card-6959399>
>
>
>
> On Thu, Jan 25, 2018 at 6:15 PM Antonin Delpeuch (lists)
> <li...@antonin.delpeuch.eu <mailto:li...@antonin.delpeuch.eu>> wrote:
>
> It looks very interesting indeed. FOSDEM usually broadcasts a
> lot of the
> talks they have (if not all?).
>
> Antonin
>
> On 25/01/2018 22:57, Thad Guidry wrote:
> > https://fosdem.org/2018/schedule/event/big_data_outside_jvm/
> <https://fosdem.org/2018/schedule/event/big_data_outside_jvm/>
> >
> > Wish I could attend this. Hopefully it gets recorded ?
> >
> > -Thad
> >
> > --
> > You received this message because you are subscribed to the Google
> > Groups "OpenRefine Development" group.
> > To unsubscribe from this group and stop receiving emails from
> it, send

> > an email to openrefine-de...@googlegroups.com
> <mailto:openrefine-dev%2Bunsu...@googlegroups.com>
> > <mailto:openrefine-de...@googlegroups.com
> <mailto:openrefine-dev%2Bunsu...@googlegroups.com>>.

> > For more options, visit https://groups.google.com/d/optout

> <https://groups.google.com/d/optout>.

>
> --
> You received this message because you are subscribed to the
> Google Groups "OpenRefine Development" group.
> To unsubscribe from this group and stop receiving emails from

> it, send an email to openrefine-de...@googlegroups.com
> <mailto:openrefine-dev%2Bunsu...@googlegroups.com>.

> For more options, visit https://groups.google.com/d/optout

> <https://groups.google.com/d/optout>.

>
> --
> You received this message because you are subscribed to the Google
> Groups "OpenRefine Development" group.
> To unsubscribe from this group and stop receiving emails from it,

> send an email to openrefine-de...@googlegroups.com
> <mailto:openrefine-de...@googlegroups.com>.

> For more options, visit https://groups.google.com/d/optout

> <https://groups.google.com/d/optout>.

>
>
> --
> You received this message because you are subscribed to the Google
> Groups "OpenRefine Development" group.
> To unsubscribe from this group and stop receiving emails from it, send

> an email to openrefine-de...@googlegroups.com
> <mailto:openrefine-de...@googlegroups.com>.

Thad Guidry

unread,

Jan 26, 2018, 2:53:38 PM1/26/18

to openref...@googlegroups.com

Here's a bit more info on how Apache Arrow folks say we might leverage them...

https://apachearrow.slack.com/archives/C0S8Z7VBK/p1516934654000036

-Thad

----

Thad Guidry [8:44 PM] Anyone know about us at OpenRefine ? We're curious if/how Arrow might be useful for us to improve our ancient in-memory data model and processing so desktop and laptop users can work with more local data in OpenRefine ?

bhulette [10:28 AM] @Thad Guidry I hadn't heard of OpenRefine before, but it definitely looks like something that could benefit from the Arrow format. The biggest selling point would be the ability to easily interoperate with other tools that use Arrow (e.g. Spark, pandas, etc...) without any serialization costs.

[10:29 AM] I don't know what your current data model looks like, but there could be performance benefits from the columnar layout as well

Thad Guidry [10:53 AM] @bhulette That's described here https://github.com/OpenRefine/OpenRefine/wiki/Server-Side-Architecture

bhulette [11:05 AM] yeah that looks pretty amenable to the arrow format - a loose analogy could be that "column models" are specified by the arrow schema, and the "raw data" is stored in record batches/dictionary batches

Thad Guidry [11:06 AM] @bhulette gotcha

bhulette [11:07 AM] the column groups idea for storing a tree is pretty interesting

[11:08 AM] you would be able to specify blank cells in arrow using validity buffers

Thad Guidry [11:11 AM] @bhulette keep the ideas coming ! (fyi, we had also thought of Apache Ignite)

bhulette [11:12 AM] Im not sure if arrow could help with storing changes

[11:13 AM] but that could be a welcome addition to the project, if people like @wesmckinn think it's in scope :slightly_smiling_face:

[11:15 AM] is the OpenRefine server distributed?

Thad Guidry [11:16 AM] @bhulette no

[11:18 AM] @bhulette OpenRefine is used locally (desktop/laptop) to clean data. We eventually want to separate the backend and frontend, so that we can do large transformations via streaming/batching against Apache BEAM, etc. But our user base, once they get that big, typically use other tools.

----

Reply all

Reply to author

Forward