feature requests

307 views

Skip to first unread message

David Huynh

unread,

Nov 17, 2012, 10:47:29 PM11/17/12

to openr...@googlegroups.com

Hi all,

From Martin's usage survey I've tried to extract and organize feature requests that you've mentioned. I want to jot them down here, with my own comments, so we can start some discussion.

Please elaborate on any feature request you're interested in, or add new ones. Perhaps we'll vote on which ones are more important to prioritize.

Thanks,

David

High-Level Capabilities

Scalability: Refine should handle more data. (How much is enough?) While there's still room for optimization, there is a limit of how much data we can handle with a given amount of memory. At some point, it becomes a product design issue: how much interactivity can we promise before we usher the user onto the pipelining usage mode.

Pipelining mode: Refine's operations can be run in a pipeline, especially in MapReduce parallelism on a Hadoop cluster. This should allow processing of really large data sets. Some product design issues include:

How to handle operations like Reconcile that are non-deterministic, and so, need some human intervention?
Can pipelining mode be invoked from the UI? The UI can block until the pipeline finishes.
Does the pipelining mode work on normal projects, or directly from input data files to output data files?
Are operations done in a pipeline run undo-able?

Scriptability: Refine can be driven from other apps through libraries in more languages. This is similar to pipelining mode, but not necessarily the same. For example, another app might automatically create a new project in Refine given some data that it generates, drive Refine to perform some fixed sequence of operations, and then let the user continue from there interactively.

Hosting: Refine can be hosted within some other web app. Issue: how does access control work between Refine and that web app?

Collaboration support: In addition to hosting, Refine should support collaboration. Multiple users can access the same project. Consider other hosted products, like Google Spreadsheets, that support collaborative editing: operations in them tend to modify only small pieces of data, such as one spreadsheet cell at a time; so, it's enough to just lock the current cell that each user focuses on. This is not possible in Refine, as each operation in Refine can and usually modify a large amount of data.

Data visualizations + statistics: Refine should support visualizations and statistics, e.g., integration of R, integration of d3js. Basic visualizations shouldn't be hard to support, though it can get complicated quickly.

Import/Export

Better XML (hierarchical formats) import/export. Make sure the hierarchical data is parsed correctly into records.

Better interface with databases: e.g., loading data directly from database tables, writing back to database tables, exporting SQL commands to update.

Easier to join / concatenate data: Allow adding rows to an existing project.

Needlebase-like web scraping functionality. This is hard and maybe fall outside the scope of Refine.

Operations

Better undo/redo: Out-of-order undo; more intuitive undo format; annotations/notes in the undo/redo history.

Better clustering. For example, allow grouping rows by values in one column and then clustering values within each group of rows in another column.

More powerful Fetch URL command. E.g., support POST, support user's credentials.

Reconciliation

Reconciliation between projects or against databases.

Expressions and Data Model

Internal JSON-like data model. Improve the record model and make it behave coherently in expressions, facets, and operations.

Better GREL. More intuitive, more capabilities, access over more things in a project or in other projects, etc., e.g., expressions able to reference other rows.

More tutorials / documentation on GREL.

More native features. E.g., native handling of addresses.

Customization + Work Flow

Customization. E.g., commonly used operations can be pinned on column drop-down menus.

Sharing of operation scripts. Operation scripts can be readily uploaded onto some wiki and shared with other people.

Development

Easier third-party extension development.

Owen Stephens

unread,

Nov 19, 2012, 5:13:31 AM11/19/12

to openr...@googlegroups.com

Thanks David - this looks like a useful list.

While the High Level Capabilities all look interesting, they also feel like they target a very different audience to the other requests. I'm more directly interested in those things that would offer immediate improvements to data manipulation than those that offer (mainly) developer facing advantages.

I'd be keen on seeing better handling of record based data with XML/JSON input and output - which would really start to offer some differentiation for Refine vs Excel. I think this goes hand in hand with more consistent use of the record model across Refine, although this seems like it might take some substantial effort.

The current lack of ability to add rows to an existing project seems a key missing piece of functionality as it limits you to single file based projects.

Owen

David Huynh

unread,

Nov 20, 2012, 12:26:31 AM11/20/12

to openr...@googlegroups.com

I think both of your requests are on the list already. Perhaps we can wait for other people to chime in, and then take a vote or figure out some other way to prioritize these features.

David

--

Thad Guidry

unread,

Nov 20, 2012, 10:36:19 AM11/20/12

to openr...@googlegroups.com

High-Level Capabilities

Scalability: Refine should handle more data. (How much is enough?) While there's still room for optimization, there is a limit of how much data we can handle with a given amount of memory. At some point, it becomes a product design issue: how much interactivity can we promise before we usher the user onto the pipelining usage mode.

In regards to scalability, I have noticed that the problem seems to increase with the state of things. Where when I apply 10 facets, and am in Records mode, things begin to have long Working... pauses with even just 100,000 records and 15 columns. But this might be resolved with a better internal JSON-like data model. dunno.

Pipelining mode: Refine's operations can be run in a pipeline, especially in MapReduce parallelism on a Hadoop cluster. This should allow processing of really large data sets. Some product design issues include:

How to handle operations like Reconcile that are non-deterministic, and so, need some human intervention?
Can pipelining mode be invoked from the UI? The UI can block until the pipeline finishes.

Does the pipelining mode work on normal projects, or directly from input data files to output data files?
Are operations done in a pipeline run undo-able?

I think folks interested in the pipelining with MapReduce and Hadoop or Mahout should look into Pentaho (open source) that already provides integration of them. I would think that it is possible for OpenRefine to leverage Hadoop in a similar fashion. Having an easier tool than Pentaho, would certainly be useful. One thing that I often have to do is still apply custom partitioning expressions for Hadoop reducers, etc. And using GREL is a breeze compared to trying to wire things up in Java, such as this example using Pentaho and Hadoop to wire up your custom record key that needs to pipe to a particular reducer http://wiki.pentaho.com/display/BAD/Using+a+Custom+Partitioner+in+Pentaho+MapReduce

Here's some more info for those interested in Pentaho and Hadoop usage:

http://www.pentahobigdata.com/ecosystem/platforms/hadoop

http://wiki.pentaho.com/display/BAD/Hadoop

http://infocenter.pentaho.com/help/topic/getting_started_with_pdi/task_getting_started_with_hadoop.html

Data visualizations + statistics: Refine should support visualizations and statistics, e.g., integration of R, integration of d3js. Basic visualizations shouldn't be hard to support, though it can get complicated quickly.

A few of the D3js visualizations I think would actually be quite useful for clustering I think. I agree it could get real complicated given that visualization requires a lot of HTML, CSS, Javascripting, etc. So perhaps a pluggable vizualization window to your data is needed that supports a pluggable control panel ? HTML5 ? Pre-wire this and then the community can probably support the plugins I guess ?

Import/Export

Better XML (hierarchical formats) import/export. Make sure the hierarchical data is parsed correctly into records.

Yes, we need to fix this and it should be a priority.

Better interface with databases: e.g., loading data directly from database tables, writing back to database tables, exporting SQL commands to update.

This would be one major improvement in my life, I guess. And it goes along with perhaps Reconciling between projects and databases. In Pentaho, how I am able to drop into a directory, the latest SQL library .jar file for my database vendor.

Easier to join / concatenate data: Allow adding rows to an existing project.

We should have supported this from day 1. We allow adding columns. And new record rows do get created, just not as the user really needs, which is an appendRows() type of function, I guess. Using the appendRows() function during an import would also be useful, to support aggregating a bunch of files and creating a bigger Refine project holding all the rows from your individual datasets.

Operations

Better undo/redo: Out-of-order undo; more intuitive undo format; annotations/notes in the undo/redo history.

Better clustering. For example, allow grouping rows by values in one column and then clustering values within each group of rows in another column.

This would be useful, it's like a double cluster operation. And should be Human operated only.

More powerful Fetch URL command. E.g., support POST, support user's credentials.

+ 1 , we are holding back a lot of useful data for folks because they have to do so much extra work to Fetch things with common Web APIs and cannot use Refine fully for that.

Reconciliation

Reconciliation between projects or against databases.

I would like to see Reconcile between projects 1st. If it is just as easy for databases also, once SQL in/out is supported, then I would like to see that also. The thing about databases is that they can get REALLY large, but during reconciling your just doing it in mass against the cells value keys themselves, so hopefully it's not too bad. The visualization part is what I need some kind of graphic example or screenshot to help me visualize how reconciling between both projects might work. Both projects often need to be side by side, because there's a lot of back and forth reconciling that can happen between 2 projects. A different faceting mechanism might be needed that runs along the top or bottom of screen and is configurable to minimize precious window real estate, might be needed.

Expressions and Data Model

Internal JSON-like data model. Improve the record model and make it behave coherently in expressions, facets, and operations.

+1 yes, we need to fix this and make it flow so much better and much faster. I feel this is # 1 priority and I'm tired of seeing Working... all the time while working with larger data sets in Refine.

More native features. E.g., native handling of addresses.

More native features actually could just come from the Customization operations and where the Community can easily contribute. Easier Plugability would be so cool.

Customization + Work Flow

Customization. E.g., commonly used operations can be pinned on column drop-down menus.

+1 Being able to pin the pluggable features, custom GREL expressions, etc, that the Community comes up with and contributes to, in whatever languages that OpenRefine supports, like Jython, Clojure, GREL, etc.

Sharing of operation scripts. Operation scripts can be readily uploaded onto some wiki and shared with other people.

Development

Easier third-party extension development.

I am not sure HOW to get better Python support (versus just Jython) into OpenRefine, but feel that it would greatly open the doors for a much wider community for extension development. We are missing a CODE window that could take Python (not Jython), Clojure and GREL directly, and the Expression Editor window is feeling rather limiting.

--
-Thad
http://www.freebase.com/view/en/thad_guidry

Martin Magdinier

unread,

Nov 20, 2012, 6:13:28 PM11/20/12

to openrefine

Thanks for this summary !

I think the discussion should be split up in different thread or even turned into GitHub issue / request when an improvement start to be clearly identified.

See my comments below.

On Tue, Nov 20, 2012 at 10:36 AM, Thad Guidry <thadg...@gmail.com> wrote:

High-Level Capabilities

Scalability: Refine should handle more data. (How much is enough?) While there's still room for optimization, there is a limit of how much data we can handle with a given amount of memory. At some point, it becomes a product design issue: how much interactivity can we promise before we usher the user onto the pipelining usage mode.

In regards to scalability, I have noticed that the problem seems to increase with the state of things. Where when I apply 10 facets, and am in Records mode, things begin to have long Working... pauses with even just 100,000 records and 15 columns. But this might be resolved with a better internal JSON-like data model. dunno.

Pipelining mode: Refine's operations can be run in a pipeline, especially in MapReduce parallelism on a Hadoop cluster. This should allow processing of really large data sets. Some product design issues include:

How to handle operations like Reconcile that are non-deterministic, and so, need some human intervention?
Can pipelining mode be invoked from the UI? The UI can block until the pipeline finishes.

Does the pipelining mode work on normal projects, or directly from input data files to output data files?
Are operations done in a pipeline run undo-able?

I think folks interested in the pipelining with MapReduce and Hadoop or Mahout should look into Pentaho (open source) that already provides integration of them. I would think that it is possible for OpenRefine to leverage Hadoop in a similar fashion. Having an easier tool than Pentaho, would certainly be useful. One thing that I often have to do is still apply custom partitioning expressions for Hadoop reducers, etc. And using GREL is a breeze compared to trying to wire things up in Java, such as this example using Pentaho and Hadoop to wire up your custom record key that needs to pipe to a particular reducer http://wiki.pentaho.com/display/BAD/Using+a+Custom+Partitioner+in+Pentaho+MapReduce

Here's some more info for those interested in Pentaho and Hadoop usage:

http://www.pentahobigdata.com/ecosystem/platforms/hadoop

http://wiki.pentaho.com/display/BAD/Hadoop

http://infocenter.pentaho.com/help/topic/getting_started_with_pdi/task_getting_started_with_hadoop.html

Data visualizations + statistics: Refine should support visualizations and statistics, e.g., integration of R, integration of d3js. Basic visualizations shouldn't be hard to support, though it can get complicated quickly.

A few of the D3js visualizations I think would actually be quite useful for clustering I think. I agree it could get real complicated given that visualization requires a lot of HTML, CSS, Javascripting, etc. So perhaps a pluggable vizualization window to your data is needed that supports a pluggable control panel ? HTML5 ? Pre-wire this and then the community can probably support the plugins I guess ?

Regarding the Stats part: The Chicago Tribune Stats extension already exists. A tutorial is available on their blog. Please note that the extension does not work with Google Refine 2.5. It should be tested with the 2.0 version. Might be a first easy steps by testing. I can do that sometime this week.

Import/Export

Better XML (hierarchical formats) import/export. Make sure the hierarchical data is parsed correctly into records.

Yes, we need to fix this and it should be a priority.

Better interface with databases: e.g., loading data directly from database tables, writing back to database tables, exporting SQL commands to update.

This would be one major improvement in my life, I guess. And it goes along with perhaps Reconciling between projects and databases. In Pentaho, how I am able to drop into a directory, the latest SQL library .jar file for my database vendor.

There is already a Google Refine Uploader Extension allows you to export datasets from Google Refine and post them as JSON to web servers! Intended for use with CouchDB. Please note that this extension is a work in progress. Feel free to join and help. Access the github of the project .

Easier to join / concatenate data: Allow adding rows to an existing project.

We should have supported this from day 1. We allow adding columns. And new record rows do get created, just not as the user really needs, which is an appendRows() type of function, I guess. Using the appendRows() function during an import would also be useful, to support aggregating a bunch of files and creating a bigger Refine project holding all the rows from your individual datasets.

I think a join / concatenate function through menu (and not GREL expression) could be nice to have. The poorly written article on my blog on this topic is by far the most view and searched. So I guess there is a need for an nice interface there.

Operations

Better undo/redo: Out-of-order undo; more intuitive undo format; annotations/notes in the undo/redo history.

Being able to fork the history. Currently when we go back in the history and perform a new action, all the next steps are lost. So if you didn't back up your project previously, or extracted the JSON code, you might loose some work. Sometimes you just also want to do one change in your process and keep the process the same.

I reckon this is at least a UI challenge to implement it. Don't know code site.

Better clustering. For example, allow grouping rows by values in one column and then clustering values within each group of rows in another column.

This would be useful, it's like a double cluster operation. And should be Human operated only.

Agree.

I would like to mentioned that I discover only recently (article) that cluster feature where also available in GREL expression. I guess it would be worth to mentioned it in the help section

More powerful Fetch URL command. E.g., support POST, support user's credentials.

+ 1 , we are holding back a lot of useful data for folks because they have to do so much extra work to Fetch things with common Web APIs and cannot use Refine fully for that.

+1. I see also a need for user's credentials from the project creation page where the user can access data protected by login and platform (for example buzzdata)

Reconciliation

Reconciliation between projects or against databases.

I would like to see Reconcile between projects 1st. If it is just as easy for databases also, once SQL in/out is supported, then I would like to see that also. The thing about databases is that they can get REALLY large, but during reconciling your just doing it in mass against the cells value keys themselves, so hopefully it's not too bad. The visualization part is what I need some kind of graphic example or screenshot to help me visualize how reconciling between both projects might work. Both projects often need to be side by side, because there's a lot of back and forth reconciling that can happen between 2 projects. A different faceting mechanism might be needed that runs along the top or bottom of screen and is configurable to minimize precious window real estate, might be needed.

Expressions and Data Model

Internal JSON-like data model. Improve the record model and make it behave coherently in expressions, facets, and operations.

+1 yes, we need to fix this and make it flow so much better and much faster. I feel this is # 1 priority and I'm tired of seeing Working... all the time while working with larger data sets in Refine.

More native features. E.g., native handling of addresses.

More native features actually could just come from the Customization operations and where the Community can easily contribute. Easier Plugability would be so cool.

Customization + Work Flow

Customization. E.g., commonly used operations can be pinned on column drop-down menus.

+1 Being able to pin the pluggable features, custom GREL expressions, etc, that the Community comes up with and contributes to, in whatever languages that OpenRefine supports, like Jython, Clojure, GREL, etc.

User already have the option to star their favorite expression and reuse from a project to an other.

Sharing of operation scripts. Operation scripts can be readily uploaded onto some wiki and shared with other people.

Development

Easier third-party extension development.

I am not sure HOW to get better Python support (versus just Jython) into OpenRefine, but feel that it would greatly open the doors for a much wider community for extension development. We are missing a CODE window that could take Python (not Jython), Clojure and GREL directly, and the Expression Editor window is feeling rather limiting.

--
-Thad
http://www.freebase.com/view/en/thad_guidry

--

Martin

Ian Mak

unread,

Nov 20, 2012, 6:16:26 PM11/20/12

to openr...@googlegroups.com

Hi David,

Thad has prompted me to feedback on this, so my two cents below.

Ian

On Sunday, 18 November 2012 03:47:32 UTC, David Huynh wrote:

Hi all,

From Martin's usage survey I've tried to extract and organize feature requests that you've mentioned. I want to jot them down here, with my own comments, so we can start some discussion.

Please elaborate on any feature request you're interested in, or add new ones. Perhaps we'll vote on which ones are more important to prioritize.

Thanks,

David

High-Level Capabilities

Scalability: Refine should handle more data. (How much is enough?) While there's still room for optimization, there is a limit of how much data we can handle with a given amount of memory. At some point, it becomes a product design issue: how much interactivity can we promise before we usher the user onto the pipelining usage mode.

We find it doesn't take long to run into Refine's limitations for size, so we'd definitely want to vote for more capability in this area, but from our experience of using Pentaho with larger datasets it doesn't take you long to run into RAM memory issues when using a standard desktop 4GB of RAM. If pipeline can help with both the scale and the RAM issues that should take priority.

Pipelining mode: Refine's operations can be run in a pipeline, especially in MapReduce parallelism on a Hadoop cluster. This should allow processing of really large data sets. Some product design issues include:

How to handle operations like Reconcile that are non-deterministic, and so, need some human intervention?

Can pipelining mode be invoked from the UI? The UI can block until the pipeline finishes.

Does the pipelining mode work on normal projects, or directly from input data files to output data files?

Are operations done in a pipeline run undo-able?

This sounds close to Pentaho which does the pipelining very well, but doesn't do the human intervention. Most ETL products seem to use a graphical flow chart interface, which works well, but can be fiddly, because the user is forced to hunt around in endless dialog boxes, looking for the correct box to tick. I'd prefer a series of GREL statements, provided that there is a suitable way to preview / interrupt flow for debugging purposes.

Scriptability: Refine can be driven from other apps through libraries in more languages. This is similar to pipelining mode, but not necessarily the same. For example, another app might automatically create a new project in Refine given some data that it generates, drive Refine to perform some fixed sequence of operations, and then let the user continue from there interactively.

This could be fantastic, one of the great things about Pentaho is the ability to set up a routine that picks up a new file for processing once it has been added to a directory.

Hosting: Refine can be hosted within some other web app. Issue: how does access control work between Refine and that web app?

Collaboration support: In addition to hosting, Refine should support collaboration. Multiple users can access the same project. Consider other hosted products, like Google Spreadsheets, that support collaborative editing: operations in them tend to modify only small pieces of data, such as one spreadsheet cell at a time; so, it's enough to just lock the current cell that each user focuses on. This is not possible in Refine, as each operation in Refine can and usually modify a large amount of data.

If you can make it work, this would be great, but it looks like a really big job and there are other priorities I would put ahead of this.

Data visualizations + statistics: Refine should support visualizations and statistics, e.g., integration of R, integration of d3js. Basic visualizations shouldn't be hard to support, though it can get complicated quickly.

We do a lot of work with visualisations but we always find that we want such precise control over the output that we end up developing specific javascript code for each instance.

Import/Export

Better XML (hierarchical formats) import/export. Make sure the hierarchical data is parsed correctly into records.

Better interface with databases: e.g., loading data directly from database tables, writing back to database tables, exporting SQL commands to update.

YES, YES, YES

Easier to join / concatenate data: Allow adding rows to an existing project.

YES, YES, YES

Needlebase-like web scraping functionality. This is hard and maybe fall outside the scope of Refine.

This feels like it is out of scope to me, there are lots of other products to do this and scraperwiki too.

Operations

Better undo/redo: Out-of-order undo; more intuitive undo format; annotations/notes in the undo/redo history.

Better clustering. For example, allow grouping rows by values in one column and then clustering values within each group of rows in another column.

More powerful Fetch URL command. E.g., support POST, support user's credentials.

That could be very useful for some datasets.

Reconciliation

Reconciliation between projects or against databases.

This is our biggest challenge, we're using open corps for some of our reconciliation, but we also need to match against four or five other data sets. Using pentaho and a custom built list we're able to get an 75% match, but our datasets are so large that we need to deploy our service on an EC2 cluster. It still leaves us with a lump of 25% of our data that needs to be reviewed manually. We can't do that in Pentaho and we'd love Open Refine to be our single reconciliation service.

I agree with Thad that project to project reconciliation should come first. This would be good for smaller datasets and would probably serve a good proportion of you user's reconciliation requirements. For people like us, who want to use Open Refine to reconcile large datasets, I don't think its an issue of adding new functionality, its the difficulty of setting up a working API. If you could make it easier for users to set up a site that will take refine commands then you don't need to add local database reconciliation.

I'm happy to be your guinea pig, I'm also willing to write up tutorials and screencasts for users showing them how to get an API up and running, so if anyone can point me in the right direction I'll gladly take on the challenge of demystifying this aspect of refine.

Expressions and Data Model

Internal JSON-like data model. Improve the record model and make it behave coherently in expressions, facets, and operations.

Better GREL. More intuitive, more capabilities, access over more things in a project or in other projects, etc., e.g., expressions able to reference other rows.

Extending GREL would be great.

More tutorials / documentation on GREL.

Yes please.

More native features. E.g., native handling of addresses.

Customization + Work Flow

Customization. E.g., commonly used operations can be pinned on column drop-down menus.

Sharing of operation scripts. Operation scripts can be readily uploaded onto some wiki and shared with other people.

See scraperwiki, very useful and could also overcome the issue around native handling of addresses etc.

Development

Easier third-party extension development.

I'm sorry I missed the original questionnaire but a couple of other things leap out at me:

1. The ability to rename data exports and to save them to different directories

2. Clustering and edit for date formats. We often see mixed date formats in our data, it would be great to have a tool for clustering these into a consistent format.

3. Ability to check / monitor / change encoding formats

4. The ability to remember manual reconciliation choices - we often see misspelt data coming our way, it would be great to reconcile it once and have the option to repeat any previous manual reconciliations. (I know you can do this using the JSON step code, but it often gets mixed up with other activities, and so we have to unpick the JSON and compile it all into a separate file so that they can be reused).

I'd also like to say thank you to you guys, thank you for creating such a great product that can be used by people who aren't experts, I don't think you realise how much you've already achieved for those of us who are working on dirty data day to day. This tool is opening so many possibilities for us, so a heartfelt thank you.

David Huynh

unread,

Nov 23, 2012, 5:03:25 PM11/23/12

to openr...@googlegroups.com

Thanks, Ian. This is valuable feedback for us, and gives us plenty to go on!

I think I understand most of what you wrote, except for these 2 points:

3. Ability to check / monitor / change encoding formats
4. The ability to remember manual reconciliation choices - we often see misspelt data coming our way, it would be great to reconcile it once and have the option to repeat any previous manual reconciliations. (I know you can do this using the JSON step code, but it often gets mixed up with other activities, and so we have to unpick the JSON and compile it all into a separate file so that they can be reused).

Could you elaborate a bit more on both?

For #4, how do you imagine the feature will work? Do you and others on your team want to share these manual recon choices? Are there risks to apply the same choices without checking on any newly reconciled column?

David

--

Mateja Verlic

unread,

Jan 24, 2013, 3:09:44 PM1/24/13

to openr...@googlegroups.com

Hi,

better late than never... refine-stat extension should now work with OpenRefine.

I forked their repository and made some changes. You can find it on my github (until they accept the pull request): https://github.com/sparkica/refine-stats

Please note that the extension has been restructured: everything was moved one level up (from stats subfolder). Please let me know if you have any trouble using it...

Best,

Mateja

Reply all

Reply to author

Forward

0 new messages