--
High-Level CapabilitiesScalability: Refine should handle more data. (How much is enough?) While there's still room for optimization, there is a limit of how much data we can handle with a given amount of memory. At some point, it becomes a product design issue: how much interactivity can we promise before we usher the user onto the pipelining usage mode.
Pipelining mode: Refine's operations can be run in a pipeline, especially in MapReduce parallelism on a Hadoop cluster. This should allow processing of really large data sets. Some product design issues include:
- How to handle operations like Reconcile that are non-deterministic, and so, need some human intervention?
- Can pipelining mode be invoked from the UI? The UI can block until the pipeline finishes.
- Does the pipelining mode work on normal projects, or directly from input data files to output data files?
- Are operations done in a pipeline run undo-able?
Data visualizations + statistics: Refine should support visualizations and statistics, e.g., integration of R, integration of d3js. Basic visualizations shouldn't be hard to support, though it can get complicated quickly.
Import/ExportBetter XML (hierarchical formats) import/export. Make sure the hierarchical data is parsed correctly into records.
Better interface with databases: e.g., loading data directly from database tables, writing back to database tables, exporting SQL commands to update.
Easier to join / concatenate data: Allow adding rows to an existing project.
OperationsBetter undo/redo: Out-of-order undo; more intuitive undo format; annotations/notes in the undo/redo history.Better clustering. For example, allow grouping rows by values in one column and then clustering values within each group of rows in another column.
More powerful Fetch URL command. E.g., support POST, support user's credentials.
ReconciliationReconciliation between projects or against databases.
Expressions and Data ModelInternal JSON-like data model. Improve the record model and make it behave coherently in expressions, facets, and operations.
More native features. E.g., native handling of addresses.
Customization + Work FlowCustomization. E.g., commonly used operations can be pinned on column drop-down menus.
Sharing of operation scripts. Operation scripts can be readily uploaded onto some wiki and shared with other people.DevelopmentEasier third-party extension development.
High-Level CapabilitiesScalability: Refine should handle more data. (How much is enough?) While there's still room for optimization, there is a limit of how much data we can handle with a given amount of memory. At some point, it becomes a product design issue: how much interactivity can we promise before we usher the user onto the pipelining usage mode.In regards to scalability, I have noticed that the problem seems to increase with the state of things. Where when I apply 10 facets, and am in Records mode, things begin to have long Working... pauses with even just 100,000 records and 15 columns. But this might be resolved with a better internal JSON-like data model. dunno.Pipelining mode: Refine's operations can be run in a pipeline, especially in MapReduce parallelism on a Hadoop cluster. This should allow processing of really large data sets. Some product design issues include:
- How to handle operations like Reconcile that are non-deterministic, and so, need some human intervention?
- Can pipelining mode be invoked from the UI? The UI can block until the pipeline finishes.
- Does the pipelining mode work on normal projects, or directly from input data files to output data files?
- Are operations done in a pipeline run undo-able?
I think folks interested in the pipelining with MapReduce and Hadoop or Mahout should look into Pentaho (open source) that already provides integration of them. I would think that it is possible for OpenRefine to leverage Hadoop in a similar fashion. Having an easier tool than Pentaho, would certainly be useful. One thing that I often have to do is still apply custom partitioning expressions for Hadoop reducers, etc. And using GREL is a breeze compared to trying to wire things up in Java, such as this example using Pentaho and Hadoop to wire up your custom record key that needs to pipe to a particular reducer http://wiki.pentaho.com/display/BAD/Using+a+Custom+Partitioner+in+Pentaho+MapReduceHere's some more info for those interested in Pentaho and Hadoop usage:Data visualizations + statistics: Refine should support visualizations and statistics, e.g., integration of R, integration of d3js. Basic visualizations shouldn't be hard to support, though it can get complicated quickly.A few of the D3js visualizations I think would actually be quite useful for clustering I think. I agree it could get real complicated given that visualization requires a lot of HTML, CSS, Javascripting, etc. So perhaps a pluggable vizualization window to your data is needed that supports a pluggable control panel ? HTML5 ? Pre-wire this and then the community can probably support the plugins I guess ?
Import/ExportBetter XML (hierarchical formats) import/export. Make sure the hierarchical data is parsed correctly into records.Yes, we need to fix this and it should be a priority.Better interface with databases: e.g., loading data directly from database tables, writing back to database tables, exporting SQL commands to update.This would be one major improvement in my life, I guess. And it goes along with perhaps Reconciling between projects and databases. In Pentaho, how I am able to drop into a directory, the latest SQL library .jar file for my database vendor.
Easier to join / concatenate data: Allow adding rows to an existing project.We should have supported this from day 1. We allow adding columns. And new record rows do get created, just not as the user really needs, which is an appendRows() type of function, I guess. Using the appendRows() function during an import would also be useful, to support aggregating a bunch of files and creating a bigger Refine project holding all the rows from your individual datasets.
OperationsBetter undo/redo: Out-of-order undo; more intuitive undo format; annotations/notes in the undo/redo history.
Better clustering. For example, allow grouping rows by values in one column and then clustering values within each group of rows in another column.This would be useful, it's like a double cluster operation. And should be Human operated only.
More powerful Fetch URL command. E.g., support POST, support user's credentials.+ 1 , we are holding back a lot of useful data for folks because they have to do so much extra work to Fetch things with common Web APIs and cannot use Refine fully for that.
ReconciliationReconciliation between projects or against databases.I would like to see Reconcile between projects 1st. If it is just as easy for databases also, once SQL in/out is supported, then I would like to see that also. The thing about databases is that they can get REALLY large, but during reconciling your just doing it in mass against the cells value keys themselves, so hopefully it's not too bad. The visualization part is what I need some kind of graphic example or screenshot to help me visualize how reconciling between both projects might work. Both projects often need to be side by side, because there's a lot of back and forth reconciling that can happen between 2 projects. A different faceting mechanism might be needed that runs along the top or bottom of screen and is configurable to minimize precious window real estate, might be needed.Expressions and Data ModelInternal JSON-like data model. Improve the record model and make it behave coherently in expressions, facets, and operations.+1 yes, we need to fix this and make it flow so much better and much faster. I feel this is # 1 priority and I'm tired of seeing Working... all the time while working with larger data sets in Refine.
More native features. E.g., native handling of addresses.More native features actually could just come from the Customization operations and where the Community can easily contribute. Easier Plugability would be so cool.Customization + Work FlowCustomization. E.g., commonly used operations can be pinned on column drop-down menus.+1 Being able to pin the pluggable features, custom GREL expressions, etc, that the Community comes up with and contributes to, in whatever languages that OpenRefine supports, like Jython, Clojure, GREL, etc.
--Sharing of operation scripts. Operation scripts can be readily uploaded onto some wiki and shared with other people.DevelopmentEasier third-party extension development.I am not sure HOW to get better Python support (versus just Jython) into OpenRefine, but feel that it would greatly open the doors for a much wider community for extension development. We are missing a CODE window that could take Python (not Jython), Clojure and GREL directly, and the Expression Editor window is feeling rather limiting.
-Thad
http://www.freebase.com/view/en/thad_guidry
--
Hi David,
Thad has prompted me to feedback on this, so my two cents below.
Ian
On Sunday, 18 November 2012 03:47:32 UTC, David Huynh wrote:
Hi all,
From Martin's usage survey I've tried to extract and organize feature requests that you've mentioned. I want to jot them down here, with my own comments, so we can start some discussion.
Please elaborate on any feature request you're interested in, or add new ones. Perhaps we'll vote on which ones are more important to prioritize.
Thanks,
David
High-Level Capabilities
Scalability: Refine should handle more data. (How much is enough?) While there's still room for optimization, there is a limit of how much data we can handle with a given amount of memory. At some point, it becomes a product design issue: how much interactivity can we promise before we usher the user onto the pipelining usage mode.
We find it doesn't take long to run into Refine's limitations for size, so we'd definitely want to vote for more capability in this area, but from our experience of using Pentaho with larger datasets it doesn't take you long to run into RAM memory issues when using a standard desktop 4GB of RAM. If pipeline can help with both the scale and the RAM issues that should take priority.
Pipelining mode: Refine's operations can be run in a pipeline, especially in MapReduce parallelism on a Hadoop cluster. This should allow processing of really large data sets. Some product design issues include:
- How to handle operations like Reconcile that are non-deterministic, and so, need some human intervention?
- Can pipelining mode be invoked from the UI? The UI can block until the pipeline finishes.
- Does the pipelining mode work on normal projects, or directly from input data files to output data files?
- Are operations done in a pipeline run undo-able?
This sounds close to Pentaho which does the pipelining very well, but doesn't do the human intervention. Most ETL products seem to use a graphical flow chart interface, which works well, but can be fiddly, because the user is forced to hunt around in endless dialog boxes, looking for the correct box to tick. I'd prefer a series of GREL statements, provided that there is a suitable way to preview / interrupt flow for debugging purposes.
Scriptability: Refine can be driven from other apps through libraries in more languages. This is similar to pipelining mode, but not necessarily the same. For example, another app might automatically create a new project in Refine given some data that it generates, drive Refine to perform some fixed sequence of operations, and then let the user continue from there interactively.
This could be fantastic, one of the great things about Pentaho is the ability to set up a routine that picks up a new file for processing once it has been added to a directory.
Hosting: Refine can be hosted within some other web app. Issue: how does access control work between Refine and that web app?
Collaboration support: In addition to hosting, Refine should support collaboration. Multiple users can access the same project. Consider other hosted products, like Google Spreadsheets, that support collaborative editing: operations in them tend to modify only small pieces of data, such as one spreadsheet cell at a time; so, it's enough to just lock the current cell that each user focuses on. This is not possible in Refine, as each operation in Refine can and usually modify a large amount of data.
If you can make it work, this would be great, but it looks like a really big job and there are other priorities I would put ahead of this.
Data visualizations + statistics: Refine should support visualizations and statistics, e.g., integration of R, integration of d3js. Basic visualizations shouldn't be hard to support, though it can get complicated quickly.
We do a lot of work with visualisations but we always find that we want such precise control over the output that we end up developing specific javascript code for each instance.
Import/Export
Better XML (hierarchical formats) import/export. Make sure the hierarchical data is parsed correctly into records.
Better interface with databases: e.g., loading data directly from database tables, writing back to database tables, exporting SQL commands to update.
YES, YES, YES
Easier to join / concatenate data: Allow adding rows to an existing project.
YES, YES, YES
Needlebase-like web scraping functionality. This is hard and maybe fall outside the scope of Refine.
This feels like it is out of scope to me, there are lots of other products to do this and scraperwiki too.
Operations
Better undo/redo: Out-of-order undo; more intuitive undo format; annotations/notes in the undo/redo history.
Better clustering. For example, allow grouping rows by values in one column and then clustering values within each group of rows in another column.
More powerful Fetch URL command. E.g., support POST, support user's credentials.
That could be very useful for some datasets.
Reconciliation
Reconciliation between projects or against databases.
This is our biggest challenge, we're using open corps for some of our reconciliation, but we also need to match against four or five other data sets. Using pentaho and a custom built list we're able to get an 75% match, but our datasets are so large that we need to deploy our service on an EC2 cluster. It still leaves us with a lump of 25% of our data that needs to be reviewed manually. We can't do that in Pentaho and we'd love Open Refine to be our single reconciliation service.
I agree with Thad that project to project reconciliation should come first. This would be good for smaller datasets and would probably serve a good proportion of you user's reconciliation requirements. For people like us, who want to use Open Refine to reconcile large datasets, I don't think its an issue of adding new functionality, its the difficulty of setting up a working API. If you could make it easier for users to set up a site that will take refine commands then you don't need to add local database reconciliation.
I'm happy to be your guinea pig, I'm also willing to write up tutorials and screencasts for users showing them how to get an API up and running, so if anyone can point me in the right direction I'll gladly take on the challenge of demystifying this aspect of refine.
Expressions and Data Model
Internal JSON-like data model. Improve the record model and make it behave coherently in expressions, facets, and operations.
Better GREL. More intuitive, more capabilities, access over more things in a project or in other projects, etc., e.g., expressions able to reference other rows.
Extending GREL would be great.
More tutorials / documentation on GREL.
Yes please.
More native features. E.g., native handling of addresses.
Customization + Work Flow
Customization. E.g., commonly used operations can be pinned on column drop-down menus.
Sharing of operation scripts. Operation scripts can be readily uploaded onto some wiki and shared with other people.
See scraperwiki, very useful and could also overcome the issue around native handling of addresses etc.
Development
Easier third-party extension development.
I'm sorry I missed the original questionnaire but a couple of other things leap out at me:
1. The ability to rename data exports and to save them to different directories
2. Clustering and edit for date formats. We often see mixed date formats in our data, it would be great to have a tool for clustering these into a consistent format.
3. Ability to check / monitor / change encoding formats
4. The ability to remember manual reconciliation choices - we often see misspelt data coming our way, it would be great to reconcile it once and have the option to repeat any previous manual reconciliations. (I know you can do this using the JSON step code, but it often gets mixed up with other activities, and so we have to unpick the JSON and compile it all into a separate file so that they can be reused).
I'd also like to say thank you to you guys, thank you for creating such a great product that can be used by people who aren't experts, I don't think you realise how much you've already achieved for those of us who are working on dirty data day to day. This tool is opening so many possibilities for us, so a heartfelt thank you.
3. Ability to check / monitor / change encoding formats
4. The ability to remember manual reconciliation choices - we often see misspelt data coming our way, it would be great to reconcile it once and have the option to repeat any previous manual reconciliations. (I know you can do this using the JSON step code, but it often gets mixed up with other activities, and so we have to unpick the JSON and compile it all into a separate file so that they can be reused).
--