Re: future plans

Message has been deleted

Tim McNamara

unread,

Nov 14, 2010, 3:45:39 PM11/14/10

to google...@googlegroups.com

On Mon, Nov 15, 2010 at 9:12 AM, Randall Amiel <randy1...@gmail.com> wrote:

will refine be transitioned to the google cloud or will it just stay as a local service using our computers resources before hitting google. ive experienced issues loading extremely large datasets. hopefully you guys will integrate refine with google storage sooner than later. it would help a lot.

Randy, yes is the short answer. I'm sure patches are welcome :)

There are issues with loading big projects because Refine keeps the whole project in memory. Google Refine's resources are mostly restricted by your Java settings[1], especially when you load the default settings.

More importantly though, there is a component of the source that is specifically for the App Engine[2] in development. This should be a win, because of the platform dynamically allocated memory. As we speak, I'm sure many engineers are working on a data pipeline between upload into Storage > Refine > BigQuery & Prediction API.

Personally, I'm glad Google doesn't restrict users to this. I think the benefits of a local service far outweigh a centralised cloud service. Being outside of the USA, it's impossible for me to access Google Storage. More importantly however, data are sensitive. It's great that the Refine project can perform operations locally.

My view is that if this project were to have been started as a Google project, its modus operandi of cloud storage and processing would have been followed. I think it's great that things have worked out the way they have.

Tangent: many of the cloud apps, e.g. Google Apps, still rely on the client to do lots of processing. Pamela Fox from Google Sydney did a great talk at Webstock 09 on how Google's infrastructure is best used. The moral I took out of that was that while distributed file systems are great, making use of the client's CPU cycles is still going to be the best way to do intensive tasks.

Tim

[1] See issues #145, #147
[2] http://code.google.com/p/google-refine/source/browse/#svn/trunk/broker/appengine%3Fstate%3Dclosed
[3] http://vimeo.com/4861271

Judson Dunn

unread,

Nov 14, 2010, 3:16:17 PM11/14/10

to google...@googlegroups.com

On Sun, Nov 14, 2010 at 2:12 PM, Randall Amiel <randy1...@gmail.com> wrote:
> will refine be transitioned to the google cloud or will it just stay as a
> local service using our computers resources before hitting google. ive
> experienced issues loading extremely large datasets. hopefully you guys will
> integrate refine with google storage sooner than later. it would help a lot.
>

As someone that uses Refine for medical data with protected health
information if it was *only* offered as a hosted service it wouldn't
be usable to me anymore. Please keep uses like this in mind also. As
you say in your intro, it could have repercussions for journalists etc
also.

Thanks.

--
Judson Dunn
http://sleepyhead.org

Randall Amiel

unread,

Nov 14, 2010, 6:09:12 PM11/14/10

to google...@googlegroups.com

agreed. the point im trying to make is that, if a user doesnt have the cpu resources to handle such big datasets, googles cloud could be used to do such transformations. in the future, im sure google will handle sensitive data as they cater to enterprise users

randall

On Nov 14, 2010 3:47 PM, "Judson Dunn" <cohe...@sleepyhead.org> wrote:

On Sun, Nov 14, 2010 at 2:12 PM, Randall Amiel <randy1...@gmail.com> wrote:

> will refine be tra...

David Huynh

unread,

Nov 14, 2010, 7:11:29 PM11/14/10

to google...@googlegroups.com

We're definitely seeing several audiences:

1. Individual people who have small to medium-sized, public or at least not sensitive data sets -- Google-hosted service would be best.

2. Individual people who have small to medium-sized, private/sensitive data sets -- desktop application (the current form) would be best.

3. Communities who want to collaborate on processing some public data sets -- Google-hosted service would be best.

a. Freebase sub-communities who want to load some data sets into Freebase.

b. Citizen reporters who want to collaborate on sifting through some government public data sets.

4. Communities who want to collaborate on processing some private data sets -- self-hosted service would be best.

a. News agencies working through data for news stories.

b. Government agencies like data.uk.gov cleaning up data before publishing it.

c. Crisis response teams handling private data, and who can't rely on connectivity to the cloud.

The two different hosted options will require significantly different technology stacks.

Am I missing any other audience?

David

Resty Cena

unread,

Nov 14, 2010, 7:32:04 PM11/14/10

to google...@googlegroups.com

David,

Corpus linguists -- who collaboratelty or privately harvests mono- or multi-lingual data from the web, clean them, and do all sorts of text manipulation and analysis on the text data.

Stefano Mazzocchi

unread,

Nov 14, 2010, 7:42:56 PM11/14/10

to google-refine

On Sun, Nov 14, 2010 at 4:32 PM, Resty Cena <rest...@gmail.com> wrote:

David,
Corpus linguists -- who collaboratelty or privately harvests mono- or multi-lingual data from the web, clean them, and do all sorts of text manipulation and analysis on the text data.

I'm curious, what do you see as the difference between what you suggest and David's option #3?

Note, we're not discussing "types of data" but "types of interaction and control requirements".

--
Stefano Mazzocchi <stef...@google.com>
Software Engineer, Google Inc.

Tim McNamara

unread,

Nov 14, 2010, 7:50:11 PM11/14/10

to google...@googlegroups.com

On Mon, Nov 15, 2010 at 1:11 PM, David Huynh <dfh...@gmail.com> wrote:

We're definitely seeing several audiences:

1. Individual people who have small to medium-sized, public or at least not sensitive data sets -- Google-hosted service would be best.

2. Individual people who have small to medium-sized, private/sensitive data sets -- desktop application (the current form) would be best.

3. Communities who want to collaborate on processing some public data sets -- Google-hosted service would be best.

   a. Freebase sub-communities who want to load some data sets into Freebase.
   b. Citizen reporters who want to collaborate on sifting through some government public data sets.

4. Communities who want to collaborate on processing some private data sets -- self-hosted service would be best.
   a. News agencies working through data for news stories.
   b. Government agencies like data.uk.gov cleaning up data before publishing it.

   c. Crisis response teams handling private data, and who can't rely on connectivity to the cloud.

The two different hosted options will require significantly different technology stacks.

Am I missing any other audience?

David

I don't know if this counts.. what about companies who want to get a feel for the product locally, but would then would like to grow and use the hosted solution to achieve scale?

Here are some things that I can think of:
- Field staff => self-hosted (micro mobile app?)
- Developers experimenting with their own extensions => self-hosted

Stefano Mazzocchi

unread,

Nov 14, 2010, 8:00:51 PM11/14/10

to google-refine

On Sun, Nov 14, 2010 at 3:09 PM, Randall Amiel <randy1...@gmail.com> wrote:

agreed. the point im trying to make is that, if a user doesnt have the cpu resources to handle such big datasets, googles cloud could be used to do such transformations. in the future, im sure google will handle sensitive data as they cater to enterprise users

There is one important thing that needs to be understood: the types of operations supported by Google Refine are very hard to scale. You can slide and dice thru hundreds of thousands of rows in a few seconds, but even if we were able to map-reduce the hell out of this (which is not a given, btw!) and slice and dice thru hundreds of millions of rows in a few minutes (assuming one could keep the cost per row linear), the overall UI experience would be so poor you wouldn't be able to stand it.

Moreover, Refine was designed from the start to be a locally hosted web service which means that it heavily depends on ajax latencies being tiny, local bandwidth being very high and I/O concurrency to be small to none.

These are all issues that will need to be addressed and while some of them just require engineering resources to be executed, others require hard-core research in distributed computing and that will take time and has a high risk of failure.

Plus, we're a very small team (at least so far) so set your expectations accordingly.

On Nov 14, 2010 3:47 PM, "Judson Dunn" <cohe...@sleepyhead.org> wrote:

On Sun, Nov 14, 2010 at 2:12 PM, Randall Amiel <randy1...@gmail.com> wrote:
> will refine be tra...
As someone that uses Refine for medical data with protected health
information if it was *only* offered as a hosted service it wouldn't
be usable to me anymore. Please keep uses like this in mind also. As
you say in your intro, it could have repercussions for journalists etc
also.

Thanks.

--
Judson Dunn
http://sleepyhead.org

Resty Cena

unread,

Nov 14, 2010, 9:19:54 PM11/14/10

to google...@googlegroups.com

#3 is collaboration, #4 is private. Multi-lingual corpus sets. Individual language sets could be private or collaboration, but multilingual corpus sets must be collaborative.

Randall Amiel

unread,

Nov 14, 2010, 10:42:46 PM11/14/10

to google...@googlegroups.com

I dont mean to start such a debate, but, what if u want to start connecting datasets without using freebase. I mean freebase isnt a centralized location to link all datasets $yet$. what about facebook graph?

On Nov 14, 2010 9:19 PM, "Resty Cena" <rest...@gmail.com> wrote:

#3 is collaboration, #4 is private. Multi-lingual corpus sets. Individual language sets could be private or collaboration, but multilingual corpus sets must be collaborative.

On Sun, Nov 14, 2010 at 5:42 PM, Stefano Mazzocchi <stef...@google.com> wrote:
>

> On Sun, Nov 14...

Randall Amiel

unread,

Nov 14, 2010, 10:51:23 PM11/14/10

to google...@googlegroups.com

stefano:
I guess thats what I was getting to: the costly operations. not all operations can be put into 1 db. we must support operations over any dataset that exposes a standardized dataset (rdf etc...) and maybe a uri to another. costly refining must take place in the cloud especially refining a join between facebook graph and freebase ;)

On Nov 14, 2010 8:00 PM, "Stefano Mazzocchi" <stef...@google.com> wrote:

On Sun, Nov 14, 2010 at 3:09 PM, Randall Amiel <randy1...@gmail.com> wrote:

>
> agreed. the point im trying to make is that, if a user doesnt have the cpu resources to handle s...

There is one important thing that needs to be understood: the types of operations supported by Google Refine are very hard to scale. You can slide and dice thru hundreds of thousands of rows in a few seconds, but even if we were able to map-reduce the hell out of this (which is not a given, btw!) and slice and dice thru hundreds of millions of rows in a few minutes (assuming one could keep the cost per row linear), the overall UI experience would be so poor you wouldn't be able to stand it.

Moreover, Refine was designed from the start to be a locally hosted web service which means that it heavily depends on ajax latencies being tiny, local bandwidth being very high and I/O concurrency to be small to none.

These are all issues that will need to be addressed and while some of them just require engineering resources to be executed, others require hard-core research in distributed computing and that will take time and has a high risk of failure.

Plus, we're a very small team (at least so far) so set your expectations accordingly.

>>
>> On Nov 14, 2010 3:47 PM, "Judson Dunn" <cohe...@sleepyhead.org> wrote:
>>

>> On Sun, Nov 14,...

Stefano Mazzocchi

unread,

Nov 15, 2010, 12:14:03 AM11/15/10

to google-refine

On Sun, Nov 14, 2010 at 7:42 PM, Randall Amiel <randy1...@gmail.com> wrote:

I dont mean to start such a debate, but, what if u want to start connecting datasets without using freebase. I mean freebase isnt a centralized location to link all datasets $yet$.

why not?

what about facebook graph?

:-)

On Nov 14, 2010 9:19 PM, "Resty Cena" <rest...@gmail.com> wrote:

#3 is collaboration, #4 is private. Multi-lingual corpus sets. Individual language sets could be private or collaboration, but multilingual corpus sets must be collaborative.

On Sun, Nov 14, 2010 at 5:42 PM, Stefano Mazzocchi <stef...@google.com> wrote:
>
> On Sun, Nov 14...

Rebecca Shapley

unread,

Nov 16, 2010, 7:32:49 PM11/16/10

to google...@googlegroups.com

This is a good division of audiences!

Another dimension to keep in mind is presence or absence of DeveloperChops and DeveloperPower. Some things need to be available exclusively through point -n- click to reach their broader audience; other times it's OK to assume someone can do a little coding or ask someone else to develop something.

Refine currently seems to straddle this a bit - you need some expression-writing chops to really use its power, although no actual development code is required.

-R.

--
Rebecca Shapley
Google Research | Structured Data and Semantic Services
Check out Fusion Tables: http://www.google.com/fusiontables

David Huynh

unread,

Nov 17, 2010, 12:19:45 PM11/17/10

to google...@googlegroups.com

On Tue, Nov 16, 2010 at 4:32 PM, Rebecca Shapley <rsha...@google.com> wrote:

This is a good division of audiences!

Another dimension to keep in mind is presence or absence of DeveloperChops and DeveloperPower. Some things need to be available exclusively through point -n- click to reach their broader audience; other times it's OK to assume someone can do a little coding or ask someone else to develop something.

Refine currently seems to straddle this a bit - you need some expression-writing chops to really use its power, although no actual development code is required.

Good point! I was designing for people who can handle Excel formulas, but I'm hoping this 2.0 launch can give us from actual usage feedback a sense of where the real balance point should be.

And even for those who can handle expressions, translating the desired overall transformation effect on the data to an actual sequence of steps (invoke this command, then invoke that command, ...) is another gulf of execution. Recipes and screencast tutorials are one possible solution to help ...

----

Regarding the 2 hosted use cases (private versus Google), there is another dimension: there are different models for collaboration. Fusion Tables and Google Docs follow one model: simultaneous editing by multiple users (is there an official name for that?). Code development tends to follow another model: check-out working copy from and check-in/commit patches to a central repository. And then there's Git versus the old commit model, too. The big question is, which of these models makes sense for something like Refine? And which can we implement?

Let's keep in mind that editing changes in Refine are categorically different from editing changes in text editors and spreadsheets:

- in text editors and spreadsheets, you have a cursor that determines the locus of change, and changes tend to be small

- in Google Refine, there is no cursor, and individual changes can be huge

David

Randall Amiel

unread,

Nov 17, 2010, 12:51:36 PM11/17/10

to google...@googlegroups.com

I think the realtime colloboration was based off openfire, and then ported to google wave. Docs n gmail prob use a varation of this. Furthermore, I dont think you would need realtime colloboration in refine, unless youre refining realtime military data or some type of realtime stream flows.

On Nov 17, 2010 12:19 PM, "David Huynh" <dfh...@gmail.com> wrote:

On Tue, Nov 16, 2010 at 4:32 PM, Rebecca Shapley <rsha...@google.com> wrote: