Data about the other side of food

Michael Bernstein

unread,

Apr 13, 2012, 7:41:08 PM4/13/12

to open...@googlegroups.com

Hi folks,

I've been working (intermittently) on a gardening focused startup for
a few months now, investigating user needs, conducting interviews and
so on, to focus on problems that gardeners need solving. After all
that, the problem I've picked is discovering and sharing the fruit,
vegetable, and herb varieties that will do best in your garden.

To that end, I have started building some data sets of plant varieties
from number of sources and will soon start merging them together. I am
de-lurking now for several reasons:

1. I wanted to give you a heads up that this data will likely increase
your coverage considerably. As an example, just one source yielded
over 500 varieties of tomato, compared to the 134 you have now.

2. I want to make sure the data I release is useful to you, but I am
not sure what kind of information about each variety you will want, or
whether (as a group) you prefer a data dump or an API.

3. I wanted to point out the tool/service I am using in case you
hadn't come across it yet: scrapinghub.com, which is based on the open
source scrapy, scrapely, and slybot libraries. Their UI is still a bit
confusing and buggy, but it has some nice features for point-and-click
selection of elements to scrape from a page, and the thus-defined jobs
are exportable to be used outside of their service with scrapy, should
you want or need to do so.

4. As a startup, Urbsly is competing in the 'Lean Startup Challenge
2012' (the winner gets some resources, mostly consulting and
mentorship). To win, I have to get the most tweets and retweets of a
custom hashtag: #leanvote2012-19, and I thought you might be
interested in helping out. Some more details at this blog post:
http://blog.urbsly.com/post/21011884622/what-is-urbsly

5. I wanted to ask for your experience using different tools for
merging data. Google Refine and Google Fusion Tables both look
interesting, but do any of the folks on this list have a
recommendation of one over the other (or for that matter, of any other
tools I should be considering)?

Thanks in advance,
--
Michael R. Bernstein
michaelbernstein.com
Founder - urbsly.com

Chacha Sikes

unread,

Apr 15, 2012, 3:12:48 PM4/15/12

to open...@googlegroups.com

Hi Michael,

Well this is exciting to hear about your new startup.

In terms of sharing data --

What I am noticing to be the easiest way to combine datasets is to do this.

-- You can provide a CSV file that includes as much data as you are interested in sharing.

-- An API is fine too - if you have one -- a document describing all of the fields in the API and search queries would be helpful. Haven't used Fusion tables yet - would be interested in trying it out.

-- For each record, you would probably want to include

* a unique id (which we would list as 'external_id') --

* a field for datasource - with information about the source

What I've been doing as I've started to combine different foods datasets: bring the datasets into Google Refine, run a script that looks at each record and finds the non-duplicates, then the duplicates. Review additions and duplicates. Merge them into one. I'm starting to figure out how to make Mongo DB do this too.

You can see how I've been structuring my fields in this file:

https://github.com/chachasikes/openfood_list/tree/master/data/complete

I'm recording the variety name & scientific name -- don't feel it is perfect yet - but having all of that information is super helpful. It's getting there. Taking examples from a number of different datasets like Foodista, GRIN & Wikipedia.

With respect to your startup... tweeted. :)

We are looking for people who are interested in having part of their food startup use and share open data about food.

I expect we will have a little badge that you can put on your project site that links back to the data & explains why open food data is important.

Germplasm -- GRIN - I have started with getting all of the plant varieties from GRIN -- which has thousands of everything plant variety- not all growable or commercially available - but still are varieities.

- Chach

--
--
Open Food - http://open-food.org/
You received this message because you are subscribed to the Google
Groups "Open Food" group.
For more options, visit this group at
http://groups.google.com/group/open-food?hl=en

--
Chacha

Michael Bernstein

unread,

Apr 15, 2012, 4:30:57 PM4/15/12

to open...@googlegroups.com

On Sun, Apr 15, 2012 at 1:12 PM, Chacha Sikes <chach...@gmail.com> wrote:
> Hi Michael,

Hi, Chach!

> What I am noticing to be the easiest way to combine datasets is to do this.
>
> -- You can provide a CSV file that includes as much data as you are
> interested in sharing.
>
> -- An API is fine too - if you have one -- a document describing all of the
> fields in the API and search queries would be helpful. Haven't used Fusion
> tables yet - would be interested in trying it out.
>
> -- For each record, you would probably want to include
> * a unique id (which we would list as 'external_id') --
> * a field for datasource - with information about the source

Hmm. In my own app I will probably have a record for each instance of
a variety per-source, as well as a record that ties sources together
as a single variety along with some sort of reconciliation of
conflicts (a good example is days to maturity: one source might claim
60-75 days, another 65-78. In that case I intend to just average it,
rounding up).

So, fields in my merged data likely won't be attributable to a single source.

> What I've been doing as I've started to combine different foods datasets:

> [snip]

Thanks for explaining your current workflow.

> With respect to your startup... tweeted. :)

Thanks! Every vote counts, and I appreciate your support.

BTW, which 'x for y' analogy makes the most sense to you, where 'y' is
any of seeds, plants, gardens, gardeners, etc.: Yelp for y, Goodreads
for y, Ravelry for y, Octopart for y?

> We are looking for people who are interested in having part of their food
> startup use and share open data about food.
> I expect we will have a little badge that you can put on your project site
> that links back to the data & explains why open food data is important.

Sign me up. I am not yet sure exactly how much data I will share
(user-contributed data might not be as open as the stuff I start off
with by scraping), but at a minimum you can expect my
scraped-and-merged data.

> Germplasm -- GRIN - I have started with getting all of the plant varieties
> from GRIN -- which has thousands of everything plant variety- not all
> growable or commercially available - but still are varieities.

Awesome! I was expecting to have to do some of that work myself much
later on (to make it easier for more gardeners to do their own
breeding). Pardon me for pointing this out in case you're already
aware, but even when accession ids refer to the same variety or
cultivar, they should still be treated as unique, since they really
represent 'strains', often with minor variations in their traits and
genes.

Michael Bernstein

unread,

Apr 29, 2012, 12:20:24 PM4/29/12

to open...@googlegroups.com

On Sun, Apr 15, 2012 at 1:12 PM, Chacha Sikes <chach...@gmail.com> wrote:

> What I've been doing as I've started to combine different foods datasets:
> bring the datasets into Google Refine, run a script that looks at each
> record and finds the non-duplicates, then the duplicates. Review additions
> and duplicates. Merge them into one. I'm starting to figure out how to make
> Mongo DB do this too.

I've been cleaning up a lot of data using Refine (and in some cases
sending it back to the source to fill in some of the blanks revealed).
But can you explain a bit more (or provide an example of) how you're
merging data from multiple sources?

Reply all

Reply to author

Forward

Data about the other side of food - growing it

Michael Bernstein

Chacha Sikes

Michael Bernstein

Michael Bernstein