I've been working (intermittently) on a gardening focused startup for
a few months now, investigating user needs, conducting interviews and
so on, to focus on problems that gardeners need solving. After all
that, the problem I've picked is discovering and sharing the fruit,
vegetable, and herb varieties that will do best in your garden.
To that end, I have started building some data sets of plant varieties
from number of sources and will soon start merging them together. I am
de-lurking now for several reasons:
1. I wanted to give you a heads up that this data will likely increase
your coverage considerably. As an example, just one source yielded
over 500 varieties of tomato, compared to the 134 you have now.
2. I want to make sure the data I release is useful to you, but I am
not sure what kind of information about each variety you will want, or
whether (as a group) you prefer a data dump or an API.
3. I wanted to point out the tool/service I am using in case you
hadn't come across it yet: scrapinghub.com, which is based on the open
source scrapy, scrapely, and slybot libraries. Their UI is still a bit
confusing and buggy, but it has some nice features for point-and-click
selection of elements to scrape from a page, and the thus-defined jobs
are exportable to be used outside of their service with scrapy, should
you want or need to do so.
4. As a startup, Urbsly is competing in the 'Lean Startup Challenge
2012' (the winner gets some resources, mostly consulting and
mentorship). To win, I have to get the most tweets and retweets of a
custom hashtag: #leanvote2012-19, and I thought you might be
interested in helping out. Some more details at this blog post:
http://blog.urbsly.com/post/21011884622/what-is-urbsly
5. I wanted to ask for your experience using different tools for
merging data. Google Refine and Google Fusion Tables both look
interesting, but do any of the folks on this list have a
recommendation of one over the other (or for that matter, of any other
tools I should be considering)?
Thanks in advance,
--
Michael R. Bernstein
michaelbernstein.com
Founder - urbsly.com
--
--
Open Food - http://open-food.org/
You received this message because you are subscribed to the Google
Groups "Open Food" group.
For more options, visit this group at
http://groups.google.com/group/open-food?hl=en
Hi, Chach!
> What I am noticing to be the easiest way to combine datasets is to do this.
>
> -- You can provide a CSV file that includes as much data as you are
> interested in sharing.
>
> -- An API is fine too - if you have one -- a document describing all of the
> fields in the API and search queries would be helpful. Haven't used Fusion
> tables yet - would be interested in trying it out.
>
> -- For each record, you would probably want to include
> * a unique id (which we would list as 'external_id') --
> * a field for datasource - with information about the source
Hmm. In my own app I will probably have a record for each instance of
a variety per-source, as well as a record that ties sources together
as a single variety along with some sort of reconciliation of
conflicts (a good example is days to maturity: one source might claim
60-75 days, another 65-78. In that case I intend to just average it,
rounding up).
So, fields in my merged data likely won't be attributable to a single source.
> What I've been doing as I've started to combine different foods datasets:
> [snip]
Thanks for explaining your current workflow.
> With respect to your startup... tweeted. :)
Thanks! Every vote counts, and I appreciate your support.
BTW, which 'x for y' analogy makes the most sense to you, where 'y' is
any of seeds, plants, gardens, gardeners, etc.: Yelp for y, Goodreads
for y, Ravelry for y, Octopart for y?
> We are looking for people who are interested in having part of their food
> startup use and share open data about food.
> I expect we will have a little badge that you can put on your project site
> that links back to the data & explains why open food data is important.
Sign me up. I am not yet sure exactly how much data I will share
(user-contributed data might not be as open as the stuff I start off
with by scraping), but at a minimum you can expect my
scraped-and-merged data.
> Germplasm -- GRIN - I have started with getting all of the plant varieties
> from GRIN -- which has thousands of everything plant variety- not all
> growable or commercially available - but still are varieities.
Awesome! I was expecting to have to do some of that work myself much
later on (to make it easier for more gardeners to do their own
breeding). Pardon me for pointing this out in case you're already
aware, but even when accession ids refer to the same variety or
cultivar, they should still be treated as unique, since they really
represent 'strains', often with minor variations in their traits and
genes.