Re: [PLOTS] Machine Learning tools

Jeffrey Warren

unread,

Jul 25, 2014, 1:18:14 PM7/25/14

to plot...@googlegroups.com, plots-spe...@googlegroups.com, Bryan, Daniela Antonova

Ugh, didn't manage do to it before Ben's email, then failed on addresses. Another try! I think plots-dev is a great place to have this conversation! Thanks!

On Fri, Jul 25, 2014 at 1:16 PM, Jeffrey Warren <je...@publiclab.org> wrote:

Hi, all - going to try to bump this to plots-dev again -- sounds like you've made a lot of progress already :-)

You can also hit the JSON/XML api on a per-tag basis: http://spectralworkbench.org/tag/cfl

The main reason to develop tools in JavaScript is so that other people can easily use them on the site without downloading new software or anything.

Jeff

On Fri, Jul 25, 2014 at 1:16 PM, Ben Gamari <bga...@gmail.com> wrote:

Bryan <btbo...@gmail.com> writes:

> Btw I'm using "tarball" figuratively. Ben, I assumed you meant a literal
> tarball of data files. We can provide lumpsums of data in efficient ways
> without resorting to tarballs of data files which require preprocessing.
>

I would be interested to hear more about what you are thinking of
here. Please expand when you have a chance.

> So if you're arguing for lumpsum data, yeah, totally we should continue to
> support that (apparently it's already there).
>

This is excellent, although I would argue that this isn't bulk _data_,
it's bulk _metadata_. If I want the actual meat of the data (that is,
the images), I need to write a script to parse the metadata, figure
out the images contained in the corpus, and crawl them, taking care to
rate-limit my requests, handling errors, etc. This isn't by any means
_difficult_ in any language worth its salt but it is superfluous work
and poses another small barrier to entry for those seeking to work with
the corpus. After all, in the end all I need for my analysis is a
directory full of images and their associated metadata.

> But a zip of images? yuck.
>

I'm not sure I understand your objection to the suggestion of a
tarball. This is a common technique for distributing datasets, as
I pointed out earlier. The reason for this is simple:

If I'm trying to work with a data set, the interface to access it is the
last thing I want to worry about. The sooner I can get a directory of
files, the sooner I can move on to the actual problems I want to work
on. `curl $URL | tar -jx` is the quickest way I know of to make this
happen. Yes, the data then needs preprocessing but this would have been
necessary regardless and there is no shortage of tools for munging text,
indexing JSON, and the like.

I would be happy to contribute a script to generate these dumps if
others agree that this is a useful exercise.

Cheers,

- Ben

Bryan

unread,

Jul 25, 2014, 1:29:14 PM7/25/14

to Jeffrey Warren, plot...@googlegroups.com, plots-spe...@googlegroups.com, Daniela Antonova

Ben, I think the only reason anyone would need the original image is if they were working to create a new image processing algorithm for the website. If that's the case, and it is a valid case in which case your point is well made, we can bug Jeff about how to get the files. I think that's going to be a very rare use case compared with people who want to analyze the data in its processed form.

Do you, personally, want data in the raw format for analyzing something? If so, we can extract that data of the disk somehow, and hopefully a pull request for better analysis. Otherwise I'm not sure if this component discussion is productive. It sounds like Daniela is primarily concerned with how to extract data out in its processed form for doing ML work. Please correct me if I'm wrong.

-Bryan

Yagiz Sutcu

unread,

Jul 25, 2014, 1:37:12 PM7/25/14

to plots-spe...@googlegroups.com, Jeffrey Warren, plot...@googlegroups.com

Simple and most useful option for downloading many spectral data for me personally should be something like this:

As Jeff mentioned below, http://spectralworkbench.org/tag/cfl gives you 471 data. I would love to have the option to chose (check boxes next to them??) some or all of them and download as csv format.

Does that make sense?

Yagiz

--
Post to this group at plots-sp...@googlegroups.com

Public Lab mailing lists (http://publiclab.org/lists) are great for discussion, but to get attribution, open source your work, and make it easy for others to find and cite your contributions, please publish your work at http://publiclab.org
---
You received this message because you are subscribed to the Google Groups "plots-spectrometry" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plots-spectrome...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jeffrey Warren

unread,

Jul 25, 2014, 1:38:29 PM7/25/14

to Yagiz Sutcu, plots-spe...@googlegroups.com, plot...@googlegroups.com

Hi, Yagiz - you can already click "JSON" or "XML" to the right side of the screen near the top to download them all - is that good enough?

--
You received this message because you are subscribed to the Google Groups "plots-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plots-dev+...@googlegroups.com.

Yagiz Sutcu

unread,

Jul 25, 2014, 1:43:17 PM7/25/14

to Jeffrey Warren, plots-spe...@googlegroups.com, plot...@googlegroups.com

Personally, I prefer separate csv filed zipped together... Also, I prefer to be able to chose some of them...
But that's me :) I do not have any experience playing with xml or json format at all.

Ben Gamari

unread,

Jul 25, 2014, 2:03:18 PM7/25/14

to Bryan, Jeffrey Warren, plot...@googlegroups.com, plots-spe...@googlegroups.com, Daniela Antonova

Bryan <btbo...@gmail.com> writes:

> Ben, I think the only reason anyone would need the original image is if
> they were working to create a new image processing algorithm for the
> website. If that's the case, and it is a valid case in which case your
> point is well made, we can bug Jeff about how to get the files. I think
> that's going to be a very rare use case compared with people who want to
> analyze the data in its processed form.
>

That is correct, working on the image processing algorithm is one reason
you would want the original images. Given the number of improvements that still could
still be made to the image processing side of things [1] it seems
reasonable to accomodate this use-case.

This, however, is far from the only reason why one might be interested
in the original data. An electrical engineering researcher may want to
view the images to gather statistics on the noise characteristics of
consumer-grade webcams. An information retrieval researcher may be
interested in the data as a small corpus for training an image retrieval
algorithm. A physical scientist may be interested in performing their
own analysis on the corpus with no intent of upstreaming their code. All
of these are perfectly reasonable uses for what should be easily
available data. This is the great thing about making data open: people
will do things with it that you did not foresee.

Moreover, there is a matter of principle here: If the service is going
to claim to be open the data needed to reproduce it should be readily
available. This is in my opinion reason enough to make the images
available, if not full anonymized database dumps. If I want I should
be able to spin up my own Spectral Workbench instance along with the
full data (privacy issues aside).

Finally, there is the related matter of scientific reproducibility. In
the physical sciences raw data is the purest form of empirical
evidence. For this reason, it is important that this data is available
so others may perform their own independent analyses and validate
existing findings. While this doesn't always happen in the traditional
research setting (although things are improving) there is no reason why
we shouldn't strive towards this ideal by providing low-friction
mechanisms for the distribution of raw data and associated metadata
whereever possible.

> Do you, personally, want data in the raw format for analyzing something? If
> so, we can extract that data of the disk somehow, and hopefully a pull
> request for better analysis.
>

At the moment I'm not in a position to work on the image processing side
of things. That being said, I believe that it is that this data be made
available and I would be happy to put together a script to generate raw
dumps if someone would agree to configure a cron job to ensure it is
run.

Cheers,

- Ben

[1] http://publiclab.org/notes/PascalW/03-11-2014/gsoc-proposal-spectralworkbench

Bryan

unread,

Jul 25, 2014, 2:07:54 PM7/25/14

to Ben Gamari, Jeffrey Warren, plot...@googlegroups.com, plots-spe...@googlegroups.com, Daniela Antonova

Part of the problem with the tarball idea is that the data set is live. Unlike say, a dead data set that supports an arxiv publication, our dataset continues to grow. Conceivably it can change over time, if a user chooses to remove his/her data. Open data should also, to the greatest extent for which we can manage, support control over one's submissions. Then again the license we use sort of takes control away from the user, maybe this isn't an issue.

It might be worth looking into a third party service to host the raw data. I'm not sure if we're hosting the images on S3, but static hosting from S3 would be a plausible way to handle the situation. Alternatively we could cross post images to flickr or one of these image sites as a set. I believe many of the image hosting sites allow bulk download as a zip file.

-Bryan

Ben Gamari

unread,

Jul 25, 2014, 2:11:19 PM7/25/14

to Bryan, Jeffrey Warren, plot...@googlegroups.com, plots-spe...@googlegroups.com, Daniela Antonova

Ben Gamari <bga...@gmail.com> writes:

> Bryan <btbo...@gmail.com> writes:
>
Snipped. Correction below.

>> Do you, personally, want data in the raw format for analyzing something? If
>> so, we can extract that data of the disk somehow, and hopefully a pull
>> request for better analysis.
>>
> At the moment I'm not in a position to work on the image processing side
> of things. That being said, I believe that it is that this data be made
> available and I would be happy to put together a script to generate raw
> dumps if someone would agree to configure a cron job to ensure it is
> run.
>

"I believe that it is that..." should read "I believe that it is
important that..."

Cheers,

- ben

Ben Gamari

unread,

Jul 25, 2014, 2:20:47 PM7/25/14

to Bryan, Jeffrey Warren, plot...@googlegroups.com, plots-spe...@googlegroups.com, Daniela Antonova

Bryan <btbo...@gmail.com> writes:

> Part of the problem with the tarball idea is that the data set is live.
> Unlike say, a dead data set that supports an arxiv publication, our dataset
> continues to grow.
>

I don't see this being a problem. All of the data sources I cited
earlier also grow in time. The solution is simply to periodically
(e.g. on a nightly basis) update the dump.

> Conceivably it can change over time, if a user chooses
> to remove his/her data. Open data should also, to the greatest extent for
> which we can manage, support control over one's submissions. Then again the
> license we use sort of takes control away from the user, maybe this isn't
> an issue.
>

While I see your point here and we should certainly support the user in
removing data from the database, it should be made clear when the user
uploads their data that the license they grant is irrevocable and copies
may "leak" out in the form of the raw data archive. It is very important
for reproducibility that the right to continue distributing removed data
is protected. Otherwise you open the door for people allowing their data
to be used only to support the theses they intend which can quickly kill
a scientific discourse.

> It might be worth looking into a third party service to host the raw data.
> I'm not sure if we're hosting the images on S3, but static hosting from S3
> would be a plausible way to handle the situation. Alternatively we could
> cross post images to flickr or one of these image sites as a set. I believe
> many of the image hosting sites allow bulk download as a zip file.
>

Hosting the images and a metadata dump on S3 would be a fantastic option.

Cheers,

- Ben

Jeffrey Warren

unread,

Jul 25, 2014, 2:29:45 PM7/25/14

to Ben Gamari, Laura Dietz, Bryan, plot...@googlegroups.com, plots-spe...@googlegroups.com, Daniela Antonova

Hi, all - you can get the raw images with the following format:

http://spectralworkbench.org/system/photos/30872/original/capture.png

http://spectralworkbench.org/system/photos/<spectrum_id>/original/<filename>

The filename is included in the JSON output, but we ought to make an alias route to make it even easier:

{"spectrum":{"photo_position":"false","version":null,"title":"d3-rw-s0","id":30872,"updated_at":"2014-07-16T21:41:57Z","photo_file_name":"capture.png","parent_id":null,"control_points":null,"baseline_content_type":null,"notes":" -- (Cloned calibration from <a href='/spectra/show/26204'>test2</a>)","sample_row":1,"photo_content_type":"image/png","client_code":"","baseline_file_name":null,"user_id":3081,"slice_data_url":null,"reversed":true,"data":"{\"lines\":[{\"average\":22,\"b\":32,\"g\":13,\"r\":23,\"wavelength\":914.047383561644},{\"average\":21,\"b\":31,\"g\":12,\"r\":22,\"wavelength\":912.936793578767},

How about:

http://spectralworkbench.org/raw/<spectrum_id>.png


although sometimes they are jpgs. 


BTW, I'm in support of any and all API people want -- the main limitation on the API is developers actually building it out, or using it, or whatever... not a lot of developers on this codebase unfortunately, although the GSoC program is changing that this summer since Sreyanth is working on some exciting stuff. But the API is definitely neglected, so I would be very happy to accept pull requests that improve or expand it!

The API (both server-side and client side) is the future of the platform - any time someone develops a new feature, a new way to clean, process, compare, analyze etc, my belief is that the best way to enable others to use it is to build the tools into the SW platform. The JavaScript API is a low-barrier way to do that, but if there are things that really can't be done there, let's do them server-side!

Jeff

Jeffrey Warren

unread,

Jul 25, 2014, 4:26:11 PM7/25/14

to plot...@googlegroups.com, plots-spe...@googlegroups.com

Hi, folks - it's great to see so much interest in the topic -- please join up on the plots-dev mailing list (http://publiclab.org/lists) to stay involved, as we've moved the discussion over there and don't want it to get split in two!

Thanks!

On Fri, Jul 25, 2014 at 4:23 PM, thomas <ttaylor...@gmail.com> wrote:

Yagiz makes an important point regarding calibration.

By chance just this morning in a coffee house conversation I was just reminded of a (naming no names) very well funded government program for automated recognition of very important things that worked very well at the stage of training the classifier but performed miserably in the field--if I remember, upon analysis it turned out something like the classifier distinguished light and shadow very well, and there were many more shadows in the field than in the training set.

My take home thoughts are 1) be careful, it's possible to accidentally build a classifier that says more about the calibration than about the environmental sample and 2) to aid interpretation of spectra from the spectral workbench it would be useful if the spectra could be linked to comprehensive metadata; Yagiz's studies of Olive Oil and Red Wine adulteration are good examples of this.

On Friday, July 25, 2014 7:31:53 AM UTC-7, ygzstc wrote:

Hi Daniela,

I was thinking a classifier (SVM for example) application would be nice to classify different spectral data. Or may be before that, PCA or PLS regression would be nice to have as well. But the main problem is, data collected by different users have very different calibration, intensity etc related issues which makes designing a classifier difficult I guess.

On the other hand, once you have access the data form Public Lab's website, you can download and play with it in your own PC/Laptop as well.

Cheers,
Yagiz

On 7/25/2014 7:54 AM, Daniela Antonova wrote:

Hi all :)

I am looking to use my machine learning expertise to contribute some tools for automated analysis of data, probably as part of the workbench and I was hoping to get some opinions on what might be useful.

In particular, how could such tools be integrated with the workbench so that people get the most out of them?

Looking forward to hearing your views!

Daniela

--
Post to this group at publicla...@googlegroups.com

Public Lab mailing lists (http://publiclab.org/lists) are great for discussion, but to get attribution, open source your work, and make it easy for others to find and cite your contributions, please publish your work at http://publiclab.org
---

You received this message because you are subscribed to the Google Groups "The Public Laboratory for Open Technology and Science" group.
To unsubscribe from this group and stop receiving emails from it, send an email to publiclaborato...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Post to this group at publicla...@googlegroups.com

Public Lab mailing lists (http://publiclab.org/lists) are great for discussion, but to get attribution, open source your work, and make it easy for others to find and cite your contributions, please publish your work at http://publiclab.org
---

You received this message because you are subscribed to the Google Groups "The Public Laboratory for Open Technology and Science" group.
To unsubscribe from this group and stop receiving emails from it, send an email to publiclaborato...@googlegroups.com.

Bryan

unread,

Jul 26, 2014, 11:28:35 AM7/26/14

to Jeffrey Warren, Ben Gamari, Laura Dietz, plot...@googlegroups.com, plots-spe...@googlegroups.com, Daniela Antonova

Alright I tried to codify the separate needs of our conversation into tickets.

For Daniela, we want to support indexing and possibly bulk downloads with CSVs or JSON.

https://github.com/publiclab/spectral-workbench/issues/3

CSV downloads of spectra operate differently than the other formats for some reason, so I'm tempted to remove it until we have that problem sorted.

https://github.com/publiclab/spectral-workbench/issues/4

For Ben and science, bulk downloads of the raw images are desirable.

https://github.com/publiclab/spectral-workbench/issues/5

Did I miss anything?

Ben Gamari

unread,

Jul 26, 2014, 11:52:36 AM7/26/14

to Bryan, Jeffrey Warren, Laura Dietz, plot...@googlegroups.com, plots-spe...@googlegroups.com, Daniela Antonova

Bryan <btbo...@gmail.com> writes:

> Alright I tried to codify the separate needs of our conversation into
> tickets.
>

Thanks Bryan!

> For Daniela, we want to support indexing and possibly bulk downloads with
> CSVs or JSON.
> https://github.com/publiclab/spectral-workbench/issues/3
>
> CSV downloads of spectra operate differently than the other formats for
> some reason, so I'm tempted to remove it until we have that problem sorted.
> https://github.com/publiclab/spectral-workbench/issues/4
>
> For Ben and science, bulk downloads of the raw images are desirable.
> https://github.com/publiclab/spectral-workbench/issues/5
>
> Did I miss anything?
>

Looks good to me!

Cheers,

- Ben

Daniela Antonova

unread,

Jul 28, 2014, 12:55:14 PM7/28/14

to Ben Gamari, Bryan, Jeffrey Warren, Laura Dietz, plot...@googlegroups.com, plots-spe...@googlegroups.com

Hey :)

Sorry I've been away :) I'll catch up on my messages and let you know how it goes :)

Many thanks for your help!

Reply all

Reply to author

Forward