Spark and Dataverse (Big Data Containers, computation)

Philip Durbin

unread,

May 1, 2018, 8:56:45 PM5/1/18

to dataverse...@googlegroups.com

This afternoon I stopped by Boston University for final project demos for a group of students I've been mentoring[1] but a second group in the same class has been doing some interesting work with Dataverse and they said they don't mind if I share the video they made:

https://www.youtube.com/watch?v=6G86wwgJHnc

The video above is called "Big Data Containers" and I'll attach a screenshot of the project goals which involve computation with Apache Spark within a containerized environment that operates on data downloaded from Dataverse. It's the standard "hello world" example from Spark which is just a word count but it gives you an idea of doing computation.

Here are links to the code mentioned in the video:

- https://github.com/dataverse-broker/dataverse-broker
- https://github.com/dataverse-broker/sample-dataverse-app

I talked to one of the professors who is happy to be a contact going forward. Please just let me know if you're interested and I'll get you in touch.

Thanks,

Phil

1. https://groups.google.com/d/msg/dataverse-community/TSxf4MTYYjg/7VJB_-GJBAAJ

--

Philip Durbin
Software Developer for http://dataverse.org
http://www.iq.harvard.edu/people/philip-durbin

Screen Shot 2018-05-01 at 8.42.18 PM.png

Péter Király

unread,

May 2, 2018, 6:55:24 AM5/2/18

to dataverse...@googlegroups.com

Hi Philip,

that's quite interesting, however there are lots of details in the
project which is unclear to me (maybe it is due to the terminology, I
am familiar with Dataverse and Spark, but not with OpenShift and
Kubernetes).

I am working on a (meta)data quality measuring framework, partly based
on Spark. I can imagine that this Boston University project could be
used as a connector between Dataverse and Spark. Would it be possible
to use it without OpenShift, and with all the languages Spark support?

In the https://github.com/dataverse-broker/sample-dataverse-app/blob/master/spark_wordcount.py
file (which is the Spark wordcount implementation), the details of the
connection is hidden behind the Spark API:

rdd = sc.textFile(filename)

If I am not mistaken, the "filename" parameter is actually an URL of
the Dataverse API to retrieve a file.

Do you know how much part of the Dataverse API is implemented? Can one
access metadata as well?

BTW: is there anybody else who works on data quality in Dataverse context?

Best,
Péter

> --
> You received this message because you are subscribed to the Google Groups
> "Dataverse Users Community" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to dataverse-commu...@googlegroups.com.
> To post to this group, send email to dataverse...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/dataverse-community/CABbxx8FJZQVe3au-i0zsUBwWCkO3nENPZAmgxCibQ6JNw%3DML%2Bg%40mail.gmail.com.
> For more options, visit https://groups.google.com/d/optout.

--
Péter Király
software developer
GWDG, Göttingen - Europeana - eXtensible Catalog - The Code4Lib Journal
http://linkedin.com/in/peterkiraly

Philip Durbin

unread,

May 2, 2018, 8:30:15 AM5/2/18

to dataverse...@googlegroups.com

These are great questions and I just copied them over to https://github.com/dataverse-broker/dataverse-broker/issues/46 in case it's easier for the students to reply there. I'm not sure how many of them are signed up on this mailing list but they are welcome to subscribe and answer here as well. Thanks for your interest!

I was just having a conversation with a collaborator yesterday about data quality. I encouraged him to open a GitHub issue about what he's up to and maybe he'll see this and it'll be the nudge he needs. :)

You're welcome to open a new issue about data quality as well or start a new thread. It's an important topic. See also discussions about CoreTrustSeal on this list.

Thanks,

Phil

> email to dataverse-community+unsub...@googlegroups.com.
> To post to this group, send email to dataverse-community@googlegroups.com.

> To view this discussion on the web visit
> https://groups.google.com/d/msgid/dataverse-community/CABbxx8FJZQVe3au-i0zsUBwWCkO3nENPZAmgxCibQ6JNw%3DML%2Bg%40mail.gmail.com.
> For more options, visit https://groups.google.com/d/optout.

--
Péter Király
software developer
GWDG, Göttingen - Europeana - eXtensible Catalog - The Code4Lib Journal
http://linkedin.com/in/peterkiraly

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.
To post to this group, send email to dataverse-community@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/CABFhGtkmnkqGMj2Wc_ZuH%3DRtwBCPkTC34KsMae16nuysC3nWfw%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--

Pete Meyer

unread,

May 3, 2018, 9:58:34 AM5/3/18

to Dataverse Users Community

Hi Philip,

On Wednesday, May 2, 2018 at 6:55:24 AM UTC-4, Péter Király wrote:

Hi Philip,

that's quite interesting, however there are lots of details in the
project which is unclear to me (maybe it is due to the terminology, I
am familiar with Dataverse and Spark, but not with OpenShift and
Kubernetes).

I am working on a (meta)data quality measuring framework, partly based
on Spark. I can imagine that this Boston University project could be
used as a connector between Dataverse and Spark. Would it be possible
to use it without OpenShift, and with all the languages Spark support?

In the https://github.com/dataverse-broker/sample-dataverse-app/blob/master/spark_wordcount.py
file (which is the Spark wordcount implementation), the details of the
connection is hidden behind the Spark API:

rdd = sc.textFile(filename)

If I am not mistaken, the "filename" parameter is actually an URL of
the Dataverse API to retrieve a file.

Do you know how much part of the Dataverse API is implemented? Can one
access metadata as well?

BTW: is there anybody else who works on data quality in Dataverse context?

I do some things that could fall under "data quality in Dataverse" - but "data quality" has a lot of internal complexity buried in it (inter-dataset metadata consistency, intrinsic re-usability / reproducibility, metadata-assisted re-usability / reproducibility, data integrity, etc). Could you say a little more about what type of quality you're working on measuring?

Best,
Pete

> email to dataverse-community+unsub...@googlegroups.com.

Péter Király

unread,

May 3, 2018, 10:47:08 AM5/3/18

to dataverse...@googlegroups.com

Hi Pete,

There are at least three directions:

- in the relevant literature (see Metadata Assessment group at Zotero:
https://www.zotero.org/groups/metadata_assessment) there are different
metrics such as completeness, accessibility, multilinguality etc.
- in Linked Data context Zaveri et al. recently did a survey and set
up a classification of 67 metrics
(https://content.iospress.com/articles/semantic-web/sw175). Some of
the metrics are not relevant outside of Linked Data context, but
others still are
- in research data repository context last year FAIR Metrics Group
(http://www.fairmetrics.org/) formed to set up rules and metrics to
measure the "FAIRness" or research data. You can find their
micropublications at https://github.com/FAIRMetrics/Metrics.

Actually it is my PhD research, and I have started the work
(http://pkiraly.github.io) on other library related data, such as
digital library metadata (Europeana), MARC bibliographic records etc.
Basically what is (relatively) easy to check is the structure of the
metadadata records: does it fit to the rules? are there outliers? are
there missing parts? Are the descriptions unique? In some cases it is
also possible to check some semantics. Regarding to the data files
itself, there is a project in the Bielefeld University called
Conquaire (http://conquaire.uni-bielefeld.de/) which tries to set up
rules to check integrity of the files (triggered by commits to the
repository).

Best,
Péter

>> > email to dataverse-commu...@googlegroups.com.

>> > To post to this group, send email to dataverse...@googlegroups.com.
>> > To view this discussion on the web visit
>> >
>> > https://groups.google.com/d/msgid/dataverse-community/CABbxx8FJZQVe3au-i0zsUBwWCkO3nENPZAmgxCibQ6JNw%3DML%2Bg%40mail.gmail.com.
>> > For more options, visit https://groups.google.com/d/optout.
>>
>>
>>
>> --
>> Péter Király
>> software developer
>> GWDG, Göttingen - Europeana - eXtensible Catalog - The Code4Lib Journal
>> http://linkedin.com/in/peterkiraly
>

> --
> You received this message because you are subscribed to the Google Groups
> "Dataverse Users Community" group.
> To unsubscribe from this group and stop receiving emails from it, send an

> email to dataverse-commu...@googlegroups.com.

> To post to this group, send email to dataverse...@googlegroups.com.
> To view this discussion on the web visit

> https://groups.google.com/d/msgid/dataverse-community/409371fa-b6ba-4189-a486-eed869d8a16c%40googlegroups.com.

Pete Meyer

unread,

May 3, 2018, 12:51:03 PM5/3/18

to Dataverse Users Community

Hi Peter,

Thanks for the links - it looks like there's quite a bit of interesting work going on in this area. I haven't had time to investigate all of them, but I'm excited to see investigation of a probabilistic approach to evaluating metadata. Most of the "data quality" areas I've been involved with have been targeted for supporting computational pipelines for relatively large datasets within Dataverse. This has been focused on a specific scientific discipline, but most of the infrastructure (compute access; co-location of datasets with compute resources; pre/post deposition datafile integrity checks; etc) is intended to be generalizable to other areas. Some of the tools to automate parts of data curation might fall under data quality too - but those have been more focused on working more efficiently than formal investigation, and may not generalize as well.

Best,
Pete

>> > email to dataverse-community+unsub...@googlegroups.com.

>> > To post to this group, send email to dataverse...@googlegroups.com.
>> > To view this discussion on the web visit
>> >
>> > https://groups.google.com/d/msgid/dataverse-community/CABbxx8FJZQVe3au-i0zsUBwWCkO3nENPZAmgxCibQ6JNw%3DML%2Bg%40mail.gmail.com.
>> > For more options, visit https://groups.google.com/d/optout.
>>
>>
>>
>> --
>> Péter Király
>> software developer
>> GWDG, Göttingen - Europeana - eXtensible Catalog - The Code4Lib Journal
>> http://linkedin.com/in/peterkiraly
>
> --
> You received this message because you are subscribed to the Google Groups
> "Dataverse Users Community" group.
> To unsubscribe from this group and stop receiving emails from it, send an

> email to dataverse-community+unsub...@googlegroups.com.

Reply all

Reply to author

Forward