Re: Updated invitation: (Recurring) WikiLoop Coalition Conf Call @ Wed 2020-04-29 12:00 - 13:10 (PDT) (Zainan Victor Zhou)

7 views
Skip to first unread message

Zainan Zhou (a.k.a Victor)

unread,
Apr 29, 2020, 7:55:49 PM4/29/20
to Samuel Klein, Denny Vrandečić, Elan Hourticolon-Retzler, Aaron Halfaker, Forest Rui Jiang, Oelen, Allard, Zainan Victor Zhou, sebas...@allenai.org, WikiLoop Coalition, María Cruz, thadg...@gmail.com, Lydia Pintscher, ant...@delpeuch.eu
Thank you all for participation today. Here are the notes. Don't forget to sign-up the mailing list for future WikiLoop Coalition Conference Calls here.

2020-04-29 Time Series Data and OpenRefine

Participants: 

  • Lydia Pintscher

  • Antonin Delpeuch

  • SJ Klein

  • Thad Guidry

  • Sebastain Kohlmeier

  • María Cruz

  • Zainan Zhou

Summary

In this WL3C we discussed time series data and OpenRefine, here is a high level summary


For Time Series Data, 

  • SJ presented COVID-19 data collaboration including Wikidata and COVID-19 Tracker efforts as a background

  • Collaboration in the format of time series data recently becomes a hot tech need as of COVID-19 case numbers get a lot of attention.

  • Challenges about time series data includes: various sources, different standard for the same number (e.g. COVID-19 “test-positive”)

  • Wikidata currently organizes time series data in a similar format of geo locations, but doesn’t have any plan for vast upgrade on the technology supporting time series data for better collaboration, presentation and usage in other Wikipedia templates. The reason is lack of hands.

  • There are many usages such as climate data, population data, stock price data which can be used, presented and cited in Wikipedia and other places.

  • A potential “Loop Offer” is for industry companies to provide internships to support projects like this, such as starting from better support of CSV in commons.


For OpenRefine, 

  • Antonin presented how to use OpenRefine software to conduct reconciliation 

  • We discuss how to encourage users to share their reconciliation choice result or usage back to the OpenRefine or public for research.

Raw Notes


Intro on WikiLoop Coalition: invite people to describe an effort leveraging open knowledge, a challenge ("Loop Request"), and perhaps


  • "Loop Request" challenge example: COVID-19 four-dimension series data: breakdown by time and geo locations.

    • Source from: 

      • dominantly government sources.

      • 3rd party

      • COVID Tracking Project, e.g. usually provides extended references.

      • 3 major aggregatable sources

        • JHU dashboard

    • Challenges

      • various sources, different standards

      • "time series data"

        • the current way to organize the data in Wikidata is like geographic data.

  • Are we aware of any templates? No, and it could impose high computational complexity.

  • Is there a better way to capture a language-neutral dataset for a given schema?

  • Envision a possible solution is to have a data in CSV file in the Commons
      - discuss WD wikiproject?

    • Thoughts: a) allow a schematized csv as with [structured data on commons] 
      b) Try a schemaless approach  --  attach labels + relabel

    • Can’t make statements about csv files yet.  [LP: if we think this is a big deal, push related issues in Phabricator → general csv support]

  • Aside: the csv-on-commons project is somewhat leaderless atm.  [how widely used?]

    • G! For instance, to host campus interns // now can’t, can’t access code! Redesigning [intern projects] to work on OS / Wiki projects 8=) 


  • Examples:  Climate over time (Temp, Rainfall), Demographics (population)
    Passengers (airports, trains), Health (epidemics, lifespan) -- 

  • Examples

  • Norms/ templates for linking visualizations [files on Commons] to source data used to compile them?  (link to a URL that is a WDQS query?)

  • Final Q: for tabular data that doesn’t fit neatly into WD entities, with context per cell (refs and timestamps) -- where should it be stored/archived today?  Indexed so that we can all find it in the future [as tools develop]

    • TG: CKAN / data.world / datahub / ODI apply metadata to columns. 

    • Single-topic wikibase, which doesn’t worry about query performance

    • Vincent’s q: Can we still auto-generate tables from this? (bot query from wbase?)

  • Related challenge: from Oxford

Loop Offer: Intern support! (potentially)

  • "LoopOffer" for companies that are transiting intern projects to OpenSource, because of COVID-19, one possibility is to divert some of the interns to work on OpenSource projects.

  • OR is developing its GSOC projects right now, relevant.


Reconciliation + OpenRefine

  • When converting a litseria-produced wikitable (w/ a wikilink or reference URL per cell) -- What’s a good UX for displaying this?  Currently in OL: reconciled names are linked to their WD item; a new column is generated for URL, year-opened + last-updated times.

  • Given a sparql query result, how do you generate implied references? 

  • How can we improve recon generally?  Usually you don’t have a recon ID for free in your table: you have strings.

  • How can we make [hand-curated] supervision of recon available to train future models / different work on the same sources / schemas? (VZ)

    • 1) give option to feed recon info back to the service .  sometimes this isn’t sufficient to be useful in the future -- you’re often using extra info from your query/context to make decisions.  Also, wide range of expectations about quality of results, varying by individual fields

    • 2) where does this recon-mapping data get stored?

    • 3) is it useful to just share the logs of queries sent to service?

    • FB comparison: if you reconciled data + submitted results for upload, they would be included into the community’s [pipeline?]

  • How can we make reconciliation scoring more flexible? (AD)

    • Let people train models on your own data. Important to do locally. Right now the service just gives you a score, as its priority ranking of potential recon targets.  

    • Option name the recon algorithm.  Explore the features used as inputs to the algo. Make available {features, algos, + scores}  

    • How to change the API to allow this?  [what do different WD tools do?]

→ check w/ MM and other active users //

  • How to encourage OR users to share their usage back w/ the service?  Clarify how it will be useful to others.  

    • Permalink to a recon-mapping file from [source] to [target ID]

    • Permalink to a named classifier/algo (parameters that work in context)

    • Is there a single-purpose service (WB?) to store these? A wikilambda fxn?

  • One question is storage.

    • VZ: we can find where to store them; if we agree on schemas.

    • Problem outside of G-sphere: long-term archiving?  [can do; compare CCrawl] 

  • Another question is capturing data model context 

    • (metadata per cell; tying into part I)


Aside on this: constant challenge to identify the recon mappings; but we have less federated knowledge use, so it doesn’t arise that often.  Make these maps visible would increase uses of federated KGs.  [GKG: using +++ sources; ]  Check w/ active KGs tackling this problem 



Appendix - Thad Guidry’s footnote on “Data Metadata”

I didn't get a chance to share some of this on the call, so here it is as a large footnote :-)


Here's my enriched dataset example of metadata usage on data.world (currently limited to only 1 single description field)


Supports Project Summary / Data dictionary / Project data set

Can click ( i ) buttons or click on tabs along the top or left side. Right side of Data dictionary displays nice dictionary summary of the dataset


1 thing to note is that you can annotate external data on the data.world platform, the idea being that you make the external data more valuable for the entire community.


Example (dataset is actually hosted on Google Docs, but pulled in, annotated and fully connected and can be shared and visualized or repurposed on data.word by others):


https://data.world/thadguidry/historical-imports-and-exports-of-the-united-states/workspace/data-dictionary


I feel that data.world brings a decent platform that ticks a lot of the boxes of Open Data guidelines by


OKFN https://okfn.org/opendata/how-to-open-data/


ODI https://theodi.org/service/tools-resources/#1536319010899-45e0c54b-ac12


ODI themselves has a new and improved hosting for Open Data with https://octopub.io/


+met...@gmail.com I think OKFN and ODI should not be left out of the question for Open Data Publishing and Hosting. They both should be invited when it comes to the larger Loop question of Open Data Publishing & Sharing of datasets.


The final note is that of Validation of formats... CSVLint is one, but ODI is working towards a more shareable platform of validation plugins... https://octopub.io/getting-started#Data_Validation so they are working with these awesome Irish guys https://lintol.io/ who also just happen to contribute back to Frictionless Data and CKAN.






On Tue, Apr 28, 2020 at 12:30 PM Samuel Klein <met...@gmail.com> wrote:

This event has been changed.

(Recurring) WikiLoop Coalition Conf Call

When
Wed 2020-04-29 12:00 – 13:10 Pacific Time - Los Angeles
Where
MTV-2000-2-El Sereno (13) [GVC, No External Guests] (map)
Joining info
Join Hangouts Meet
meet.google.com/uam-sfma-zeh
Join by phone
+1 970-639-1957 (PIN: 859449)
More phone numbers
Calendar
Zainan Victor Zhou
Who
Zainan Victor Zhou - creator
Denny Vrandečić
Elan Hourticolon-Retzler
Aaron Halfaker
WikiLoop Coalition
Lydia Pintscher
Changed: Topics:
1. Review: Linking epidemic time-series on  Wikidata + Wikipedia (check w Tiago?)
2. New:  Antonin on an OpenRefine loop (details TBD; reconciliation + provenance)  

Going (z...@google.com)?   Yes - Maybe - No    more options »

Invitation from Google Calendar

You are receiving this email at the account z...@google.com because you are subscribed for updated invitations on calendar Zainan Victor Zhou.

To stop receiving these emails, please log in to https://www.google.com/calendar/ and change your notification settings for this calendar.

Forwarding this invitation could allow any recipient to send a response to the organizer and be added to the guest list, or invite others regardless of their own invitation status, or to modify your RSVP. Learn More.

Zainan Victor Zhou

unread,
Apr 29, 2020, 9:23:00 PM4/29/20
to Zainan Zhou (a.k.a Victor), Samuel Klein, Denny Vrandečić, Elan Hourticolon-Retzler, Aaron Halfaker, Forest Rui Jiang, Oelen, Allard, sebas...@allenai.org, WikiLoop Coalition, María Cruz, thadg...@gmail.com, Lydia Pintscher, ant...@delpeuch.eu
Oops, now fixed! Thank you for letting me know

--
You received this message because you are subscribed to the Google Groups "WikiLoop Coalition" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wikiloop-coalit...@googlegroups.com.
To post to this group, send email to wikiloop-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/wikiloop-coalition/CAGbEfRsjrj05PoqFNvKHKd-LYFgLEVYmTQBHGMhyMefRUk%3Dyjw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages