Vocabs

Amanpreet Singh

unread,

May 16, 2014, 2:18:38 PM5/16/14

to Simone Fonda, Christian Morbidoni, annotation-tool-gsoc

As said by Christian, I should use json dump for vocabs, but I think its bad idea in case of Wikidata, its properties dump is of size 4.6 mb, see here. So we should rethink about this.

And as I wanted, I think we should a selector for this vocab, because I want to suggest user properties(predicates) just as he types in the same way we do for object.

Opinion?

--

Amanpreet Singh,

IIT Roorkee

David Cuenca

unread,

May 16, 2014, 2:29:55 PM5/16/14

to Amanpreet Singh, Simone Fonda, Christian Morbidoni, annotation-tool-gsoc

Amanpreet,

Maybe it is better to query wikidata for the property. This thread on the Wikidata mailing list might give you some insights
http://comments.gmane.org/gmane.org.wikimedia.wikidata/3684

However I have no idea if it is a requirement of the Pundit to have all the vocabs available.

Cheers,

Micru

--
You received this message because you are subscribed to the Google Groups "Annotation tool GSoC" group.
To unsubscribe from this group and stop receiving emails from it, send an email to annotation-tool-...@googlegroups.com.
Visit this group at http://groups.google.com/group/annotation-tool-gsoc.
For more options, visit https://groups.google.com/d/optout.

--
Etiamsi omnes, ego non

Simone Fonda

unread,

May 17, 2014, 11:24:22 AM5/17/14

to Amanpreet Singh, Christian Morbidoni, annotation-tool-gsoc

On Fri, May 16, 2014 at 8:18 PM, Amanpreet Singh
<amanpreet...@gmail.com> wrote:

> As said by Christian, I should use json dump for vocabs, but I think its bad
> idea in case of Wikidata, its properties dump is of size 4.6 mb, see here.
> So we should rethink about this.

I'm sure you watched and have read the documentation both me and
Christian showed you, so you noticed that our predicate vocabolary is
veeeery dryer with respect to the massive json you showed us.

If you prettify the json the first 750 rows are taken just for the
first predicate. Using this number as a rule of thumb, we end up with
an estimation of 286 predicates. Is this far from the number of
predicates we are talking about?

I'm pretty confident that if we are talking of something under the
300-500 items, the browser would handle it just fine. Though, the
final word must come from an effective test, rather than this rough
estimation.

So, something i would do, is to export that massive json to the pundit
format, and see where we end up.

> And as I wanted, I think we should a selector for this vocab, because I want
> to suggest user properties(predicates) just as he types in the same way we
> do for object.

This is a non-existing concern: pundit already suggests the predicates
to the user this way, no matter if they are already loaded or they get
loaded through ajax.

Build a selector for predicates is certainly possible and it might
even be very easy to accomplish. A thing i haven't seen in that json,
though, is the crucial information about range and domain. Are they
present in the wikimedia predicates or are they missing completely?

Very tied to these two informations is the fact that pundit suggest
you only the predicates that fits any subject and/or object already
present in the statement you are composing. If you decide to go for a
selector, this will not happen anymore, and, basicly, you are forcing
the user to type something each time they need to build a statement.

Simone

Amanpreet Singh

unread,

May 19, 2014, 10:54:50 AM5/19/14

to Simone Fonda, Christian Morbidoni, annotation-tool-gsoc

As per my conversation at IRC, Wikidata doesn't have domain and range for properties.
IRC log.

About the properties, the estimation is quite wrong as we have 1200 properties at Wikidata in existence, so should we still export them to pundit format, or we should take other way round of selector?

Simone Fonda

unread,

May 19, 2014, 11:04:22 AM5/19/14

to Amanpreet Singh, Christian Morbidoni, annotation-tool-gsoc

On Mon, May 19, 2014 at 4:54 PM, Amanpreet Singh
<amanpreet...@gmail.com> wrote:

> As per my conversation at IRC, Wikidata doesn't have domain and range for
> properties.
> IRC log.
>
> About the properties, the estimation is quite wrong as we have 1200
> properties at Wikidata in existence, so should we still export them to
> pundit format, or we should take other way round of selector?

If it doesnt take too much time, i would go as suggested: export them
in the pundit predicates format and see if the world explodes right
after.

Simone

Christian Morbidoni

unread,

May 20, 2014, 3:05:42 AM5/20/14

to Amanpreet Singh, annotation-tool-gsoc, Simone Fonda

I see they do not have domain and ranges and I understand why from their perspective..
I think you can for now use empty domain and ranges for all the predicates.

Amanpreet Singh

unread,

May 20, 2014, 12:10:08 PM5/20/14

to Christian Morbidoni, annotation-tool-gsoc, Simone Fonda

I created the JSON file in the required format using a python script, then called upon Pundit and voila it worked, we have 1031 predicates in our bucket. It hanged Pundit for a second, but then it was ok. Here is the JSON file.

Hopefully, this is what we need. Maybe we would have to reduce no. of predicates

David Cuenca

unread,

May 20, 2014, 12:22:05 PM5/20/14

to Amanpreet Singh, Christian Morbidoni, annotation-tool-gsoc, Simone Fonda

I'm just seeing the file and I notice that there is no info about the type of property (some of them take a string, other a number, etc).

How will the Pundit know which value formatter/suggester use in each case? Will it be queried from WD?

About the number of predicates, yes, you could eliminate the ones that refer to identifiers, like IMDB, viaf, etc. I don't think there is any use case to annotate anything using them.

Btw Amanpreet, there is a GsoC office hour now, I hope you are there :)

Cheers,

Micru

Cheers,

Micru

--

You received this message because you are subscribed to the Google Groups "Annotation tool GSoC" group.
To unsubscribe from this group and stop receiving emails from it, send an email to annotation-tool-...@googlegroups.com.
Visit this group at http://groups.google.com/group/annotation-tool-gsoc.
For more options, visit https://groups.google.com/d/optout.

Simone Fonda

unread,

May 20, 2014, 1:31:46 PM5/20/14

to David Cuenca, Amanpreet Singh, Christian Morbidoni, annotation-tool-gsoc

On Tue, May 20, 2014 at 6:22 PM, David Cuenca <dac...@gmail.com> wrote:

> I'm just seeing the file and I notice that there is no info about the type
> of property (some of them take a string, other a number, etc).
> How will the Pundit know which value formatter/suggester use in each case?
> Will it be queried from WD?

Pundit stores this information as domain (what can be used as subject)
and range (what can be used as object).

Both are arrays, so a predicate can go from a text fragment OR image
fragment to person OR place, for example.

Range and domain values must be URIs (usually rdf types/classes), like
http://xmlns.com/foaf/0.1/Person or
http://purl.org/pundit/ont/ao#fragment-text. If empty Pundit will
allow the user to use any kind of item.

At the moment there is no special formatter/suggester for "simple"
types like number/string, but i guess they can be implemented, if
needed.

> About the number of predicates, yes, you could eliminate the ones that refer
> to identifiers, like IMDB, viaf, etc. I don't think there is any use case to
> annotate anything using them.

Reducing them sounds like a very good idea!

Simone

David Cuenca

unread,

May 21, 2014, 6:13:57 AM5/21/14

to Simone Fonda, Amanpreet Singh, Christian Morbidoni, annotation-tool-gsoc

On Tue, May 20, 2014 at 7:31 PM, Simone Fonda <fo...@netseven.it> wrote:

Pundit stores this information as domain (what can be used as subject)

and range (what can be used as object).

In that case, wouldn't be convenient (even if not totally acurate) to use as domain "wikidata item" (because the subject will be always a Wikidata item) and as range the datatype than the predicate/property can take?

At the moment there is no special formatter/suggester for "simple"
types like number/string, but i guess they can be implemented, if
needed.

It would be great if the existing Wikidata formatting widgets could be reused, but that is probably better left for the end.

> About the number of predicates, yes, you could eliminate the ones that refer
> to identifiers, like IMDB, viaf, etc. I don't think there is any use case to
> annotate anything using them.

Reducing them sounds like a very good idea!

To filter out the properties that refer to identifiers is not going to be easy (other than removing all that have "ID" or "identifier" in the name, which are not all of them), the final solution for this will be when properties allow statements. Then we'll have a proper classification and it will be easy to sort them out without having to do that manually. This is not enabled yet, so for now either trim manually or just select some test properties.

Micru

Simone Fonda

unread,

May 21, 2014, 6:47:02 AM5/21/14

to David Cuenca, Amanpreet Singh, Christian Morbidoni, annotation-tool-gsoc

On Wed, May 21, 2014 at 12:13 PM, David Cuenca <dac...@gmail.com> wrote:

>> Pundit stores this information as domain (what can be used as subject)
>> and range (what can be used as object).
>
> In that case, wouldn't be convenient (even if not totally acurate) to use as
> domain "wikidata item" (because the subject will be always a Wikidata item)
> and as range the datatype than the predicate/property can take?

Yep, absolutely. To be even more precise, it depends on the shape of
the statements you want to achieve, you can either have it in the
domain or in the range (think at inverses: city > has major > someone
\\\ someone > is major of > city).

The very most bestest solution (R), if wikidata has classes
informations like places, persons etc.. would be to stuff this into
domains and ranges to get a better annotation environment.

Simone

David Cuenca

unread,

May 21, 2014, 7:26:34 AM5/21/14

to Simone Fonda, Amanpreet Singh, Christian Morbidoni, annotation-tool-gsoc

On Wed, May 21, 2014 at 12:47 PM, Simone Fonda <fo...@netseven.it> wrote:

Yep, absolutely. To be even more precise, it depends on the shape of

the statements you want to achieve, you can either have it in the
domain or in the range (think at inverses: city > has major > someone
\\\ someone > is major of > city).

The very most bestest solution (R), if wikidata has classes
informations like places, persons etc.. would be to stuff this into
domains and ranges to get a better annotation environment.

In Wikidata there are no hard-coded constraints like this (and I cannot recommend them either) because there are always edge cases which limit the usability. What it is being implemented is a property suggester based on which properties the item has, and maybe in the future there will be a value suggester too. This is being integrated in the backend, so if the standard Wikidata entity selector is used, that will be an added benefit in the future.

For instance, if an item has a property "sex or gender", "viaf ID" and "place of birth", it will suggest "date of birth", "date of death", etc. Test:

http://suggester.wmflabs.org/wiki/Special:PropertySuggester

The WD entity selector also handles several languages. property aliases and property descriptions. I would recommend to consider using it.

Cheers,

Micru

Bob Morris

unread,

May 21, 2014, 12:16:58 PM5/21/14

to fo...@netseven.it, David Cuenca, Amanpreet Singh, Christian Morbidoni, annotation-tool-gsoc

I've been following this discussion under the assumption that it is
about something that, sooner or later, comes down to a discussion of
rdfs:domain and rdfs:range. If that's wrong, ignore my post.

IMO, the problem with prematurely---or at all---assigning rdfs:domain
is that the world, especially the Open World, doesn't explode right
away. Thanks to the OWA, it explodes only when someone wants to
extend the application of the predicate in a way that the originator
didn't foresee, but is perhaps anyway useful. Worst of all, my
experience is that often when humans with a little mathematical
training see an assertion <Predicate> rdfs:domain <Class> they often
imagine that "domain" is used as in "domain of a function." They
become confused when one or another RDFS reasoning application doesn't
ever tell them that some usage is invalid, but only tells them that,
depending on the data, a reasonable query may return "false"---if they
ask the right question.

I don't know enough about Pundit to know what consequences are risked
by having empty domains or ranges(*). The consensus of the thread
seems to be that the consequences are few. But naively, it wouldn't
surprise me if doing so trips on use (or misuse) of one or another
existing pundit-based tools thereby making more, not less, work for
Amanpreet. So it might be a good idea for him to formulate and
routinely test some competency questions, probably in SPARQL, that
exhibit some minimal success with realistic data having empty domains,
and exhibit some failure with other (realistic or synthetic) data. At
the other end of the summer, this might lead to useful reports of the
limitations of whatever he produces. Or, it might indeed suggest that
it doesn't matter.

Bob Morris
(*) range typing has somewhat different pitfalls, especially when data
are moved around the internet serialized as text, which sadly
continues to be thought necessary. Even that metadata luminary Dublin
Core began only(?) as late as 2012 to try to remedy the ambiguity
whether some dcmi predicates were to take literal values or object
values [1]. Happily, they are now explicitly concerned with exploring
"[...] How can validatable constraints be defined for RDF data ..."
[2].

[1] http://wiki.dublincore.org/index.php/User_Guide/Publishing_Metadata
[2] http://wiki.dublincore.org/index.php/Main_Page#Purculating_activities

> --
> You received this message because you are subscribed to the Google Groups "Annotation tool GSoC" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to annotation-tool-...@googlegroups.com.
> Visit this group at http://groups.google.com/group/annotation-tool-gsoc.
> For more options, visit https://groups.google.com/d/optout.

--

Robert A. Morris

Emeritus Professor of Computer Science
UMASS-Boston
100 Morrissey Blvd
Boston, MA 02125-3390

Filtered Push Project
Harvard University Herbaria
Harvard University

email: morri...@gmail.com
web: http://efg.cs.umb.edu/
web: http://wiki.filteredpush.org
http://www.cs.umb.edu/~ram
===
The content of this communication is made entirely on my
own behalf and in no way should be deemed to express
official positions of The University of Massachusetts at Boston or
Harvard University.

Christian Morbidoni

unread,

May 21, 2014, 12:56:46 PM5/21/14

to Bob Morris, Simone Fonda, David Cuenca, Amanpreet Singh, annotation-tool-gsoc

Dear Bob,

In Pundit we are using rdfs:domains and rdfs:range as "constraints". This means that the user will be able to use only the instances of domain and range classes if specified. I know this is "wrong" from a RDFSchema perspective, where the meaning is rather to drive inference. We do not do inference at all.

So the drawback of having no domain and ranges for a property would simply be the user being free to create some non-sense triple (e.g. a table buys a person).

I'm not sure if this is acceptable in the wikidata usecase. What do the wikidata guys think?

best

Christian

David Cuenca

unread,

May 21, 2014, 6:38:19 PM5/21/14

to Christian Morbidoni, Bob Morris, Simone Fonda, Amanpreet Singh, annotation-tool-gsoc

Dear Bob and Christian,

I recommend reading these blog posts about some design decisions taken in Wikidata
http://blog.wikimedia.de/2013/02/22/restricting-the-world/
http://blog.wikimedia.de/2013/06/04/on-truths-and-lies/
https://blog.wikimedia.de/2013/09/12/a-categorical-imperative/

To this I must add that Wikidata is not a stone that once you carve some wrong inscription it will stay forever wrong. You can think of it more in terms of an evolving organism where community and statements coexist and improve iteratively. If there is some statement wrong or absurd, it doesn't matter much, someone will realize and will correct it on due time (you can also correct them if you spot one, just click "edit"). And we have also tools to detect outliers, constraints that check that certain properties ("predicates" in Pundit-jargon) have a value within the defined ranges. For instance check the constraints for "date of birth"
https://www.wikidata.org/wiki/Property_talk:P569

These are not hard constraints, I can add wrong information to any item or use it in an item where it doesn't belong, but it will raise some alarms and someone will fix it. Sometimes automatically, sometimes manually, Furthermore, as a community we know each other history, if an editor is new there are people checking for vandalism, and if they are usual editors, they have a history and a reputation. Mistakes can happen, sure, but that is part of Wikidata's nature.

So, sure, let annotators log in and create their statements with complete freedom (except for datatype or format, that might trigger some nasty error), if they get the properties wrong we will teach them how to better use them and they will do better next time. No big deal.