Regarding #1:convert bot to wikidata

chinmay naik

unread,

Apr 12, 2013, 12:06:58 PM4/12/13

to crow...@googlegroups.com

Hi,

From the past week, i have been trying to understand the things which need to be done. I was waiting for the English wikipedia to go on live using wikidata(it was scheduled as Apr 10 but was postponed).

I had created some wikidata items but they soon get deleted.As such i have done some experimental test writes using wikidata api to already existing wikidata items. http://www.wikidata.org/wiki/Q414043 reelin item(have reduced the description). I was also able to add properties(statements) to wikidata items. I have also prepared a very rough bot script to modify/add properties. As of now , the current inclusion syntax(through which wikidata properties is linked to wiki articles) is very rigid. http://meta.wikimedia.org/wiki/Wikidata/Notes/Inclusion_syntax_v0.3

I have a few concerns and questions regarding the project. As of now, only 400 properties are supported for a wiki data item. These properties are predefined items. I am assuming actual gene data will be saved onto wikidata property.Would that be the correct way ?? Would they be sufficient?

Interlanguage links can be implemented. I havent been able to create Interwiki links(ex list of articles which access a data item) . Can you provide some info on this.

Chinmay

Benjamin Good

unread,

Apr 12, 2013, 12:52:40 PM4/12/13

to crow...@googlegroups.com

We will certainly create all the wikidata properties needed to capture the gene information. We should soon start to create the properties listed as proposals here:

http://www.wikidata.org/wiki/Wikidata:Molecular_Biology_task_force/Properties

We aren't too concerned with managing the interlanguage links. For the most part that is happening automatically now anyway.

-Ben

--
--
You received this message because you are subscribed to the Google
Groups "Crowdsourcing Biology" group.
To post to this group, send email to crow...@googlegroups.com
To unsubscribe from this group, send email to
crowdbio+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/crowdbio?hl=en?hl=en

2012 GSoC Organization page: http://www.google-melange.com/gsoc/org/google/gsoc2012/scripps_crowdbio
GSoC Ideas page: http://sulab.org/gsoc/
---
You received this message because you are subscribed to the Google Groups "Crowdsourcing Biology" group.
To unsubscribe from this group and stop receiving emails from it, send an email to crowdbio+u...@googlegroups.com.
To post to this group, send email to crow...@googlegroups.com.
Visit this group at http://groups.google.com/group/crowdbio?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

chinmay naik

unread,

Apr 12, 2013, 1:31:35 PM4/12/13

to crow...@googlegroups.com

Thanks Ben for the info. I was considering about Interwiki links. For ex. If we have a gene wikidata item, should it also have a list of wikipedia articles links that access the item?

Also from the ideas page, Idea 3 "BioGps new query interface" seems interesting as well. I am now familiar with Mygeneinfo API. For #3, is proficiency in JavaScript necessary??

Thanks in advance

Chinmay

Benjamin Good

unread,

Apr 12, 2013, 1:48:44 PM4/12/13

to crow...@googlegroups.com

You can already access interwiki links programmatically and wikidata is really about having precisely defined properties connecting things rather than generic hyperlinks. So no on the interwiki link idea. (Though we do play with those often and it was not a bad idea.)

Yes, as I understand it, the new query interface (idea 3) would definitely entail some javascript skills.

chinmay naik

unread,

Apr 12, 2013, 2:02:51 PM4/12/13

to crow...@googlegroups.com

Thanks Ben. I will get back to you if i have any more questions.

chinmay naik

unread,

Apr 15, 2013, 2:57:15 PM4/15/13

to crow...@googlegroups.com

Hi Ben,

I have currently created some wikidata items(ex item on Christiano Ronaldo, Lionel Messi),have updated statements, obtain inter-language links etc through the bot. Modifications on existent items persist but new items get deleted. I have accessed wikidata statements by logging into Italian Wikipedia to get a better understanding.

A simple way would be modify the interlanguage link , link it to your user page and access the item.

As of now, we can only access the wikidata item from the default linked wikipedia article(on 11 live wikipedias) though the current inclusion syntax says otherwise. From the Torino wikipedia page, http://it.wikipedia.org/wiki/Torino I tried to edit the article and access a different wikidata item but without any success.

I also had a look at the proposed properties to capture gene information.As stated, many identifier properties are both associated with humans and mice. If such ids are specified as qualifiers , then even linkage properties (linking genes to genes, genes to biological concepts) will need to have qualifiers as well. To have a clean structured model, would it be better to have separate items to represent human and mouse genes?? I think it is also critical to have a efficient set of properties to capture gene information. I am hoping that the English wiki would come live soon.

Chinmay

Benjamin Good

unread,

Apr 15, 2013, 3:20:05 PM4/15/13

to crow...@googlegroups.com, Andrew Su

Nice progress so far. (Don't forget to write your proposal...) If you haven't done so already, you should register for the wikidata mailing list

https://lists.wikimedia.org/mailman/listinfo/wikidata-l

That will be your best source of information for updates to wikidata. It looks like we have another week or two to wait before we see English Wikipedia activated.

Regarding the data model and human/mouse/etc. Our primary focus is and will remain on human genes. We will have one item identifier for each human gene e.g. Q414043 for the human gene Reelin. Within Wikipedia/wikidata this is the important object. For interactions, we would have human genes interacting with other human genes (Q... interacts with Q...). We do not have plans for adding mouse genes directly into wikidata at this time.

For identifiers in databases outside of wikidata like NCBI Gene, we will again be focused on human. So Q414043 hasEntrezGeneId 5649 works just fine.

The question is how we connect the human wikidata item Q414043 off to the mouse identifier in external databases 19699 . I don't think this is a good case for qualifiers.

I think (and this is open for debate) now would be a great time to introduce a new relationship 'hasOrthologousGeneId' that could be used to link these human gene records to related genes in other species. While the genes in mouse are obviously highly related, they are not the same things ad human genes and there are many genes in other organisms (monkeys, cows, drosophila, etc..) that also have strong orthology relationships that we should capture. This was never accomplished before in the gene wiki but I think its within reach now.

(cc'ing gene wiki leader for comment..)

-Ben

chinmay naik

unread,

Apr 15, 2013, 4:08:30 PM4/15/13

to crow...@googlegroups.com

Thanks Ben.

I have subscribed to the wikidata mailing list. I have to start drafting my proposal. Since this is my first time, kindly provide some good example proposals or application templates..

The first part where we will be capturing gene info onto wikidata is crystal clear to me. If eng wiki is activated soon , i will submit concrete coding examples for it. I believe deploying a stable wikidata for gene wiki articles would be faster than expected. I still have to understand "add visualisation elements to gene wiki plus" and also decide on the approach. Adding 'hasOrthologousId' seems better than having qualifiers. This way we can capture relationship with other species without corrupting the structure of human gene wikidata item(which is our first priority).

Have issues 1 and 2 for the existing bot been resolved? I have submitted pull requests for the same. Recently, Chunlei Wu suggested a coding idea "make language specific client for mygeneinfo API". I donot have any proficiency in the suggested languages. However to demonstrate my coding skills, I will be submitting coding examples for the above project. Kindly, suggest me the way ahead.

Chinmay

Benjamin Good

unread,

Apr 15, 2013, 4:49:48 PM4/15/13

to crow...@googlegroups.com, Max Nanis

Hi Chinmay (and everyone else)

First, we now have an application template up on our gsoc organization site. See

http://www.google-melange.com/gsoc/org/google/gsoc2013/scripps_crowdbio

That should give you a basic idea of what we are looking for.

You should also seek out general GSoC information for students. A good place to start is

http://www.google-melange.com/document/show/gsoc_program/google/gsoc2013/help_page

Note that you can apply to multiple mentoring organizations. All applications are very competitive.

Regarding the wikidata project specifically. Don't worry too much about the genewikiplus integration for right now. (its down at the moment anyway). The emphasis is on developing highly reliable, error tolerant code to manage wikidata processes and to get things displaying on Wikipedia. (feel free to work with examples on other language wikipedias in the meantime.)

I'm waiting for Max to decide on pulling changes regarding those issues into the gene wiki bot code. He is the current president of that code base..

You don't need to submit a client as Chunlei suggested. The point is for you to demonstrate, one way or another, that you are capable of developing useful code. Gene wiki patches or prototype wikidata bots or even prior successful projects would be adequate for that demonstration. The MyGene.info clients were just one example for people looking for something to do.

Way forward:

1) Understand the whole GSoC process.

2) Look at our application template, check out other orgs application templates.. get an idea of what we are looking for.

3) Draft your proposal

4) If concerned, feel free to post it here for discussion. (I can't completely guarantee feedback on all proposals or all proposal parts, but I will do what I can.)

Aside from what is in the template, the best small piece of advice I can give you regarding the proposal is to be as specific as possible with regard to your plan. Say, in detail, how you envision the entire system working, how each component will work, what will happen when components fail (e.g. wikidata server goes down, bot gets banned as a spammer, network connection times out, mygene.info does not return information on the id you pass it etc...). You must also have a detailed timeline - in general its a safe estimate to (at least) double the amount of time you think it will take to complete anything.

hope that helps

-Ben

chinmay naik

unread,

Apr 22, 2013, 2:34:19 PM4/22/13

to crow...@googlegroups.com

Hi Ben,

From http://www.wikidata.org/wiki/Wikidata:Molecular_Biology_task_force/Properties , identifier properties, media properties are existing fields in GNF_Protein_box Template.I know the various public databases from which they are retreived However, Relationship properties(genes to genes) are new properties. Are we going to use NCBI database for them?? Kindly provide some info on this.

Recently , wikidata added qualifiers to properties. Qualifiers are now something like variants of existing property. why I say this is because now we can add any property as qualifier to an existing property. I was wondering about possible ways to link mouse identifiers(other organisms ) with human identifiers. Consider the following approach:

Create a property called 'HasOrthologousId'. Add qualifiers to indicate organism like mouse etc.
Add qualifiers to other identifiers(entrez Id etc) as well . Maintain a 1-1 relationship to identify the identifier of the organism.

Ex: HasOrthologousId - organism x, organism y , org z and so on

Entrez Id - first value(human), second(org x), third( nil), fourth(org z) .......

Kindly let me know your thoughts on this.

Benjamin Good

unread,

Apr 22, 2013, 3:00:00 PM4/22/13

to crow...@googlegroups.com

Hi Chinmay,

Focus the proposal and/or early development work on the properties that are in the current GNf_Protein_box template. Thats really the goal for the summer. If we nail that down completely we can consider other data like interactions.

Hmmm.. I think that when the object of a property resides within the wikidata space you would definitely not want to use the qualifier to add descriptive information about it (e.g. that its about mice) because you could get this from a query that would pass the information on from the secondary topic. In my understanding of qualifiers they are meant to describe something about the specific use of the property rather than something about the entities being linked by the property. Make sense?

But.. in the case that the topic is outside of wikidata (as it is here for the mouse ids) maybe the qualifier is a good way to do this. The only other ways I see would be to pack that information into the property (e.g. hasMouseOrthologueId) or to create an item to represent each relevant mouse gene within wikidata and then add the information there. That would get you humanGeneX hasOrthologue mouseGeneY and mouseGeneY hasEntrezId id0007 . If its allowed and assuming you could get the needed data back in order to render it on the human gene page in Wikipedia, I like the last version the best as its the most extensible. Need to verify those ifs though.

I suggest you run this problem by the wikidata mailing list (please cc me if you do).

-Ben

chinmay naik

unread,

Apr 22, 2013, 3:37:23 PM4/22/13

to crow...@googlegroups.com

Yes, qualifiers are used to describe specific use of property. http://meta.wikimedia.org/wiki/Wikidata/Notes/Data_model_primer#Qualifiers

But consider any wikidata item , http://www.wikidata.org/wiki/Q4115189 , we can edit a property , add a qualifier which can be any other property -value pair. Does it not mean that we can actually pack more than one qualifier in a single property?. Ex: Consider the property Sandbox-string . similar to this , we can also pack multiple entrez ids.

Yes, definitely this will give a complex mechanism.

humanGeneX hasOrthologue mouseGeneY and mouseGeneY hasEntrezId id0007 works fine. I tried it out Sandbox item. Currently only "transcluion of default wikidata items supported". translusion of other items will be supported in future.http://lists.wikimedia.org/pipermail/wikidata-l/2013-April/002117.html
But wouldnt it raise further more questions?? If we create seperate wikidata items, how and what properties would describe them?

chinmay naik

unread,

Apr 22, 2013, 4:39:55 PM4/22/13

to crow...@googlegroups.com

I apologize for the wrong set of questions. Pardon me if I communicated it the wrong way.
I wanted to convey the following information.If we have a gene item, then would it be best to have a single gene wikidata item to capture all aspects of gene information? Then this item could serve as a sort of central repository for the item. If we create multiple set of items for different organisms, would this result in some redundancy ?

Kindly let me know your thoughts on this.

Chinmay

Benjamin Good

unread,

Apr 22, 2013, 5:16:44 PM4/22/13

to crow...@googlegroups.com

As long as you re-use the wikidata items, you avoid redundancy by _not_ packing information into the properties/qualifiers. For example, if you do this:

humanGeneX hasOrthologue mouseGeneY

mouseGeneY hasEntrezId id0007

mouseGeneY hasTaxon mouse

you only ever represent

mouseGeneY hasTaxon mouse

once.

Say there was another entry in wikidata like a rat gene that also had that gene as a mouse orthologue. In this model you only add one more statement

ratGeneZ hasOrthologue mouseGeneY

and you don't have to restate the relationship between mouseGeneY and 'taxon mouse' in the property/qualifier context.

The only reason not to do it this way (that I can see right now) would be if it somehow prevented you from using the information associated with the mouseGeneY data item in the context of the page about humanGeneX - or if you were prevented from created items like mouseGeneY that might not have a Wikipedia page associated with them. (I don't know the answer to the first and the second was still being debated last time I checked but I think you would be ok.)

Does that make sense?

-ben

chinmay naik

unread,

Apr 22, 2013, 5:35:33 PM4/22/13

to crow...@googlegroups.com

Ahh yes.I get it now. Thanks Ben for the info. Adding mouse identifiers is clear to me now.

Most parts of my proposal are ready. Hopefully, in next couple of days it will be finished.