[Wikitech-l] Automatic Taxobox

5 views
Skip to first unread message

Ashwin Ravichandran

unread,
Apr 2, 2012, 7:47:00 AM4/2/12
to wikit...@lists.wikimedia.org
Hey,

I am very much interested in the idea of a Taxobox. I have an interesting
method of generating it using a basic Python script. It would gather all
the data using the basic Python application and would be able to store it
in the taxonomy templates and we can display according with respect to the
display templates.

I will look into the merits and de - merits of the Generation of Automatic
Taxobox and you will be receiving my proposal in the following week.

I just need to know whether I am on the right track here?

Regards,
Ashwin

--
Ashwin.S.Ravichandran
_______________________________________________
Wikitech-l mailing list
Wikit...@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Martijn Hoekstra

unread,
Apr 2, 2012, 8:05:39 AM4/2/12
to Wikimedia developers
Hey Ashwin,

Where are you getting the data for the taxobox? Does it need human
supervision? If not, depending on how heavy this is, this could be a
very cool Lua Template, where it can be used to generate it
dynamically.

Ashwin Ravichandran

unread,
Apr 2, 2012, 8:17:21 AM4/2/12
to Wikimedia developers
Hey Martjin,

The data for the Taxobox can be obtained from Wikipedia, itself, can't it?
Imagine, we had a to generate the template for the animal "Elephant".

http://en.wikipedia.org/wiki/Elephant. We could write a python script
which would be able to generate the data and store it in the Taxonomy
template. And, we can choose it to generate equivalently in the display
template.

We can generate the taxobox using the Lua Template or we can derive our own
template which would make it more user friendly.

Regards,
Ashwin

On Mon, Apr 2, 2012 at 5:35 PM, Martijn Hoekstra
<martijn...@gmail.com>wrote:

Yury Katkov

unread,
Apr 2, 2012, 8:23:28 AM4/2/12
to Wikimedia developers
Could you clarify how exactly do you want to generate data for
Taxobox? [1] Do you plan to use Natural Language Processing? Or do
you want to do the opposite and you want to parse the template like
guys from dbpedia.org?

[1] http://en.wikipedia.org/wiki/Template:Taxobox
-----
Yury Katkov

Ashwin Ravichandran

unread,
Apr 2, 2012, 8:30:18 AM4/2/12
to Wikimedia developers
Yury,

I was thinking more in terms of what the Dbpedia people do. They have a
strong algorithm which tends to utilize data. But, *I didn't think about
NLP, thanks for the input. *:)

I was more into looking deriving the information from the page, itself.
Like completely checking the page for all the info we need and then,
storing it.

Cheers,
Ashwin

Martijn Hoekstra

unread,
Apr 2, 2012, 8:31:40 AM4/2/12
to Wikimedia developers
Ah, that will certainly need human supervision for automatic
generation. Chances that we will be able to interpret the line

Elephants are large land mammals in two extant genera of the family
Elephantidae: Elephas and Loxodonta, with the third genus Mammuthus
extinct.

into

Elephant := genus ( Elphantidae Elephas || Elephantidae Loxodonta ||
Elephantidae Mammuthus)

and try to resolve the genera from there at all seems pretty slim to
get 100% correct, no matter how sophistocated the script is, but to
write a script that can assist a human user to make a taxobox should
be possible.

Ashwin Ravichandran

unread,
Apr 2, 2012, 8:41:09 AM4/2/12
to Wikimedia developers
Agreed, but we will be diving into further classification, won't we?

Imagine.

Elephant: = (Elephantidae Elephas || Elephantidae Loxodonta || Elephantidae
Mammuthus)

But, we didn't specify what type of elephant?

Imagine, we have the Asian Elephant:

Then, we know the fact Asian Elephant: = (Elephantidae Elephas)

Whereas African Elephant: = (Elephantidae Loxodonta)

and the genera Extinct: = (Elephantidae Mammuthus).

With the above script, we might not be 100% correct, but at least we are
trying for 100.
Taxobox generation will be quite easy after that.

Cheers,
Ashwin

Jelle Zijlstra

unread,
Apr 2, 2012, 9:45:02 AM4/2/12
to Wikimedia developers
Virtually all Wikipedia articles that need one already have a taxobox,
which will be far easier to process than the lead sentence, so I'm not sure
where the need for natural language processing comes in. Also, are you
aware of the existing automatic taxobox system on en.wikipedia (
https://en.wikipedia.org/wiki/Template:Automatic_taxobox).

2012/4/2 Ashwin Ravichandran <ashw...@gmail.com>

Ashwin Ravichandran

unread,
Apr 2, 2012, 11:20:27 AM4/2/12
to Wikimedia developers
Jelle,

I saw the Automatic taxobox. But, it doesn't seem quite user friendly and
isn't that what we are working on so that it becomes more feasible? We can
use actually use NLP to a major extend using the Text Extraction which can
be quite helpful.

Ryan Kaldari

unread,
Apr 2, 2012, 1:08:30 PM4/2/12
to wikit...@lists.wikimedia.org
I definitely agree that the Automatic taxobox needs a better user
interface - that is the biggest obstacle to it's adoption. Right now
there is a significant learning curve to being able to use it. I would
support a project to improve the user interface of the existing
Automatic taxobox, but frankly I don't see much value in using NLP to
populate the data. In fact, I would hesitate to automatically populate
any of the data for any taxobox from article content. The taxonomy in
Wikipedia articles is notoriously unreliable and outdated and very
frequently contradictory. Just look through the mess we have in our
Zebra articles and you'll see what I mean. And if we can't get Zebras
right, imagine what our taxonomy is like for arthropods!

Ryan Kaldari

Ashwin Ravichandran

unread,
Apr 3, 2012, 12:35:25 AM4/3/12
to Wikimedia developers
Ryan,


*I definitely agree that the Automatic taxobox needs a better user


> interface - that is the biggest obstacle to it's adoption. Right now there
> is a significant learning curve to being able to use it. I would support a

> project to improve the user interface of the existing Automatic taxobox,*

Thanks a ton for clearing that out. I will know what to put in the proposal
then.

*but frankly I don't see much value in using NLP to populate the data. In


fact, I would hesitate to automatically populate any of the data for any
taxobox from article content. The taxonomy in Wikipedia articles is
notoriously unreliable and outdated and very frequently contradictory. Just
look through the mess we have in our Zebra articles and you'll see what I
mean. And if we can't get Zebras right, imagine what our taxonomy is like

for arthropods!*
So what you are suggesting is that you want us to generate the taxonomy
using other sites? Wiki = Unreliable?

Reply all
Reply to author
Forward
0 new messages