Crawling Tbox - Seeds

41 views
Skip to first unread message

Valentino Hudhra

unread,
Feb 6, 2014, 1:33:28 PM2/6/14
to ldsp...@googlegroups.com
Hi,

firs, kudos for the tbox_only feature. My question though is about the seed list for obtaining the Tbox/vocabularies out there. Let's say I need to collect all of them (at least what is live) in order to do some analysis. 

I guess that the LOV Project [1] is the right place to start, but what exactly is a good seed list in this occasion. I am asking because I am new to crawling.

Another question, can you give a very rough estimate of how long will it take to crawl the TBox(es) with a good seed list?

Cheers,

Valentino

Andreas Harth

unread,
Feb 6, 2014, 3:25:21 PM2/6/14
to ldsp...@googlegroups.com
Hi Valentino,

On 02/06/2014 07:33 PM, Valentino Hudhra wrote:
> firs, kudos for the tbox_only feature. My question though is about the
> seed list for obtaining the Tbox/vocabularies out there. Let's say I
> need to collect all of them (at least what is live) in order to do some
> analysis.
>
> I guess that the LOV Project [1] is the right place to start, but what
> exactly is a good seed list in this occasion. I am asking because I am
> new to crawling.
>
> Another question, can you give a very rough estimate of how long will it
> take to crawl the TBox(es) with a good seed list?

IIRC we have the tbox_only feature in there for the following case:
if you're crawling RDF data and want to make sure your dataset contains
all the T-Box information you need to crawl another hop for the T-Box
URIs.

The problem with only crawling T-Box statements is that you might soon
run out of new URIs.

With vocab.cc we have the T-Box URIs from the BTC-2012 dataset [1] and
from our paper [2]:
"vocab.cc offers with 261,119 unique vocabulary elements a significantly
larger coverage of existing vocabularies than previous approaches." - at
the time we wrote the paper LOV had 3,714 vocabulary elements.

So: to get a good T-Box dataset, grab the vocab.cc URIs [3] and crawl
hop-1 from there. Done. Should take a few hours max.

If you want, you can also use the URIs from the other repositories.
I'd be interested in knowing how much overlap there is between them.

Best regards,
Andreas.

[1] http://km.aifb.kit.edu/projects/btc-2012/
[2] http://www.aifb.kit.edu/images/8/8e/Vocab.pdf
BibTex:
http://www.aifb.kit.edu/web/Spezial:Semantische_Suche/-5B-5BInproceedings3362-5D-5D/format%3Dkiteva/limit%3D10000
[3]
http://code.google.com/p/vocab/source/browse/#svn%2Ftrunk%2Fwar%2FWEB-INF%2Ffiles

Reply all
Reply to author
Forward
0 new messages