Crawling existing bulletin boards, representing contents as SIOC?

6 views
Skip to first unread message

Matthias Samwald

unread,
Oct 1, 2009, 10:43:04 AM10/1/09
to SIOC-Dev
Dear SIOC community,

At the moment, I am thinking about possible ways of turning existing
bulletin boards (often based on the popular vBulletin software) into
SIOC, by crawling them and extracting the content.

Does any of you have experience with crawling bulletin boards? Is
there any existing software that could be built upon?

Cheers,
Matthias

Paul A Houle

unread,
Oct 1, 2009, 11:14:40 AM10/1/09
to sioc...@googlegroups.com
Back in '99 I wrote a webcrawler in Java that I called 'Blackbird.'
It was in a lot of ways, like the airplane with the same name. (Yes,
web crawling is a bit of a 'black art')

Although it wasn't distributed, it had fancy concurrency control
and queuing policies; it could get a lot of the performance that would
be possible with reasonable Unix box and internet connection.

Then I went through a phase of creating simpler and simpler web
crawlers. I kind of thought I was devolving until I saw the crawling
strategy Nutch uses and realized it was pretty much the same.

These days I'm a big believer in breadth-first crawling. The web
crawler runs in stages: stage N outputs a list of urls to stage N+1.
The crawler itself is pretty dumb: it grabs the URLs, writes the
contents into files or stuffs them into DB blobs. Concurrency control
can be ~simple~, for instance, just divide the list of tasks to do
into M sublists, fork into M children, and let each child do 1/M of
the work. (That's not the best strategy, but you can even do it in
Perl or PHP.)

Once a stage of the crawl is done, I run some scripts that extract
whatever data comes out of the stage. The nice thing about having this
decoupled from the crawler is that you can fix bugs in your extractor
without having to re-run the crawl. The extractor sends URLs on the
stage N+1, you can even move URLs that were temporary fails in stage N
to stage N+1.

You'll usually see a rapid increase in the size of the stages, then
a gentle plateau, then it falls off and you're left with some
stragglers, which are all web traps. Terminate the crawl then... The
real advantage of breadth-first is that it easily shakes off common web
traps.

In early development or for small jobs you can do it manually and
have a lot of control over what's happening. In a more mature system
you can have higher-level optimization start and stop the stages, run
the extractor scripts, decide when to terminate a crawl, etc.

My current web crawler has a centralized work queue: other scripts
submit jobs to the crawler, which works through them, and runs
callback scripts when jobs are completed. It works pretty nice.


James Howison

unread,
Oct 2, 2009, 10:32:17 AM10/2/09
to sioc...@googlegroups.com

Well first move would be to see if they have some form of export, or
if the forums are open source whether you can add that and then have
it taken up by the site :)

Or this might be an option (I haven't used it but I have written
crawlers and they are a pain).

http://news.idg.no/cw/art.cfm?id=E1888BDC-1A64-67EA-E4609525E2DBCDB9

http://80legs.com/

--J

Uldis Bojars

unread,
Oct 7, 2009, 7:52:04 AM10/7/09
to SIOC-Dev
Hi Matthias,

On Oct 1, 3:43 pm, Matthias Samwald <samw...@gmx.at> wrote:
> At the moment, I am thinking about possible ways of turning existing
> bulletin boards (often based on the popular vBulletin software) into
> SIOC, by crawling them and extracting the content.
>
> Does any of you have experience with crawling bulletin boards? Is
> there any existing software that could be built upon?

Are you able to install plugins on these bulletin boards?
If yes, just install the vBulletin SIOC exporter [1] and use any RDF
crawler.

We used this approach to collect data for the boards.ie SIOC data
competition [2][3]. Most of the work was done by Thomas Schandl and
Tuukka Hastrup, and they can probably tell more about the process and
tools used.

[1] http://wiki.sioc-project.org/index.php/VBSIOC
[2] http://wiki.sioc-project.org/index.php/Data/Boards.ie/Structure
[3] http://www.johnbreslin.com/blog/2008/07/30/deri-nui-galway-launches-the-boardsie-sioc-data-competition/

If there is no possibility to install plugins, then some kind of a
wrapper for converting HTML into RDF will need to be used.

Does anyone know if there are such "SIOC wrappers" available for
vBulletin and other systems?

Uldis

[ http://captsolo.net | http://twitter.com/CaptSolo ]

Matthias Samwald

unread,
Oct 8, 2009, 4:30:12 AM10/8/09
to SIOC-Dev
Thanks for your replies.

I know that there is a vBulletin SIOC plugin, but this is not an
option. I want to RDFize existing forums out there, without having
influence on them.

However, after looking into it a bit more, I think that this should be
simple to do for most bulletin boards based on vBulletin. Most of them
have an 'archive', which is a stripped-down version of the forum
content, looking like this:
http://www.vbulletin.com/forum/archive/index.php

So what I plan to do is to write a PHP script that creates a local
mirror of each forum archive with the standard Unix tool 'wget'. Then,
the script will iterate through the downloaded files and create SIOC
from the archive HTML pages with a bunch of XML querying and regular
expressions. I think this should be fairly trivial to implement (at
least when I find the time to do it).

-- Matthias

Matthias Samwald

unread,
Oct 8, 2009, 5:06:40 AM10/8/09
to SIOC-Dev
Oh, and if someone feels motivated to help with this project, please
say so! :)

seb

unread,
Oct 8, 2009, 7:56:27 AM10/8/09
to SIOC-Dev
Hi Matthias,

What kind of help/contribution would you need more specifically?

/seb

Matthias Samwald

unread,
Oct 8, 2009, 9:04:57 AM10/8/09
to SIOC-Dev
Hi Seb,

Basically help with doing the actual programming by some experience
programmer. For example, I could write down how the vBulletin archive
HTML should be mapped to SIOC, and someone else could help with
writing the code. However, it might still be more efficient if I just
start hacking right away...

Another possible help could be writing some of the necessary code
snippets for extracting the various attributes from the archive HTML
pages (e.g., as PHP code that uses SimpleXML and regular expressions
to extract each post, each author, each content, thread title, date,
links to external resources et cetera). Yes, I guess that would be
more efficient.

-- Matthias

Alexandre Passant

unread,
Oct 8, 2009, 9:18:56 AM10/8/09
to sioc...@googlegroups.com
HI,

On 8 Oct 2009, at 14:04, Matthias Samwald wrote:

>
> Hi Seb,
>
> Basically help with doing the actual programming by some experience
> programmer. For example, I could write down how the vBulletin archive
> HTML should be mapped to SIOC, and someone else could help with
> writing the code. However, it might still be more efficient if I just
> start hacking right away...
>
> Another possible help could be writing some of the necessary code
> snippets for extracting the various attributes from the archive HTML
> pages (e.g., as PHP code that uses SimpleXML and regular expressions
> to extract each post, each author, each content, thread title, date,
> links to external resources et cetera). Yes, I guess that would be
> more efficient.

I guess the current SIOC PHP API may help you to write such wrapping
service, available at [1].
If you have any question wrt this API, please ask us on the ML.

Best,

Alex.

[1] http://wiki.sioc-project.org/index.php/PHPExportAPI

>
> -- Matthias
>
> On 8 Okt., 13:56, seb <seb....@gmail.com> wrote:
>> Hi Matthias,
>>
>> What kind of help/contribution would you need more specifically?
>>
>> /seb
>>
>> On 8 oct, 11:06, Matthias Samwald <samw...@gmx.at> wrote:
>>
>>> Oh, and if someone feels motivated to help with this project, please
>>> say so! :)
> >

--
Dr. Alexandre Passant
Digital Enterprise Research Institute
National University of Ireland, Galway
:me owl:sameAs <http://apassant.net/alex> .






James Howison

unread,
Oct 8, 2009, 2:38:14 PM10/8/09
to sioc...@googlegroups.com
My experience with writing crawl and processing bots like this [1] is
that you want to have an architecture that has each step (e.g. get
list of pages, get each page, parse the page, generate) as a separate
job. That makes it much easier to recover from the inevitable
unpredictable error conditions. Generate a set of jobs, throw them in
a queue, have forked agents pull them out and record errors separately
for review later. You need a singleton grabbing pages, so that you
can control how often you hit the server.

Good luck!
James

[1]: Howison, J. and Crowston, K. (2004). The perils and pitfalls of
mining Sourceforge. In Proc. of Workshop on Mining Software
Repositories at the International Conference on Software Engineering
ICSE.
http://citeseer.ist.psu.edu/howison04perils.html

Sergio Fernández

unread,
Oct 8, 2009, 8:44:19 AM10/8/09
to sioc...@googlegroups.com
Dear Matthias,

we made some work on scrapping forums and mailing lists:

Diego Berrueta, Sergio Fernández and Lian Shi. Bootstrapping the
Semantic Web of Social Online Communities. WWW2008 Workshop on
Social Web Search and Mining (SWSM2008), Beijing, China, April
22, 2008.
http://www.wikier.org/stuff/research/publications/2008/SWSM2008-bootstrapping-social-semantic-web.pdf

There are many many technologies (TagSoup in Java, pyquery in python,
XSLT or many others...) that can be deployed adapting any current
crawler. But I don't know any packaged open-source product that fullfil
your requirements.

BTW, have RDFa in that forum would be cool.

Cheers,
--
Sergio Fernández - sergio.f...@fundacionctic.org
Departamento I+D+i
Fundación CTIC - www.fundacionctic.org
Tlfn: +34 984 29 12 12
Fax: +34 984 39 06 12
Edificio Centros Tecnológicos
Parque Científico Tecnológico
33203 Cabueñes - Gijón - Asturias - Spain

Paul A Houle

unread,
Oct 9, 2009, 10:03:28 AM10/9/09
to sioc...@googlegroups.com
Sergio Fernández wrote:
>
> There are many many technologies (TagSoup in Java, pyquery in python,
> XSLT or many others...) that can be deployed adapting any current
> crawler. But I don't know any packaged open-source product that fullfil
> your requirements.
>
>
A general strategy I like is to run HTML through HTML Tidy,
converting it to XHTML. Then you can use all kinds of XML tools, such
as XQuery, XSLT, or the DOM to do your parsing. I've done this in
both Java and PHP and I've had good results. In one project (parsing
all of Slashdot) bad HTML caused structural instability in the XHTML
generated by Tidy, but most of the time this approach works like a charm.

seb

unread,
Oct 9, 2009, 10:31:05 AM10/9/09
to SIOC-Dev
I found ruby's Hpricot package to be quite handy with its ability to
query the HTML document via XPath expressions.
Not sure how well it handles bad markup though.

/seb

Clay Fink

unread,
Oct 9, 2009, 5:18:54 PM10/9/09
to sioc...@googlegroups.com
I've been using the Jehrico Java API. JTidy works well also, but
Jehrico seems like a much more active project and it's easy to use.

Clay

Sent from my iPhone

Davide Eynard

unread,
Oct 12, 2009, 3:07:14 AM10/12/09
to SIOC-Dev
Hi Matthias,

years ago I developed a crawler/scraper with this exact purpose (I
mean,
extracting messages from fora... I guess SIOC did not exist yet ;)) .
It is called
TWO and you can find it at http://two.sf.net. It is REALLY old, so I
guess that
the regular expressions I used to scrape page contents are probably
not valid
anymore, but I think that the main structure of vBulletin-like fora
has remained
the same so it might just require some changes to work.

Just a couple more info:

- the tool has been written in Perl (which might be good or very bad
depending
on how you like it ;)

- there is some documentation in pdf available on the website: it is
more like
a high-level project documentation, but it should give you all the
information
you need to understand the tool and modify it to suit your needs

- the tool has been designed to be modular - it already came with
three different
scrapers and it is built to be extensible... I'm waiting to see export
plugins for
other fora ;-)

Well, if anyone of you is interested in reviving the project let me
know - I'd
be glad to see an old product of mine being useful to someone ;)

Cheers,

da
Reply all
Reply to author
Forward
0 new messages