html/wiki markup: discard or retain?

3 views

Skip to first unread message

Paul McQuesten

unread,

Jan 6, 2011, 4:13:33 PM1/6/11

to link-grammar

On Tue, Jan 4, 2011 at 8:42 PM, Linas Vepstas <linasv...@gmail.com>
wrote:
> p.s. you realize that you can download most of wikipedia, pre-parsed,
> by link-grammar, all in a "easy" to use xml file? See
>
> http://gnucash.org/linas/nlp/data/enwiki/

Re stripping html/wiki markup:

Mayhap links embedded in a document would make a good clue about the
subject matter? And that could help set likelihood for word sense
disambiguation? Might it be useful for LG to recognize links (via
regex for http://...) and retain them. Perhaps mark as .ij
(interjection) at first?

More generally, how could information from footnotes be captured? Used
by Relex?

Linas Vepstas

unread,

Jan 6, 2011, 7:09:00 PM1/6/11

to link-g...@googlegroups.com

On 6 January 2011 15:13, Paul McQuesten <mcqu...@gmail.com> wrote:
> On Tue, Jan 4, 2011 at 8:42 PM, Linas Vepstas <linasv...@gmail.com>
> wrote:
>> p.s. you realize that you can download most of wikipedia, pre-parsed,
>> by link-grammar, all in a "easy" to use xml file? See
>>
>> http://gnucash.org/linas/nlp/data/enwiki/
>
> Re stripping html/wiki markup:
>
> Mayhap links embedded in a document would make a good clue about the
> subject matter? And that could help set likelihood for word sense
> disambiguation? Might it be useful for LG to recognize links (via
> regex for http://...) and retain them. Perhaps mark as .ij
> (interjection) at first?

My knee-jerk reaction would be to say that some other software
layer should strip these out, remember where they went, and
then put them back in after a parse. This is based on the
observation that links wouldn't improve the speed or accuracy
of the parse, and are thus irrelevant clutter.

> More generally, how could information from footnotes be captured? Used
> by Relex?

Relex would be a reasonable layer in which to do this kind of
processing; however, I don't see how it could be "used" by relex.
Relex is a mashup of two different things: a pathetic document
management framework wrapped around a core dependency parser.
The dependency parser is all about English grammar, and can't
per-se "do anything" with urls or footnotes.

However, patches that would strip out urls/footnotes, and then
reinsert them, post-parse, would be accepted.