Parsed Wikipedia Articles

34 views
Skip to first unread message

Richard

unread,
Jun 15, 2011, 9:08:20 PM6/15/11
to link-grammar
Hi,

I know there is a collection of Relex Parsed Wikipedia articles, and
that the raw text of all articles in the wikipedia dump they were
taken from is on the same website; but do you have anywhere the raw
text versions of just the articles that were parsed?

Thanks,
Richard

Dominic Lachowicz

unread,
Jun 16, 2011, 11:42:11 AM6/16/11
to link-g...@googlegroups.com
I don't know if this is what you're looking for, but you can download
all of the Wikipedia articles:

http://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia

> --
> You received this message because you are subscribed to the Google Groups "link-grammar" group.
> To post to this group, send email to link-g...@googlegroups.com.
> To unsubscribe from this group, send email to link-grammar...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/link-grammar?hl=en.
>
>

--
"I like to pay taxes. With them, I buy civilization." --  Oliver Wendell Holmes

Bruce Williams

unread,
Jun 16, 2011, 3:38:12 PM6/16/11
to link-g...@googlegroups.com
Hi

Are the Relex Parsed Wikipedia articles available?

Bruce Williams
Concepts, like individuals, have their histories and are just as  incapable of
withstanding the ravages of time as are individuals.  But in and
through all this
they retain a kind of homesickness  for the scenes of their childhood.
Soren Kierkegaard

Linas Vepstas

unread,
Jun 16, 2011, 5:16:21 PM6/16/11
to link-g...@googlegroups.com
On 16 June 2011 14:38, Bruce Williams <william...@gmail.com> wrote:
> Hi
>
> Are the Relex Parsed Wikipedia articles available?

In the http://gnucash.org/linas/nlp/data/enwiki-20080524/ directory

raw wikipedia dump:
enwiki-20080524-pages-articles.xml.bz2 16-Jul-2008 16:59 3.7G

stripped of wiki markup:
enwiki-20080524-alpha.tar.bz2 23-Jul-2008 20:54 1.6G

In the http://gnucash.org/linas/nlp/data/enwiki-20101011/ directory, likewise

--linas

Bruce Williams

unread,
Jun 17, 2011, 12:18:47 PM6/17/11
to link-g...@googlegroups.com
Thanks Linas

Bruce Williams
Concepts, like individuals, have their histories and are just as  incapable of
withstanding the ravages of time as are individuals.  But in and
through all this
they retain a kind of homesickness  for the scenes of their childhood.
Soren Kierkegaard

Richard

unread,
Jun 18, 2011, 3:42:16 PM6/18/11
to link-grammar
Thanks, but that's not what I'm looking for. I want precisely the set
of articles which were parsed with Relex, and no more.

I want to want to repeat an experiment, which I performed using the
Relex-parsed wikipedia articles, but with a different parser, so that
I can determine the effect the parser has on my results.

As per Linas' post below, the parsed articles are available, as are
'just-text' versions of the articles. But only 200,000 of the 4.5
million articles were parsed. In order for my comparison to be valid,
I need to parse just the set of 200,000 articles which were parsed
with Relex; not the full 4.5 million articles.

In the meantime I'm trying to write a script which will take the
'intersection' of the parsed articles and the entire dump, but Unix
scripting is not my strong suit, and it's proving somewhat difficult.

Richard

On Jun 16, 11:42 am, Dominic Lachowicz <domlachow...@gmail.com> wrote:
> I don't know if this is what you're looking for, but you can download
> all of the Wikipedia articles:
>
> http://en.wikipedia.org/wiki/Wikipedia:Database_download#English-lang...
>
>
>
> On Wed, Jun 15, 2011 at 9:08 PM, Richard <r.kee...@gmail.com> wrote:
> > Hi,
>
> > I know there is a collection of Relex Parsed Wikipedia articles, and
> > that the raw text of all articles in the wikipedia dump they were
> > taken from is on the same website; but do you have anywhere the raw
> > text versions of just the articles that were parsed?
>
> > Thanks,
> > Richard
>
> > --
> > You received this message because you are subscribed to the Google Groups "link-grammar" group.
> > To post to this group, send email to link-g...@googlegroups.com.
> > To unsubscribe from this group, send email to link-grammar...@googlegroups.com.
> > For more options, visit this group athttp://groups.google.com/group/link-grammar?hl=en.

Linas Vepstas

unread,
Jun 18, 2011, 10:20:54 PM6/18/11
to link-g...@googlegroups.com
On 18 June 2011 14:42, Richard <r.ke...@gmail.com> wrote:
> I want to want to repeat an experiment, which I performed using the
> Relex-parsed wikipedia articles, but with a different parser, so that
> I can determine the effect the parser has on my results.

!? "a different parser" I presume the stanford parser? Note
that the wikiepedia articles were not parsed in stanford compatibility
mode, so I don't know how you expect to compare. And, given the
other recent email thread about how recent versions of link-grammar
bungled the constituent tree when a sentence contained and/or clauses,
comparing constituent trees will yeild ...poor results.

--linas

Reply all
Reply to author
Forward
0 new messages