Trying to run reuters-21578 benchmark

43 views
Skip to first unread message

Luís Oliveira

unread,
Aug 17, 2009, 9:36:32 PM8/17/09
to montezuma-dev
Hello,

I download the reuters-21578 corpus, unpacked it onto 'tests/corpora/
reuters-21578/corpus' and ran REUTERS-INDEXER::RUNIT but all I'm
getting is:

------------------------------------------------------------
0 Secs: 0.72 Docs: NIL

Is the test stale or am I missing something else?

Thanks,

Leslie P. Polzer

unread,
Aug 18, 2009, 4:38:47 AM8/18/09
to montez...@googlegroups.com

Luís Oliveira wrote:

Yeah, this seems to have suffered from bitrot; I've fixed it
in SVN (r416, r417).

Thanks for checking this out.

Leslie

Luís Oliveira

unread,
Aug 18, 2009, 6:59:28 AM8/18/09
to montez...@googlegroups.com
On Tue, Aug 18, 2009 at 9:38 AM, Leslie P.
Polzer<s...@viridian-project.de> wrote:
> Yeah, this seems to have suffered from bitrot; I've fixed it
> in SVN (r416, r417).

Thanks for the quick fix. I think I'd like to make the test download
the corpus automatically. Do you think I should use Drakma/cl-md5 or a
shell script?

--
Luís Oliveira
http://student.dei.uc.pt/~lmoliv/

Leslie P. Polzer

unread,
Aug 18, 2009, 7:05:41 AM8/18/09
to montez...@googlegroups.com

Luís Oliveira wrote:

> Thanks for the quick fix. I think I'd like to make the test download
> the corpus automatically. Do you think I should use Drakma/cl-md5 or a
> shell script?

I've been wondering whether we shouldn't just include it in the
repository. What do you think about that?

If we opt for automatic download a simple shell script would be best
IMO.

Leslie

Luís Oliveira

unread,
Aug 18, 2009, 7:11:43 AM8/18/09
to montez...@googlegroups.com
On Tue, Aug 18, 2009 at 12:05 PM, Leslie P.
Polzer<s...@viridian-project.de> wrote:
> I've been wondering whether we shouldn't just include it in the
> repository. What do you think about that?

Mirroring the corpus in the Google Code page sounds like a good idea.
Not sure about including it in the SVN trunk, isn't it a bit too big?

Luís Oliveira

unread,
Aug 18, 2009, 8:45:08 AM8/18/09
to montez...@googlegroups.com
On Tue, Aug 18, 2009 at 12:11 PM, Luís Oliveira <lui...@gmail.com> wrote:
> Mirroring the corpus in the Google Code page sounds like a good idea.
> Not sure about including it in the SVN trunk, isn't it a bit too big?

By the way, I just noticed that the test is indexing just the *.txt
files, while 99% of the corpus is contained in the *.sgm files. I've
attached a simple patch that fixes that.

montezuma-reuters-21578-indexer.diff

Leslie P. Polzer

unread,
Aug 18, 2009, 9:24:02 AM8/18/09
to montez...@googlegroups.com

Luís Oliveira wrote:

> By the way, I just noticed that the test is indexing just the *.txt
> files, while 99% of the corpus is contained in the *.sgm files. I've
> attached a simple patch that fixes that.

Applied as r418, thanks.

Reply all
Reply to author
Forward
0 new messages