WordNet::QueryData changes

danny

unread,

Mar 15, 2009, 3:29:26 PM3/15/09

to wn-perl

Hi all -

I have a bunch of proposed changes to WordNet::Query data. Jason asked
me to post them here for any feedback anyone might have. You can grab
the module with changes from:

http://conceptuary.com/wn/WordNet-QueryData-1.47-b.tar.gz

Tests are mostly untouched other than several lines to test the new
changes, so you can see they pass even with the changes. I use a bunch
of modules that depend on QueryData and my changes seem to be
compatible with them. In fact, I kept some methods (like dataPath())
that seem kinda useless within the context of QueryData, but which
have other things depending on them (like a call from WordNet::Tools,
in this case).

* added the ability for new() to take a named param list
* added a new() param "noload" to not preload index files, but to
instead use Search::Dict lookups thereafter
* added _getIndexFH() and _getDataFH() to consolidate opening and
caching of filehandles
* added _dataLookup() to consolidate reads from data files
* added _indexLookup() to consolidate reads from index files
* added _indexOffsetLookup() to consolidate offset reads from index
files
* added _parseIndexLine() to consolidate the parsing of index file
lines
* moved path data to new(), so that everything reads off of $self->
{dir}
* removed the cntlinst path special-casing
* all file opens are deferred until necessary; for noload this means
as long as possible, for caching it means during the constructor (see
_get*FH() functions)
* documented "noload" option
* loop tests again for "noload"
* cleaned up some formatting

The major change is the ability to pass a "noload" parameter to new().
Below is the POD blurb explaining.

I'd like to convince Jason to adopt these changes, so if anybody has
any feedback, please provide it. I have been through most of the
dependent modules on CPAN to be sure these changes are compatible.

Regards,
Danny

--------------------------

CACHING VERSUS NOLOAD

The "noload" option results in data being retrieved using a
dictionary lookup rather than caching the indexes in RAM.
This method yields an immediate startup time but *slightly* (though
less than you might think) longer lookup time. For the curious, here
are some profile data for each method on a duo core intel mac,
averaged
seconds over 10000 iterations:

Caching versus noload times in seconds

noload => 1
noload => 0
---------------------------------------------------------------------------------
new() 0.00001
2.55
queryWord("descending") 0.0009 0.0001
querySense("sunset#n#1", "hype") 0.0007 0.0001
validForms ("lay down#2") 0.0004 0.0001

Obviously the new() comparison is not very useful, because nothing is
happening with the constructor in the case of noload => 1. Similarly,
lookups with caching are basically just hash lookups, and therefore
very
fast. The lookup times for noload => 1 illustrate the tradeoff
between
caching at new() time and using dictionary lookups.

Because of the lookup speed increase when noload => 0, many users will
find it useful to set noload to 1 during development cycles, and to 0
when RAM is less of a concern than speed. The bottom line is that
noload => 1 saves you over 2 seconds of startup time, and costs you
about
0.0005 seconds per lookup.

Ted Pedersen

unread,

Mar 15, 2009, 9:22:13 PM3/15/09

to wn-...@googlegroups.com

Greetings all,

I did some profiling of the proposed changes to QueryData. I used one
of the test files in WordNet::Similarity (t/pairs.t) ... I ran the
following using the old (currently released) and new (proposed)
versions of QueryData...

perl -d:DProf t/pairs.t
dprofdd -u

We do see something of a slowdown with the new changes with even a
single test file, and so I'd be a bit concerned about having the
"noload" incorporated as a default (since most users of
WordNet::Similarity simply install QueryData without making
modifications, etc.)

My own experience of QueryData has been that typically there is one
load followed by lots of queries, and so loading the index has
generally been the right thing to do (and the cost in terms of RAM at
least is fairly negligible).

I suppose the goal in general is to "make the common case fast" - the
question then is whether or not a load followed by a few queries or a
load followed by many queries is the common case...from the point of
view of WordNet::Similarity at least, the common case is a single load
followed by many queries, so I think we'd be happiest if that remained
the default behavior of WordNet::QueryData.

The profiling results...

The old (currently released) version...

Total Elapsed Time = 78.50668 Seconds
User Time = 71.62668 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
34.9 25.01 31.700 295828 0.0001 0.0001
WordNet::QueryData::getSensePointers
12.7 9.133 42.404 304699 0.0000 0.0001 WordNet::QueryData::querySense
11.0 7.929 9.715 297028 0.0000 0.0000 WordNet::QueryData::getSense
4.70 3.368 3.368 851828 0.0000 0.0000 WordNet::QueryData::lower
4.62 3.312 4.222 199220 0.0000 0.0000 WordNet::QueryData::offset
3.85 2.760 2.760 3 0.9200 0.9200
WordNet::Similarity::ICFinder::configure
3.82 2.734 42.626 63623 0.0000 0.0007
WordNet::Similarity::hso::_getDownwardOffsetsPOS
3.70 2.652 2.793 23376 0.0001 0.0001 WordNet::QueryData::getWordPointers
3.55 2.540 2.540 2 1.2700 1.2700
WordNet::Similarity::DepthFinder::_processSynsetsFile
3.52 2.520 2.520 1 2.5200 2.5200 WordNet::QueryData::loadIndex
2.90 2.080 3.110 9 0.2311 0.3455 WordNet::Tools::new
2.68 1.920 50.714 189836 0.0000 0.0003 WordNet::Similarity::hso::_medStrong
1.44 1.030 1.030 36 0.0286 0.0286 WordNet::QueryData::listAllWords
1.37 0.979 0.979 301422 0.0000 0.0000 WordNet::QueryData::delMarker
1.20 0.860 3.740 23376 0.0000 0.0002 WordNet::QueryData::queryWord

The new (proposed) version ....

Total Elapsed Time = 205.4782 Seconds
User Time = 96.26824 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
36.2 34.92 43.965 295828 0.0001 0.0001 WordNet::QueryData::getSensePointers
12.6 12.20 58.032 304699 0.0000 0.0002 WordNet::QueryData::querySense
11.2 10.86 13.387 297028 0.0000 0.0000 WordNet::QueryData::getSense
4.41 4.244 4.244 851828 0.0000 0.0000 WordNet::QueryData::lower
4.33 4.171 5.401 199220 0.0000 0.0000 WordNet::QueryData::offset
4.12 3.967 58.153 63623 0.0001 0.0009 WordNet::Similarity::hso::_getDownw
ardOffsetsPOS
3.81 3.670 3.670 3 1.2233 1.2233
WordNet::Similarity::ICFinder::configure
3.70 3.566 3.702 23376 0.0002 0.0002 WordNet::QueryData::getWordPointers
3.45 3.320 3.320 1 3.3200 3.3200 WordNet::QueryData::loadIndex
3.34 3.220 3.220 2 1.6100 1.6100
WordNet::Similarity::DepthFinder::__processSynsetsFile
2.98 2.865 69.397 189836 0.0000 0.0004
WordNet::Similarity::hso::_medStrong
2.84 2.730 3.870 9 0.3033 0.4300 WordNet::Tools::new
1.45 1.399 1.399 301422 0.0000 0.0000 WordNet::QueryData::delMarker
1.16 1.120 1.120 36 0.0311 0.0311 WordNet::QueryData::listAllWords
1.10 1.055 4.895 23376 0.0000 0.0002 WordNet::QueryData::queryWord

We actually ran into this issue with the command line interface to
WordNet::Similarity (similarity.pl) some time ago, that is what to do
about the load times. We actually created an interactive mode for
similarity.pl that loads the database once and then lets the user
proceed in an interactive session without reloading. We also created a
--file option that lets a user load a number of different pairs of
concepts all at once (and thereby only requires a single load of the
database...) So, I'm wondering if there might not be a solution like
that which would work for a command line oriented application?

Anyway, WordNet::Similarity is here :
http://search.cpan.org/dist/WordNet-Similarity/

and similarity.pl is here...
http://search.cpan.org/dist/WordNet-Similarity/utils/similarity.pl

Just some thoughts, very much from the WordNet::Similarity point of
view...I'll be curious to hear what other users have to say about how
they are using QueryData, and what they might have done about this
database load issue in the past (and what they think about the
future...)

Thanks!
Ted

--
Ted Pedersen
http://www.d.umn.edu/~tpederse

Danny Brian

unread,

Mar 15, 2009, 11:59:44 PM3/15/09

to wn-...@googlegroups.com

> My own experience of QueryData has been that typically there is one
> load followed by lots of queries, and so loading the index has
> generally been the right thing to do (and the cost in terms of RAM at
> least is fairly negligible).

Ted -

With so many dependent tools I wouldn't propose changing the defaults,
even *if* the common case were wanting a shorter startup time (and I
don't think it is), which is why I've retained the default behavior
and (hopefully) all compatibility. However, in a development
environment, an immediate start-up with query times that are still
under hundredths of a second is a highly useful option.

As for RAM, it's not negligible in all apps. It takes about 20MB VRAM
more to load the indexes into memory on my own system. I'm simply
asking for users to have the option of the trade-off. I for one use
many tools that don't have the luxury of "sticking around"
indefinitely like a web app, or waiting 2-3 seconds to start up.

I'd also argue that one of the values of an embedded database like
WordNet in the first place if not having to cache the indexes. It's a
strength of the wn tools IMO, not a deficiency.

Regards,
Danny

Jason Rennie

unread,

Mar 16, 2009, 10:26:15 AM3/16/09

to wn-...@googlegroups.com

To add my own 2c, I personally very much like Danny's "noload" contribution. The default is the same as it was before, but by adding extra constructor arguments, you can have it run off disk, making development/debugging iterations faster, as Danny notes.

Ted, considering that QueryData's default behavior should not change, what do you think?

Jason

--
Jason Rennie
Research Scientist, ITA Software
http://www.itasoftware.com/

Ted Pedersen

unread,

Mar 16, 2009, 10:55:10 AM3/16/09

to wn-...@googlegroups.com

Hi Jason and Danny,

I agree, my main concern really was any change in default behavior. If
that's not the case, then I think this would work out quite nicely!

Thanks!
Ted

Jason Rennie

unread,

Mar 20, 2009, 9:16:42 PM3/20/09

to wn-...@googlegroups.com

Danny's changes have just been released as QueryData 1.48. It's available at

http://people.csail.mit.edu/jrennie/WordNet/

I also fixed a bug in the WNSEARCHDIR handling which caused it to not work properly on un*x. I also submitted to CPAN, but it will probably take a few hours to show up.

Thanks to Danny for his efforts. Enjoy!

Jason

Reply all

Reply to author

Forward