So I've found the problem. What I thought were indexes are actually
offsets. Each synset is defined by an 8-digit number that defines the
position of the synset data in the data.*pos* file. So adding
characters/spaces to a line in the data file requires that all synsets
following that line have a new offset that increased by the number of
added spaces.
Using my previous example, if you add the word 'thingy' to the
'entity' synset, the following changes need to be made:
then you have to add len('thingy 0')=8 to all offsets in the
succeeding lines. i.e.
00001740 03 n 01 entity 0 003 ~ 00001930 n ............
00001930 03 n 01 physical_entity 0 007 @ 00001740 n ............
00002137 03 n 02 abstraction 0 abstract_entity 0 010 @
00001740 ............
.
.
.
goes to
00001740 03 n 01 entity 0 thingy 0 003 ~ 00001930 n ............
00001938 03 n 01 physical_entity 0 007 @ 00001740 n ............
00002145 03 n 02 abstraction 0 abstract_entity 0 010 @
00001740 ............
.
.
.
because len('thingy 0')=8. And, of course, these new offsets must be
re-defined in the index.*pos* file and the index.sense file.
Coming up with an automated way to do this could be hard. Adding a new
synset should be relatively easy because all you have to do is add a
new line at the end of the data.*pos* file and a new line anywhere to
the index files (I don't think lines have to be in alphabetical
order). However, adding new words to a synset is very much non-
trivial, since all changes must be cascaded down the list from the
line that is edited.
Does anyone have any tips on how to automate this in an efficient way?
The data.noun file is 15MB!
Thanks,
Ben