Hrm, well I would shy away from a SQLite db initially. I'd implement
option 2) first instead. Once that was done and people were using it,
I'd implement 4) as an optional add-on, disabled by default.
d
In the end, I went with the 'none of the above' option ;-)
I spent a lot of time and effort trying to get a shelve module based
solution to work. It parsed the data files and persisted all the
records discovered in shelve objects (dict-like interface with
Berkeley DB backend) keyed by the OUI/AIB record id. This turned out
to be an extremely bad idea that wasted a lot of precious development
time.
Bad points :-
- created unnecessary dependencies between netaddr and Berkeley DB
support from Python (either as builtin or 3rd party). Suffered from
the same issues as the SQLite option in this regard.
- added a lot of weight to the netaddr release tarballs and repository
checkins. The bsddb cache files were 3-4 times larger than the actual
data files as they were effectively pickled Python data structures.
They were binary blobs the repository too so subversion had to send
full copies across the wire when the files changed - YUCK!
- a really nasty exit message was being printed to stdout from deep
inside some Python or Berkeley DB C code every time the netaddr module
was being unloaded. I wasn't going to spend time trying to hunt that
one down!
So, I chose a totally different approach and this how it works now.
If the netaddr.eui module is run as a script, it parses the IEEE data
files looking for information to populate a couple of index files, one
for OUIs and one for IABs. The index consists of a basic CSV file
containing records of 3 fields containing the following :-
- the OUI/IAB record id as an integer
- the OUI/IAB record offset from file start
- the size of the record
The index generator performs in O(N) time but only needs to be run
very infrequently, usually whenever the information in the main data
files change (possibly once or twice annually). This never needs to be
done when netaddr is being used in its default mode.
If the netaddr.eui module is loaded via an import call (main usage),
it only loads the index files on module start up, populating a lookup
dictionary in memory. Whenever a user calls the .info() method on an
EUI object :-
- an OUI and/or IAB object is created
- this object consults the in memory index looking for its record id.
This performs in O(1) time.
- it opens the relevant IEEE OUI or IAB data file, seeks to the
required file offset and reads in a single OUI/AIB raw text record
using the size field from the index.
- the OUI/IAB object parses the single raw text to extract the
required information and exposes it to the end user as its object
attributes.
I think this approach ends up being very elegant :-
- zero dependencies (pure-Python)
- relative low memory footprint (only stores the index dict lookup,
not full data records)
- relatively quick startup time and good seek time for individual
record lookups
From start to finish this took a lot longer to complete than I ever
expected but I am very happy with the results and I've learned a lot
of useful things in the process.
By contrast, The IANA data files are parsed every time the netaddr.ip
module is first imported. There is no need to create indices because
of the relatively small amount of data being processed and stored. In
future netaddr releases I hope to store these address ranges in memory
using a more efficient nested tree structure rather than a list to cut
down on the number of comparisons (currently O(N) time). However, with
the sequence being relatively short it isn't much of an issue at the
present time and doesn't require time spent optimizing it. (Issue 13)
in the bug tracker when solved should help here at some point.