Design decisions required for IEEE OUI and IAB lookups

20 views
Skip to first unread message

David Moss

unread,
Oct 2, 2008, 9:06:39 AM10/2/08
to netaddr
Overview

I'm working on this feature for netaddr that I've had in mind since
its inception. I've left it until now as it isn't a massively
important feature but will have some implications for the library's
future which I'd like to air here. As users requests for this are now
coming in, it is time to get something coded up and out the door.

For the uninitiated, MAC addresses contain useful information allowing
you to identify the specific vendor who requested either a 24-bit OUI
(Organisationally Unique Identifier) or 36-bit IAB (Individual Address
Block) from the IEEE.

So, if you extract the OUI "00-16-3E" from a MAC address, you can be
fairly certain the host the interface belongs to is virtual and is
running on a Xen virtual server (unless the person who set it up has
over-ridden the default configuration). There are many other instances
where this kind of information would be useful e.g. in hardware asset
databases, network packet sniffers etc.

The Problem

I've been hemming and hawing about exactly how to implement it.

I need to make a 100,000+ records from published IEEE data files
available in lookup form through each EUI object's interface. The real
question is deciding on the right way to do this that is fairly quick
and is resource efficient. Because netaddr is a library, rather than a
framework/tool/application I want to keep things as simple as possible
with the minimum of overhead.

The Solution

I have considered several options and currently prefer option 4. The
others are no less relevant, they just have differing costs associated
with them.

1) Write a script to auto-generate a Python module that contains all
the information required.

Pros

- Easy to include in netaddr's namespace
- Easy to deploy
- Pure Python, no dependencies
- Quick and easy to setup

Cons

- Hogs memory in the Python interpreter once imported (more than
doubles netaddr's memory footprint)
- High start up costs. Slows down imports in client code (annoying
for users creating small, sharp CLI tools)
- Users who don't use the functionality will take an unnecessary
hit

2) Parsing the raw IEEE data files for each query.

Pros
- Quick and easy to setup
- Minimal code required
- Memory efficient (if caching of data is omitted)

Cons
- Slow, especially when performing multiple reads
- If data is cached it becomes a memory hog
- Where to store the data files? Differs from platform to platform

3) Parse and store IEEE information in a custom data file format for
easy and quick searching (fixed-width records)

Pros
- Pure Python
- Performance should be pretty good
- Memory efficient

Cons
- A lot of work!
- Wasteful of disk space (fixed width record storage is not
particular efficient)
- Difficult to modify and maintain if other data needs to be
stored in future (such as IANA information for IP addresses)
- Where to store the data files? Differs from platform to platform

4) Including a pre-built SQLite database (plus populating code) with
each distribution of netaddr' that can be queried and updated easily.

Pros
- Memory efficient
- Fast!
- Easy to update
- Easy to change and maintain (database schemas etc)
- Powerful querying capability

Cons
- Adds SQLite dependencies to netaddr. OK for Python 2.5 onwards
but a burden for users of 2.4 and lower.
- Where to store the data files? Differs from platform to platform

Conclusion

The benefits of using SQLite are pretty way up there. I could (via
some clever importing) make SQLite an optional requirement at the
expense of some functionality. No SQLite == no lookups

Anyone got any strong feelings about having a SQLite dependency? As it
is already "batteries included" from 2.5 onwards things are only going
to get better as time goes on and older Python versions get replaced.

The other concern is where to store the included data files as it
varies by platform. You also cannot assume that the files or databases
will be read-writeable.

How does option 4 sound? Am I muddying the purity of the library by
doing this?

Comments, suggestion and *constructive* criticism welcome ;-)

Dave M.

Duncan McGreggor

unread,
Oct 2, 2008, 12:57:44 PM10/2/08
to net...@googlegroups.com

Hrm, well I would shy away from a SQLite db initially. I'd implement
option 2) first instead. Once that was done and people were using it,
I'd implement 4) as an optional add-on, disabled by default.

d

David Moss

unread,
Oct 7, 2008, 9:53:59 AM10/7/08
to netaddr
Thanks Duncan.

Definitely a great idea to implement a safe fallback default (option
2). It's much better to have the functionality available even if it is
a lot slower.

I have another slightly less controversial alternative to SQLite, the
shelve standard library module. This can either replace the SQLite
option or live alongside it as a fast, memory efficient alternative.
I've read that bsddb support is potentially being dropped in Py3K so
again, reliance on option 2 will remain a mandatory fallback of last
resort.
Message has been deleted

nna...@gmail.com

unread,
Oct 9, 2008, 12:22:54 AM10/9/08
to netaddr
> I have another slightly less controversial alternative to SQLite, the
> shelve standard library module. This can either replace the SQLite
> option or live alongside it as a fast, memory efficient alternative.
> I've read that bsddb support is potentially being dropped in Py3K so
> again, reliance on option 2 will remain a mandatory fallback of last
> resort.

This sounds like an excellent option.
But I'm also biased since sqlite module wasn't compiled with my python
2.5.

Let us know if you'd like any assistance such as testing, etc. I
really like this project and I would like to do whatever I can to help
improve the quality of the code (although you guys seem to be doing a
superb job on your own).

David Moss

unread,
Oct 9, 2008, 12:35:46 PM10/9/08
to netaddr
On Oct 9, 5:22 am, nnat...@gmail.com wrote:
> This sounds like an excellent option.
> But I'm also biased since sqlite module wasn't compiled with my python
> 2.5.

Useful to know. SQLite might not be such a hot idea after all ...

> Let us know if you'd like any assistance such as testing, etc. I
> really like this project and I would like to do whatever I can to help
> improve the quality of the code (although you guys seem to be doing a
> superb job on your own).

Patches are always welcome!

Some testing of this feature will be very handy when I can get it
checked because it makes quite a few assumptions about the Python
setup it runs on. Hopefully they are all valid assumptions ;-)

David Moss

unread,
Jan 22, 2009, 4:09:39 PM1/22/09
to netaddr
Just thought I'd finish off this thread with a bit of detail on the
final solution I chose for this problem, now part of release 0.6.

In the end, I went with the 'none of the above' option ;-)

I spent a lot of time and effort trying to get a shelve module based
solution to work. It parsed the data files and persisted all the
records discovered in shelve objects (dict-like interface with
Berkeley DB backend) keyed by the OUI/AIB record id. This turned out
to be an extremely bad idea that wasted a lot of precious development
time.

Bad points :-

- created unnecessary dependencies between netaddr and Berkeley DB
support from Python (either as builtin or 3rd party). Suffered from
the same issues as the SQLite option in this regard.
- added a lot of weight to the netaddr release tarballs and repository
checkins. The bsddb cache files were 3-4 times larger than the actual
data files as they were effectively pickled Python data structures.
They were binary blobs the repository too so subversion had to send
full copies across the wire when the files changed - YUCK!
- a really nasty exit message was being printed to stdout from deep
inside some Python or Berkeley DB C code every time the netaddr module
was being unloaded. I wasn't going to spend time trying to hunt that
one down!

So, I chose a totally different approach and this how it works now.

If the netaddr.eui module is run as a script, it parses the IEEE data
files looking for information to populate a couple of index files, one
for OUIs and one for IABs. The index consists of a basic CSV file
containing records of 3 fields containing the following :-

- the OUI/IAB record id as an integer
- the OUI/IAB record offset from file start
- the size of the record

The index generator performs in O(N) time but only needs to be run
very infrequently, usually whenever the information in the main data
files change (possibly once or twice annually). This never needs to be
done when netaddr is being used in its default mode.

If the netaddr.eui module is loaded via an import call (main usage),
it only loads the index files on module start up, populating a lookup
dictionary in memory. Whenever a user calls the .info() method on an
EUI object :-

- an OUI and/or IAB object is created
- this object consults the in memory index looking for its record id.
This performs in O(1) time.
- it opens the relevant IEEE OUI or IAB data file, seeks to the
required file offset and reads in a single OUI/AIB raw text record
using the size field from the index.
- the OUI/IAB object parses the single raw text to extract the
required information and exposes it to the end user as its object
attributes.

I think this approach ends up being very elegant :-

- zero dependencies (pure-Python)
- relative low memory footprint (only stores the index dict lookup,
not full data records)
- relatively quick startup time and good seek time for individual
record lookups

From start to finish this took a lot longer to complete than I ever
expected but I am very happy with the results and I've learned a lot
of useful things in the process.

By contrast, The IANA data files are parsed every time the netaddr.ip
module is first imported. There is no need to create indices because
of the relatively small amount of data being processed and stored. In
future netaddr releases I hope to store these address ranges in memory
using a more efficient nested tree structure rather than a list to cut
down on the number of comparisons (currently O(N) time). However, with
the sequence being relatively short it isn't much of an issue at the
present time and doesn't require time spent optimizing it. (Issue 13)
in the bug tracker when solved should help here at some point.

Reply all
Reply to author
Forward
0 new messages