Strange Lat/Long numbers for TIGER 2008

Andrew Hallock

unread,

Dec 13, 2009, 1:06:21 AM12/13/09

to GeoCommons Geocoder

I downloaded the TIGER 2008 data files for Washington DC and ran the
usual example on the site for "1600 Pennsylvania Av, Washington DC,"
but the lat/long numbers are coming back as:

lon=>2533.967734, :zip=>"20502", :lat=>1233.403826

Ugh. Was so close to getting this to work! Any ideas why the numbers
are coming back incorrect? I'm not a GIS expert, but they aren't
formatted as decimal lat/long.

Thanks.

Kate Chapman

unread,

Dec 13, 2009, 10:08:30 AM12/13/09

to geocommon...@googlegroups.com

Hi Andrew,

Just to start with the most simple thing first you download the data from here? http://www.census.gov/geo/www/tiger/tgrshp2008/tgrshp2008.html

Thanks,

Kate

Andrew Hallock

unread,

Dec 13, 2009, 12:37:58 PM12/13/09

to GeoCommons Geocoder

HI Kate,

I downloaded all the ZIP files at
ftp://ftp2.census.gov/geo/tiger/TIGER2008/11_DISTRICT_OF_COLUMBIA/11001_District_of_Columbia/

Then ran the tiger_import script.

Here's what I'm running:
Latest GeoCommons Geocoder commit
Ruby1.8
Sqlite 3.6.21
levenshtein (0.2.0)
Text (1.1.2)

The resulting Sqlite database can be found here:
http://senduit.com/1a7eac

A few weeks ago, I tested using the TIGER 2009 data and Ruby 1.9--and
got the same Lat/Long results, which is why I tried working with the
2008 data.

Thanks,
Andrew

On Dec 13, 10:08 am, Kate Chapman <k8chap...@gmail.com> wrote:
> Hi Andrew,
>
> Just to start with the most simple thing first you download the data from

> here?http://www.census.gov/geo/www/tiger/tgrshp2008/tgrshp2008.html
>
> Thanks,
>
> Kate

Kate Chapman

unread,

Dec 14, 2009, 11:24:57 PM12/14/09

to GeoCommons Geocoder

Hi Andrew,

I tried geocoding with your file and I'm getting the same result.
Could you do a pull of the most recent code from master and let me
know if you are getting the same result. I just want to make sure it
isn't a problem that magically worked itself out.

What operating system are you using? Only other thing I can think of
is have you used shp2sql for something on that machine before?

Thanks,

Kate

On Dec 13, 12:37 pm, Andrew Hallock <andrew.hall...@gmail.com> wrote:
> HI Kate,
>

> I downloaded all the ZIP files atftp://ftp2.census.gov/geo/tiger/TIGER2008/11_DISTRICT_OF_COLUMBIA/110...

Andrew Hallock

unread,

Dec 17, 2009, 11:22:40 PM12/17/09

to GeoCommons Geocoder

I tried on two different machines and don't remember using shp2sql
before. One machine was Ubuntu 9.04 64-bit, and the other was an
older machine I had lying around: Ubuntu 8.10 32-bit

I also tried pulling the latest code from master and got the same
results. Since we're both using the TIGER 2008 data, I think it's
probably the shp2sql, like you said. I use the Makefile.nix when
running make.

How do I know my shp2sql is correct?

Thanks,
Andrew

Bill McKinnon

unread,

Jan 7, 2010, 3:04:27 PM1/7/10

to GeoCommons Geocoder

Andrew,

Did you ever get a resolution to your issue? I'm encountering the
same numbers that you mentioned when I build the Geocommons system on
a Fedora 11 system using the TIGER 2009 data (with a comparable set of
packages to those that you listed). Any thoughts would be appreciated.
Thanks!

Thanks,
Bill

On Dec 17 2009, 10:22 pm, Andrew Hallock <andrew.hall...@gmail.com>

Nicholas C

unread,

Jan 7, 2010, 3:56:57 PM1/7/10

to GeoCommons Geocoder

I am pretty sure the problem is the metaphone values generated for the
street names. The geocoder can't make a proper match and it gives wild
results. I need to double check this, but here is what I think is
happening:

The import process is producing a metaphone (street_phone) value that
includes the street name and suffix (Rd, St, Ave, etc.). The geocoding
functions only use a street name metaphone and don't include the
suffix. I'm pretty sure the Address object strips off the suffix and
that's what the geocoding functions are using. The import script is
pretty much just straight SQL and doesn't use the same logic to
produce the metaphone. Also, the rebuild_metaphones script is only
rebuilding the city_phone and not the street_phone.

I may be completely wrong, but that's what I remember running into
about a month ago. Kate has been very helpful and may have an
uncommitted fix for this. I know she was planning on doing a full
import of the 2009 TIGER data using the current code base. If I find
the solution, I will definitely post it here. If anyone else has some
Ruby experience and can dig into this, I would be interested in the
solution.

-Nick

Kate Chapman

unread,

Jan 7, 2010, 4:12:05 PM1/7/10

to geocommon...@googlegroups.com

Nick,

That is correct. At some point I imported an additional column without the suffix that was just street names. I must have committed the wrong file at some point in an edit or something silly.

My suggestion would be to add the additional column to the import process to create the metaphones.

-Kate

dsachs

unread,

Jan 11, 2010, 1:36:43 PM1/11/10

to GeoCommons Geocoder

Kate and Nick,

Thank you for your persistence in tracking this town--I was bitten by
it as well. I'd love to get a 2009 geocoder built (I'm presently
using an old checkout + 2008 data), but I'm afraid I don't quite see
how to apply the fix you identified. If it's not too much trouble,
could you please give me a hint as to how to patch my checkout of
trunk in the way you suggest?

Thanks again,

David

On Jan 7, 1:12 pm, Kate Chapman <k8chap...@gmail.com> wrote:
> Nick,
>
> That is correct. At some point I imported an additional column without the
> suffix that was just street names. I must have committed the wrong file at
> some point in an edit or something silly.
>
> My suggestion would be to add the additional column to the import process to
> create the metaphones.
>
> -Kate
>

Dan B

unread,

Jan 9, 2010, 12:20:23 AM1/9/10

to GeoCommons Geocoder

It may be something with the tiger_import script. I used the older,
ruby-based one and got a lat/long of 38.895112/-77.036366 and a zip
code of 20000.

The older script is:

#!/usr/bin/ruby

require 'geocoder/us/import/tiger'

base = File.join(File.dirname(__FILE__), "..", "sql")
db = Geocoder::US::Import::TIGER.new(ARGV[0], :sql => base)
ARGV[1..ARGV.length].each do |path|
db.import_tree path
end
db.log "creating index"
#db.create_index

The command that I used was:

bin/tiger_import district_of_columbia.db
TIGER2008/11_DISTRICT_OF_COLUMBIA

Nicholas C

unread,

Jan 11, 2010, 4:22:37 PM1/11/10

to GeoCommons Geocoder

David,

I don't have a fix for the import script. I tried modifying the custom
Sqlite function that generates the metaphone value, but it didn't like
me using the Address object and I'm not very familiar with Ruby so it
was an uphill battle for me. If I do find a fix, I'll post it in this
forum.

Dan's suggestion that you use the old Ruby import script might work
but I think there was a reason they moved away from that. It may have
just been memory usage, in which case it might still work just fine.

-Nick

Todd W

unread,

Jan 12, 2010, 2:08:52 AM1/12/10

to GeoCommons Geocoder

I too am having the strange lat/lng results... any update on a patch?
Does that mean we will have to re-run that import script?

Todd W

unread,

Jan 13, 2010, 2:19:43 PM1/13/10

to GeoCommons Geocoder

bummer... 2 days and still no info on how to fix this? :-(

Kate Chapman

unread,

Jan 13, 2010, 2:42:37 PM1/13/10

to geocommon...@googlegroups.com

Todd,

I'm the only maintainer on this, I have other deadlines going on right now. If anyone submits a patch I will apply it, but it is unlikely I'm going to get to this in the next couple days.

If you email me off the group I will send you a link to download the database.

-Kate

Bill McKinnon

unread,

Jan 16, 2010, 4:06:37 PM1/16/10

to GeoCommons Geocoder

I'm going to take a run at this problem in my spare time. I'm
starting from scratch, but have tweaked lib/geocoder/us (line 31) to
provide debug output while geocoding.

The output that I generated when trying to geocode the White House was
the following, which makes me believe that the metaphones are ok, and
that the correct street is being selected from all of the possible
candidates. I'm assuming that the datapoints at the end of this
output should be the intermediate lat/long points on Pennsylvania Ave
(in this case), and they do seem a bit off from what they should be.
Can anyone on this list confirm either my interpretation of the
geographic points and/or that their output is the same (once the debug
variable is set)?

>> p db.geocode("1600 Pennsylvania Av, Washington DC")
ADDR: #<Geocoder::US::Address:0xb6488094 @street=["pennsylvania av",
"pennsylvania avenue", "pennsylvania ave"], @plus4="",
@full_state="dc", @sufnum="", @state="DC", @city=["washington"],
@prenum=nil, @number="1600", @zip="", @text="1600 Pennsylvania Av,
Washington DC">
addr=#<Geocoder::US::Address:0xb6488094 @street=["pennsylvania av",
"pennsylvania avenue", "pennsylvania ave"], @plus4="",
@full_state="dc", @sufnum="", @state="DC", @city=["washington"],
@prenum=nil, @number="1600", @zip="", @text="1600 Pennsylvania Av,
Washington DC">
SQL : SELECT *, levenshtein(?, city) AS city_score
FROM place WHERE city_phone IN (metaphone(?,5)) AND
state = ? order by priority desc LIMIT 1;
EXEC: ["washington", "washington", "DC"]
ROWS: 1 (0.037s)
street parts = ["pennsylvania"]
SQL :
SELECT feature.*, levenshtein(?, street) AS street_score
FROM feature
WHERE street_phone IN (metaphone(?,5)) AND feature.zip IN
(?)
EXEC: ["pennsylvania av", "pennsylvania", "20014"]
ROWS: 0 (0.001s)
zip results 2
EXEC: ["pennsylvania av", "pennsylvania", "200%"]
ROWS: 9 (0.006s)
candidates=
[{:paflag=>"P", :street_score=>"0.176470588235294", :zip=>"20052", :street=>"Pennsylvania
Ave NW", :street_phone=>"PNSLF", :fid=>"909478"},
{:paflag=>"P", :street_score=>"0.176470588235294", :zip=>"20037", :street=>"Pennsylvania
Ave NW", :street_phone=>"PNSLF", :fid=>"909477"},
{:paflag=>"P", :street_score=>"0.176470588235294", :zip=>"20020", :street=>"Pennsylvania
Ave SE", :street_phone=>"PNSLF", :fid=>"909481"},
{:paflag=>"P", :street_score=>"0.176470588235294", :zip=>"20006", :street=>"Pennsylvania
Ave NW", :street_phone=>"PNSLF", :fid=>"909476"},
{:paflag=>"A", :street_score=>"0.176470588235294", :zip=>"20005", :street=>"Pennsylvania
Ave NW", :street_phone=>"PNSLF", :fid=>"909473"},
{:paflag=>"P", :street_score=>"0.176470588235294", :zip=>"20004", :street=>"Pennsylvania
Ave NW", :street_phone=>"PNSLF", :fid=>"909475"},
{:paflag=>"A", :street_score=>"0.176470588235294", :zip=>"20004", :street=>"Pennsylvania
Ave NW", :street_phone=>"PNSLF", :fid=>"909472"},
{:paflag=>"P", :street_score=>"0.176470588235294", :zip=>"20003", :street=>"Pennsylvania
Ave SE", :street_phone=>"PNSLF", :fid=>"909480"},
{:paflag=>"P", :street_score=>"0.176470588235294", :zip=>"20001", :street=>"Pennsylvania
Ave NW", :street_phone=>"PNSLF", :fid=>"909474"}]
SQL :
SELECT feature_edge.fid AS fid, range.*
FROM feature_edge, range
WHERE fid IN (?,?,?,?,?,?,?,?,?)
AND feature_edge.tlid = range.tlid
ORDER BY min(abs(fromhn - ?), abs(tohn - ?))
LIMIT 36;
EXEC: ["909481", "909472", "909473", "909474", "909475", "909476",
"909477", "909478", "909480", 1600, 1600]
ROWS: 36 (0.012s)
SQL : SELECT edge.* FROM edge WHERE edge.tlid IN (?)
EXEC: ["76225813"]
ROWS: 1 (0.000s)
SQL : SELECT tlid, side,
min(fromhn) > min(tohn) AS flipped,
min(fromhn) AS from0, max(tohn) AS to0,
min(tohn) AS from1, max(fromhn) AS to1
FROM range WHERE tlid IN (?)
GROUP BY tlid, side;
EXEC: ["76225813"]
ROWS: 2 (0.001s)
NUM : 1522 < 1600 < 1608 (flipped? false)
DIST: 0.906976744186046
POINTS: [[0.000513, 0.00128], [1604.462336, 1396.850523],
[-1225.405504, 1131.612825], [-944.755392, 1396.853316], [-736.122944,
1131.612741], [-623.423168, 1396.856059], [220.726976, 1131.612725],
[1554.434624, 1396.858337], [2134.427072, 1131.612691], [1090.440768,
1396.862393], [1177.577152, 1131.612708]]DONE: 0.071s
[{:precision=>:range, :lat=>1233.403826, :number=>"1600", :prenum=>"", :zip=>"20502", :components=>
{:denominator=>10.25, :state=>0, :parity=>1.25, :number=>2.0, :prenum=>1, :total=>7.72058823529412, :zip=>0, :street=>2.47058823529412, :city=>1.0}, :street=>"Pennsylvania
Ave NW", :lon=>2533.967734, :score=>0.753}]
=> nil

Kevin Galligan

unread,

Feb 8, 2010, 1:46:19 PM2/8/10

to GeoCommons Geocoder

I'm looking at this right now. The first thing I'm seeing that's
weird is the WKB compress/uncompress code in both the c and ruby
files.

According to this...

http://dev.mysql.com/doc/refman/5.0/en/gis-wkb-format.html

WKB has 5 bytes of "stuff" in front, then two 8-byte doubles.
However, the c code...

uint32_t compress_wkb_line (void *dest, const void *src, uint32_t len)
{
uint32_t d, s;
double value;
if (!len) return 0;
for (s = 9, d = 0; s < len; d += 4, s += 8) {
value = *(double *)(src + s);
value *= 1000000;
*(int32_t *)(dest + d) = (int32_t) value;
}
return d;
}

appears to start at byte 9. Yes? I haven't done c in years, and I'm
learning ruby as I'm doing this, so its a bit of a struggle.

I'm going to recompile the c and see (ha ha) what that does.

> ...
>
> read more »

Kevin Galligan

unread,

Feb 8, 2010, 3:25:03 PM2/8/10

to GeoCommons Geocoder

Some updates.

I don't think that compress code is ever being called. I don't really
have debugging ability right now, so I didn't the poor man's version
of debugging. I put 'exit(1);' in the body of that function:

uint32_t compress_wkb_line (void *dest, const void *src, uint32_t len)
{
uint32_t d, s;
double value;

exit(1);
if (!len) return 0;
for (s = 5, d = 0; s < len; d += 4, s += 8) {

value = *(double *)(src + s);

value *= 1000000;
*(int32_t *)(dest + d) = (int32_t) value;
}
return d;
}

Nothing seems to happen. I'll recompile and try again.

So, maybe the geometries are regular WKB instead of packed? I can't
see the sqlite blob values, which is pretty frustrating, but what can
you do?

Another issue. The ruby code uses 'unpack', which might act weird on
32 vs 64 bit systems (I'm on 64. This is probably not an issue, but
the docs mentioned something about platform size on output).

> ...
>
> read more »

Kevin Galligan

unread,

Feb 8, 2010, 4:51:43 PM2/8/10

to GeoCommons Geocoder

OK. The value in edge, for the geometry column, is expected to be a
compressed geo value, but its not. It looks like a raw WKB value. I
imported 'geo_ruby' and used that to parse and deal with the WKB. At
the top...

require 'geo_ruby'

include GeoRuby::SimpleFeatures

then in 'unpack_geometry'

def unpack_geometry (geom)
points = []
if !geom.nil?
g = Geometry.from_ewkb(geom)
g.points.each{|point|
points << [point.x, point.y]}

end
points
end

I basically learned ruby today to do this, so if there's a better way
to hack that up, give it a shot.

I tried it with one address in Delaware, and the lat/lon produced came
out literally what looks like 10-20 feet away. So, pretty good. I
haven't tested this on other

On Feb 8, 1:46 pm, Kevin Galligan <kgalli...@gmail.com> wrote:

> ...
>
> read more »

Kate Chapman

unread,

Feb 8, 2010, 5:28:14 PM2/8/10

to GeoCommons Geocoder

Kevin,

I haven't tested this on 64 bit systems so I'm not sure. If you find
out anything more about this let me know.

-Kate

> ...
>
> read more »

Kevin Galligan

unread,

Feb 8, 2010, 6:16:28 PM2/8/10

to geocommon...@googlegroups.com

Well, the summary of the last email is this. The ruby code is expecting a special number format. Not standard WKB. However, the value is actually WKB. So, just reading it with geo_ruby works. I'm able to geocode addresses at the range level. Haven't tried anything else yet.

I'm probably going to be putting in some significant tweaks and testing over the next few weeks. I've never used git, but I assume there's an easy way to pull patches to send along?

On Feb 8, 2010 5:28 PM, "Kate Chapman" <k8ch...@gmail.com> wrote:

Kevin,

I haven't tested this on 64 bit systems so I'm not sure. If you find
out anything more about this let me know.

-Kate

On Feb 8, 3:25 pm, Kevin Galligan <kgalli...@gmail.com> wrote:
> Some updates.
>

> I don't think th...
> ...
>
> read more »

Kate Chapman

unread,

Feb 8, 2010, 6:22:55 PM2/8/10

to geocommon...@googlegroups.com

Odd, I'm thinking maybe something goes wrong with the import process
in certain circumstances. The import then causing the issue.

If you fork the geocoder on github and make your commits you can then
do a pull request to me. That allows me to then merge in your
changes.

-Kate

Dan B

unread,

Feb 9, 2010, 7:57:15 AM2/9/10

to GeoCommons Geocoder

evin --

Other than the 64-bit system part, I think you've really nailed it.
Based on your advice (using Geo_Ruby), I was able to solve this
problem. The basic issue is consistency (and it pops up in a few
other places). I pointed out before that you could successfully
import the Tiger data if you used the Ruby-based importer instead of
the SQL-based importer, but I never stopped to figure out why. As it
turns out:

* The Ruby-based TIGER importer uses Ruby's string#pack function on
the geometry data and the geocoder uses string#unpack on the same
geometry to produce valid coordinates.
* The SQLite extensions includes compress_wkb_line, which appears to
be used in the NAVTEQ convert.sql, and its inverse
uncompress_wkb_line. These don't seem to be used anywhere else.
* The SQL-based TIGER importer does not include any compression, but
the geocoder still uses string#unpack on the otherwise valid geometry.

Kevin's addition of Geo_Ruby works because it simply reads the valid
and unpacked geometry blob instead of trying to unpack the unpacked
geometry blob. You'll note that the import.rb also uses Geo_Ruby
prior to packing the geometry blob!

In my opinion, this is small inconsistency is pretty close to the
previous issue with the use of metaphones in places.sql (namely, if
the Ruby metaphone function is slightly different from the metaphone
function used to build places.sql, the geocoder will not produce the
expected results). The same goes for constructing the street_phone
during input or lookup. In the end, you can't expect the geocoder to
behave consistently if the data isn't consistent.

My suggestion is to fix (or just use) the importer instead of relying
on the SQL scripts and SQLite3 extensions. I don't mind writing and
making the necessary pull requests if that sounds like a reasonable
plan.

And thanks again Kevin. I spent most of the weekend trying to figure
out this problem and your email cleared it all up very quickly.

Dan

On Feb 8, 6:16 pm, Kevin Galligan <kgalli...@gmail.com> wrote:
> Well, the summary of the last email is this. The ruby code is expecting a
> special number format. Not standard WKB. However, the value is actually
> WKB. So, just reading it with geo_ruby works. I'm able to geocode
> addresses at the range level. Haven't tried anything else yet.
>
> I'm probably going to be putting in some significant tweaks and testing over
> the next few weeks. I've never used git, but I assume there's an easy way to
> pull patches to send along?
>

Kevin Galligan

unread,

Feb 9, 2010, 9:26:11 AM2/9/10

to geocommon...@googlegroups.com

If I can paraphrase, we should try to use the same "stack", for lack of a better term, for importing and accesing. I've run into similar issues in my time. Using c to pack, then ruby to unpack (and other examples) is asking for trouble. Using ruby for both makes more intuitive sense to me. However, for "macro" operations (again, lacking a better term) its not as big of a deal. "Pack" is bad because its doing bit manipulation. Inserting data into a table isn't a big deal because you either have the right schema or you don't.

I have a stupid question, though. Why are we packing those values? Id suspect a moderate performance boost, but I'd be curious to see how that stacked up to the rest of the lookup. Also, I can envision possibly wanting to use this data in different ways, maybe from a different platform. On top of that, since it seems like other processes are somewhat brittle, time would be better spent in other areas rather than optimizing this.

I'm going to take a look at the data consistency in the next few days. I have a lot of valid address data, and geo for a big chunk of zip+4. I did about 100k last night, then took a distance from where the geocoder thinks it is. Most are really close. Some are reasonable. 1-2% are over a mile. I tried that late yesterday, though. I may have had mush brain.

-kevin

> Well, the summary of the last emai...

> On Feb 8, 2010 5:28 PM, "Kate Chapman" <k8chap...@gmail.com> wrote:
>
> Kevin,
>

> I haven't teste...

Dan B

unread,

Feb 9, 2010, 11:31:15 AM2/9/10

to GeoCommons Geocoder

Yes, we should use the same stack, although you can certainly mix C
and Ruby code so long as they are producing the same results.

Some of the GIS vendors routinely compress their blob data to save
storage space; I'm guessing that the Geocoder adopted that practice as
well. I did some quick comparisons of the compress_WKB_line and
uncompress_WKB_line functions and the compressed data was often 50%
smaller or more. I don't think it was an area of focus, but it was
probably simple enough to throw in a set of reliable compress/
uncompress functions (or pack/unpack methods) and move on to the
actual geocoding algorithms.

The Tiger Data isn't perfect, so your results may vary. I started
working with local data from my state's Department of Transportation
to get more accurate results (which is when I started running into
issues with non-Tiger data), but even it isn't perfect. Google and
some of the other geocoders use commercial databases (like NAVTEQ and
Tele Atlas) to augment Tiger data but they still generate imperfect
results.

And unless you're using better data for places.sql, Zip+4 shouldn't
buy you any additional precision. The algorithm uses the zip code to
narrow down the number of matching streets; the data distributed with
the geocoder doesn't include data on the +4.

On Feb 9, 9:26 am, Kevin Galligan <kgalli...@gmail.com> wrote:
> If I can paraphrase, we should try to use the same "stack", for lack of a
> better term, for importing and accesing. I've run into similar issues in my
> time. Using c to pack, then ruby to unpack (and other examples) is asking
> for trouble. Using ruby for both makes more intuitive sense to me. However,
> for "macro" operations (again, lacking a better term) its not as big of a
> deal. "Pack" is bad because its doing bit manipulation. Inserting data into
> a table isn't a big deal because you either have the right schema or you
> don't.
>
> I have a stupid question, though. Why are we packing those values? Id
> suspect a moderate performance boost, but I'd be curious to see how that
> stacked up to the rest of the lookup. Also, I can envision possibly wanting
> to use this data in different ways, maybe from a different platform. On top
> of that, since it seems like other processes are somewhat brittle, time
> would be better spent in other areas rather than optimizing this.
>
> I'm going to take a look at the data consistency in the next few days. I
> have a lot of valid address data, and geo for a big chunk of zip+4. I did
> about 100k last night, then took a distance from where the geocoder thinks
> it is. Most are really close. Some are reasonable. 1-2% are over a mile. I
> tried that late yesterday, though. I may have had mush brain.
>
> -kevin
>

Kevin Galligan

unread,

Feb 9, 2010, 12:20:04 PM2/9/10

to geocommon...@googlegroups.com

Yeah, to summarize what I was saying, the C/Ruby mix doesn't bother me
for most of the stuff, except the compress/uncompress part. For that,
using Ruby on both ends would be better, as I imagine it does a better
job of treating data across platforms in a consistent way (although
rubygeo mentioned something about it not working on OSX because of
endian issues).

So, as a higher level issue, is there active development going on with
the geocoder, or is it just fixes and patches? If there's an active
development effort, maybe some planning along those lines?

I did a little work on a different geocoder. I think it was a port of
the perl code to Java. Tiger had some fun quirks, as I remember.

Can I get some clarification here?

"And unless you're using better data for places.sql, Zip+4 shouldn't
buy you any additional precision."

Do you mean the geocoder should be as precise, or I'm not going to do
much better than the zip+4? To be honest, I'm only trying a geocoder
because our zip+4 data is older and incomplete. Otherwise, that geo
was good enough. Outside of city areas it isn't as hot, but generally
fine. I'm trying to use the geocoder to fill in the gaps where I have
zip+4 on an address, but not in my geo lookup table.

"The algorithm uses the zip code to narrow down the number of matching streets"

So, it checks that the matches are inside the zip code's boundary? I
was wondering about that. What's most interesting is that I didn't
really see any catastrophic misses. Like 100 miles. Most were close,
a few were a couple miles away, then it fell off sharply. Of course,
I only have data for Delaware right now, so...

Speaking of which, I'm currently downloading and building the full set
on an EC2 instance. If we can figure out the DB format, I think I
could make the block storage device with the DB available for those in
need, at least for a bit.

Picking up a ruby book today. Will tinker around a bit.

-Kevin

Kate Chapman

unread,

Feb 9, 2010, 4:51:44 PM2/9/10

to geocommon...@googlegroups.com

Hi Kevin/Dan/All,

I can provide at least some background on the geocoder that will
hopefully give some context.

The reason for the C/Ruby mix is we had various issues when we were
intially releasing it. Some of the C code was not threadsafe and was
segfaulting, some of the Ruby code seemed to have a memory leak and we
slowing down and never finishing. Things are a bit messy now and
shouldn't be, I haven't had the time to clean everything up in the
way I would like though.

As for active development there have really only been 2 people that
have done major work on the geocoder so far, one was a contractor paid
to create it for GeoCommons. The other is myself, I have am an
employee of FortiusOne and work on GeoCommons. Features that have
been added so far were based on GeoCommons needs specifically but it
would be great to come up with additional work to improve the
geocoder.

I can put the database up on my own server for people to download.
Ideally though it would be good to get the import process to be more
smooth. That way additional datasets could be used for greater
accuracy. For example if someone had good data for a particular city
or state swapping it for TIGER would be great. I would also like to
at some point allow rooftop style geocoding.

Hope this helps. If people fork the geocoder on github and submit
patches I'll be more than happy to apply them. If there is a group
that wants to work on enhancements I'll be more than willing to help
with that as well.

Thanks,

Kate

Kevin Galligan

unread,

Feb 9, 2010, 5:19:49 PM2/9/10

to geocommon...@googlegroups.com

Good background info. My geo experience is heavy in some spaces, and
very light in others. Geocoding is on the lighter end, but I think I
can be helpful in some ways. We'll see how it pans out.

I don't have any other street data sources besides Tiger. What I do
have is a large amount of actual address data, cleaned and updated,
and I've been killing time all day dreaming up ways to stress the
geocoder to see what it does with it.

As for merged source data, I think that's sort of what open street map
is, right? Augmented tiger and other sources? I have no idea how
good that data is. We use it for map graphics, but that's about it.

The load process is relatively smooth. It just doesn't pack the data.
For now, performance and size gains aside, I'm going to use the
existing loader and skip the packing. Especially if there's memory
trouble with Ruby. I have almost zero Ruby experience.

As a side discussion, how specific is the SQL to sqlite? I have an
urge to get it onto Postgis. I'm not sure what benefit that would
offer, but its natural for our environment.

Thanks,
-Kevin

Dan B

unread,

Feb 9, 2010, 6:10:07 PM2/9/10

to GeoCommons Geocoder

I prefer to use Ruby too (and I'm swapping out SQLite3 for PostGIS, so
the SQLite extensions for compression are less useful).

Kate would be able to speak on the active development. The goal of
the project was probably to deliver a geocoder for US street addresses
to the Ruby platform (the original Geocoder/US was delivered in PERL)
and let Ruby users figure it out from there. I get the feeling that
development right now is focused on fixing any bugs and packaging the
files for the broader audience (by making the installation easier or
simplifying import procedures). You could improve the data sources
(by using NAVTEQ or others), but you'd have to pay for them. You
could improve the algorithm (by using shapefiles of property maps, for
example, instead of using straight-line approximations) but you'd
probably lose nationwide coverage.

I may have misread your earlier statement on Zip+4. I thought you
meant that you were going to supply addresses with Zip+4 in the hopes
that they would be more accurate than the same addresses without the
+4. My comment was that the geocoder isn't checking for zipcode
boundaries at all. Instead, it uses zipcode (or city and state) to
find a match in the places table, which it then uses to find an
initial set of street data (from the features table) for that place.
A zipcode or city/state combination allows it to find the right set of
Main Streets in a database full of Main Streets. From there, it just
interpolates the street address from the range of addresses on any
given segment of streets.

BUT, you could use the lat/lon data from the geocoder to locate a
point and then use some other system to determine whether that point
falls within a set of boundaries (like a Zip+4 geometry) and I think
that's what you meant.

Kate Chapman

unread,

Feb 9, 2010, 6:32:52 PM2/9/10

to geocommon...@googlegroups.com

Kevin,

Ideally yes I would like to use OpenStreetMap data. There are not
many addresses in OSM for the US right now though. I have been
working with a group importing DC GIS data into OSM and have data for
a couple other states as well. Ideally once the OSM data is better
than the TIGER data it would be good to have the geocoder use it.

I don't think it would be that bad to port to PostGIS, though I would
be interested in seeing what the performance is like. With the
current implementation as you geocode data is loaded into memory so it
actually gets faster with use. How that would relate to PostGIS I am
not sure.

-Kate

Kate Chapman

unread,

Feb 9, 2010, 6:37:41 PM2/9/10

to geocommon...@googlegroups.com

The initial goal for the project was to develop a geocoder for US
street addresses just as you said. It was funded through a grant from
the FGDC. There is a bit more information regarding that here:
http://highearthorbit.com/geocommons-open-sourced-geocoder/
Development currently is mostly fixing bugs and making the system
easier to install and use. I'd like to improve the gem packaging and
get it to the point where people only needed to do "gem install" to
get up and running.

-Kate

Reply all

Reply to author

Forward