I also tried pulling the latest code from master and got the same
results. Since we're both using the TIGER 2008 data, I think it's
probably the shp2sql, like you said. I use the Makefile.nix when
running make.
How do I know my shp2sql is correct?
Thanks,
Andrew
Did you ever get a resolution to your issue? I'm encountering the
same numbers that you mentioned when I build the Geocommons system on
a Fedora 11 system using the TIGER 2009 data (with a comparable set of
packages to those that you listed). Any thoughts would be appreciated.
Thanks!
Thanks,
Bill
On Dec 17 2009, 10:22 pm, Andrew Hallock <andrew.hall...@gmail.com>
The import process is producing a metaphone (street_phone) value that
includes the street name and suffix (Rd, St, Ave, etc.). The geocoding
functions only use a street name metaphone and don't include the
suffix. I'm pretty sure the Address object strips off the suffix and
that's what the geocoding functions are using. The import script is
pretty much just straight SQL and doesn't use the same logic to
produce the metaphone. Also, the rebuild_metaphones script is only
rebuilding the city_phone and not the street_phone.
I may be completely wrong, but that's what I remember running into
about a month ago. Kate has been very helpful and may have an
uncommitted fix for this. I know she was planning on doing a full
import of the 2009 TIGER data using the current code base. If I find
the solution, I will definitely post it here. If anyone else has some
Ruby experience and can dig into this, I would be interested in the
solution.
-Nick
Thank you for your persistence in tracking this town--I was bitten by
it as well. I'd love to get a 2009 geocoder built (I'm presently
using an old checkout + 2008 data), but I'm afraid I don't quite see
how to apply the fix you identified. If it's not too much trouble,
could you please give me a hint as to how to patch my checkout of
trunk in the way you suggest?
Thanks again,
David
On Jan 7, 1:12 pm, Kate Chapman <k8chap...@gmail.com> wrote:
> Nick,
>
> That is correct. At some point I imported an additional column without the
> suffix that was just street names. I must have committed the wrong file at
> some point in an edit or something silly.
>
> My suggestion would be to add the additional column to the import process to
> create the metaphones.
>
> -Kate
>
The older script is:
#!/usr/bin/ruby
require 'geocoder/us/import/tiger'
base = File.join(File.dirname(__FILE__), "..", "sql")
db = Geocoder::US::Import::TIGER.new(ARGV[0], :sql => base)
ARGV[1..ARGV.length].each do |path|
db.import_tree path
end
db.log "creating index"
#db.create_index
The command that I used was:
bin/tiger_import district_of_columbia.db
TIGER2008/11_DISTRICT_OF_COLUMBIA
I don't have a fix for the import script. I tried modifying the custom
Sqlite function that generates the metaphone value, but it didn't like
me using the Address object and I'm not very familiar with Ruby so it
was an uphill battle for me. If I do find a fix, I'll post it in this
forum.
Dan's suggestion that you use the old Ruby import script might work
but I think there was a reason they moved away from that. It may have
just been memory usage, in which case it might still work just fine.
-Nick
The output that I generated when trying to geocode the White House was
the following, which makes me believe that the metaphones are ok, and
that the correct street is being selected from all of the possible
candidates. I'm assuming that the datapoints at the end of this
output should be the intermediate lat/long points on Pennsylvania Ave
(in this case), and they do seem a bit off from what they should be.
Can anyone on this list confirm either my interpretation of the
geographic points and/or that their output is the same (once the debug
variable is set)?
>> p db.geocode("1600 Pennsylvania Av, Washington DC")
ADDR: #<Geocoder::US::Address:0xb6488094 @street=["pennsylvania av",
"pennsylvania avenue", "pennsylvania ave"], @plus4="",
@full_state="dc", @sufnum="", @state="DC", @city=["washington"],
@prenum=nil, @number="1600", @zip="", @text="1600 Pennsylvania Av,
Washington DC">
addr=#<Geocoder::US::Address:0xb6488094 @street=["pennsylvania av",
"pennsylvania avenue", "pennsylvania ave"], @plus4="",
@full_state="dc", @sufnum="", @state="DC", @city=["washington"],
@prenum=nil, @number="1600", @zip="", @text="1600 Pennsylvania Av,
Washington DC">
SQL : SELECT *, levenshtein(?, city) AS city_score
FROM place WHERE city_phone IN (metaphone(?,5)) AND
state = ? order by priority desc LIMIT 1;
EXEC: ["washington", "washington", "DC"]
ROWS: 1 (0.037s)
street parts = ["pennsylvania"]
SQL :
SELECT feature.*, levenshtein(?, street) AS street_score
FROM feature
WHERE street_phone IN (metaphone(?,5)) AND feature.zip IN
(?)
EXEC: ["pennsylvania av", "pennsylvania", "20014"]
ROWS: 0 (0.001s)
zip results 2
EXEC: ["pennsylvania av", "pennsylvania", "200%"]
ROWS: 9 (0.006s)
candidates=
[{:paflag=>"P", :street_score=>"0.176470588235294", :zip=>"20052", :street=>"Pennsylvania
Ave NW", :street_phone=>"PNSLF", :fid=>"909478"},
{:paflag=>"P", :street_score=>"0.176470588235294", :zip=>"20037", :street=>"Pennsylvania
Ave NW", :street_phone=>"PNSLF", :fid=>"909477"},
{:paflag=>"P", :street_score=>"0.176470588235294", :zip=>"20020", :street=>"Pennsylvania
Ave SE", :street_phone=>"PNSLF", :fid=>"909481"},
{:paflag=>"P", :street_score=>"0.176470588235294", :zip=>"20006", :street=>"Pennsylvania
Ave NW", :street_phone=>"PNSLF", :fid=>"909476"},
{:paflag=>"A", :street_score=>"0.176470588235294", :zip=>"20005", :street=>"Pennsylvania
Ave NW", :street_phone=>"PNSLF", :fid=>"909473"},
{:paflag=>"P", :street_score=>"0.176470588235294", :zip=>"20004", :street=>"Pennsylvania
Ave NW", :street_phone=>"PNSLF", :fid=>"909475"},
{:paflag=>"A", :street_score=>"0.176470588235294", :zip=>"20004", :street=>"Pennsylvania
Ave NW", :street_phone=>"PNSLF", :fid=>"909472"},
{:paflag=>"P", :street_score=>"0.176470588235294", :zip=>"20003", :street=>"Pennsylvania
Ave SE", :street_phone=>"PNSLF", :fid=>"909480"},
{:paflag=>"P", :street_score=>"0.176470588235294", :zip=>"20001", :street=>"Pennsylvania
Ave NW", :street_phone=>"PNSLF", :fid=>"909474"}]
SQL :
SELECT feature_edge.fid AS fid, range.*
FROM feature_edge, range
WHERE fid IN (?,?,?,?,?,?,?,?,?)
AND feature_edge.tlid = range.tlid
ORDER BY min(abs(fromhn - ?), abs(tohn - ?))
LIMIT 36;
EXEC: ["909481", "909472", "909473", "909474", "909475", "909476",
"909477", "909478", "909480", 1600, 1600]
ROWS: 36 (0.012s)
SQL : SELECT edge.* FROM edge WHERE edge.tlid IN (?)
EXEC: ["76225813"]
ROWS: 1 (0.000s)
SQL : SELECT tlid, side,
min(fromhn) > min(tohn) AS flipped,
min(fromhn) AS from0, max(tohn) AS to0,
min(tohn) AS from1, max(fromhn) AS to1
FROM range WHERE tlid IN (?)
GROUP BY tlid, side;
EXEC: ["76225813"]
ROWS: 2 (0.001s)
NUM : 1522 < 1600 < 1608 (flipped? false)
DIST: 0.906976744186046
POINTS: [[0.000513, 0.00128], [1604.462336, 1396.850523],
[-1225.405504, 1131.612825], [-944.755392, 1396.853316], [-736.122944,
1131.612741], [-623.423168, 1396.856059], [220.726976, 1131.612725],
[1554.434624, 1396.858337], [2134.427072, 1131.612691], [1090.440768,
1396.862393], [1177.577152, 1131.612708]]DONE: 0.071s
[{:precision=>:range, :lat=>1233.403826, :number=>"1600", :prenum=>"", :zip=>"20502", :components=>
{:denominator=>10.25, :state=>0, :parity=>1.25, :number=>2.0, :prenum=>1, :total=>7.72058823529412, :zip=>0, :street=>2.47058823529412, :city=>1.0}, :street=>"Pennsylvania
Ave NW", :lon=>2533.967734, :score=>0.753}]
=> nil
According to this...
http://dev.mysql.com/doc/refman/5.0/en/gis-wkb-format.html
WKB has 5 bytes of "stuff" in front, then two 8-byte doubles.
However, the c code...
uint32_t compress_wkb_line (void *dest, const void *src, uint32_t len)
{
uint32_t d, s;
double value;
if (!len) return 0;
for (s = 9, d = 0; s < len; d += 4, s += 8) {
value = *(double *)(src + s);
value *= 1000000;
*(int32_t *)(dest + d) = (int32_t) value;
}
return d;
}
appears to start at byte 9. Yes? I haven't done c in years, and I'm
learning ruby as I'm doing this, so its a bit of a struggle.
I'm going to recompile the c and see (ha ha) what that does.
> ...
>
> read more »
I don't think that compress code is ever being called. I don't really
have debugging ability right now, so I didn't the poor man's version
of debugging. I put 'exit(1);' in the body of that function:
uint32_t compress_wkb_line (void *dest, const void *src, uint32_t len)
{
uint32_t d, s;
double value;
exit(1);
if (!len) return 0;
for (s = 5, d = 0; s < len; d += 4, s += 8) {
value = *(double *)(src + s);
value *= 1000000;
*(int32_t *)(dest + d) = (int32_t) value;
}
return d;
}
Nothing seems to happen. I'll recompile and try again.
So, maybe the geometries are regular WKB instead of packed? I can't
see the sqlite blob values, which is pretty frustrating, but what can
you do?
Another issue. The ruby code uses 'unpack', which might act weird on
32 vs 64 bit systems (I'm on 64. This is probably not an issue, but
the docs mentioned something about platform size on output).
> ...
>
> read more »
require 'geo_ruby'
include GeoRuby::SimpleFeatures
then in 'unpack_geometry'
def unpack_geometry (geom)
points = []
if !geom.nil?
g = Geometry.from_ewkb(geom)
g.points.each{|point|
points << [point.x, point.y]}
end
points
end
I basically learned ruby today to do this, so if there's a better way
to hack that up, give it a shot.
I tried it with one address in Delaware, and the lat/lon produced came
out literally what looks like 10-20 feet away. So, pretty good. I
haven't tested this on other
On Feb 8, 1:46 pm, Kevin Galligan <kgalli...@gmail.com> wrote:
> ...
>
> read more »
I haven't tested this on 64 bit systems so I'm not sure. If you find
out anything more about this let me know.
-Kate
> ...
>
> read more »
Well, the summary of the last email is this. The ruby code is expecting a special number format. Not standard WKB. However, the value is actually WKB. So, just reading it with geo_ruby works. I'm able to geocode addresses at the range level. Haven't tried anything else yet.
I'm probably going to be putting in some significant tweaks and testing over the next few weeks. I've never used git, but I assume there's an easy way to pull patches to send along?
On Feb 8, 2010 5:28 PM, "Kate Chapman" <k8ch...@gmail.com> wrote:
Kevin,
I haven't tested this on 64 bit systems so I'm not sure. If you find
out anything more about this let me know.
-Kate
On Feb 8, 3:25 pm, Kevin Galligan <kgalli...@gmail.com> wrote:
> Some updates.
>
> I don't think th...
> ...
>
> read more »
If you fork the geocoder on github and make your commits you can then
do a pull request to me. That allows me to then merge in your
changes.
-Kate
Other than the 64-bit system part, I think you've really nailed it.
Based on your advice (using Geo_Ruby), I was able to solve this
problem. The basic issue is consistency (and it pops up in a few
other places). I pointed out before that you could successfully
import the Tiger data if you used the Ruby-based importer instead of
the SQL-based importer, but I never stopped to figure out why. As it
turns out:
* The Ruby-based TIGER importer uses Ruby's string#pack function on
the geometry data and the geocoder uses string#unpack on the same
geometry to produce valid coordinates.
* The SQLite extensions includes compress_wkb_line, which appears to
be used in the NAVTEQ convert.sql, and its inverse
uncompress_wkb_line. These don't seem to be used anywhere else.
* The SQL-based TIGER importer does not include any compression, but
the geocoder still uses string#unpack on the otherwise valid geometry.
Kevin's addition of Geo_Ruby works because it simply reads the valid
and unpacked geometry blob instead of trying to unpack the unpacked
geometry blob. You'll note that the import.rb also uses Geo_Ruby
prior to packing the geometry blob!
In my opinion, this is small inconsistency is pretty close to the
previous issue with the use of metaphones in places.sql (namely, if
the Ruby metaphone function is slightly different from the metaphone
function used to build places.sql, the geocoder will not produce the
expected results). The same goes for constructing the street_phone
during input or lookup. In the end, you can't expect the geocoder to
behave consistently if the data isn't consistent.
My suggestion is to fix (or just use) the importer instead of relying
on the SQL scripts and SQLite3 extensions. I don't mind writing and
making the necessary pull requests if that sounds like a reasonable
plan.
And thanks again Kevin. I spent most of the weekend trying to figure
out this problem and your email cleared it all up very quickly.
Dan
On Feb 8, 6:16 pm, Kevin Galligan <kgalli...@gmail.com> wrote:
> Well, the summary of the last email is this. The ruby code is expecting a
> special number format. Not standard WKB. However, the value is actually
> WKB. So, just reading it with geo_ruby works. I'm able to geocode
> addresses at the range level. Haven't tried anything else yet.
>
> I'm probably going to be putting in some significant tweaks and testing over
> the next few weeks. I've never used git, but I assume there's an easy way to
> pull patches to send along?
>
If I can paraphrase, we should try to use the same "stack", for lack of a better term, for importing and accesing. I've run into similar issues in my time. Using c to pack, then ruby to unpack (and other examples) is asking for trouble. Using ruby for both makes more intuitive sense to me. However, for "macro" operations (again, lacking a better term) its not as big of a deal. "Pack" is bad because its doing bit manipulation. Inserting data into a table isn't a big deal because you either have the right schema or you don't.
I have a stupid question, though. Why are we packing those values? Id suspect a moderate performance boost, but I'd be curious to see how that stacked up to the rest of the lookup. Also, I can envision possibly wanting to use this data in different ways, maybe from a different platform. On top of that, since it seems like other processes are somewhat brittle, time would be better spent in other areas rather than optimizing this.
I'm going to take a look at the data consistency in the next few days. I have a lot of valid address data, and geo for a big chunk of zip+4. I did about 100k last night, then took a distance from where the geocoder thinks it is. Most are really close. Some are reasonable. 1-2% are over a mile. I tried that late yesterday, though. I may have had mush brain.
-kevin
> Well, the summary of the last emai...
> On Feb 8, 2010 5:28 PM, "Kate Chapman" <k8chap...@gmail.com> wrote:
>
> Kevin,
>
> I haven't teste...
Some of the GIS vendors routinely compress their blob data to save
storage space; I'm guessing that the Geocoder adopted that practice as
well. I did some quick comparisons of the compress_WKB_line and
uncompress_WKB_line functions and the compressed data was often 50%
smaller or more. I don't think it was an area of focus, but it was
probably simple enough to throw in a set of reliable compress/
uncompress functions (or pack/unpack methods) and move on to the
actual geocoding algorithms.
The Tiger Data isn't perfect, so your results may vary. I started
working with local data from my state's Department of Transportation
to get more accurate results (which is when I started running into
issues with non-Tiger data), but even it isn't perfect. Google and
some of the other geocoders use commercial databases (like NAVTEQ and
Tele Atlas) to augment Tiger data but they still generate imperfect
results.
And unless you're using better data for places.sql, Zip+4 shouldn't
buy you any additional precision. The algorithm uses the zip code to
narrow down the number of matching streets; the data distributed with
the geocoder doesn't include data on the +4.
On Feb 9, 9:26 am, Kevin Galligan <kgalli...@gmail.com> wrote:
> If I can paraphrase, we should try to use the same "stack", for lack of a
> better term, for importing and accesing. I've run into similar issues in my
> time. Using c to pack, then ruby to unpack (and other examples) is asking
> for trouble. Using ruby for both makes more intuitive sense to me. However,
> for "macro" operations (again, lacking a better term) its not as big of a
> deal. "Pack" is bad because its doing bit manipulation. Inserting data into
> a table isn't a big deal because you either have the right schema or you
> don't.
>
> I have a stupid question, though. Why are we packing those values? Id
> suspect a moderate performance boost, but I'd be curious to see how that
> stacked up to the rest of the lookup. Also, I can envision possibly wanting
> to use this data in different ways, maybe from a different platform. On top
> of that, since it seems like other processes are somewhat brittle, time
> would be better spent in other areas rather than optimizing this.
>
> I'm going to take a look at the data consistency in the next few days. I
> have a lot of valid address data, and geo for a big chunk of zip+4. I did
> about 100k last night, then took a distance from where the geocoder thinks
> it is. Most are really close. Some are reasonable. 1-2% are over a mile. I
> tried that late yesterday, though. I may have had mush brain.
>
> -kevin
>
So, as a higher level issue, is there active development going on with
the geocoder, or is it just fixes and patches? If there's an active
development effort, maybe some planning along those lines?
I did a little work on a different geocoder. I think it was a port of
the perl code to Java. Tiger had some fun quirks, as I remember.
Can I get some clarification here?
"And unless you're using better data for places.sql, Zip+4 shouldn't
buy you any additional precision."
Do you mean the geocoder should be as precise, or I'm not going to do
much better than the zip+4? To be honest, I'm only trying a geocoder
because our zip+4 data is older and incomplete. Otherwise, that geo
was good enough. Outside of city areas it isn't as hot, but generally
fine. I'm trying to use the geocoder to fill in the gaps where I have
zip+4 on an address, but not in my geo lookup table.
"The algorithm uses the zip code to narrow down the number of matching streets"
So, it checks that the matches are inside the zip code's boundary? I
was wondering about that. What's most interesting is that I didn't
really see any catastrophic misses. Like 100 miles. Most were close,
a few were a couple miles away, then it fell off sharply. Of course,
I only have data for Delaware right now, so...
Speaking of which, I'm currently downloading and building the full set
on an EC2 instance. If we can figure out the DB format, I think I
could make the block storage device with the DB available for those in
need, at least for a bit.
Picking up a ruby book today. Will tinker around a bit.
-Kevin
I can provide at least some background on the geocoder that will
hopefully give some context.
The reason for the C/Ruby mix is we had various issues when we were
intially releasing it. Some of the C code was not threadsafe and was
segfaulting, some of the Ruby code seemed to have a memory leak and we
slowing down and never finishing. Things are a bit messy now and
shouldn't be, I haven't had the time to clean everything up in the
way I would like though.
As for active development there have really only been 2 people that
have done major work on the geocoder so far, one was a contractor paid
to create it for GeoCommons. The other is myself, I have am an
employee of FortiusOne and work on GeoCommons. Features that have
been added so far were based on GeoCommons needs specifically but it
would be great to come up with additional work to improve the
geocoder.
I can put the database up on my own server for people to download.
Ideally though it would be good to get the import process to be more
smooth. That way additional datasets could be used for greater
accuracy. For example if someone had good data for a particular city
or state swapping it for TIGER would be great. I would also like to
at some point allow rooftop style geocoding.
Hope this helps. If people fork the geocoder on github and submit
patches I'll be more than happy to apply them. If there is a group
that wants to work on enhancements I'll be more than willing to help
with that as well.
Thanks,
Kate
I don't have any other street data sources besides Tiger. What I do
have is a large amount of actual address data, cleaned and updated,
and I've been killing time all day dreaming up ways to stress the
geocoder to see what it does with it.
As for merged source data, I think that's sort of what open street map
is, right? Augmented tiger and other sources? I have no idea how
good that data is. We use it for map graphics, but that's about it.
The load process is relatively smooth. It just doesn't pack the data.
For now, performance and size gains aside, I'm going to use the
existing loader and skip the packing. Especially if there's memory
trouble with Ruby. I have almost zero Ruby experience.
As a side discussion, how specific is the SQL to sqlite? I have an
urge to get it onto Postgis. I'm not sure what benefit that would
offer, but its natural for our environment.
Thanks,
-Kevin
Kate would be able to speak on the active development. The goal of
the project was probably to deliver a geocoder for US street addresses
to the Ruby platform (the original Geocoder/US was delivered in PERL)
and let Ruby users figure it out from there. I get the feeling that
development right now is focused on fixing any bugs and packaging the
files for the broader audience (by making the installation easier or
simplifying import procedures). You could improve the data sources
(by using NAVTEQ or others), but you'd have to pay for them. You
could improve the algorithm (by using shapefiles of property maps, for
example, instead of using straight-line approximations) but you'd
probably lose nationwide coverage.
I may have misread your earlier statement on Zip+4. I thought you
meant that you were going to supply addresses with Zip+4 in the hopes
that they would be more accurate than the same addresses without the
+4. My comment was that the geocoder isn't checking for zipcode
boundaries at all. Instead, it uses zipcode (or city and state) to
find a match in the places table, which it then uses to find an
initial set of street data (from the features table) for that place.
A zipcode or city/state combination allows it to find the right set of
Main Streets in a database full of Main Streets. From there, it just
interpolates the street address from the range of addresses on any
given segment of streets.
BUT, you could use the lat/lon data from the geocoder to locate a
point and then use some other system to determine whether that point
falls within a set of boundaries (like a Zip+4 geometry) and I think
that's what you meant.
Ideally yes I would like to use OpenStreetMap data. There are not
many addresses in OSM for the US right now though. I have been
working with a group importing DC GIS data into OSM and have data for
a couple other states as well. Ideally once the OSM data is better
than the TIGER data it would be good to have the geocoder use it.
I don't think it would be that bad to port to PostGIS, though I would
be interested in seeing what the performance is like. With the
current implementation as you geocode data is loaded into memory so it
actually gets faster with use. How that would relate to PostGIS I am
not sure.
-Kate
-Kate