Hi!
This was a fun one to do!
Here are the steps I took. I am fairly confident in the results, but
I'm also fairly confident it could be done different/better.
1. The file with geographic names was pipe (|) delimited and not comma
delimited. Although not necessary I replaced the pipes with commas. In
a terminal:
cat NationalFile_20120416.txt | sed -i 's/|/,/g' > places.csv
2. I had a hunch that a file with 2.221.270 lines would have some
errors, so I ran it through the awesome csvclean tool
(
http://csvkit.readthedocs.org/en/0.4.4/scripts/csvclean.html). That
stripped 645 lines from the file, most of them with the message
"Expected 20 columns, found 21 columns". The error lines get saved to
a seperate file, places_err.csv, for inspection and the clean file in
a new one, places_out.csv.
3. Next I wanted to remove the columns that were not needed to make
the file a little bit more managable (it was at 301mb). So I ran
csvcut (
http://csvkit.readthedocs.org/en/0.4.4/scripts/csvcut.html)
csvcut -c 1,2,3,4,5,6,7,8,9,10,11 places_out.csv > places_out_cut.csv
but it gave a encoding error. A bit of googling around led me to
conclude that the file was utf-8 with BOM. To strip the BOM from the
file I used the shell script found here:
http://thegreyblog.blogspot.com/2010/09/shell-script-to-find-and-remove-bom.html
(if you are on OS X then you will have to install seq for it to work -
instructions here:
http://scruss.com/blog/2008/02/08/seq-for-os-x/
After the BOM cleaning then the csvcut command ran fine and the new
file was down to 206 mb.
4. Then I wrote a short python script that processed each line from
the file and if found a match from a slur csv file, writes a new file:
matched.csv, adding a column with the slur word found.
The result:
Unique values: 83
5 most frequent values:
Coon: 1016
Squaw: 864
Pole: 794
Negro: 604
Gables: 106
Count: 4137
And here they are on a Google Fusion map:
http://go.gogn.in/IXcX7m
The files are in a dropbox folder here:
http://go.gogn.in/M28DEH
All the best,
pallih
--
pal...@kaninka.net
http://gogn.in
http://twitter.com/pallih
http://is.linkedin.com/in/pallih
My email load:
http://courteous.ly/KbEbzo