CSV Request

446 views
Skip to first unread message

gerald

unread,
May 10, 2012, 4:10:56 PM5/10/12
to ScraperWiki
Hi,

So, I've got a massive delimited text file of the names of every
geographic location in the US, a list of racial slurs, and I'm looking
for the names of locations that include a racial slur. Specifically,
I'll need a CSV with the following information for each
controversially named location separated into columns: Feature ID,
Feature Name, Feature Class, State Name Abbreviation, State Number,
County Name, County Number, Primary Latitude DEC and Primary Longitude
DEC.

Here are link to the geographic names and a list of racist terms.

http://geonames.usgs.gov/docs/stategaz/NationalFile_20120416.zip

https://docs.google.com/spreadsheet/ccc?key=0Antez86oOXPndGZ4LVNnMlFWWVZSdnVETzhqb3RRN1E

PS. I'd like to see how you do this one so I can learn.

Páll Hilmarsson

unread,
May 10, 2012, 9:10:15 PM5/10/12
to scrap...@googlegroups.com
Hi!

This was a fun one to do!

Here are the steps I took. I am fairly confident in the results, but
I'm also fairly confident it could be done different/better.

1. The file with geographic names was pipe (|) delimited and not comma
delimited. Although not necessary I replaced the pipes with commas. In
a terminal:

cat NationalFile_20120416.txt | sed -i 's/|/,/g' > places.csv

2. I had a hunch that a file with 2.221.270 lines would have some
errors, so I ran it through the awesome csvclean tool
(http://csvkit.readthedocs.org/en/0.4.4/scripts/csvclean.html). That
stripped 645 lines from the file, most of them with the message
"Expected 20 columns, found 21 columns". The error lines get saved to
a seperate file, places_err.csv, for inspection and the clean file in
a new one, places_out.csv.

3. Next I wanted to remove the columns that were not needed to make
the file a little bit more managable (it was at 301mb). So I ran
csvcut (http://csvkit.readthedocs.org/en/0.4.4/scripts/csvcut.html)

csvcut -c 1,2,3,4,5,6,7,8,9,10,11 places_out.csv > places_out_cut.csv

but it gave a encoding error. A bit of googling around led me to
conclude that the file was utf-8 with BOM. To strip the BOM from the
file I used the shell script found here:
http://thegreyblog.blogspot.com/2010/09/shell-script-to-find-and-remove-bom.html
(if you are on OS X then you will have to install seq for it to work -
instructions here: http://scruss.com/blog/2008/02/08/seq-for-os-x/

After the BOM cleaning then the csvcut command ran fine and the new
file was down to 206 mb.

4. Then I wrote a short python script that processed each line from
the file and if found a match from a slur csv file, writes a new file:
matched.csv, adding a column with the slur word found.

The result:

Unique values: 83
5 most frequent values:
Coon: 1016
Squaw: 864
Pole: 794
Negro: 604
Gables: 106

Count: 4137

And here they are on a Google Fusion map: http://go.gogn.in/IXcX7m

The files are in a dropbox folder here: http://go.gogn.in/M28DEH


All the best,

pallih

--
pal...@kaninka.net
http://gogn.in
http://twitter.com/pallih
http://is.linkedin.com/in/pallih

My email load: http://courteous.ly/KbEbzo

gerald

unread,
May 11, 2012, 5:29:37 PM5/11/12
to ScraperWiki
Thanks so much!

On May 10, 8:10 pm, Páll Hilmarsson <pal...@kaninka.net> wrote:
> Hi!
>
> This was a fun one to do!
>
> Here are the steps I took. I am fairly confident in the results, but
> I'm also fairly confident it could be done different/better.
>
> 1. The file with geographic names was pipe (|) delimited and not comma
> delimited. Although not necessary I replaced the pipes with commas. In
> a terminal:
>
> cat NationalFile_20120416.txt | sed -i 's/|/,/g' > places.csv
>
> 2. I had a hunch that a file with 2.221.270 lines would have some
> errors, so I ran it through the awesome csvclean tool
> (http://csvkit.readthedocs.org/en/0.4.4/scripts/csvclean.html). That
> stripped 645 lines from the file, most of them with the message
> "Expected 20 columns, found 21 columns". The error lines get saved to
> a seperate file, places_err.csv, for inspection and the clean file in
> a new one, places_out.csv.
>
> 3. Next I wanted to remove the columns that were not needed to make
> the file a little bit more managable (it was at 301mb). So I ran
> csvcut (http://csvkit.readthedocs.org/en/0.4.4/scripts/csvcut.html)
>
> csvcut -c 1,2,3,4,5,6,7,8,9,10,11 places_out.csv > places_out_cut.csv
>
> but it gave a encoding error. A bit of googling around led me to
> conclude that the file was utf-8 with BOM. To strip the BOM from the
> file I used the shell script found here:http://thegreyblog.blogspot.com/2010/09/shell-script-to-find-and-remo...
> (if you are on OS X then you will have to install seq for it to work -
> instructions here:http://scruss.com/blog/2008/02/08/seq-for-os-x/
>
> After the BOM cleaning then the csvcut command ran fine and the new
> file was down to 206 mb.
>
> 4. Then I wrote a short python script that processed each line from
> the file and if found a match from a slur csv file, writes a new file:
> matched.csv, adding a column with the slur word found.
>
> The result:
>
> Unique values: 83
> 5 most frequent values:
> Coon:   1016
> Squaw:  864
> Pole:   794
> Negro:  604
> Gables: 106
>
> Count: 4137
>
> And here they are on a Google Fusion map:http://go.gogn.in/IXcX7m
>
> The files are in a dropbox folder here:http://go.gogn.in/M28DEH
>
> All the best,
>
> pallih
>
> --
> pal...@kaninka.nethttp://gogn.inhttp://twitter.com/pallihhttp://is.linkedin.com/in/pallih
>
> My email load:http://courteous.ly/KbEbzo
>
>
>
>
>
>
>
> On Thu, May 10, 2012 at 8:10 PM, gerald <gerald.a.r...@gmail.com> wrote:
> > Hi,
>
> > So, I've got a massive delimited text file of the names of every
> > geographic location in the US, a list of racial slurs, and I'm looking
> > for the names of locations that include a racial slur. Specifically,
> > I'll need a CSV with the following information for each
> > controversially named location separated into columns: Feature ID,
> > Feature Name, Feature Class, State Name Abbreviation, State Number,
> > County Name, County Number, Primary Latitude DEC and Primary Longitude
> > DEC.
>
> > Here are link to the geographic names and a list of racist terms.
>
> >http://geonames.usgs.gov/docs/stategaz/NationalFile_20120416.zip
>
> >https://docs.google.com/spreadsheet/ccc?key=0Antez86oOXPndGZ4LVNnMlFW...
Reply all
Reply to author
Forward
0 new messages