Data Quality

4 views
Skip to first unread message

bosconet

unread,
Feb 23, 2011, 8:36:32 PM2/23/11
to CIVIC HACK DAY
I started playing around with the Real Property data today and noticed
that the data needed some manipulating to cleaning get into a DB.
Specifically I found commas embedded in fields in the CSV download
file. I was able to fix the few instances where I encountered this by
hand but have since noticed some other data quality issues.

As an example I identified 66 properties where the 'lotSize' is
identified a value of cubic feet (that is if CU FT means the same to
the city as it does to normal people). There are also over 16,000
properties where the lot size is identified as acres. Then you have
other lot sizes identified by dimensions (e.g. 14X80).

Does anyone know if the City is aware of this issue?

Mark Headd

unread,
Feb 24, 2011, 9:58:47 AM2/24/11
to civic-h...@googlegroups.com
I saw that you started a forum thread on this:


Hope it leads to getting the data cleaned up.  If you have any interest, you may also want to check out Google Refine (if you haven't already) - http://code.google.com/p/google-refine/

I've used it to clean up open gov data and make it more usable.  It's not ideal because you have to offload the data from the City's portal, but it can make the data more usable if there are issues, or if there are specific usage requirements for it,  From the city of Chicago, I took this:


And converted it to this:


In about 20 minutes.  I was not happy that they had jammed all of the hours of operation into a single field, and I wanted to strip off formatting on phone numbers (so I could make an app dial them).

Google Refine is awesome for general data clean up / enhancement needs.

Hope this helps. 

Mark

--
Event Info at: http://CivicHackDay.org

You received this message because you are subscribed to the "Civic Hack Day" Google group.
To post to this group, send email to civic-h...@googlegroups.com
To unsubscribe from this group, send email to
civic-hack-da...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/civic-hack-day?hl=en?hl=en

bosconet

unread,
Feb 24, 2011, 10:07:00 AM2/24/11
to CIVIC HACK DAY
Thanks to the pointer for Google Refine. I was not aware of it.

I posted the thread after complaining on twitter about the same thing
and having @bmorewebmaster requesting it.


On Feb 24, 9:58 am, Mark Headd <mhe...@gmail.com> wrote:
> I saw that you started a forum thread on this:
>
> http://discuss.baltimorecity.gov/topic/data-quality-issues-in-real-pr...
>
> Hope it leads to getting the data cleaned up.  If you have any interest, you
> may also want to check out Google Refine (if you haven't already) -http://code.google.com/p/google-refine/
Reply all
Reply to author
Forward
0 new messages