parsing dates

86 views
Skip to first unread message

Jon Voss

unread,
May 7, 2014, 5:53:52 PM5/7/14
to openr...@googlegroups.com
Hi all,
I'm stumped on this one, though think there's got to be an easy solution.

I'm working on a dataset of metadata for photos (from this amazing collection: http://sfmta.photoshelter.com) and I'm trying to parse out the dates as the date fields are sometimes mixed up.  It turns out that the description field has the most accurate dates reported, and they look like this:

Curve in New Track Ferries and Cliff Line | April 7, 1905
Sutter Street Powerhouse Boiler Room | Circa 1905
End of Streetcar 1074 | April 1905

So I've written this to get at most of them, though it doesn't work for the Circa 1905 variants.  

if(value.split(' | ')[1].contains('Circa'),
   toString(toDate(value.split(' | ')[1].split(' ')[1]
      + ' 1, '
      + value.split(' | ')[1].split(' ')[2]), "yyyy-MM"),
   if(value.split(' | ')[1].contains(',')
      , toString(toDate(value.split(' | ')[1]), "yyyy-MM-dd")
      , toString(toDate(value.split(' | ')[1].split(' ')[0]
         + ' 1, '
         + value.split(' | ')[1].split(' ')[1]), "yyyy-MM"))) 

Wonder if anyone has any thoughts on a better approach?

Thanks! Jon

Thad Guidry

unread,
May 7, 2014, 9:44:28 PM5/7/14
to openr...@googlegroups.com
1st off, don't try to wack off too much in one go... take your time..starting with:

1. Split your columns, so that your date part is contain in 1 column in Refine.
2. After that, use this to cut to the chase and skip over stupid strings 1st... (by cheating with fingerprint() !! )

value.fingerprint().toDate('y d MMM')

3. Iterate using a Custom Facet with that above GREL expression and hack on it, changing the format as necessary as you go until all your dates in the date part column have been converted to a proper date object for Refine to do its magic on with other functions, facets, etc.

the standard syntax is value.toDate('some Simple Date Format string')


Let us know if we can help further, Jon.



--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

Owen Stephens

unread,
May 8, 2014, 4:59:19 AM5/8/14
to openr...@googlegroups.com
If the format is as regular as presented here, and assuming that 'Circa' can be added to the front of any date expression then the following might suffice

trim(value.replace("Circa","")).match(/.*\| ?(\w*)\s?(\d?),? (\d\d\d\d)/)

Or if you break out the date part to a separate column as suggested by Thad:
match(trim(value.replace("Circa")),/(\w*)\s?(\d?),? (\d\d\d\d)/)

The resulting array would have Month/Array in position 0, day or empty string in position 1 and year in position 2. You could then build the date from these components testing for blank strings in pos 0 and pos 1

I like Thad's use of fingerprint though - very neat and might be useful to me elsewhere - thanks :)

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: ow...@ostephens.com
Telephone: 0121 288 6936

Jon Voss

unread,
Jun 30, 2014, 9:38:29 PM6/30/14
to openr...@googlegroups.com
ugh, missed your responses on this last month Thad and Owen!  fortunately coming back to it on another iteration of WWI data.  Thanks for your hints on this!
Reply all
Reply to author
Forward
0 new messages