future plans

1 view

Skip to first unread message

Alexis

unread,

Feb 18, 2007, 11:01:46 AM2/18/07

to parsedatetime-dev

Hi,

I just discovered parsedatetime. It looks great! I'm wondering if I
should abandon my own parsing library and apply my efforts here, so
I'm trying to figure out the scope of its functionality and the future
roadmap. So I was hoping I could put a couple quick questions to the
list:

1) Is there a document that comprehensively summarizes what human-
readable formats it can digest?

2) Does the library support switching to date-month-year format,
instead of just month-date-year format?

3) What does PyICU provide besides localized month and day names?

I've poked around with the svn-62 but there doesn't seem to be a
document summarizing the overall architecture and plans.

bear

unread,

Feb 19, 2007, 12:29:03 AM2/19/07

to parsedat...@googlegroups.com

On 2/18/07, Alexis <alexisg...@gmail.com> wrote:

Hi,

Hi and thanks for using pdt :)

I just discovered parsedatetime. It looks great! I'm wondering if I
should abandon my own parsing library and apply my efforts here, so
I'm trying to figure out the scope of its functionality and the future
roadmap. So I was hoping I could put a couple quick questions to the
list:

I don't have a public roadmap - just a bunch of thoughts Darshana and I have about where pdt should go. I want to get some of them down and start using the new wiki pages for that.

I'm more than happy to take ideas and even code to use in pdt with full credit given. Darshana joined the "team" in just that way - she was doing work for Chandler and after a while I just added her to the team :)

1) Is there a document that comprehensively summarizes what human-
readable formats it can digest?

Right now the best documentation for that is the unit tests. We made sure to add a test for every new parsing style we needed to handle.

2) Does the library support switching to date-month-year format,
instead of just month-date-year format?

Yes, it's completely driven by the information PyICU returns for a locale and pdt also has an internal set of classes if PyICU is not available or may be too "heavy" to use.

One of the earlier bug reports was from someone in Australia requesting support for that so it's been in there for a while.

3) What does PyICU provide besides localized month and day names?

PyICU provides everthing that IBM's ICU package provides - so that's a lot of information. Right now I only use it to pull month, day, am/pm and other date format related items.

I've poked around with the svn-62 but there doesn't seem to be a
document summarizing the overall architecture and plans.

Yea, that is something that is needed - the early need for pdt was personal so like all personal projects gone public, docs are lacking - that will change.

---
bear

Build and Release Engineer
Open Source Applications Foundation (OSAF)
be...@osafoundation.org
http://www.osafoundation.org

be...@code-bear.com
http://code-bear.com

Alexis

unread,

Feb 19, 2007, 12:00:33 PM2/19/07

to parsedatetime-dev

Hi again,

On Feb 19, 5:29 am, bear <bea...@gmail.com> wrote:
> I don't have a public roadmap - just a bunch of thoughts Darshana and I have
> about where pdt should go. I want to get some of them down and start using
> the new wiki pages for that.

Great! I will keep my eye on those pages.

>
> 1) Is there a document that comprehensively summarizes what human-readable formats it can digest?

>
> Right now the best documentation for that is the unit tests. We made sure
> to add a test for every new parsing style we needed to handle.
>

Yes, I was looking over the unit tests. Did you ever consider using
doctest (http://docs.python.org/lib/module-doctest.html) instead of
unittest? With it, you can make unit tests that have the appearance of
annotated console sessions. I switched to doctest recently, and quite
like it.

It's especially apt for testing something like a parsing library,
which consists of many string specimens without too much setup code.
For instance, my unit tests for a few dates consist of this excerpt
from my (restructured) text file datescraper.txt:

[code]
Tests of ``datescraper.scrapeDate``
-------------------------------------------

>>> from datescraper import scrapeDate
>>> import datetime

>>> scrapeDate("22 February 2004")
datetime.date(2004, 2, 22)

>>> scrapeDate("22 feb 2004")
datetime.date(2004, 2, 22)

>>> scrapeDate("22nd February 2004")
datetime.date(2004, 2, 22)
[/code]

It's a nicely self-documenting style of unit test. I suspect the
unittest package carries baggage from its origins in less dynamic,
more strictly class-based languages -- maybe too much baggage. If this
direction intrigues you, I would be glad to rewrite one of the unit
test files in doctest, to give you a sense of how that would look.

> > 2) Does the library support switching to date-month-year format, instead of just month-date-year format?
>
> Yes, it's completely driven by the information PyICU returns for a locale
> and pdt also has an internal set of classes if PyICU is not available or may
> be too "heavy" to use.

Interesting. So does that mean that instead of custom regexes, pdt
uses the standard date and time formats provided by ICU in order to
generate corresponding regexes automatically? That's quite a feat. But
it puzzles me, considering all the regexes I see in the code, e.g.,
around lines 130-150.

> 3) What does PyICU provide besides localized month and day names?
>
> PyICU provides everthing that IBM's ICU package provides - so that's a lot
> of information. Right now I only use it to pull month, day, am/pm and other
> date format related items.

Okay. I was just wondering what it provided that was not provided by
resources such as calendar.month_names or time.strftime(). I think
time.strftime(), for instance, has access to what you mention --
localized month names, day names, meridian markings, and abbreviations
-- and it's all provided by the underlying ANSI C library, which is
already built-in to the platform. However, time.strftime() may be
restricted to parsing the current locale. I'm not sure.

time.strftime() and time.strptime() rely on the same mini-language for
specifying date/time formats. strftime generates strings in a given
format, while strptime parses them. I suppose a lot of our work would
be unnecessary if there were a function like time.strptime() which,
instead of parsing a string based on a format, returned a regex based
on a format. With such a function, you could build a parsing library
merely by using the minilanguage to specify the subset of desired
formats, using the resulting regexes to find a hit in your sample
text, and then parsing the hit with time.strptime() or just by pulling
out matching groups. This is not the approach I took, but it might be
if I were starting from scratch. I'm curious how it strikes you.

Thanks for your reply. I'll keep my eyes open in case more material
appears on the wiki or in the svn repos.

Cheers,
Alexis

bear

unread,

Feb 19, 2007, 8:56:26 PM2/19/07

to parsedat...@googlegroups.com

On 2/19/07, Alexis <alexisg...@gmail.com> wrote:

Hi again,

On Feb 19, 5:29 am, bear <bea...@gmail.com> wrote:
> I don't have a public roadmap - just a bunch of thoughts Darshana and I have
> about where pdt should go. I want to get some of them down and start using
> the new wiki pages for that.

Great! I will keep my eye on those pages.

>
> 1) Is there a document that comprehensively summarizes what human-readable formats it can digest?
>
> Right now the best documentation for that is the unit tests. We made sure
> to add a test for every new parsing style we needed to handle.
>

Yes, I was looking over the unit tests. Did you ever consider using
doctest (http://docs.python.org/lib/module-doctest.html) instead of
unittest? With it, you can make unit tests that have the appearance of
annotated console sessions. I switched to doctest recently, and quite
like it.

I've tried a number of times to make the switch to doctest and just have not enjoyed it.

It's a nicely self-documenting style of unit test. I suspect the
unittest package carries baggage from its origins in less dynamic,
more strictly class-based languages -- maybe too much baggage. If this
direction intrigues you, I would be glad to rewrite one of the unit
test files in doctest, to give you a sense of how that would look.

I would love to see one of the unit tests rewritten in a doctest style - that would probably be a better example for me than my own attempt :)

> > 2) Does the library support switching to date-month-year format, instead of just month-date-year format?
>
> Yes, it's completely driven by the information PyICU returns for a locale
> and pdt also has an internal set of classes if PyICU is not available or may
> be too "heavy" to use.

Interesting. So does that mean that instead of custom regexes, pdt
uses the standard date and time formats provided by ICU in order to
generate corresponding regexes automatically? That's quite a feat. But
it puzzles me, considering all the regexes I see in the code, e.g.,
around lines 130-150.

pdt uses a blend of both. Within the more common regexes are placeholders for the localized text and that is generated once when ptc is first initialized. For the m/d/y d/m/y y/d/m style regexes there is a custom bit of code that looks at how ICU describes the locale's date preference and then builds the regex to match.

> 3) What does PyICU provide besides localized month and day names?
>
> PyICU provides everthing that IBM's ICU package provides - so that's a lot
> of information. Right now I only use it to pull month, day, am/pm and other
> date format related items.

Okay. I was just wondering what it provided that was not provided by
resources such as calendar.month_names or time.strftime(). I think
time.strftime(), for instance, has access to what you mention --
localized month names, day names, meridian markings, and abbreviations
-- and it's all provided by the underlying ANSI C library, which is
already built-in to the platform. However, time.strftime() may be
restricted to parsing the current locale. I'm not sure.

time.strftime() and time.strptime() rely on the same mini-language for
specifying date/time formats. strftime generates strings in a given
format, while strptime parses them. I suppose a lot of our work would
be unnecessary if there were a function like time.strptime() which,
instead of parsing a string based on a format, returned a regex based
on a format. With such a function, you could build a parsing library
merely by using the minilanguage to specify the subset of desired
formats, using the resulting regexes to find a hit in your sample
text, and then parsing the hit with time.strptime() or just by pulling
out matching groups. This is not the approach I took, but it might be
if I were starting from scratch. I'm curious how it strikes you.

In a lot of ways that's exactly what the various init routines for ptc are doing. They parse the information returned from ICU and use that to either build the lists of text that are then inserted into regexes or create the regexes.

The initial reason for using PyICU tho was not as fancy or meaningful to be honest - I was using it because it's written by an OSAF staffer and I wanted to get a feel for it and see how it works so I could give him some feedback. There may be a better way of doing it and I'm open to those kind of refactoring discussions.

Thanks for your reply. I'll keep my eyes open in case more material
appears on the wiki or in the svn repos.

Glad to chat with someone about the ins and outs of date/time parsing :)

Alexis

unread,

Feb 21, 2007, 3:57:48 AM2/21/07

to parsedatetime-dev

Hi again,

On Feb 20, 1:56 am, bear <bea...@gmail.com> wrote:
> > It's a nicely self-documenting style of unit test. I suspect the
> > unittest package carries baggage from its origins in less dynamic,
> > more strictly class-based languages -- maybe too much baggage. If this
> > direction intrigues you, I would be glad to rewrite one of the unit
> > test files in doctest, to give you a sense of how that would look.
>
> I would love to see one of the unit tests rewritten in a doctest style -
> that would probably be a better example for me than my own attempt :)

Okay, I will have a whack at rewriting one of the unit tests as a
doctest to see how it goes. I notice that Calendar.parse() return an
entire struct_time even if it only find a few data values, and that it
also return error values. That may make it a little challenging to use
with doctest, but I'll see when I try.

In my opinion the design of a nice, easy-to-use API for a parsing
library is one of the hardest parts. This issue of how to return
incomplete values is one of the main problems. Do you return only what
was parsed successfully (e.g., only day and month)? Do you guess the
missing value based on current time, a specified start time, or an
optional parameter that specifies the guessing policy? Do you return
only the data you found, or a large data type with codes indicate
which parts of it are meaningful?

I'm still studying your work. My library leans more in the direction
of returning a simple data structure (datetime.time, and datetime.date
objects), which may be why doctest was a good fit. However, it's also
less powerful about what kind of guesswork it does to fill in the year
(for instance). Analogous issues come up w/r/t to time parsing (assume
PM times past a certain hour?).

The issue of incomplete return values is the first big tricky issue
for this kind of library. It complicates both the return value and the
calling interface of the functions. I'd say the other big issues are
(1) how to prepare valid date and time objects (do you allow the user
to specify the tz and dst of the specimen?), and (2) designing the
library to allow fairly easy addition of new supported formats.

As regards the API design, I suspect neither of our libraries has
really hit the right balance. I suspect there's a way to require less
study from the user and little setup code, in order to make the
majority of simple parsing requests. As regards the internals and
adding new formats, I'm still not sure about the best way to do the
internals -- I'm reading your code now. My code does not have as
strict a division as I would like between the *textual parsing*, which
takes specified formats and strings and returns particular data
values, and the *processing* of those values, which may need to take
into account timezone or DST issues.

I'm curious how perl's Date::Manip handles these broader issues, since
I think it's the mother-of-all human language date parsers. Also, I
also don't really know much about grammars and parsing in general, so
I may be missing the obvious here.

> In a lot of ways that's exactly what the various init routines for ptc are
> doing. They parse the information returned from ICU and use that to either
> build the lists of text that are then inserted into regexes or create the
> regexes.

Very interesting. I'll have to look at it more closely.

> The initial reason for using PyICU tho was not as fancy or meaningful to be
> honest - I was using it because it's written by an OSAF staffer and I wanted
> to get a feel for it and see how it works so I could give him some
> feedback. There may be a better way of doing it and I'm open to those kind
> of refactoring discussions.

Okay. I'm just wondering if ICU provides anything not available from
the standard library.

My own instinct is that supporting English-language date and time
parsing is a pretty hairy problem on its own, given the variety of
formats people actually use. I live in the UK, so I'd like to support
the mm/dd vs dd/mm distinction, but that's the extent of my ambitions.

Also, because of the variety of forms in real-life use, I'm a bit
skeptical of the value of trying to support non-English languages in
the absence of a compendium of specimens of foreign language text
taken from actual use. I wonder if getting the name strings from ICU
really buys enough foreign language parsing to justify the complexity
cost of bringing in an external library. But maybe ICU also has good
information about prevailing informal formatting styles?

I don't mean to criticize. Just thinking out loud...

> > Thanks for your reply. I'll keep my eyes open in case more material
> > appears on the wiki or in the svn repos.
>
> Glad to chat with someone about the ins and outs of date/time parsing :)

Yes, it is nice!

alexis

bear

unread,

Feb 21, 2007, 5:15:37 PM2/21/07

to parsedatetime-dev

On Feb 21, 3:57 am, "Alexis" <alexisgallag...@gmail.com> wrote:
> Hi again,
>
> On Feb 20, 1:56 am, bear <bea...@gmail.com> wrote:
>
> > > It's a nicely self-documenting style of unit test. I suspect the
> > > unittest package carries baggage from its origins in less dynamic,
> > > more strictly class-based languages -- maybe too much baggage. If this
> > > direction intrigues you, I would be glad to rewrite one of the unit
> > > test files in doctest, to give you a sense of how that would look.
>
> > I would love to see one of the unit tests rewritten in a doctest style -
> > that would probably be a better example for me than my own attempt :)
>
> Okay, I will have a whack at rewriting one of the unit tests as a
> doctest to see how it goes. I notice that Calendar.parse() return an
> entire struct_time even if it only find a few data values, and that it
> also return error values. That may make it a little challenging to use
> with doctest, but I'll see when I try.
>
> In my opinion the design of a nice, easy-to-use API for a parsing
> library is one of the hardest parts. This issue of how to return
> incomplete values is one of the main problems. Do you return only what
> was parsed successfully (e.g., only day and month)? Do you guess the
> missing value based on current time, a specified start time, or an
> optional parameter that specifies the guessing policy? Do you return
> only the data you found, or a large data type with codes indicate
> which parts of it are meaningful?

One of the earlier uses for the library was to populate a full date
and time from as little information as possible, so that drove the API
for sure. The internals all work from a start datetime value and
default to "now" if it's not given.

> I'm still studying your work. My library leans more in the direction
> of returning a simple data structure (datetime.time, and datetime.date
> objects), which may be why doctest was a good fit. However, it's also
> less powerful about what kind of guesswork it does to fill in the year
> (for instance). Analogous issues come up w/r/t to time parsing (assume
> PM times past a certain hour?).
>
> The issue of incomplete return values is the first big tricky issue
> for this kind of library. It complicates both the return value and the
> calling interface of the functions. I'd say the other big issues are
> (1) how to prepare valid date and time objects (do you allow the user
> to specify the tz and dst of the specimen?), and (2) designing the
> library to allow fairly easy addition of new supported formats.

Item 1 you mention I decided to defer to the caller :) basically
assuming that the library will be passed in generic (in the TZ sense)
information and any adjustments will be made after. When I look at
all the messy details that are involved with knowing and dealing with
TZ info I quickly decided to make it TZ neutral.

Issue 2 is something I'm struggling with as I think about what needs
to be changed to split the API up into smaller parts that work with
each other. Right now the code has huge chunks of linear logic that
is, IMO, a huge source of bugs waiting to be discovered.

> As regards the API design, I suspect neither of our libraries has
> really hit the right balance. I suspect there's a way to require less
> study from the user and little setup code, in order to make the
> majority of simple parsing requests. As regards the internals and
> adding new formats, I'm still not sure about the best way to do the
> internals -- I'm reading your code now. My code does not have as
> strict a division as I would like between the *textual parsing*, which
> takes specified formats and strings and returns particular data
> values, and the *processing* of those values, which may need to take
> into account timezone or DST issues.

Your hitting all of the same points I've gone over myself. Finding
that balance I think is going to be crucial if the library is to
become very stable. My thought is that I will have to provide a lot
of little entry points to cover specific parsing requirements and then
a couple entry points that have more defaults and can be considered
portals for the less specific user.

The division between text parsing (what I call chunking :) and the
actual datetime calculation is another big point I agree. Right now
the internals are inter-mingled between those two items and they need
to be split up. This way someone who may not need the actual date
value but is instead interested in the relation of the various
datetime chunks to each other (or to the surrounding text) can also
use the library.

> I'm curious how perl's Date::Manip handles these broader issues, since
> I think it's the mother-of-all human language date parsers. Also, I
> also don't really know much about grammars and parsing in general, so
> I may be missing the obvious here.

I studied Date::Manip intensely to see how it was structured and also
to see what features it has that I would like to implement and came
away feeling that the internals of Date::Manip may have started small
and tight but quickly grew into a monster over time.

The sad part was I found myself running into the same pitfalls as soon
as I took the first version of pdt and started adding locale specific
items! But I'm hoping I've avoided a lot of the maintenance nightmare
in how I've structured the internals.

I wouldn't consider myself a parsing/grammer guru but I've done enough
of both to know the tools and I think I went the route I did simple
for ease-of-maintenance right now. I feel that later once I think all
of the patterns are worked out some of the parsing internals could be
refactored into a grammer if needed. But right now I don't think it
needs to be done.

> > In a lot of ways that's exactly what the various init routines for ptc are
> > doing. They parse the information returned from ICU and use that to either
> > build the lists of text that are then inserted into regexes or create the
> > regexes.
>
> Very interesting. I'll have to look at it more closely.
>
> > The initial reason for using PyICU tho was not as fancy or meaningful to be
> > honest - I was using it because it's written by an OSAF staffer and I wanted
> > to get a feel for it and see how it works so I could give him some
> > feedback. There may be a better way of doing it and I'm open to those kind
> > of refactoring discussions.
>
> Okay. I'm just wondering if ICU provides anything not available from
> the standard library.

It's possible that the standard library does and I'm always open to
moving bits away from ICU to reduce the complexity.

> My own instinct is that supporting English-language date and time
> parsing is a pretty hairy problem on its own, given the variety of
> formats people actually use. I live in the UK, so I'd like to support
> the mm/dd vs dd/mm distinction, but that's the extent of my ambitions.

>From my personal experiance the mm/dd dd/mm issue is a solved problem
*if* the library and the caller have a way of giving each other what
the assumptions are. I think I make it an easier problem to solve by
making the caller tell the library (using the localeID parameter) how
the library handles 01/02

> Also, because of the variety of forms in real-life use, I'm a bit
> skeptical of the value of trying to support non-English languages in
> the absence of a compendium of specimens of foreign language text
> taken from actual use. I wonder if getting the name strings from ICU
> really buys enough foreign language parsing to justify the complexity
> cost of bringing in an external library. But maybe ICU also has good
> information about prevailing informal formatting styles?

All good points and I asked myself the same when considering to even
do localized parsing. It turns out that the IBM folks have done a
*lot* of brain work within ICU and they have the API for the library
to discover the various differences in date/time formats.

The only part that any library can not solve is how to interpret the
flow of the different chunks of text. For example, in English it is
common to say "next tuesday" and most people understand that you are
talking about the first Tuesday that follows the current day. But how
is that expressed in French or Spanish?

Heck, even in the US that phrase can be different if the current day
is Monday. Some regions would say the Tuesday that is +1 day away
(i.e. in the same week) is the answer but others mean that it's the
Tuesday that is +8 days away. All very twisty and fun!

> I don't mean to criticize. Just thinking out loud...

Hey - if I cannot answer then that means I haven't thought it out and
that means it's an area I need to learn more about. I *love* talking
about these items with people.

Reply all

Reply to author

Forward

0 new messages