A question about parsers

25 views
Skip to first unread message

Alex Sirota

unread,
Dec 2, 2010, 3:08:00 PM12/2/10
to tor...@googlegroups.com
Hi gang,

I'm looking for some good open source, generic parser software. Is Lucene considered a parser? From what I've read it is a high quality search engine, but can it be programmed to parse specific grammars? I'm looking at parsing curriculum vitae documents -- a much harder version of a resume.

Thanks for any info you can lend.

--
Alex Sirota

Amplify your small business online with NewPath Technologies!

E: al...@newpathconsulting.com

"At the moment of commitment, the universe conspires to assist you." -- Johann Wolfgang von Goethe

Michael Reid

unread,
Dec 2, 2010, 4:27:44 PM12/2/10
to tor...@googlegroups.com
Hi,

Are these CV documents really regular grammars? If they don't have a
well defined structure then I would suppose you would need something
like data mining or maybe some really rudimentary/lightweight natural
language processing? I'm an expert in neither of these areas so my
terminology may be way off or I may be simply wrong.

However, if you really do have a regular grammar then there are simply
reams and reams of parser generators/libraries available for almost
any (computer) language you care to work in.

Two packages available for Java are:

* JParsec - http://jparsec.codehaus.org/
* ANTLR - http://www.antlr.org/

I choose these as examples because they represent the two major
approaches you'll encounter. The JParsec style is more like a library
of small parsers that you can combine to compose larger more complex
parsers. With the ANTLR approach, you precisely specify your grammar
to the parser software, and the tool spits out a program that can
parse that grammar.

Personally, I tend to favour the first approach since it lets you work
more informally and generally I find they are easier to integrate into
existing code. However the ANTLR approach has been around since the
70s so it is well proven.

If you can clarify what kind of data you are looking to extract, as
well as some details about the grammar I'm sure a lot of options will
present themselves.

/mike.

> --
> You received this message because you are subscribed to the Google Groups
> "TorCamp" group.
> To post to this group, send email to tor...@googlegroups.com.
> To unsubscribe from this group, send email to
> torcamp+u...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/torcamp?hl=en.
>

Alex Sirota

unread,
Dec 2, 2010, 4:43:11 PM12/2/10
to tor...@googlegroups.com
Mike,

Thank you for the detailed set of options. Much appreciated. CVs are somewhat complex form of a resume. Although it would be great to present a grammar there is unfortunately not a single grammar that works. There are a multitude of resume formats and CVs unfortunately share very little with resumes, primarily work history and demographic information.

Here's a sample CV format from University of Toronto - Medicine Department.

http://www.deptmedicine.utoronto.ca/Faculty/Promotion_CV/Curriculum_Vitae.htm

CVs can be many pages long and can have as many as 100 different data types.

Here's a standard that is being proposed that would form a grammar, but how do you get from unstructured CVs to a proposed standard without going through a lengthy adoption and conversion procedure (many years?)

http://casrai.org/data-standards/cv-standard

I'm thinking of a multiphase process that includes

a) a standard best effort parse to get items like work history and demographics, honors and awards

b) a custom parse that looks at bibliographic and funded projects as well as research teams and publications

c) a system that allows one to fill in the gaps visually after a) and b) have completed to correct mistakes.

d) conversion to XML

e) a Lucene-style search engine on top to analyze the results

The main issue is that while there are only about 40,000 such CVs in Canada, the length and depth of each CV is quite complex. There are probably then 40,000 different formats and only a small subset of them would share any formatting structure (due to a shared nature in how they were created).

While resume parsers seem to do a decent enough jobs on a well formatted resume, seems like the CV parser is a potentially intractable problem, or so I've been told.

But surely it can't be.

Thanks for replying. Looking forward to any other input.

Yanni Chiu

unread,
Dec 6, 2010, 12:24:08 PM12/6/10
to tor...@googlegroups.com
On 02/12/10 4:43 PM, Alex Sirota wrote:
>
> I'm thinking of a multiphase process that includes
>
> a) a standard best effort parse to get items like work history and
> demographics, honors and awards
>
> b) a custom parse that looks at bibliographic and funded projects as
> well as research teams and publications
>
> c) a system that allows one to fill in the gaps visually after a) and b)
> have completed to correct mistakes.
>
> d) conversion to XML
>
> e) a Lucene-style search engine on top to analyze the results
>
> The main issue is that while there are only about 40,000 such CVs in
> Canada, the length and depth of each CV is quite complex. There are
> probably then 40,000 different formats and only a small subset of them
> would share any formatting structure (due to a shared nature in how they
> were created).
>
> While resume parsers seem to do a decent enough jobs on a well formatted
> resume, seems like the CV parser is a potentially intractable problem,
> or so I've been told.
>
> But surely it can't be.
>
> Thanks for replying. Looking forward to any other input.

I see a solution like this...

- develop a model of CV metadata (your link to
http://casrai.org/data-standards/cv-standard looks like a good place to
start)
- build your CV database
- analyze each CV, and record the metadata
- a manual interface to fill in the metadata, while simultaneously
viewing the original CV would be a useful escape hatch
- a search interface that can look at the raw CV (i.e. free-text search)
and/or the metadata (i.e. tags or keywords in particular metadata fields)

The technical unknown is "analyze each CV". You could have a set of CV
parsers available (and more developed as needed). A heuristic (or manual
selection) could be used to choose which parser to use. Maybe multiple
parsers, each targeted at gathering specific metadata, could be used. A
final validation heuristic could be used to determine if all the useful
text has been extracted into metadata. A manual interface could be
available to fine tune the metatdata of any particular CV.

As for technology, have a look at the hot off the press PhD work and
open-source implementation at:
http://www.lukas-renggli.ch/blog/petitparser-1
http://www.lukas-renggli.ch/blog/petitparser-2

In this system, the parsers are just a part of your development
environment (Smalltalk). You can modify them, create new ones, and
compose them - which, IMHO, is the kind of parsing system you'll need to
solve your problem.

Hope that helps.

--
Yanni Chiu

Alex Sirota

unread,
Dec 6, 2010, 4:25:15 PM12/6/10
to tor...@googlegroups.com
Thank you Yanni. Thanks especially for the open source parsers -- it has been difficult to find something like this. Smalltalk is an unfortunate technology choice, but is probably very well suited to the task.

It does seem that different parts of the CV need to have different parsers suited to them, as some are very consistent (bibliography) but many formats are quite different from each other.


If you come across any more parsers please let me know.

Alex


--
You received this message because you are subscribed to the Google Groups "TorCamp" group.
To post to this group, send email to tor...@googlegroups.com.
To unsubscribe from this group, send email to torcamp+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/torcamp?hl=en.

Yanni Chiu

unread,
Dec 6, 2010, 6:01:34 PM12/6/10
to tor...@googlegroups.com
On 06/12/10 4:25 PM, Alex Sirota wrote:
> Smalltalk is an unfortunate
> technology choice, but is probably very well suited to the task.

Not sure whether there's a link off the blog, but there's a paper that
lists the author's reasons for choosing Smalltalk for the task of
"language engineering". The reasons are:
- minimal syntax
- dynamic semantics
- reflective facilities
- homogeneous language (Smalltalk is implemented in Smalltalk)
- homogeneous tools (developer tools & env implemented in Smalltalk)
- on-the-fly programming (no compile/link, changes are immediate to the
live system)

I'd like to understand why you chose the words "unfortunate technology
choice". What impediment do you foresee due to the language choice?
Within the Smalltalk community this is a difficult question to answer,
due to familiarity with can be done with Smalltalk - "it makes simple
things simple, and complicated things possible".

Alex Sirota

unread,
Dec 6, 2010, 6:33:03 PM12/6/10
to tor...@googlegroups.com
Unfortunate because the system should be open sourced and unfortunately there is a very small Smalltalk community compared to other languages. And the installation profile is probably more complex and getting support/help is either more expensive and/or harder to do.

But that's the breaks... Parsers are somewhat an esoteric field I guess.

Alex

--
You received this message because you are subscribed to the Google Groups "TorCamp" group.
To post to this group, send email to tor...@googlegroups.com.
To unsubscribe from this group, send email to torcamp+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/torcamp?hl=en.

Rick Innis

unread,
Dec 7, 2010, 10:11:41 AM12/7/10
to tor...@googlegroups.com
Might I suggest there's some prolog apps out there that'll solve this?
Prolog and natural language can be a petty good fit. There's a list of
Prolog natural language tools at http://www.ai.uga.edu/mc/pronto/

Unfortunately, as with smalltalk, finding a good Prolog developer can
be challenging. (OK, I have to ask: hands up, anyone else here who's
programmed in Prolog and Smalltalk outside of coursework. Anyone?
Bueller?)

R.

Yanni Chiu

unread,
Dec 7, 2010, 12:18:48 PM12/7/10
to tor...@googlegroups.com
On 06/12/10 6:33 PM, Alex Sirota wrote:
> Unfortunate because the system should be open sourced and
> unfortunately there is a very small Smalltalk community compared to
> other languages.

The Squeak/Pharo variant of Smalltalk, on which the PetitParser work is
built on, as well as the parser stuff itself, is all MIT licensed. It
took many years of community effort to move from the original "custom
open source" license used by Apple for Squeak's release.

As for small community, how big does it need to be to provide you a
solution? I recently learned that Ryerson U. is actually using Smalltalk
as the main teaching language in one of its core courses. So there are
more Smalltalk knowledgeable people out there than you might think.
Also, there's:

"I always knew that one day Smalltalk would replace Java. I just didn't
know it would be called Ruby." -- Kent Beck

> And the installation profile is probably more complex and getting
> support/help is either more expensive and/or harder to do.

It's not complex to install. Download the one-click image, execute 5
lines of installation code, and you should have the parser system ready
to evaluate. Most things seem complicated, when you've not seen it before.

Getting support/help on someone's fresh PhD work could be hard in
general, but I've seen the author respond to email and blog questions
about his work.

As for Squeak/Pharo suport/help being expensive or hard to obtain, I'm
not sure what your expectations are. Generally the mailing list is the
place to ask, as with any open-source project - support there is free
and easy to get. There are several commercial Smalltalk vendors - so
that support is not free, but I doubt it'd out of line with industry
pricing. If you're looking for a highly visible ecosystems of small
companies, then that's not here in North America, but it's developing in
Europe and South America.

> But that's the breaks... Parsers are somewhat an esoteric field I
> guess.

For your parsing problem, it seems the evaluation parameters need to be
wider.

Yanni Chiu

unread,
Dec 7, 2010, 12:23:03 PM12/7/10
to tor...@googlegroups.com
On 07/12/10 10:11 AM, Rick Innis wrote:
> Might I suggest there's some prolog apps out there that'll solve this?
> Prolog and natural language can be a petty good fit. There's a list of
> Prolog natural language tools at http://www.ai.uga.edu/mc/pronto/
>
> Unfortunately, as with smalltalk, finding a good Prolog developer can
> be challenging. (OK, I have to ask: hands up, anyone else here who's
> programmed in Prolog and Smalltalk outside of coursework. Anyone?
> Bueller?)

Can I raise my hand up halfway, if part of my job years ago, was to
install and support an experimental Prolog implementation? I only had to
learn enough Prolog to know that it was working.

Alex Sirota

unread,
Dec 7, 2010, 12:55:03 PM12/7/10
to tor...@googlegroups.com
All right -- you make some great points. Last time I looked at Smalltalk was on a Mac in the 80s. Lots has evolved since then. I like the Ruby quote -- I was hoping this system was somehow available in Ruby but maybe it's early days still.

I'm learning Ruby now so maybe this will be a side project to explore how parsers work along side.

Seems like a steep learning curve, but I agree everything does until you actually do it.

Alex

--
You received this message because you are subscribed to the Google Groups "TorCamp" group.
To post to this group, send email to tor...@googlegroups.com.
To unsubscribe from this group, send email to torcamp+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/torcamp?hl=en.

Aaron Wieland

unread,
Dec 7, 2010, 8:10:11 PM12/7/10
to tor...@googlegroups.com
On 07/12/2010 10:11 AM, Rick Innis wrote:
> Unfortunately, as with smalltalk, finding a good Prolog developer can
> be challenging. (OK, I have to ask: hands up, anyone else here who's
> programmed in Prolog and Smalltalk outside of coursework. Anyone?
> Bueller?)

I moved to Toronto in 2001 for a Smalltalk position, where I remained
for a few years. Previously, I had used Smalltalk semi-covertly for
several months at a federal job.

-- Aaron

Bob Hutchison

unread,
Dec 8, 2010, 6:50:18 AM12/8/10
to tor...@googlegroups.com
On 2010-12-07, at 12:55 PM, Alex Sirota wrote:

All right -- you make some great points. Last time I looked at Smalltalk was on a Mac in the 80s. Lots has evolved since then. I like the Ruby quote -- I was hoping this system was somehow available in Ruby but maybe it's early days still.

I'm learning Ruby now so maybe this will be a side project to explore how parsers work along side.

Seems like a steep learning curve, but I agree everything does until you actually do it.

Hi Alex, I know you a little bit, and I seriously doubt that you'll have any difficulty at all :-) And, I think, you should take this opportunity to experience something eye opening. Smalltalk won't go away for a reason. If you go with squeak/pharo you'll find support easily enough through the mailing lists.

Cheers,
Bob
----
Bob Hutchison
Recursive Design Inc.




Reply all
Reply to author
Forward
0 new messages