Are these CV documents really regular grammars? If they don't have a
well defined structure then I would suppose you would need something
like data mining or maybe some really rudimentary/lightweight natural
language processing? I'm an expert in neither of these areas so my
terminology may be way off or I may be simply wrong.
However, if you really do have a regular grammar then there are simply
reams and reams of parser generators/libraries available for almost
any (computer) language you care to work in.
Two packages available for Java are:
* JParsec - http://jparsec.codehaus.org/
* ANTLR - http://www.antlr.org/
I choose these as examples because they represent the two major
approaches you'll encounter. The JParsec style is more like a library
of small parsers that you can combine to compose larger more complex
parsers. With the ANTLR approach, you precisely specify your grammar
to the parser software, and the tool spits out a program that can
parse that grammar.
Personally, I tend to favour the first approach since it lets you work
more informally and generally I find they are easier to integrate into
existing code. However the ANTLR approach has been around since the
70s so it is well proven.
If you can clarify what kind of data you are looking to extract, as
well as some details about the grammar I'm sure a lot of options will
present themselves.
/mike.
> --
> You received this message because you are subscribed to the Google Groups
> "TorCamp" group.
> To post to this group, send email to tor...@googlegroups.com.
> To unsubscribe from this group, send email to
> torcamp+u...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/torcamp?hl=en.
>
I see a solution like this...
- develop a model of CV metadata (your link to
http://casrai.org/data-standards/cv-standard looks like a good place to
start)
- build your CV database
- analyze each CV, and record the metadata
- a manual interface to fill in the metadata, while simultaneously
viewing the original CV would be a useful escape hatch
- a search interface that can look at the raw CV (i.e. free-text search)
and/or the metadata (i.e. tags or keywords in particular metadata fields)
The technical unknown is "analyze each CV". You could have a set of CV
parsers available (and more developed as needed). A heuristic (or manual
selection) could be used to choose which parser to use. Maybe multiple
parsers, each targeted at gathering specific metadata, could be used. A
final validation heuristic could be used to determine if all the useful
text has been extracted into metadata. A manual interface could be
available to fine tune the metatdata of any particular CV.
As for technology, have a look at the hot off the press PhD work and
open-source implementation at:
http://www.lukas-renggli.ch/blog/petitparser-1
http://www.lukas-renggli.ch/blog/petitparser-2
In this system, the parsers are just a part of your development
environment (Smalltalk). You can modify them, create new ones, and
compose them - which, IMHO, is the kind of parsing system you'll need to
solve your problem.
Hope that helps.
--
Yanni Chiu
--
You received this message because you are subscribed to the Google Groups "TorCamp" group.
To post to this group, send email to tor...@googlegroups.com.
To unsubscribe from this group, send email to torcamp+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/torcamp?hl=en.
Not sure whether there's a link off the blog, but there's a paper that
lists the author's reasons for choosing Smalltalk for the task of
"language engineering". The reasons are:
- minimal syntax
- dynamic semantics
- reflective facilities
- homogeneous language (Smalltalk is implemented in Smalltalk)
- homogeneous tools (developer tools & env implemented in Smalltalk)
- on-the-fly programming (no compile/link, changes are immediate to the
live system)
I'd like to understand why you chose the words "unfortunate technology
choice". What impediment do you foresee due to the language choice?
Within the Smalltalk community this is a difficult question to answer,
due to familiarity with can be done with Smalltalk - "it makes simple
things simple, and complicated things possible".
--
You received this message because you are subscribed to the Google Groups "TorCamp" group.
To post to this group, send email to tor...@googlegroups.com.
To unsubscribe from this group, send email to torcamp+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/torcamp?hl=en.
Unfortunately, as with smalltalk, finding a good Prolog developer can
be challenging. (OK, I have to ask: hands up, anyone else here who's
programmed in Prolog and Smalltalk outside of coursework. Anyone?
Bueller?)
R.
The Squeak/Pharo variant of Smalltalk, on which the PetitParser work is
built on, as well as the parser stuff itself, is all MIT licensed. It
took many years of community effort to move from the original "custom
open source" license used by Apple for Squeak's release.
As for small community, how big does it need to be to provide you a
solution? I recently learned that Ryerson U. is actually using Smalltalk
as the main teaching language in one of its core courses. So there are
more Smalltalk knowledgeable people out there than you might think.
Also, there's:
"I always knew that one day Smalltalk would replace Java. I just didn't
know it would be called Ruby." -- Kent Beck
> And the installation profile is probably more complex and getting
> support/help is either more expensive and/or harder to do.
It's not complex to install. Download the one-click image, execute 5
lines of installation code, and you should have the parser system ready
to evaluate. Most things seem complicated, when you've not seen it before.
Getting support/help on someone's fresh PhD work could be hard in
general, but I've seen the author respond to email and blog questions
about his work.
As for Squeak/Pharo suport/help being expensive or hard to obtain, I'm
not sure what your expectations are. Generally the mailing list is the
place to ask, as with any open-source project - support there is free
and easy to get. There are several commercial Smalltalk vendors - so
that support is not free, but I doubt it'd out of line with industry
pricing. If you're looking for a highly visible ecosystems of small
companies, then that's not here in North America, but it's developing in
Europe and South America.
> But that's the breaks... Parsers are somewhat an esoteric field I
> guess.
For your parsing problem, it seems the evaluation parameters need to be
wider.
Can I raise my hand up halfway, if part of my job years ago, was to
install and support an experimental Prolog implementation? I only had to
learn enough Prolog to know that it was working.
--
You received this message because you are subscribed to the Google Groups "TorCamp" group.
To post to this group, send email to tor...@googlegroups.com.
To unsubscribe from this group, send email to torcamp+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/torcamp?hl=en.
I moved to Toronto in 2001 for a Smalltalk position, where I remained
for a few years. Previously, I had used Smalltalk semi-covertly for
several months at a federal job.
-- Aaron
All right -- you make some great points. Last time I looked at Smalltalk was on a Mac in the 80s. Lots has evolved since then. I like the Ruby quote -- I was hoping this system was somehow available in Ruby but maybe it's early days still.
I'm learning Ruby now so maybe this will be a side project to explore how parsers work along side.
Seems like a steep learning curve, but I agree everything does until you actually do it.