>
> WOW. A lot of interest in this area apparently. There have been a
> lot of people that joined the list in just 2 days.
Pent up demand! I posted on the Apache Lucene and Mahout mailing
lists plus blogged it, which I assume others did as well.
>
> Some thoughts:
> 1) I think I was the only one that recommended a paper every week.
> Everyone else seems to favor one every month. In retrospect, every
> week is more than I want to tackle as well. One month seems more
> manageable.
+1, although I could likely handle once every two weeks, too. I'm
guessing, that what will happen, is that over time, newcomers will
look in the archives and ask questions on older discussions, too.
>
> 2) The question on how we select which paper to review hasn't really
> been addressed. I don't really have any ideas. I've never been part
> of a virtual reading group before. Does anyone have any experience
> with how other's handle this?
I haven't participated in one either.
>
> 3) A number of people have mentioned that we need to start with the
> basics but how do we define "the basics"? Everyone is at a different
> level apparently.
Here's some classes:
http://en.wikipedia.org/wiki/User:Stevenbird/List_of_NLP_Courses
I'd suggest we need to walk before we can run, so, to me, the basics
start by looking at the lower levels of language like morphology and
syntax as well as Part of Speech (POS) and the idea of parsing out
sentences into a parse tree.
Perhaps, by a simple show of response, people can indicate their level
of comfort with those areas. That is, do people understand what those
things are in terms of language, never mind NLP? I'm not saying we need
to know what algorithms are for them, I'm saying just basic
definitions. I'd say most of it is covered by High School grammar,
but that was a long time ago, so it may be worth a refresher.
After that, is it safe to say people understand the basics of what
tokenization/segmentation and sentence detection (at least for English
and other whitespace delimited languages) are? Again, it isn't
important that one knows how to actually implement them, just be
familiar with the concept.
If we can establish that foundation, then I'd suggest we start looking
at papers that actually discuss how to implement POS tagging, since
POS tagging is often one of the things you need to do the higher level
stuff. From there, I'd then suggest looking at parsing. Given those
two foundational pieces, we can then start looking into deeper things
like word sense disambiguation, info extraction, emotion detection,
etc. Basically, wherever people want to go.
Thus, I'd suggest the following start:
1. Part of Speech Tagging
2. Parsing
3. Named Entity Recognition - identifying people, places, nouns.
From there, we'll have more of a sense of the group dynamic and how
all of this plays out.
As for a model, I'd _suggest_: At the beginning of the next reading
period (which may not be the first of the month depending on when we
start), we ask for a volunteer (the "Editor of the Month" - EOTM) to
spend 2-3 days researching the topic and then come back with a few
suggested papers (2-5). Then, the group votes on the list and the
paper with the most votes is selected. Votes are open for 3 days (72
hours) so that we can account for time differences, travel, etc.
Then, readers have two weeks to read the paper (and feel free to ask
questions as you go). Once it is assumed everyone has read it, more
discussion can follow. I'd also add suggest that the EOTM is
responsible for coming up with 5-10 questions to help seed the
discussion.
As for the EOTM job, this is often as simple as going to Google
Scholar or CiteSeer or some other scholar search tool and plugging in
the topic and then finding the most cited papers and doing a little
pre-reading and a little verification to come up with reasonable
results, plus maybe looking at some online syllabi, etc. I'd say it
is probably 1-2 hours of time, likely less. For example, for POS
tagging, Google Scholar suggests (http://scholar.google.com/scholar?q=part+of+speech+tagging&hl=en&btnG=Search
) the Brill paper, which is one of the papers I had in mind. The
other trick, is that the EOTM needs to pick papers that are freely
available and not locked up in journals.
Just a suggestion, please add your own. Also, we need not feel like
we have to solve it all now. The group can evolve as the membership
changes/grows.
Cheers,
Grant
I have not been a part of (virtual or otherwise) reading group before,
but I wonder if the virtual format could actually take supposed
benefits to the next level.
What if this was a reading _and writing_ group. And I mean writing code.
Say, we are looking at a new algorithm. Let's create an open source
project to implement that algorithm. So, as people are reading
different paper aspects they can put their new knowledge into code,
samples, new algorithms for GATE/NLTK/Mahout/etc. At the end those
bits can be contributed to the underlying project or kept separately.
And people at different levels of understanding can still contribute
at their level of understanding. Having a centrally referenced
repository could also simply the discussion by just pointing URL at
relevant part.
This does require a slower pace of reading, but I think may have a
stronger effect long term. It also has some network-effect benefits.
And we could have the running code demonstrations on Google AppEngine.
That would support both Java and Python, so covers multiple good
libraries.
Just a thought!
Regards,
Alex.
Personal blog: http://blog.outerthoughts.com/
Research group: http://www.clt.mq.edu.au/Research/
- I think age is a very high price to pay for maturity (Tom Stoppard)
I found the following reading lists from CMU (Thanks to Tom Mitchell,
William Cohen, Scott Fahlman and Eric Nyberg) very helpful in the area of
active learning and bootstrap learning for NLP. These list might not be as
basic as some people on the list would like, but we can still keep them for
future use.
Bootstrap Learning:
http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-21/www/semisupervised.html
Active Learning:
http://www.cs.cmu.edu/~ReadTheWeb/activelearning/activelearningbib.html
Ozgur
>
> This sounds great to me. !
>
> In the interests of maintaing momentum, I would recommend that you
> take first rotation as EOTM Grant. Possibly targeting July 15th to
> have 2-5 papers from POS or Parsing for us to vote on. I would be
> willing volunteer to take the second rotation targeting August 1st to
> have a second paper on POS or parsing. After about two rotations,
> there should be more feedback in this area about pace and format.
>
I accept and the 15th sounds reasonable.
> To summarize what you presented (and other posts)
>
> 1) Editor of the Month (EOTM) will spend 2-3 days selecting 2-5 papers
> for consideration
> 2) Everyone votes on papers for 3 days.
> 3) Everyone reads papers for 2 weeks.
> 4) At end of 2 weeks EOTM posts 5-10 seed questions.
> 5) Next EOTM starts process at #1 again.
>
> Some questions:
> - How do we pick EOTM? I suggest it be on a volunteer basis. We can
> start a dissussion for people to volunteer with their topic. We could
> also have the current EOTM responsible for tallying the votes or
> selecting the next EOTM.
I think volunteer basis is good. Like anything done in open source,
the group will only be viable if there are volunteers willing to
sustain it. If people don't volunteer, then it shows the group is not
viable and we can all move on.
> - How do we vote for papers? I suggest emails be sent directly to the
> EOTM to avoid confusion on the list and make it easy to tally them. I
> think we can rely on the honor of the EOTM for a honest tally.
I think the EOTM should just start a thread like:
Subject: [VOTE] Select paper on Part of Speech
Content:
Please place a [x] by the paper you would like to read:
[] POS Tagging using Magic
Abstract: .....
EOTM Comments: ....
[] POS Tagging using the Dark Arts
Abstract: .....
EOTM Comments: ....
[] POS Tagging using chemical reactions
Abstract: .....
EOTM Comments: ....
At the end of three days, the EOTM calls the vote.
By doing it on the list, we don't have to worry about spam filters,
etc. and there is a public record. I personally don't want any
private email. To me, much like open source, a group like this is all
about things happening in the open.
> - How long should we discuss a paper? I suggest we start a discussion
> specifically for the paper, probably with the start read date as part
> of the subject (as well as topic and paper), and then discussion can
> continue as long as it takes. However we could start getting the next
> paper ready immediatly (or maybe 2 weeks after if we do one every
> month?). This allows people that want to do biweekly papers to skip
> every other one. In fact, members could set their own pace to
> whatever they want, one every month, two months, every six weeks, etc.
I'd suggest we try the month approach for the first couple, but after
that let's just see where the group goes. We're all volunteers here
and no one is paying any money, so we should feel free to refactor as
we see fit.
I like the idea, as it is often useful to put these ideas into
practice to really understand them, but I think it can be done as a
background task for those who are interested. Some implementations
may take months to complete in Open Source, which would likely remove
any momentum the group has from a reading/discussion standpoint.
Plus, it will likely be hard to decide which project to contribute
to. I'm partial to Mahout since I am a co-founder and others are
likely partial to some of the other projects listed. Now, one thing
that is likely useful is to actually, during discussion, say things
like: "Try this out in GATE by doing X, Y, Z" or "Take a look at how
this is implemented in Mahout by looking here: ..."
Certainly, however, I'd personally extend a welcome to anyone that
wants to contribute to Mahout who has an itch to scratch when it comes
to Machine Learning, but that isn't why I'm here.
I also think that discussions on papers are likely to go beyond just
the reading/discussion paper as people come and go from the project.
This, to me, is one of the real benefits of a group like this over a
live discussion in some meeting place (which has its own merits).
-Grant
I like Grant's version of doing it by referencing where appropriate.
With most code bases on the web, that's probably the best way. Other
(shorter) things can be done on individual blogs.
Regards,
Alex.
Personal blog: http://blog.outerthoughts.com/
Research group: http://www.clt.mq.edu.au/Research/
- I think age is a very high price to pay for maturity (Tom Stoppard)