Neural net - guessing literature kind

Borys Musielak

unread,

Nov 17, 2003, 5:02:41 PM11/17/03

to

I have to write an algorithm which guesses the literature kind
having 100 words of text.
Of course the natural way to design it is by using a neural net
and back propagation.
The question is - how to design the net and what function to use
to make it work well. And even more important task is: what does really
differentiate poems from fiction, or criminals in terms of words (i
think of using some syntax loops like verses with rhymes).
Enough would be to differ poetry from prose. Better but i dont know if
easy would be guessing prose kind: like love story, whodoneit, fantasy etc.
I dont expect solutions, I rather need some ideas and maybe links to
similar projects.
I'd be very grateful for any advice.

--
best regards
Borys

Anthony Ventimiglia

unread,

Nov 17, 2003, 5:36:28 PM11/17/03

to

Borys Musielak <mic...@poczta.onet.pl> writes:

> I have to write an algorithm which guesses the literature kind
> having 100 words of text.
> Of course the natural way to design it is by using a neural net
> and back propagation.

I have this thing for Bayesian filters, so for me I'd the natural way
to do it was to use a Bayesian filter, and feed it some text to teach
it different classifications.

--
(incf *yankees-world-series-losses*)

Richard Heathfield

unread,

Nov 17, 2003, 9:46:47 PM11/17/03

to

Borys Musielak wrote:

> I have to write an algorithm which guesses the literature kind
> having 100 words of text.

Compress the canonical works.
Add the 100 words to each canonical work in turn and re-compress. You should
get the best compression when the 100 words are of the same literary kind
(ideally, same author!) as the canonical work itself.

You could do this with existing software and batch files. Very little actual
programming involved.

--
Richard Heathfield : bin...@eton.powernet.co.uk
"Usenet is a strange place." - Dennis M Ritchie, 29 July 1999.
C FAQ: http://www.eskimo.com/~scs/C-faq/top.html
K&R answers, C books, etc: http://users.powernet.co.uk/eton

Borys Musielak

unread,

Nov 18, 2003, 4:56:54 PM11/18/03

to

Użytkownik Richard Heathfield napisał:

> Borys Musielak wrote:
>
>
>>I have to write an algorithm which guesses the literature kind
>>having 100 words of text.
>
>
> Compress the canonical works.
> Add the 100 words to each canonical work in turn and re-compress. You should
> get the best compression when the 100 words are of the same literary kind
> (ideally, same author!) as the canonical work itself.
>
> You could do this with existing software and batch files. Very little actual
> programming involved.
>

Well but the thing is I NEED TO use neural nets and I CANNOT use any
contributed work. This is a project I have to do all by myself (hop with
Your help)
So if anyone thinks of some ideas of how to design a net like this, I
would appreciate very much.
Thanks for your answers anyway.

--
pozdrawiam
Borys

osmium

unread,

Nov 18, 2003, 6:30:23 PM11/18/03

to

Borys Musielak writes:

It seems to me that poetry and prose differ mostly in line length and
punctuation. I was taught that Shakespeare was poetry but I never quite
bought that. It seemed more like an assertion than anything else.

For a prose classification scheme, I would think of something a bit like
the Minnesota Multi-Phasic Personality Inventory (MMPI), whcih is probably
well documented. As I see it it is a correlation problem. Discard the
prepositions and articles and focus on the nouns mostly. If the word
"tomahawk" is used it is one indicator of a western.

Calum

unread,

Nov 19, 2003, 6:23:04 AM11/19/03

to

You can't just shove text at a neural net. All a neural net does is
classify an n-dimensional space. What would be the "inputs"?

If I were you I'd drop the neural net idea. People who think neural
nets "work like the brain man, and therefore can classify anything"
don't know what they're talking about. You could certainly train it to
match an existing text, but the minute you give it a new text, I expect
it would fail miserably, since it would be little more than a memory.

There is a "cosine correlation" of two documents. To match your author
you could get a cosine correlation between your sample text and the
author's corpus. But this doesn't use neural nets.

Cosine correlation creates a vector of n distinct words in a document.
Then it finds the angle between the vectors of two documents. This is
easily computed using a dot product. The smaller the angle, the greater
the similarity.

This would be a great project and almost certainly work with cosine
correlation. The first step would be to cluster your existing corpus
using cosine correlation, to check that documents clustered according to
authors. But please drop neural nets - its not going to work.

osmium

unread,

Nov 19, 2003, 11:39:26 AM11/19/03

to

Calum writes:

> Borys Musielak wrote:
> >>> I have to write an algorithm which guesses the literature kind
> >>> having 100 words of text.
> >>
> >>
> >>
> >> Compress the canonical works.
> >> Add the 100 words to each canonical work in turn and re-compress. You
> >> should get the best compression when the 100 words are of the same
> >> literary kind (ideally, same author!) as the canonical work itself.
> >>
> >> You could do this with existing software and batch files. Very little
> >> actual programming involved.
> >>
> >
> > Well but the thing is I NEED TO use neural nets and I CANNOT use any
> > contributed work. This is a project I have to do all by myself (hop with
> > Your help)
> > So if anyone thinks of some ideas of how to design a net like this, I
> > would appreciate very much.
> > Thanks for your answers anyway.
> >
>
> You can't just shove text at a neural net. All a neural net does is
> classify an n-dimensional space. What would be the "inputs"?
>
> If I were you I'd drop the neural net idea. People who think neural
> nets "work like the brain man, and therefore can classify anything"
> don't know what they're talking about. You could certainly train it to
> match an existing text, but the minute you give it a new text, I expect
> it would fail miserably, since it would be little more than a memory.

When I saw I NEED TO I took it to be a clue that someone told him to use
neural nets. I visualized that the program would be given several (30 or
so?) training texts along with a human's classification of those texts.
After all, that's the way a child trains his neurons, isn't it? After
absorbing the training the program could improve it's powers of
discrimination if some human would comment on the results, thus providing
feedback. A big problem I foresaw but didn't propose a solution for is
finding the root word, "de-stemming it" if you will. To address that I
would look into how spell checkers attack this problem. ISTR that K or R
addressed this problem somewhere, I think there is a succint explanation of
the process. If I ran out of time that capability would be relegated to a
promise of wonderful things to come in a future release.

All in all, it looked like a pretty ambitious assignment to me.

BTW, I would hope that James Joyce's _Ulysses_ wasn't one of the training
samples.

Calum

unread,

Nov 19, 2003, 10:20:45 AM11/19/03

to

Porter's stemmer is very effective, and freely available. There are
also lexical databases like Wordnet that will give you a more precise
stemming.

http://www.tartarus.org/~martin/PorterStemmer/
http://www.cogsci.princeton.edu/~wn/

I think the problem is that a neural network would not know how to
structure its input. If you just put a book in front of a child, it
will not learn to read. Similarly, an artifical neural net could hardly
build up a lexicon or grammar, this is way beyond the capabilities of an
artificial neural network. So what exactly would it discriminate against???

> All in all, it looked like a pretty ambitious assignment to me.

It sounds like a pretty duff assignment. To be told to do this without
any guidance seems ridiculous.

On the other hand, some pre-processed metrics could be classified. I
mean, suppose there was a text processing algorithm that could build up
a "signature" of an author. Perhaps ten numbers or so. These could
then be classified using a neural network. But I've never heard of that
being done before.

Doing something like "average word length" could perhaps discriminate
between a tabloid and a broadsheet. But I don't think this would work
reliably in general. Cosine correlation is one method that does work,
I'd rather see a project that works, than one that is k001 because it
uses a neural net.

Will Dwinnell

unread,

Nov 29, 2003, 12:12:52 PM11/29/03

to

Borys Musielak <mic...@poczta.onet.pl> wrote:
"I have to write an algorithm which guesses the literature kind having
100 words of text."

Does that mean the first 100 words, specifically? Also, you'll need
to be specific about what kinds (literature, technical, poetry, news
copy, etc.) and how many kinds (2, 3, 12?) of literature you want to
distinguish.

Borys Musielak <mic...@poczta.onet.pl> continues:

"Of course the natural way to design it is by using a neural net and
back propagation."

I don't see why- there are many classification systems, both neural
and non-neural. There's nothing wrong with this choice, but I don't
understand why it is "the natural way to design it".

Borys Musielak <mic...@poczta.onet.pl> continues:

"The question is - how to design the net and what function to use to
make it work well. And even more important task is: what does really
differentiate poems from fiction, or criminals in terms of words (i
think of using some syntax loops like verses with rhymes).
Enough would be to differ poetry from prose. Better but i dont know if
easy would be guessing prose kind: like love story, whodoneit, fantasy
etc. I dont expect solutions, I rather need some ideas and maybe
links to similar projects.
I'd be very grateful for any advice."

Filling in the above details would be a good start. All such
projects, though, share the common element of needing appropriate
features by which to distinguish classes. Designing these will need
to proceed from the aforementioned details.

-Will Dwinnell, MBA
http://will.dwinnell.com

Will Dwinnell

unread,

Nov 29, 2003, 12:18:14 PM11/29/03

to

Calum <calum...@ntlworld.com> wrote

"You can't just shove text at a neural net. All a neural net does is
classify an n-dimensional space. What would be the "inputs"?

If I were you I'd drop the neural net idea. People who think neural
nets "work like the brain man, and therefore can classify anything"
don't know what they're talking about. You could certainly train it
to
match an existing text, but the minute you give it a new text, I
expect
it would fail miserably, since it would be little more than a memory."

This really depends on how the data is prepared. All data presented
to any machine learning system (neural network or not) requires
preparation. The idea is to provide a format for the data which is
conducive to solving the problem. I'm not claiming that I know for
certain that this is even possible in this case (I have too little
information for that), but only a poor analyst would accept a model so
geared to the training data as you describe.

-Will Dwinnell
http://will.dwinnell.com

Will Dwinnell

unread,

Nov 29, 2003, 3:03:10 PM11/29/03

to

Calum <calum...@ntlworld.com> wrote:
"I think the problem is that a neural network would not know how to
structure its input."

Structuring the input is the job of the analyst who builds the neural
network- or whatever model is being used.

Calum <calum...@ntlworld.com> continues:

"If you just put a book in front of a child, it will not learn to
read. Similarly, an artifical neural net could hardly build up a
lexicon or grammar, this is way beyond the capabilities of an
artificial neural network."

Since we're arguing by analogy, let's also note that 'If you just put
a disc in front of a computer, it will not read it'. Again, people
still have some niggling responsibilities in the information age.

Calum <calum...@ntlworld.com> continues:

"So what exactly would it discriminate against???"

Assuming that this is at all possible, such a model would require
derived features, as you describe below.

Calum <calum...@ntlworld.com> continues:

"On the other hand, some pre-processed metrics could be classified. I
mean, suppose there was a text processing algorithm that could build
up a "signature" of an author. Perhaps ten numbers or so. These
could then be classified using a neural network. But I've never heard
of that being done before."

Consider the following, which discriminate authors (of text and
computer programs), genders of authors, etc.:

http://www.cavi.univ-paris3.fr/lexicometrica/jadt/jadt2002/PDF-2002/baayen_vanhalteren_neijt_tweedie.pdf

http://webster.cs.uga.edu/~khaled/MLcourse/Abstract1.pdf

http://clue.eng.iastate.edu/~guan/course/CprE-592-YG-Fall-2002/paper/Olivier_DeVel.pdf

http://ftp.cerias.purdue.edu/pub/papers/ivan-krsul/krsul-spaf-authorship-analysis.pdf

"Doing something like "average word length" could perhaps discriminate
between a tabloid and a broadsheet. But I don't think this would work
reliably in general. Cosine correlation is one method that does work,
I'd rather see a project that works, than one that is k001 because it
uses a neural net."

How well this works (keep in mind there is a spectrum of performance)
will depend on the particulars of the problem. So far, I haven't seen
enough information about the original poster's problem to even hazard
a guess as to whether a solution is feasible.

-Will Dwinnell
http://will.dwinnell.com