A naive attempt at parsing english into just sentences

98 views
Skip to first unread message

Dilapidus

unread,
Jul 9, 2021, 3:17:58 AM7/9/21
to antlr-discussion

I figured I would see if I could get the sentences out of a document.  Preparations A through G have failed me so far...

// First attempt at English parsing
grammar English;

// First attempt at English parsing
grammar English;

// A paragraph is a list of sentences
paragraph   :   SENTENCE+ (NL | EOF)  ;

// A sentence is anything that is not a period followed by a period
SENTENCE    :   ~('.')+ '.'   ;

NL  : '\r'?'\n'   |  '\r';

I'll worry about Dr. and Mrs.  later.  

The given input text :

The quick brown fox. Jumps over the lazy dog.

sir.

Gets me very close. 


Untitled.jpg
But the two newlines show up in the last sentence.    I don't seem to be able to skip the newlines either but I think that's what I want to do.

Any thoughts?

R

Dilapidus

unread,
Jul 9, 2021, 12:33:25 PM7/9/21
to antlr-discussion
To be clear, I imagine I could very easily trim the sentences in java.   I just want to understand grammar better.

rtm...@googlemail.com

unread,
Jul 11, 2021, 11:41:46 AM7/11/21
to antlr-discussion
Message has been deleted
Message has been deleted

Mike Cargal

unread,
Jul 12, 2021, 11:06:02 AM7/12/21
to antlr-di...@googlegroups.com
You’re not matching the NL rule because your SENTENCE rule will consume \n and \r characters (and produce a longer token than the NL rule).

You could fix that part by having SENTENCE be “not period or \n or \r” followed by a period.  But, then, of course, you’re going to include the \n, \r as part of the content of your sentence.

You could create a SENTENCE_PART token that was everything that’s not a period, newline or carriage return.  And then set up a “sentence” parser rule like “sentence: SENTENCE_PART+ ‘.’;

All of that said, you seem to be headed down the road of natural language parsing, and I think you’ll find the consensus here to be that ANTLR is the wrong tool for that.  And, that probably accounts for the silence in response to your questions.
--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antlr-discussion/5802a373-5f82-4735-bb6b-924ae973031bn%40googlegroups.com.

Richard Ross

unread,
Jul 13, 2021, 8:24:20 PM7/13/21
to antlr-di...@googlegroups.com
Not really but  thanks.

See I know how to skip in general, but what I'd like to do is take a blob of text roughly formatted like :

Section name

paragraphs are multiple sentences which in turn are just anything terminated by a period.   This is a paragraph and it is terminated by a newline.   More here doesn't
matter (as I'm expected softwraps).  After this you would expect another section

Section name2

Different paragraph.

From what I've read this is fairly hard to do, it's just problem that most interested me when I started playing with this.


R




--
You received this message because you are subscribed to a topic in the Google Groups "antlr-discussion" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/antlr-discussion/hnBj51_r0lM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to antlr-discussi...@googlegroups.com.

Dilapidus

unread,
Aug 18, 2021, 6:54:54 PM8/18/21
to antlr-discussion
Mike,

Thanks very much. this got back burnered a bit and I'm just coming back to think through what you've written.   I appreciate the inputs thought.  

As far as NLP, I quite recognize that ANTLR (or any grammar based approach) would be quite impossible.   My goal is quite different.   My goal is to define a special 'english lite' type of language for my own nefarious purposes.   Once I've broken out sentences (and after accommodating something like headers) I would then use a separate grammar (perhaps) to parse the 'not really but similar too' sentences and act on them from there.   Similar to a pseudo-code compiler, I suppose with great attention paid to not being to nitpicky on the writer.

Again, thanks. 

Oscar Fernández Sierra

unread,
Aug 19, 2021, 11:13:47 AM8/19/21
to antlr-discussion
Perpaps you can try https://www.nltk.org/ instead of ANTRL

Oscar
Reply all
Reply to author
Forward
0 new messages