Identify natural language fragment

26 views
Skip to first unread message

Julius Hamilton

unread,
Sep 13, 2021, 1:42:38 PMSep 13
to nltk-users

Hey,

I'd like some sort of script that can recognize fairly accurately if some line of text is a sentence fragment, i.e. it is an incomplete excerpt of a natural language sentence, or if it is actually a complete and correct non-sentence, for example a title or chapter heading.

I've been thinking that it's likely that this is too sophisticated to accomplish in a rule-based manner. Because humans have sophisticated cognitive systems that allow them to use a lot of context to figure out if something seems like part of a sentence like "and then they went off to", rather than just being a title like "The beginning of the story". That is, maybe some kind of rules based on a syntactic parse of the fragment could be possible, but it might be too much work for little reward, since it might not be effective. I'm not sure.

I'm thinking that this would be a very good occasion for a deep learning algorithm instead. I've never set up my own learning algorithm. As far as I know, there are actually unsupervised algorithms where you don't have to train it on any data, it can just group entities by similarity somehow.

Is this feasible? Does anyone think there could be a machine learning algorithm that can recognize sentence fragments vs. titles? Does it need to be trained on data or not? Maybe NLTK already has pre-trained libraries for this, for text type and structure identification, perhaps?

Thanks very much,
Julius
Reply all
Reply to author
Forward
0 new messages