I'd like some sort of script that can recognize fairly accurately if some line of text is a sentence fragment, i.e. it is an incomplete excerpt of a natural language sentence, or if it is actually a complete and correct non-sentence, for example a title or chapter heading.
I've been thinking that it's likely that this is too sophisticated to accomplish in a rule-based manner. Because humans have sophisticated cognitive systems that allow them to use a lot of context to figure out if something seems like part of a sentence like "and then they went off to", rather than just being a title like "The beginning of the story". That is, maybe some kind of rules based on a syntactic parse of the fragment could be possible, but it might be too much work for little reward, since it might not be effective. I'm not sure.
I'm thinking that this would be a very good occasion for a deep learning algorithm instead. I've never set up my own learning algorithm. As far as I know, there are actually unsupervised algorithms where you don't have to train it on any data, it can just group entities by similarity somehow.
Is this feasible? Does anyone think there could be a machine learning algorithm that can recognize sentence fragments vs. titles? Does it need to be trained on data or not? Maybe NLTK already has pre-trained libraries for this, for text type and structure identification, perhaps?
Thanks very much,