Ruby for Wrint

3 views
Skip to first unread message

William Hertling

unread,
Feb 4, 2012, 4:36:49 PM2/4/12
to pdxruby
I'm thinking about writing a tool to warn me for common novel writing
mistakes that wouldn't be caught by a grammer checker. A sort-of lint
for writers.

Some of the types of rules I want to write are:
- double words are errors
- warn for "was" as the second word of a second (weak writing)
- warn for use of "really" (weak writing)
- warn for use of "just" (crutch word)
- warn for commas > n in a sentence (overly complex sentence)
- warn for "the <word> the" (a typo I frequently make, in a word
construct that rarely exists)
- warn if frequency of occurrence of any non-frequently-occurring word
exceeds X in Y words
- identification of proper names used and frequency of occurrence. (to
handle that case where I unintentionally change a characters name part-
way through.)

As a proof of concept, I just want to handle plain text input. Output
would also be text with the warning type and the text that it
occurred.

Any suggestions for a starting direction? Parsing is probably one of
my weak points as a programmer, so if there are obvious ways to
approach this, it's probably not obvious to me.

Thanks,
Will

Brian Troutwine

unread,
Feb 4, 2012, 4:55:23 PM2/4/12
to pdx...@googlegroups.com
On Sat, Feb 4, 2012 at 4:36 PM, William Hertling
<william....@gmail.com> wrote:
> I'm thinking about writing a tool to warn me for common novel writing
> mistakes that wouldn't be caught by a grammer checker. A sort-of lint
> for writers.

I really tend to only parse machine languages, but what you're hoping
to do won't be too difficult once you get the input text into a tree
form. If you're wanting to do it this work in Ruby--and I imagine that
you are, given the mailing list--have a look at these:

http://www.complang.org/ragel/
http://treetop.rubyforge.org/

They're both equally pleasant to work in, I find. Be forewarned that
the treetop's website isn't the best place for documentation. You'll
need to source dive, browse example code on Github or read through
blog tutorials; see this SO question for more details:

http://stackoverflow.com/questions/520818/learning-treetop

Ping the list when you've produced something? I'd enjoy a tool like this.

Happy hacking!
--
Brian L. Troutwine

Colin Curtin

unread,
Feb 4, 2012, 5:03:07 PM2/4/12
to pdx...@googlegroups.com
Will,

I would check out the Ruby Linguistics framework:

There's also NLTK (and its stellar book to get you into NLP) in Python which I've had some success with. There is also the Alchemy API that is free for noncommercial use that is pretty good for identifying parts of speech and such.

Some of the things you mentioned are pretty trivial (double words, really, just, comma threshold, etc) which would be great to have besides the hard stuff like proper nouns. I think for tracking changes for the nouns, you could run a diff on the previous version to make sure that if one has changed that all have changed, and turn it into a revision tracking thing too.

Let me know if you need more help, I think this is a great project.
Colin


--
You received this message because you are subscribed to the Google Groups "pdxruby" group.
To post to this group, send email to pdx...@googlegroups.com.
To unsubscribe from this group, send email to pdxruby+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/pdxruby?hl=en.


Jason LaPier

unread,
Feb 4, 2012, 5:09:27 PM2/4/12
to pdx...@googlegroups.com
On Sat, Feb 4, 2012 at 1:36 PM, William Hertling <william....@gmail.com> wrote:
I'm thinking about writing a tool to warn me for common novel writing
mistakes that wouldn't be caught by a grammer checker. A sort-of lint
for writers.


I started on something like this a few months ago, though I only wrote a few rules for myself so far: one is your double-word typo (I do that all the time). The other is warning about -ing and -ly words. In those cases, I like to see them highlighted so I can make sure I'm not using too many of them.

I'm not interested in parsing grammar - when I'm writing, I intentionally write grammatically incorrect sentences all the time, especially in dialog. From your list, I don't think you care too much about grammar either, it looks like you're just trying to catch some of the common problems that lead to weaker writing.

Let me know if you're interested in doing this as an open-source ruby script (or gem or whatever), I'd be happy to help. Here's a few hints to get started:

irb> dbl_words_regex = /(\b\w+)\s\1\b/i
=> /(\b\w+)\s\1\b/i
irb> str = "Give it the the kid."
=> "Give it the the kid."
irb> str.match(dbl_words_regex)
=> #<MatchData "the the" 1:"the">
irb> str.gsub(dbl_words_regex) { |m| "****#{m}****" }
=> "Give it ****the the**** kid."
irb> str2 = "The the kid went home."
=> "The the kid went home."
irb> str2.gsub(dbl_words_regex) { |m| "****#{m}****" }
=> "****The the**** kid went home."


irb> ings_and_lys_regex = /\w{2,}ly\b|\w{2,}ing\b|\bas\b/i
=> /\w{2,}ly\b|\w{2,}ing\b|\bas\b/i
irb(main):018:0> str3 = "He really wanted to know what happened so suddenly, turning around as he pondered."
=> "He really wanted to know what happened so suddenly, turning around as he pondered."
irb(main):019:0> str3.gsub(ings_and_lys_regex) { |m| "****#{m}****" }
=> "He ****really**** wanted to know what happened so ****suddenly****, ****turning**** around ****as**** he pondered."

In regex, \b is a word boundary, so that helps find the beginnings and ends of words whether they have whitespace or punctuation on either side of them. In most cases with writing, you'll want your regex to end with i after the last / (i = case insensitive).

- Jason L.

William Hertling

unread,
Feb 4, 2012, 7:51:46 PM2/4/12
to pdxruby
On Feb 4, 2:09 pm, Jason LaPier <jason.lap...@gmail.com> wrote:
> Let me know if you're interested in doing this as an open-source ruby
> script (or gem or whatever), I'd be happy to help. Here's a few hints to
> get started:

It sounds like there is interest in both using and contributing, so
yes, I think it would be a great idea.

And yes, I'm not looking to replace a traditional grammer checker.
Just to look for some specific problems I would normally find in the
late proofreading stage.

I think it would be good to run the core code as an open source
project. At some point, if it became generally useful to a non-
technical user, I would love to host it as a web application.

Reid Beels

unread,
Feb 8, 2012, 4:03:07 AM2/8/12
to pdxruby
Not ruby, but possibly relevant:
http://onlinelabor.blogspot.com/2012/02/word-smell-detector-wsd-tool-for.html?spref=tw

On Feb 4, 4:51 pm, William Hertling <william.hertl...@gmail.com>
wrote:

William Hertling

unread,
Feb 9, 2012, 1:52:50 AM2/9/12
to pdxruby
Brilliant, thanks Reid. He provides all the regular expression he
uses, which is awesome.

On Feb 8, 1:03 am, Reid Beels <rei...@gmail.com> wrote:
> Not ruby, but possibly relevant:http://onlinelabor.blogspot.com/2012/02/word-smell-detector-wsd-tool-...
Reply all
Reply to author
Forward
0 new messages