Filtering before the tokenization

28 views

Skip to first unread message

Florian Birée

unread,

Sep 4, 2009, 5:19:13 PM9/4/09

to pyencha...@googlegroups.com

Hello PyEnchant people,

I'm writing a text editor in python (named Bristoledit:
<http://dev.filyb.info/bristoledit/>).

I'm implementing a spell checker plugin, using PyEnchant (which is
really easy to use, thanks to the developers!). I've quickly worked on
filters for LaTeX markup (the code:
<http://dev.filyb.info/bristoledit/browser/trunk/bristol/plugins/spellchecker.py?rev=%2C168>,
and I'm considering doing the same job for (x)html, but... it seems that
filters are applied only after a first tokenization, that split words by
spaces.

But there is spaces in LaTeX comments! And in almost every (x)html
tag... So, is it possible to filtering the text in another way to avoid
this problem?

Yours,
--
Florian Birée
e-mail : flo...@biree.name
Messagerie Instantanée Jabber/XMPP/Google Talk : floria...@jabber.fr
Site web : http://florian.biree.name/
Carnet web : http://filyb.info/

florian.vcf

signature.asc

Ryan Kelly

unread,

Sep 5, 2009, 6:02:58 AM9/5/09

to pyencha...@googlegroups.com

Hi Florian,

> I'm implementing a spell checker plugin, using PyEnchant (which is
> really easy to use, thanks to the developers!). I've quickly worked on

> filters for LaTeX markup and I'm considering doing the same job for (x)html,

> but... it seems that filters are applied only after a first tokenization,
> that split words by spaces.
>
> But there is spaces in LaTeX comments! And in almost every (x)html
> tag... So, is it possible to filtering the text in another way to avoid
> this problem?

Agreed, this is currently much harder than it should be. I don't think
it can be done using a standard filter in the current setup.

Instead, you could construct a custom tokenizer function that is applied
before the standard one. Off the top of my head, it would look
something like this:

from enchant.tokenize import *

class latex_tokenize(tokenize):
"""Tokenizer that splits LaTeX documents into checkable chunks.

This tokenizer removes comments, commands etc and yields chunks
of checkable text from a LaTeX document.
"""
def next(self):
...logic here, producing checkable chunks of text...
...use basic_tokenize as an example...

# Combine latex_tokenize with standard tokenizer
tknz = wrap_tokenizer(latex_tokenize,get_tokenizer("en",filters))

# Use custom tokenizer in SpellChecker instance
chkr = SpellChecker(language,tokenize=tknz)

In this setup, latex_tokenize will receive the entire text as a single
string, rather than having to work with individual tokens.

Of course this is far from ideal going forward; I will have to think
about how to do this with a Filter subclass while maintaining backwards
compatibility.

Ryan

--
Ryan Kelly
http://www.rfk.id.au | This message is digitally signed. Please visit
ry...@rfk.id.au | http://www.rfk.id.au/ramblings/gpg/ for details