Hunspell flagging ‘wouldn’t’, ‘couldn’t’, ‘shouldn’t’, etc. as INCORRECT due to smart quotes

301 views
Skip to first unread message

Michael Beijer

unread,
Oct 9, 2013, 3:02:39 PM10/9/13
to cafetra...@googlegroups.com
Hunspell spell check doesn’t flag a word like: 

don't 

but it does flag the smart quote version as incorrect: 

don’t 

Is there a way to fix this?

Michael

Igor Kmitowski

unread,
Oct 12, 2013, 6:01:50 PM10/12/13
to cafetra...@googlegroups.com
Hi Michael,

How about adding them to the user's spell checking dictionary?

Igor

Michael Beijer

unread,
Oct 12, 2013, 6:26:58 PM10/12/13
to cafetra...@googlegroups.com
Nope, that doesn't work.

E.g., I just added 'couldn’t' to my user.dic file.

However, the word 'couldn’t' is still flagged. Or , more precisely: only 'couldn' is underlined red. The '’t' part is ignored.

Ctrl+Shift+S lists 'couldn’t', but clicking on it in the list produces: 'couldn’t’t'.

That is, CT is treating words as if they end with the ’ character.

Hope this makes some sense.

Michael

Igor Kmitowski

unread,
Oct 13, 2013, 2:55:15 AM10/13/13
to cafetra...@googlegroups.com
Hi Michael,

Adding them to Placeables should do the trick.

Igor
--
Igor Kmitowski
Translator and Java developer
CafeTran website: http://www.cafetran.com
CafeTran support: cafetran...@gmail.com

Michael Beijer

unread,
Oct 13, 2013, 6:20:54 AM10/13/13
to cafetra...@googlegroups.com
Cool, I will try that and report back.

Michael

Sent from my iPad
> --
> You received this message because you are subscribed to the Google Groups "CafeTranslators" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to cafetranslato...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

Michael Beijer

unread,
Oct 13, 2013, 10:48:59 AM10/13/13
to cafetra...@googlegroups.com
Hi Igor,

Just tried, and adding them to the placeables list has no effect whatsoever.

I just added 

couldn’t
wouldn’t
didn’t
can’t
shouldn’t

to my placeables list, even restarted CT, and they are all still underlined by Hunspell as incorrect.

By the way, is adding a word to the Placeables list supposed to stop it from being flagged as incorrect by the Hunspell spell checker? Because if this is true, it is not working for me. 

Michael ​​

Michael Beijer
Translator & Terminologist
(Dutch/Flemish into English)
46 Priory Street, Lewes, 
East Sussex BN7 1HJ, 
United Kingdom.
Mob. +44 (0)797 093 5608
michael@wordbook.nl
Skype/Twitter: michaelbeijer



--
You received this message because you are subscribed to the Google Groups "CafeTranslators" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cafetranslators+unsubscribe@googlegroups.com.

Hans list

unread,
Oct 13, 2013, 11:28:25 AM10/13/13
to cafetra...@googlegroups.com
Perhaps it‘’'s broken?

Hans list

unread,
Oct 13, 2013, 11:34:34 AM10/13/13
to cafetra...@googlegroups.com
Would not? Could not? Should not!

I am Sam 

I am Sam 
Sam I am 

That Sam-I-am 
That Sam-I-am! 
I do not like 
that Sam-I-am 

Do you like 
green eggs and ham 

I do not like them, 
Sam-I-am. 
I do not like 
green eggs and ham. 

Would you like them 
Here or there? 

I would not like them 
here or there. 
I would not like them 
anywhere. 
I do not like 
green eggs and ham. 
I do not like them, 
Sam-I-am 

Would you like them 
in a house? 
Would you like them 
with a mouse? 

I do not like them 
in a house. 
I do not like them 
with a mouse. 
I do not like them 
here or there. 
I do not like them 
anywhere. 
I do not like green eggs and ham. 
I do not like them, Sam-I-am. 


Would you eat them 
in a box? 
Would you eat them 
with a fox? 

Not in a box. 
Not with a fox. 
Not in a house. 
Not with a mouse. 
I would not eat them here or there. 
I would not eat them anywhere. 
I would not eat green eggs and ham. 
I do not like them, Sam-I-am. 

Would you? Could you? 
in a car? 
Eat them! Eat them! 
Here they are. 

I would not , 
could not, 
in a car 

You may like them. 
You will see. 
You may like them 
in a tree? 
d not in a tree. 
I would not, could not in a tree. 
Not in a car! You let me be. 

I do not like them in a box. 
I do not like them with a fox 
I do not like them in a house 
I do mot like them with a mouse 
I do not like them here or there. 
I do not like them anywhere. 
I do not like green eggs and ham. 
I do not like them, Sam-I-am. 

A train! A train! 
A train! A train! 
Could you, would you 
on a train? 

Not on a train! Not in a tree! 
Not in a car! Sam! Let me be! 
I would not, could not, in a box. 
I could not, would not, with a fox. 
I will not eat them with a mouse 
I will not eat them in a house. 
I will not eat them here or there. 
I will not eat them anywhere. 
I do not like them, Sam-I-am. 


Say! 
In the dark? 
Here in the dark! 
Would you, could you, in the dark? 

I would not, could not, 
in the dark. 

Would you, could you, 
in the rain? 

I would not, could not, in the rain. 
Not in the dark. Not on a train, 
Not in a car, Not in a tree. 
I do not like them, Sam, you see. 
Not in a house. Not in a box. 
Not with a mouse. Not with a fox. 
I will not eat them here or there. 
I do not like them anywhere! 

You do not like 
green eggs and ham? 

I do not 
like them, 
Sam-I-am. 

Could you, would you, 
with a goat? 

I would not, 
could not. 
with a goat! 

Would you, could you, 
on a boat? 

I could not, would not, on a boat. 
I will not, will not, with a goat. 
I will not eat them in the rain. 
I will not eat them on a train. 
Not in the dark! Not in a tree! 
Not in a car! You let me be! 
I do not like them in a box. 
I do not like them with a fox. 
I will not eat them in a house. 
I do not like them with a mouse. 
I do not like them here or there. 
I do not like them ANYWHERE! 

I do not like 
green eggs 
and ham! 

I do not like them, 
Sam-I-am. 

You do not like them. 
SO you say. 
Try them! Try them! 
And you may. 
Try them and you may I say. 

Sam! 
If you will let me be, 
I will try them. 
You will see. 

Say! 
I like green eggs and ham! 
I do!! I like them, Sam-I-am! 
And I would eat them in a boat! 
And I would eat them with a goat... 
And I will eat them in the rain. 
And in the dark. And on a train. 
And in a car. And in a tree. 
They are so good so good you see! 

So I will eat them in a box. 
And I will eat them with a fox. 
And I will eat them in a house. 
And I will eat them with a mouse. 
And I will eat them here and there. 
Say! I will eat them ANYWHERE! 

I do so like 
green eggs and ham! 
Thank you! 
Thank you, 
Sam-I-am 

Michael Beijer

unread,
Oct 13, 2013, 11:50:23 AM10/13/13
to cafetra...@googlegroups.com
Hi Igor,

OK, I think I have tracked down the problem, which is a Hunspell problem.

In order to get Hunspell to work correctly with smart/curly quotes, you need to use the so-called ICONV command.

ICONV = input character conversion 
OCONV = output character conversion

Have a look at what I found:


Chris Lott <[hidden email]> wrote:
 
> Does anyone know how I can get spell check
> within Vim to handle  "smart" quotes (i.e.
> typographically correct ones) properly?
>
> For instance, spell check in Vim doesn't flag a word like:
> don't
>
> but it does flag the smart quote version as incorrect:
> don’t


In Hunspell, it is possible to make several characters
equivalent with the ICONV command. For example,
the following Hunspelll command in *.aff file replaces
the ASCII quote  U+0027 into U+2019 prior to probing
the dictionary so that don’t or don't are both accepted:

ICONV ’ '

-------------------------------------------*

> On Wed, Dec 1, 2010 at 22:00, Nikolai Weibull <now <at> bitwi.se> wrote:
>> On Wed, Dec 1, 2010 at 21:12, Bram Moolenaar <Bram <at> moolenaar.net> wrote:
>>>
>>> Nikolai Weibull wrote:
>>>
>>>> Writing “Let’s begin …” marks the ‘s’ as a spelling
>>>> error.  Writing “Let's begin …” works fine.  Is this a bug,
>>>> or am I missing something?
>
>>> Right, only latin1 quotes are supported.
>
>> OK, so let’s fix that.  How do we fix that?

The hunspell doc is not very clear but I think this is what the
ICONV directive of Hunspell is for. Looking at this English
dictionary of OpenOffice 3.x at:


... the en_US.aff file contains:

2839 ICONV 6
2840 ICONV ’ '
2841 ICONV ffi ffi
2842 ICONV ffl ffl
2843 ICONV ff ff
2844 ICONV fi fi
2845 ICONV fl fl
2846
2847 OCONV 1
2848 OCONV ' ’

My understanding is that ICONV causes to convert the input
fancy quote U+2019 into a regular quote (among other conversions)
before probing the dictionary.  So "Let’s" and “Let's" are both
recognized as correct.

-------------------------------------------*

So, it would appear that we need to use the ICONV command. Not sure how or where to do that though...

Michael 

Michael Beijer

unread,
Oct 13, 2013, 11:58:56 AM10/13/13
to cafetra...@googlegroups.com
Zooming in on it.

I just found the following in my nl_NL.aff file (at C:\CafeTran Espresso\cafetran\resources\spellchecker):

# replace correct accented double vowels with unaccented ones for acceptance
ICONV 9
ICONV áá aa
ICONV éé ee
ICONV íé ie
ICONV óó oo
ICONV úú uu
ICONV óé oe
ICONV ’ '
ICONV ij ij
ICONV IJ IJ

-------------------------------*

So that seems to be how you use this command: place it at the beginning of the text file. I am not sure where to put it though, in en_GB.aff or en_GB.dic, or maybe in my own user.aff or user.dic. I will experiment and report back.

Michael

Michael Beijer

unread,
Oct 13, 2013, 2:25:46 PM10/13/13
to cafetra...@googlegroups.com
Christ what a mess. 

I managed to get it to work, kind of. 

I downloaded a different .dic and .aff file, this time from the LibreOffice project. However, guess what, now I have a new problem. 

For some reason, the LibreOffice en-GB list also has all sorts of strange, incorrect words in it, like 'wouldn', 'couldn', etc. I am not sure why. The original en-GB .dic and .aff files I was using, were 560 KB, whereas the LibreOffice one is 6.80 MB. That is, the libreOffice lists have many more words in them, but a lot of them are wrong.

I can't remember where I downloaded the original en_GB.dic and en_GB.aff I am currently using in CT. I think either Igor or Hans (1 or 2) suggested a link, or I found it myself. In any case, I did a little test and the en-GB list I have is missing too many words. For example, it doesn't have the word 'disruptive'. 'Disruptive' is in the LibreOffice list (as well as tons of incorrect words;).

Hmm. So it is a matter of finding a .dic and .aff set that (a) has the most words in it, and (b) doesn't flag wouldn’t as incorrect (because it has ICONV in its .aff file).

All this messing around has taught me one thing: there are many different versions of the en-GB Hunspell dictionaries in circulation and it is a good idea to make sure you know which version you are using, as some of them are crap. I suppose the same applies to all the other languages too. It would probably be a good idea if each CT user using a different language would find the best version for their language and then we could maintain a collection of them somewhere. I will do the English one, as that is the only language I translate into. Hans1/Hans2: feel like looking into the Dutch one? And then there is French, German, Polish, Spanish, etc. etc. etc. ;)

Michael 

Igor Kmitowski

unread,
Oct 13, 2013, 2:48:54 PM10/13/13
to cafetra...@googlegroups.com
Hi Michael,

I did some investigation too and it turns out that CT (or rather a Java
breaking at words library which CT depends on) does not treat that
(unusual?) quote character as a part of the word and separates it from the
word. So Hunspell does not even see it as a whole. I will think of some
workaround or replacing this part of code.

Cheers,
Igor

> Christ what a mess.
>
> I managed to get it to work, kind of.
>
> I downloaded a different .dic and .aff file, this time from the
> LibreOffice
> project. However, guess what, now I have a new problem.
>
> For some reason, the LibreOffice en-GB list *also* has all sorts of
> strange, incorrect words in it, like 'wouldn', 'couldn', etc. I am not
> sure
> why. The original en-GB .dic and .aff files I was using, were 560 KB,
> whereas the LibreOffice one is 6.80 MB. That is, the libreOffice lists
> have
> many more words in them, but a lot of them are wrong.
>
> I can't remember where I downloaded the original en_GB.dic and en_GB.aff
> I
> am currently using in CT. I think either Igor or Hans (1 or 2) suggested
> a
> link, or I found it myself. In any case, I did a little test and the
> en-GB
> list I have is missing too many words. For example, it doesn't have the
> word 'disruptive'. 'Disruptive' *is* in the LibreOffice list (as well as
> tons of incorrect words;).
>
> Hmm. So it is a matter of finding a .dic and .aff set that (a) has the
> most
> words in it, and (b) doesn't flag *wouldn’t* as incorrect (because it has
> ICONV in its .aff file).
>
> All this messing around has taught me one thing: there are many different
> versions of the en-GB Hunspell dictionaries in circulation and it is a
> good
> idea to make sure you know which version you are using, as some of them
> are
> crap. I suppose the same applies to all the other languages too. It would
> probably be a good idea if each CT user using a different language would
> find the best version for their language and then we could maintain a
> collection of them somewhere. I will do the English one, as that is the
> only language I translate into. Hans1/Hans2: feel like looking into the
> Dutch one? And then there is French, German, Polish, Spanish, etc. etc.
> etc. ;)
>
> Michael
>


Igor Kmitowski

unread,
Oct 13, 2013, 2:58:54 PM10/13/13
to cafetra...@googlegroups.com
> Hi Michael,
>
> I did some investigation too and it turns out that CT (or rather a Java
> breaking at words library which CT depends on) does not treat that
> (unusual?) quote character as a part of the word and separates it from
> the word. So Hunspell does not even see it as a whole. I will think of
> some workaround or replacing this part of code.

Actually, there is a simple workaround. Just add the following words to
the user's spellchecking dictionary:

wouldn
couldn
shouldn
etc.

Igor

Michael Beijer

unread,
Oct 13, 2013, 3:02:14 PM10/13/13
to cafetra...@googlegroups.com
Hi Igor,

Yeah, I think That’s what the people making the LibreOffice list did (http://extensions.libreoffice.org/extension-center/american-british-canadian-spelling-hyphen-thesaurus-dictionaries). 

However, then these words are considered to be correct, which is not such a good thing. Or am I missing something?

Michael

Michael Beijer
Translator & Terminologist
(Dutch/Flemish into English)
46 Priory Street, Lewes, 
East Sussex BN7 1HJ, 
United Kingdom.
Mob. +44 (0)797 093 5608
michael@wordbook.nl
Skype/Twitter: michaelbeijer



Igor Kmitowski

unread,
Oct 13, 2013, 3:08:35 PM10/13/13
to cafetra...@googlegroups.com

> However, then these words are considered to be correct, which is not
> such a good thing. Or am I missing something?

Yes, as I said it is a workaround. What are the odds of making that kind
of spelling mistake? :)

Michael Beijer

unread,
Oct 13, 2013, 3:31:26 PM10/13/13
to cafetra...@googlegroups.com
You might be right, but this is not really much of a long term solution, is it? 

I hope you manage to fix this in CT, so that we don’t have to resort to workarounds like this.

Michael

Michael Beijer
Translator & Terminologist
(Dutch/Flemish into English)
46 Priory Street, Lewes, 
East Sussex BN7 1HJ, 
United Kingdom.
Mob. +44 (0)797 093 5608
michael@wordbook.nl
Skype/Twitter: michaelbeijer



Michael Beijer

unread,
Nov 4, 2013, 12:20:35 PM11/4/13
to cafetra...@googlegroups.com
The problem has been solved (by Roberto Savelli)!


Basically, the solution is this:

Do not use: U+2019 (RIGHT SINGLE QUOTATION MARK)  
Use: U+02BC (MODIFIER LETTER APOSTROPHE) 

...in any words like: wouldn’t, couldn’t, etc.

See:



So now you can type correct (curly) quote in your translations and not have Hunspell flag them as incorrect! 

I use this AutoHotkey script to type curly quotes in words like wouldn’t in CafeTran (and any other program on my PC): 

^+0::Send, {U+02BC} ; MODIFIER LETTER APOSTROPHE ( ʼ ) ; works with Hunspell

Michael

Hans List

unread,
Nov 4, 2013, 12:31:00 PM11/4/13
to CafeTran Google Group
Please. Adapt. Your. Life. To. Computer. Beep.

Beep.

Resetting.

Michael Beijer

unread,
Nov 4, 2013, 12:32:22 PM11/4/13
to cafetra...@googlegroups.com
Wow, turns out this also solves the problem of CafeTran glossaries not giving you any hits for words like this:

PGBʼs

Before importing your document into CafeTran, do a find and replace and change all: 

U+2019 (RIGHT SINGLE QUOTATION MARK) characters into: U+02BC (MODIFIER LETTER APOSTROPHE) characters

note that they both might look exactly the same to the naked eye, depending on the font you use: 

U+2019 (RIGHT SINGLE QUOTATION MARK) =  ’ 
U+02BC (MODIFIER LETTER APOSTROPHE) = ʼ 

U+2019 (RIGHT SINGLE QUOTATION MARK) =  ’ 
U+02BC (MODIFIER LETTER APOSTROPHE) = ʼ 

U+2019 (RIGHT SINGLE QUOTATION MARK) =  ’ 
U+02BC (MODIFIER LETTER APOSTROPHE) = ʼ 

U+2019 (RIGHT SINGLE QUOTATION MARK) =  ’ 
U+02BC (MODIFIER LETTER APOSTROPHE) = ʼ 

U+2019 (RIGHT SINGLE QUOTATION MARK) =  ’ 
U+02BC (MODIFIER LETTER APOSTROPHE) = ʼ 

Michael


Hans List

unread,
Nov 4, 2013, 1:00:54 PM11/4/13
to CafeTran Google Group

So problem started when people started to use the curly quote instead of an apostrophe :)

Well "people" is a little exaggerated here.

Op 4 nov. 2013 18:32 schreef "Michael Beijer" <mic...@wordbook.nl>:
--
You received this message because you are subscribed to the Google Groups "CafeTranslators" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cafetranslato...@googlegroups.com.

Michael Beijer

unread,
Nov 5, 2013, 11:16:43 AM11/5/13
to cafetra...@googlegroups.com
Oh oh, looks like this won't work after all. I just read this on the memoQ mailing list:

Max B.
Today at 4:10 PM
View Source
El 05/11/2013 11:48, michael@... escribió:
> In conclusion, I think that Kilgray may want to correct the issue by 
> changing the curly version of the character that automatically 
> replaces the straight version to U+02BC from the current U+2019. 
> However, I am not aware if this may create backwards-compatibility 
> problems or other issues due to the fact that it's Unicode-only. I 
> have systematically replaced the characters in a series of projects 
> and I had no problems with the exported files in a limited series of 
> trials with Word and Studio documents.
I tried using U+02BC instead of U+2019.
In memoQ, I see a square with a question mark, probably because my font 
does not have this character.

The exported Word file looks OK, but there are problems with 
spellchecking software:
• The Word spellchecker puts curly waves under each word with U+02BC.

• The Antidote grammar checker treats U+2019 as if it were no there at all.

If I sent out such a file to a client, she would probably wonder what is 
happening.



Hans List

unread,
Nov 5, 2013, 11:45:48 AM11/5/13
to CafeTran Google Group
first business principle: always avoid to wake up sleeping dogs
Reply all
Reply to author
Forward
0 new messages