Script for finding words of any size that do NOT contain vowels with acute diacritic marks?

nwaits

unread,

Oct 17, 2012, 10:31:42 AM10/17/12

to

I'm very impressed with python's wordlist script for plain text. Is there a script for finding words that do NOT have certain diacritic marks, like acute or grave accents (utf-8), over the vowels?
Thank you.

Dave Angel

unread,

Oct 17, 2012, 11:00:11 AM10/17/12

to nwaits, pytho...@python.org

On 10/17/2012 10:31 AM, nwaits wrote:
> I'm very impressed with python's wordlist script for plain text. Is there a script for finding words that do NOT have certain diacritic marks, like acute or grave accents (utf-8), over the vowels?
> Thank you.

if you can construct a list of "illegal" characters, then you can simply
check each character of the word against the list, and if it succeeds
for all of the characters, it's a winner.

If that's not fast enough, you can build a translation table from the
list of illegal characters, and use translate on each word. Then it
becomes a question of checking if the translated word is all zeroes.
More setup time, but much faster looping for each word.

--

DaveA

wxjm...@gmail.com

unread,

Oct 17, 2012, 11:32:52 AM10/17/12

to nwaits, pytho...@python.org, d...@davea.name

Lazy way.
Py3.2

>>> import unicodedata
>>> def HasDiacritics(w):
... w_decomposed = unicodedata.normalize('NFKD', w)
... return 'no' if len(w) == len(w_decomposed) else 'yes'
...
>>> HasDiacritics('éléphant')
'yes'
>>> HasDiacritics('elephant')
'no'
>>> HasDiacritics('\N{LATIN CAPITAL LETTER U WITH DIAERESIS AND MACRON}')
'yes'
>>> HasDiacritics('U')
'no'
>>>

Should be ok for the CombiningDiacriticalMarks unicode range
(common diacritics)

jmf

wxjm...@gmail.com

unread,

Oct 17, 2012, 11:32:52 AM10/17/12

to comp.lan...@googlegroups.com, pytho...@python.org, d...@davea.name, nwaits

Le mercredi 17 octobre 2012 17:00:46 UTC+2, Dave Angel a écrit :

Ian Kelly

unread,

Oct 17, 2012, 1:07:11 PM10/17/12

to Python

On Wed, Oct 17, 2012 at 9:32 AM, <wxjm...@gmail.com> wrote:
>>>> import unicodedata
>>>> def HasDiacritics(w):
> ... w_decomposed = unicodedata.normalize('NFKD', w)
> ... return 'no' if len(w) == len(w_decomposed) else 'yes'
> ...
>>>> HasDiacritics('éléphant')
> 'yes'
>>>> HasDiacritics('elephant')
> 'no'
>>>> HasDiacritics('\N{LATIN CAPITAL LETTER U WITH DIAERESIS AND MACRON}')
> 'yes'
>>>> HasDiacritics('U')
> 'no'

Is there something wrong with True and False that you had to replace
them with strings?

"return len(w) != len(w_decomposed)" is all you need.

David Robinow

unread,

Oct 17, 2012, 1:16:43 PM10/17/12

to Python

On Wed, Oct 17, 2012 at 1:07 PM, Ian Kelly <ian.g...@gmail.com> wrote:
> "return len(w) != len(w_decomposed)" is all you need.

Thanks for helping, but I already knew that.

wxjm...@gmail.com

unread,

Oct 17, 2012, 2:17:16 PM10/17/12

to Python

Not at all, I knew this. In this I decided to program like
this.

Do you get it? Yes/No or True/False

jmf

wxjm...@gmail.com

unread,

Oct 17, 2012, 2:17:16 PM10/17/12

to comp.lan...@googlegroups.com, Python

Le mercredi 17 octobre 2012 19:07:43 UTC+2, Ian a écrit :

Chris Angelico

unread,

Oct 17, 2012, 2:22:29 PM10/17/12

to pytho...@python.org

On Thu, Oct 18, 2012 at 5:17 AM, <wxjm...@gmail.com> wrote:
> Not at all, I knew this. In this I decided to program like
> this.
>
> Do you get it? Yes/No or True/False

Yes but why? When you're returning a boolean concept, why not return a
boolean value? You don't even use values with one that
compares-as-true and the other that compares-as-false (for instance,
you could write the function so that it returns just the
diacritic-containing characters, meaning it'll return "" if there
aren't any). To what benefit?

Puzzled.

ChrisA

Ian Kelly

unread,

Oct 17, 2012, 2:27:12 PM10/17/12

to Python

On Wed, Oct 17, 2012 at 12:17 PM, <wxjm...@gmail.com> wrote:
> Not at all, I knew this. In this I decided to program like
> this.
>
> Do you get it? Yes/No or True/False

It's just bad style, because both 'yes' and 'no' evaluate true.

if HasDiacritics('éléphant'):
print('Correct!')

if HasDiacritics('elephant'):
print('Error!')

Prints:

Correct!
Error!

You could replace the test with "if HasDiacritics('elephant') ==
'yes':", but why force the caller to write that out when the former
test is more natural and less prone to error (e.g. typoing 'yes')?

wxjm...@gmail.com

unread,

Oct 17, 2012, 2:33:30 PM10/17/12

to Python

I *know* all this. In my prev. msg, the goal was to emph. the
usage of *unicode.normalize()".

jmf

wxjm...@gmail.com

unread,

Oct 17, 2012, 2:33:30 PM10/17/12

to comp.lan...@googlegroups.com, Python

Le mercredi 17 octobre 2012 20:28:21 UTC+2, Ian a écrit :

Steven D'Aprano

unread,

Oct 17, 2012, 7:18:03 PM10/17/12

to

David, Ian was directly responding to wxjm...@gmail.com, whose
suggestion included an entirely unnecessary conversion from a bool flag
to the strings 'yes' and 'no'. That can be seen in the part of Ian's post
that you deleted.

Regardless of whether *you personally* already knew that jmf's function
was unidiomatic and a poor design, you weren't directly the target of the
comment. I'm glad you already knew what Ian said, but you're not the only
person reading this thread.

--
Steven