chemistry character set

58 views
Skip to first unread message

Mike Dewhirst

unread,
Feb 5, 2018, 6:56:00 PM2/5/18
to Django users
Chemical names start with both upper and lower case as well as Greek
characters. Chemical names also exist in multiple non-western non-latin
languages.

To get lists of chemicals sorting more or less "correctly" I currently
slugify with allow_unicode=True.

This for example gets tert-Butyl... sorted nicely among names starting
with upper-case T.

Unfortunately the α-terpineol or beta this or  ε that all sink to the
end of the list instead of sorting into the A, B or Es.

My google-fu indicates I can sort on a property but that is slow. I have
thought about tweaking slugify to include a table of equivalences
between Greek and Western chars but that doesn't necessarily cater for
non-Western character sets. Maybe an ever expanding table of equivalences?

Thanks for any ideas ...

Mike

Jason

unread,
Feb 6, 2018, 6:08:16 AM2/6/18
to Django users
At first glance, I thought this was an easy problem to have, but apparently it is certainly not!  I came across an Oracle whitepaper that describes how to sort your linguistic data, and you might find some clues there to adapt with your current db.  http://ilmarkerm.blogspot.com/2009/07/using-linguistic-indexes-for-sorting-in.html is an old post describing linguistic indexes in postgres and mysql, but the dbs used are almost 8 years out of date, so you might have to update the syntax to your current version.

Julio Biason

unread,
Feb 6, 2018, 6:28:27 AM2/6/18
to django...@googlegroups.com
Hi Mike,

One thing that occurs me is that you can override the model save() to update another field -- one that the user doesn't have access. On that function, you will write a new field, say `sortable_name` in which you'll transfor the chemical name into something that will appear in the proper order, like converting alphas to A, betas to B, etc.

When you request the list of chemicals by name order, you actually use the `sortable_name` field, which will have all the conversions in place.



Mike

--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users+unsubscribe@googlegroups.com.
To post to this group, send email to django...@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/4160ee4d-8b36-1118-1bec-2ba8ab40d891%40dewhirst.com.au.
For more options, visit https://groups.google.com/d/optout.



--
Julio Biason, Sofware Engineer
AZION  |  Deliver. Accelerate. Protect.
Office: +55 51 3083 8101  |  Mobile: +55 51 99907 0554

Mike Dewhirst

unread,
Feb 6, 2018, 6:44:31 AM2/6/18
to django...@googlegroups.com, Julio Biason
On 6/02/2018 10:27 PM, Julio Biason wrote:
Hi Mike,

One thing that occurs me is that you can override the model save() to update another field -- one that the user doesn't have access. On that function, you will write a new field, say `sortable_name` in which you'll transfor the chemical name into something that will appear in the proper order, like converting alphas to A, betas to B, etc.

Agreed. This is what I have done to date ...

In substance.save() ...
        self.slug = greek_tweak(self.name, allow_unicode=True)
substance.slug is not displayed anywhere and nor is it used in urls because there can be many substances with the same name. And greek_tweak() ...
def greek_tweak(name, allow_unicode=True):
    name = name.replace('α', 'a').replace('β', 'b').replace('γ', 'g')
    name = name.replace('δ', 'd').replace('ε', 'e')
    return slugify(name, allow_unicode)

And back in substance Meta ...
        ordering = ['slug']



When you request the list of chemicals by name order, you actually use the `sortable_name` field, which will have all the conversions in place.

On Mon, Feb 5, 2018 at 9:55 PM, Mike Dewhirst <mi...@dewhirst.com.au> wrote:
Chemical names start with both upper and lower case as well as Greek characters. Chemical names also exist in multiple non-western non-latin languages.

To get lists of chemicals sorting more or less "correctly" I currently slugify with allow_unicode=True.

This for example gets tert-Butyl... sorted nicely among names starting with upper-case T.

Unfortunately the α-terpineol or beta this or  ε that all sink to the end of the list instead of sorting into the A, B or Es.

My google-fu indicates I can sort on a property but that is slow. I have thought about tweaking slugify to include a table of equivalences between Greek and Western chars but that doesn't necessarily cater for non-Western character sets. Maybe an ever expanding table of equivalences?

Thanks for any ideas ...

Mike

--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users+unsubscribe@googlegroups.com.
To post to this group, send email to django...@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/4160ee4d-8b36-1118-1bec-2ba8ab40d891%40dewhirst.com.au.
For more options, visit https://groups.google.com/d/optout.



--
Julio Biason, Sofware Engineer
AZION  |  Deliver. Accelerate. Protect.
Office: +55 51 3083 8101  |  Mobile: +55 51 99907 0554
--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users...@googlegroups.com.

To post to this group, send email to django...@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.

Mike Dewhirst

unread,
Feb 6, 2018, 6:52:17 AM2/6/18
to Django users
On 6/02/2018 10:08 PM, Jason wrote:
> At first glance, I thought this was an easy problem to have, but
> apparently it is certainly not!  I came across an Oracle whitepaper
> <http://www.oracle.com/technetwork/products/globalization/twp-appdev-linguistic-sorting-10gr2-132064.pdf> that
> describes how to sort your linguistic data, and you might find some
> clues there to adapt with your current db.
> http://ilmarkerm.blogspot.com/2009/07/using-linguistic-indexes-for-sorting-in.html is
> an old post describing linguistic indexes in postgres and mysql, but
> the dbs used are almost 8 years out of date, so you might have to
> update the syntax to your current version.

Thank you. I think this is where we probably need to go. I asked the
original question because I'm hoping the project will reach a tipping
point and start to accumulate a growing number of multilingual users. We
have our first multinational user but they only operate in the English
speaking world so no pressure at the moment.

I really appreciate that pointer

Cheers

Mike

>
> On Monday, February 5, 2018 at 6:56:00 PM UTC-5, Mike Dewhirst wrote:
>
> Chemical names start with both upper and lower case as well as Greek
> characters. Chemical names also exist in multiple non-western
> non-latin
> languages.
>
> To get lists of chemicals sorting more or less "correctly" I
> currently
> slugify with allow_unicode=True.
>
> This for example gets tert-Butyl... sorted nicely among names
> starting
> with upper-case T.
>
> Unfortunately the α-terpineol or beta this or  ε that all sink to the
> end of the list instead of sorting into the A, B or Es.
>
> My google-fu indicates I can sort on a property but that is slow.
> I have
> thought about tweaking slugify to include a table of equivalences
> between Greek and Western chars but that doesn't necessarily cater
> for
> non-Western character sets. Maybe an ever expanding table of
> equivalences?
>
> Thanks for any ideas ...
>
> Mike
>
> --
> You received this message because you are subscribed to the Google
> Groups "Django users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to django-users...@googlegroups.com
> <mailto:django-users...@googlegroups.com>.
> To post to this group, send email to django...@googlegroups.com
> <mailto:django...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/django-users.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/django-users/1a1ad0d7-f6b5-4397-beb4-0f15964cabf2%40googlegroups.com
> <https://groups.google.com/d/msgid/django-users/1a1ad0d7-f6b5-4397-beb4-0f15964cabf2%40googlegroups.com?utm_medium=email&utm_source=footer>.

Hanne Moa

unread,
Feb 15, 2018, 6:20:49 AM2/15/18
to django...@googlegroups.com
On 2018-02-06 12:51, Mike Dewhirst wrote:
> Thank you. I think this is where we probably need to go. I asked the
> original question because I'm hoping the project will reach a tipping
> point and start to accumulate a growing number of multilingual users. We
> have our first multinational user but they only operate in the English
> speaking world so no pressure at the moment.

There can be no sort that satisfies every possible language at the same
time. For instance, Norwegian sorts "ä" as "a" and "ö" as "o". Swedish
sorts them after "å" as separate letters: åäö. Then there is Turkish
where "i" sorts differently from "ı" (dotless i).

I'm guessing chemistry names follow their own rules, you could see how
hard it is to make your own os collation table and use that? Then
everything running on the server would sort by the same rules.


HM

Mike Dewhirst

unread,
Feb 15, 2018, 7:56:51 AM2/15/18
to django...@googlegroups.com
On 15/02/2018 10:19 PM, Hanne Moa wrote:
> On 2018-02-06 12:51, Mike Dewhirst wrote:
>> Thank you. I think this is where we probably need to go. I asked the original question because I'm hoping the project will reach a tipping point and start to accumulate a growing number of multilingual users. We have our first multinational user but they only operate in the English speaking world so no pressure at the moment.
> There can be no sort that satisfies every possible language at the same time. For instance, Norwegian sorts "ä" as "a" and "ö" as "o". Swedish sorts them after "å" as separate letters: åäö. Then there is Turkish where "i" sorts differently from "ı" (dotless i).

That is interesting! It says to me that longer term I need to think
about special sort orders for different languages. A bit above my pay
grade just now.

I've worked the greek letter prefixes by using a separate sort field
only seen by the software. A simple replace('α', 'a') lets me adjust
sort order for the moment. That may work with diacritics for some time.
I'll be driven by actual requirements until I hit a brick wall and then
I'll ask for PhD help :)

Thanks

Mike

Peter of the Norse

unread,
Mar 15, 2018, 9:41:52 AM3/15/18
to django...@googlegroups.com
I ran into a similar problem with one of my projects; people were using Greek and Cyrillic letters and other symbols to be cute.  It’s all in English, but they kept doing things like using ß for B and ¥ for Y.  And then expecting to be able to search the way they way it looks.  So I am doing the cleanup in the .save() method.  My only advice is to use https://docs.python.org/3/library/stdtypes.html?highlight=translate#str.translate instead of multiple replaces.  If you make the translation map a global variable, it is much faster. 

- Peter of the Norse
--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users...@googlegroups.com.
To post to this group, send email to django...@googlegroups.com.

Mike Dewhirst

unread,
Mar 15, 2018, 6:21:47 PM3/15/18
to django...@googlegroups.com, Peter of the Norse
On 16/03/2018 12:40 AM, Peter of the Norse wrote:
> I ran into a similar problem with one of my projects; people were
> using Greek and Cyrillic letters and other symbols to be cute.  It’s
> all in English, but they kept doing things like using ß for B and ¥
> for Y.  And then expecting to be able to search the way they way it
> looks.  So I am doing the cleanup in the .save() method.  My only
> advice is to use
> https://docs.python.org/3/library/stdtypes.html?highlight=translate#str.translate instead
> of multiple replaces.  If you make the translation map a global
> variable, it is much faster.

Wow!

Thank you. Ain't Python marvellous!

Mike

>
> - Peter of the Norse
>
> On Feb 15, 2018, at 5:55 AM, Mike Dewhirst <mi...@dewhirst.com.au
>> <mailto:django-users...@googlegroups.com>.
>> To post to this group, send email to django...@googlegroups.com
>> <mailto:django...@googlegroups.com>.
> --
> You received this message because you are subscribed to the Google
> Groups "Django users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to django-users...@googlegroups.com
> <mailto:django-users...@googlegroups.com>.
> To post to this group, send email to django...@googlegroups.com
> <mailto:django...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/django-users.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/django-users/65B863CA-7C6C-4DC7-83FC-DAE87D2F6E8C%40Radio1190.org
> <https://groups.google.com/d/msgid/django-users/65B863CA-7C6C-4DC7-83FC-DAE87D2F6E8C%40Radio1190.org?utm_medium=email&utm_source=footer>.
Reply all
Reply to author
Forward
0 new messages