Accented characters ... search inconsistency

37 views
Skip to first unread message

Peter Koves

unread,
Aug 24, 2025, 1:59:36 PMAug 24
to mementodatabase
There is an inconsistency in how search works wrt accented characters in the Windows desktop app and the Android app.
  • On Windows searching for Dery (name of Hungarian author Déry) will find his entries. Good.
  • On Android the same search finds nothing, have to search for the proper name Déry. Bad.
  • Why is this important specifically in Hungarian? There are short and long variants of certain letters: i and í, o and ó, ö and ő, u and ú, ü and ű. It is sometimes difficult to remember which one is used espeically in names. Note: in other cases, the diacritics completely transform the sound, so a is very diffrent from á, similarly for o and ö,ő, etc.
The actual solution I prefer would be this on all platforms:
  • If the search string contains any accented character, do an exact search, i.e., as it works on ANdorid now.
  • If there are no characters with diacritics (aka accented), then search should find any variant of the character. So, a search for garcon should find garçon, etc.
  • Note:Both Windows and Android provide API-s to convet a string containing characres with diacritics to one that does not ... or so ChatGPT tells me. this being the case, the implemntation should be easy and work accross Cyrillic, Greek, etc.
Extra credit 😉:
  • You may be thanked by German clients for this. Recognize that ae could represent ä, similarly for oe and ue representing ö and ü, and more interestingly, that ss may stand for ß.
  • There may be other similar contractions. For example, I know that Dutch has IJ and ij. You may think you see 4 characters there but there are actually two (uses the arrow keys to verify, or try to click between the two parts). So IJ is the same as IJ (looks the same but latter is the two characters I and J, and indeed the actual ligature is often written with the two separate characters). Example:the river IJssel can be written as IJssel, but never as Ijssel.
  • Super extra credit. Handle ø (Danish/Norwegian), å  (Swedish),  ł (Polish), ð (South Slavic). Note that these are all special cases that the APIs do not handle. Icelandic also has þ which is represented as th. Further there is also æ which is represented as ae ... so ae should match both match both ä and æ, but not that the cited APIs will not convert æ to ae).
This covers languages with alphabets based on Latin, and you have learned more than you ever wanted to know about them. I have no idea whether languages with other alphabets need treatment. For example, I know that Russian Cyrillic used to have Ѣ, ѭ, etc., but these are a) all pre-1918 and b) the latter only appears in archaic church texts).

David Gilmore

unread,
Aug 25, 2025, 11:31:30 AMAug 25
to mementodatabase
Good write up on the issue.

Unfortunately, there are significant differences between the Windows environment and the Android/Apple (Unix like) environments. For example under Windows, searches/sorting default to case insensitive ("A" and "a" are the same), where as Android defaults to case sensitive ("A" and "a" are different). The Memento application for the three environment are completely different versions because of those differences.

The English alphabet can be contained in one byte (256 different values), whereas other languages may require  more characters and thus more bytes. Which is why the industry created multi-byte characters (1, 2, or 3 bytes). (Emojis are part of that extended character set.) There is a standard for extended characters, but not all keyboard software obeys that standard, whether on Windows or Android. And to further complicate it, each language has its own rules.

If you regularly use different languages and they use extended characters, you might try different keyboards to see if one or another works better in your environment.

As far as searches go, that is also dependent on how the application implements it. But applications have it very difficult, since there are completely different rules for Slavic (Russian) versus Hanzi (Chinese) alphabets, for example. But this is something that needs to be brought to the attention of the application developers, and it ia part of their effort to "nationalize" their application.

This forum is a user's forum, and Memento staff do not regularly monitor this site. I would suggest sending this question directly to Memento (Contact information can be found on the Memento Database Web site).

Bill Crews

unread,
Aug 25, 2025, 12:13:06 PMAug 25
to David Gilmore, mementodatabase
I haven't tried it, but to the extent application searches may depend on the OS environment, you could also see differences between the platforms of the desktop edition. I know Windows and Unix/Linux have had distinct ways of sorting.


--
You received this message because you are subscribed to the Google Groups "mementodatabase" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mementodataba...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/mementodatabase/68c01470-5d43-47fa-8130-62b65ea0d172n%40googlegroups.com.

Peter Koves

unread,
Aug 25, 2025, 2:37:09 PMAug 25
to mementodatabase
Got it thanks. I'll send this directly to the developer. 
Having been a developer and architect for about 45 years I know all about single-byte, double-byte, code pages, Unicode16 & Unicode-32 as well as UTF8, etc., and especially about how hard it can be to support non-standard-ASCII (literally "hard" or at least firm: around 1986 with the help of a friend we burned a character generator ROM for my monitor whose content I modified so it could display ő, Ő, ű, and Ű; it was then just a question of writing a TSR so the characters could be input easily).

My point is this. Both Windows and Android have builtin support for converting text with diacritics to text with them removed.  Android was Unicode from the ground up, so obviously it supports it in th DB. But so does the Windows version of Memento as can be verified byt adding an entry with a Chinese character (which I tried). So all that remains is 
  • use the OS APIs to convert strings with diacritics to one that has them removed and
  • to be able to have an SQL Select that can match "diacritic insensitively". ChatGPT show how this can be done for Postgres, MySQL/MariaDB, SQL Server, and SQLite. I wonder which DB is used on Android and Windows (likely diffrent).

Reply all
Reply to author
Forward
0 new messages