How to count the number of characters in persian?

2 views
Skip to first unread message

Pejman Habashi

unread,
Jun 24, 2022, 7:21:10 PMJun 24
to Persian Computing
Simple Question: How do you measure the length of string?

The simple naive answer is to count the number of Unicode chars in the string. And this is what most programmers will do and understand very well. However, in languages with diacritics, I assume they do not contribute to the length of string (do they?). How about zero length non-joining space (ZWNJ)?

Specifically, in Persian:

len ( "علی" ) = 3
len( "عَلی" ) = 3 or 4?
len( "عِلّی" ) = 3 or 4 or 5?
len( "می‌نوشته‌ای" ) = 9 or 11?

The  ZWNJ become more critical as in everyday writing sometimes it is required and sometimes it is not. For example "سایه‌ی" cannot be written without ZWNJ while "مسیری" usually is written without ZWNJ.





Peter von Kaehne

unread,
Jun 25, 2022, 3:52:01 AMJun 25
to Pejman Habashi, Persian Computing
I think there are several aspects. Depending on what you want - reliable iteration and movement of a pointer along a text or reliable search or else then you need to agree on a standard. Or possibly several within the same programme. 

Where I deal with Farsi etc texts in programmes (search of poor quality user input into long prior prepared texts of often higher quality - I.e. with zwnj and diacritics reliably in the text) we use “decomposed stripped of diacritics and zwnj etc,  normalised” for search which will give you the minimal count. Similarly for “give me the third and seventh letter of your password“ you need to have a count which is based on stripped characters and deletion of ZWNJ

Simple decomposed unstripped will always give the highest count of your alternatives. We use that for all pointing and iterating on the texts, where user input is not involved. 

I am sure there are other applications - the bottom line is , you need to think and agree (with anyone else contributing at the very least  ) what you want to achieve and hence which way of processing Unicode is the best in your particular use case. But no way is “wrong” and hence none of the results is wrong. 

Peter

Sent from my phone. Please forgive misspellings and weird “corrections”

On 25 Jun 2022, at 00:21, Pejman Habashi <pejman...@gmail.com> wrote:


--
--
https://persian-computing.org/
https://groups.google.com/g/persian-computing/
---
You received this message because you are subscribed to the Google Groups "Persian Computing" group.
To unsubscribe from this group and stop receiving emails from it, send an email to persian-comput...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/persian-computing/149e210e-6412-452f-b12f-1ed1162cf3edn%40googlegroups.com.

Pejman Habashi

unread,
Jun 27, 2022, 11:16:45 AMJun 27
to Persian Computing
Thanks Peter, that was what I was expecting mostly. For low level processing of strings (i.e: utf-8) it is clear you have to consider the memory representation of content.

My main concern is with text processing and I have decided to normalize the text (even more than what I should :D ) into basic 32 Persian alphabet. At some point it looks too harsh and I needed a second opinion on it.

I wish there was a more standard way of doing this. Something that everyone or at least majority (including linguistics and general population) agrees upon. With lack of such standard, I will go ahead and create one for now, unless someone has a better idea.

Thanks,
Pejman.
Reply all
Reply to author
Forward
0 new messages