remove right-to-left script elements from allCountries.txt

31 views
Skip to first unread message

Alois Treindl

unread,
Jun 9, 2020, 5:04:38 AM6/9/20
to GeoNames
allCountries.txt contains a field for alternative names.
There all kind of languages and scripts are found, among them Hebrew and Arabic.

These two run right to left, and are hard to deal with when processing the alternative names for a user interface,
if one, like me, is not used to deal with such scripts.

Is there a way to remove these two languages from the data?

Occasionally, but not systematically, Unicode character U+200E LEFT-TO-RIGHT MARK is also found in this field, and it should also be eliminated together with Hebrew and Arabic.

Alois Treindl

unread,
Jun 9, 2020, 5:49:38 AM6/9/20
to GeoNames
I found a kind of solution. This often happens just after asking for help via a group posting.
Posting seems to sharpen my attention.

perl -CS -pe 's/,[\x{0530}-\x{ffff}]+[^\t]+//' < allCountries.txt | perl -CS -pe 's/[\x{2000}-\x{20ff}]//g' > allCountries.cleaned.txt

This looks for begin of Unicode beyond x{0530} and deletes all up to next TAB character, which is the end of the alternative names field
It also removes all right-to-left or left-to-right codes.

It leaves cyrillic and greek script elements.
Reply all
Reply to author
Forward
0 new messages