Hi
During the annual ROR meeting, on February 4th, there was brief mention of the possible value of adding ‘script codes’ to ROR organisation names. Similar to the code indicating the name’s language, a script code would indicate the script or alphabet being used: Latin, Cyrillic, Greek, Arabic, Hanzi, Hangul etc. While the great majority, approx 89%, of non-acronym names are currently provided in the Latin script (as are over 99% of acronyms) I think one of the great strengths of the ROR system is its inclusion of non Latin names. One could argue that in the (much) longer term, if ROR is to develop into a truly global system, the proportion of non Latin versions of names needs to increase, but an initial step is to identify and tabulate the scripts currently used. This would also allow names to be filtered and / or sorted by script, in use cases where that was appropriate.
Fortunately, identifying scripts is reasonably straightforward – much easier in many cases than identifying the ‘language’ of a name. While the latter can be hard to define, the script is generally obvious, and it can be derived automatically from the characters used. Scripts can be identified, and added to each name record, for 100% of ROR names in the space of a few seconds.
The key is that ROR names, like almost all textual data, are presented in Unicode. Unicode is the successor to, and a huge extension of, the old ASCII character set – it has a symbol and name for every letter, character or mark used in almost every language, current and ancient, encompassing almost 155,000 symbols in total. Furthermore Unicode is itself almost always encoded using a standard scheme called UTF-8, and the ROR names are no exception.
Each script has its own block of related Unicode characters. For example Latin letters, including their accented variations, are found in characters numbered 32 – 767 (0020 – 02FF in hex), Cyrillic in the block 1024 – 1279 (0400 – 04FF), whilst the traditional Chinese characters (or ‘CJK ideographs’) used in Hanzi, Kanji and Hanja have a block from 19,968 – 40,959 (4E00 – 9FFF). The different scripts are themselves defined by an ISO standard: ISO 15924: Codes for the representation of names of scripts. Details on scripts and the associated Unicode blocks are readily available on the web (e.g. https://en.wikipedia.org/wiki/ISO_15924, https://www.unicodepedia.com/groups/). Identifying the script of a name should therefore involve little more than decoding its UTF-8 characters and seeing which block of characters they belong to.
Needless to say it is not quite as simple as that. A lot of non Latin names use, in practice, ‘Latin’ punctuation. Occasionally non Latin characters are found in Latin names. In addition, a very small proportion of the names listed in ROR seem to genuinely include two scripts. It is therefore necessary to ‘pre-process’ names before the script coding takes place, and do a small amount of ‘post-processing’ after that coding. The process remains, however, generally straightforward, entirely automated and fast.
In the hope it might be of interest or use to others I have summarised further details of a coding process in the attached document. I have also attached a spreadsheet with some of the associated data, which acts as a resource / appendix to the description document. No apologies for diving into some of the weeds of Unicode and Postgres regular expressions! Though the implementation details are system / language dependent, I think the actions required would be very similar whatever the technical tools employed.
The process as described assumes that the names, along with other ROR data, have been imported into a Postgres database, and it consists essentially of a series of SQL statements against that database. The process is integrated within an updated version of the imp_ror system I have mentioned previously, available on my GitHub, where the SQL commands are executed within a simple framework written in Rust. Most of the code dealing with encoding scripts is to be found at https://github.com/steve-canham/imp_ror/blob/master/src/process/src_script_coder.rs.
I appreciate that all this might not be a high priority, given all the other activity in ROR, but I hope it is of some use, if not now than in the future!
Cheers
Steve
P.S. Minor point, but the docs seem a little confused about the total number of records in ROR. The 1.63 release states that 287 new records were added, and given that the 1.62 release included 114,725 records that gives a total of 115,012. I get that number when I download the 1.63 file into a Postgres DB. But the release notes for 1.63 state that there are 115,299 records. In other words I think the 287 has been added twice.