Unicode characters to actual scripts

141 views
Skip to first unread message

Parthasarathi Mukhopadhyay

unread,
Oct 30, 2020, 6:46:36 AM10/30/20
to openr...@googlegroups.com
Dear all

I am trying to create a  Geographic name authority dataset in OR 3.4.1.
The steps that I've followed are as follows:

1. Collection Indian District names along with the state names (631);
2. Use Nomination to fetch data on the basis of DistrictName, StateName (able to gather almost all except 8 districts);
3. Create a column for DisplayName on the basis of fetched data;
4. Reconcile DisplayName with WikiData;
5. Can add columns on the basis of reconciled data (except name of the districts in other languages);
6. Tried to fetch datasets in JSON format by using WikiData API;
7. Received language data (along with other properties);
8. It shows languages in Unicode character codes but not in the scripts of the languages;

{"entities":{"Q1356112":{"pageid":1295707,"ns":0,"title":"Q1356112","lastrevid":1299287312,"modified":"2020-10-29T03:53:59Z","type":"item","id":"Q1356112","labels":{"hi":{"language":"hi","value":"\u0939\u0928\u0941\u092e\u093e\u0928\u0917\u0922\u093c \u091c\u093f\u0932\u093e"},"ru":{"language":"ru","value":"\u0425\u0430\u043d\u0443\u043c\u0430\u043d\u0433\u0430\u0440\u0445"},"fr":{"language":"fr","value":"district de Hanumangarh"},"en":{"language":"en","value":"Hanumangarh District"},"es":{"language":"es","value":"Distrito de Hanumangarh"},"it":{"language":"it","value":"distretto di Hanumangarh"},"pa":{"language":"pa","value":"\u0a39\u0a28\u0a41\u0a2e\u0a3e\u0a28\u0a17\u0a5c\u0a4d\u0a39 \u0a1c\u0a3c\u0a3f\u0a32\u0a4d\u0a39\u0a3e"},"de":{"language":"de","value":"Hanumangarh"},"gu":{"language":"gu","value":"\u0ab9\u0aa8\u0ac1\u0aae\u0abe\u0aa8\u0a97\u0aa2 \u0a9c\u0abf\u0ab2\u0acd\u0ab2\u0acb"},"vi":{"language":"vi","value":"Hanumangarh"},"pnb":{"language":"pnb","value":"\u0636\u0644\u0639 \u06c1\u0646\u0648\u0645\u0646\u06af\u0691\u06be"},"ar":{"language":"ar","value":"\u0645\u0646\u0637\u0642\u0629 \u0647\u0627\u0646\u0648\u0645\u0627\u0646\u062c\u0631\u0627"},"nl":{"language":"nl","value":"Hanumangarh"},"sv":{"language":"sv","value":"Hanumangarh"},"nb":{"language":"nb","value":"Hanumangarh"},"sa":{"language":"sa","value":"\u0939\u0928\u0941\u092e\u093e\u0928\u0917\u0922\u093c\u092e\u0923\u094d\u0921\u0932\u092e\u094d"},"zh":{"language":"zh","value":"\u54c8\u52aa\u8292\u52a0\u723e\u7e23"},"te":{"language":"te","value":"\u0c39\u0c28\u0c41\u0c2e\u0c3e\u0c28\u0c4d\u200c\u0c17\u0c30\u0c4d \u0c1c\u0c3f\u0c32\u0c4d\u0c32\u0c3e"},"ne":{"language":"ne","value":"\u0939\u0928\u0941\u092e\u093e\u0928\u0917\u0922\u093c \u091c\u093f\u0932\u094d\u0932\u093e"},"mr":{"language":"mr","value":"\u0939\u0928\u0941\u092e\u093e\u0928\u0917\u0922 \u091c\u093f\u0932\u094d\u0939\u093e"},"nan":{"language":"nan","value":"Hanumangarh"},"ta":{"language":"ta","value":"\u0b85\u0ba9\u0bc1\u0bae\u0bbe\u0ba9\u0bcd\u0b95\u0bbe\u0b9f\u0bcd \u0bae\u0bbe\u0bb5\u0b9f\u0bcd\u0b9f\u0bae\u0bcd"},"ja":{"language":"ja","value":"\u30cf\u30cc\u30de\u30fc\u30f3\u30ac\u30eb\u770c"},"bho":{"language":"bho","value":"\u0939\u0928\u0941\u092e\u093e\u0928\u0917\u0922\u093c \u091c\u093f\u0932\u093e"}

9. Tried Facet > Customized facets > Unicode char-code facet but no changes in the dataset;

10. Is it possible to convert Unicode character code in native scripts like "en":{"language":"en","value":"Hanumangarh District"}?
11. What am I missing?

Best
-----------------------------------------------------------------------
Parthasarathi Mukhopadhyay
Professor, Department of Library and Information Science,
University of Kalyani, Kalyani - 741 235 (WB), India
-----------------------------------------------------------------------

Thad Guidry

unread,
Oct 30, 2020, 10:49:20 AM10/30/20
to openr...@googlegroups.com
Hi Parthasarathi!

besides languages, it has a languagefallback parameter that you could supply.
Scroll down to the bottom of that page to see several examples.

An example of getting the EN language for both Labels and Descriptions on your entity id:


If your question was indeed within OpenRefine context only, then you can use reinterpret() but it won't work in your case because most of the values are in native language script.

image.png

and now trying reinterpret with ASCII won't help (because there's no direct conversion, that's where translations, aliases, etc. come into play, hence using Wikidata to get the label already translated into EN or whichever language needed)

image.png

Hope this helps!  If you get really stuck on the Wikidata API, just let us know on private email.



--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/CAGM_5uYeSO41yZLWAqZRDL0b%3D5jJqAo7EQ2kp%2B5vS8JBQpu8_w%40mail.gmail.com.

Thad Guidry

unread,
Oct 30, 2020, 10:54:22 AM10/30/20
to openr...@googlegroups.com
Parthasarathi,

I just reread your ask on #8 ... somehow missed that... so yeah...just use parseJson() to extract the values and (if you have a unicode font already installed, then the browser and OpenRefine will take advantage of displaying that for you)

Hopefully you already know how to use parseJson() for those JSON blobs you have?


Thad Guidry

unread,
Oct 30, 2020, 10:57:34 AM10/30/20
to openr...@googlegroups.com
Parthasarathi,

This is what I use in my Firefox settings, but you can use what you want/wherever.  I use Lucida Sans Unicode for Sans-serif as a default.

image.png


Parthasarathi Mukhopadhyay

unread,
Oct 30, 2020, 11:19:57 AM10/30/20
to openr...@googlegroups.com
Hi Thad

Thanks for this excellent explanation.


But the method you have demonstrated for handling Wikidata API is much more useful for my case.

I do hope that the rest I can manage )

Best regards and heartfelt thanks.

-----------------------------------------------------------------------
Parthasarathi Mukhopadhyay
Professor, Department of Library and Information Science,
University of Kalyani, Kalyani - 741 235 (WB), India
-----------------------------------------------------------------------

Tom Morris

unread,
Oct 30, 2020, 11:36:07 AM10/30/20
to openr...@googlegroups.com
The \u notation that you see is just an artifact of the JSON encoding. It will all get resolved properly when you decode it. The quoted JSON isn't valid, but once I cleaned it up, this GREL expression:

    value.parseJson().entities.Q1356112.labels.te.value

produced this text:

    హనుమాన్‌గర్ జిల్లా

[I'm assuming that you were interested in something harder to decode than English :) ]

Tom

p.s. If the OpenRefine Wikidata extension doesn't allow the fetching of any language label directly, that's probably something we need to add. I'm sure it's a very common need. I'll investigate the current state of affairs.

Thad Guidry

unread,
Oct 30, 2020, 11:41:25 AM10/30/20
to openr...@googlegroups.com
Tom,
The Wikidata Reconcile endpoint already supports Language selection.

Special properties

Labels, aliases and descriptions can be accessed as follows (L for label , D for description, A for aliases, S for sitelink):

  • Len for Label in English
  • Dfi for Description in Finnish
  • Apt for Alias in Portuguese
  • Sdewiki for Sitelink in German Wikipedia page titles


--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.

Parthasarathi Mukhopadhyay

unread,
Oct 30, 2020, 11:51:40 AM10/30/20
to openr...@googlegroups.com
Yes. I can see it now. I panicked immediately with those many \u in fetched data.

Actually I am in the process of creating a name authority files for Indian place names (will be useful for libraries in India in managing place name authority) with geocoordinate values and properties like population, literacy, capital, area, border share with other places etc by fetching data from Nominatim (nominatim.openstreetmap.org) and Wikidata. The name of the places in all Indian languages can be a very useful see/see also feature for retrieval in Indian scripts. At the end I'll try to port back data (with MARC tags/subfields) from OR to MARCEdit to ILS (Library Management software).

Lets see .....

Best regards
-----------------------------------------------------------------------
Parthasarathi Mukhopadhyay
Professor, Department of Library and Information Science,
University of Kalyani, Kalyani - 741 235 (WB), India
-----------------------------------------------------------------------

--

Thad Guidry

unread,
Oct 30, 2020, 12:03:54 PM10/30/20
to openr...@googlegroups.com
Parthasarathi,

Can you also update any properties, aliases, labels, descriptions, etc. in Wikidata where needed as well for those Indian place names to help the world?
I hope that is the end end goal? :-)



Parthasarathi Mukhopadhyay

unread,
Oct 30, 2020, 12:16:07 PM10/30/20
to openr...@googlegroups.com
I have collected 631 districts, 50K+ sub divisional towns and 600k+ villege names.

Most of the districts (except eight newly formed districts) are covered in WikiData so far. But town and village level are naturally not covered very well in WikiData.

I'll share my completed datasets (initially on MARC for libraries) and as OR project.

Best regards
-----------------------------------------------------------------------
Parthasarathi Mukhopadhyay
Professor, Department of Library and Information Science,
University of Kalyani, Kalyani - 741 235 (WB), India
-----------------------------------------------------------------------

Thad Guidry

unread,
Oct 30, 2020, 12:25:33 PM10/30/20
to openr...@googlegroups.com
Woohoo!  Love it!
Then the community can gladly take up the task of reconciling where necessary, uploading it, etc.

Thanks Parthasarathi for this great effort and contribution towards Indian culture and locations!



Tom Morris

unread,
Nov 3, 2020, 12:18:28 PM11/3/20
to openr...@googlegroups.com
On Fri, Oct 30, 2020 at 11:41 AM Thad Guidry <thadg...@gmail.com> wrote:

The Wikidata Reconcile endpoint already supports Language selection.

Special properties

Labels, aliases and descriptions can be accessed as follows (L for label , D for description, A for aliases, S for sitelink):

  • Len for Label in English
  • Dfi for Description in Finnish
  • Apt for Alias in Portuguese
  • Sdewiki for Sitelink in German Wikipedia page titles
I should have remembered that since I just reviewed the documentation. These should be more discoverable -- perhaps included in autocomplete with a term like "label" instead of "L".

Tom 

Antonin Delpeuch (lists)

unread,
Nov 3, 2020, 12:43:07 PM11/3/20
to openr...@googlegroups.com
On 03/11/2020 18:18, Tom Morris wrote:
> I should have remembered that since I just reviewed the documentation.
> These should be more discoverable -- perhaps included in autocomplete
> with a term like "label" instead of "L".

Agreed, I have opened an issue here:
https://github.com/wetneb/openrefine-wikibase/issues/96

Antonin
Reply all
Reply to author
Forward
0 new messages