It became costume lately to mark sheva na, kamatz katan and dagesh hazak differently in order to make the reading more accurate.
While kamatz katan seems to finally have its own unicode char the other two are still missing.
What I was trying to do so far is:
- Talk to the culmus people about adding special signs for all 3, using either the unicode char or private space, which seems to be the unicode
way for adding extra chars.
Most (all?) of the Culmus fonts already support qamats qatan.
The *only* correct way to add characters at this point is to use the Unicode Private Use Area. Any other way breaks the Unicode standard.
The best solution is to propose the characters to the Unicode Technical Committee. Note that the sheva na character was proposed and rejected back in 2002/3 (or thereabouts). With enough people saying "we need it", I think it could get through, as its necessity was one of the sticking points.
We made some attempts at this. It is doable *to some extent*, but also reveals some issues:
- Write a python script that discovers and mark them (WIP)
(1) ambiguities: basically, if you know *one* of qamats qatan or sheva na, you know the other. If you know neither,
the answer is ambiguous without understanding context. If you know either, dagesh kal/dagesh hazak are easily distinguishable automatically.
(2) differences in custom: eg, is the sheva merachef (medial sheva) pronounced as a sheva na (eg, Chabad) or a sheva nach (almost everyone else?).
- Work with wikipedia people about integrating it into mediawiki, but this is more project specific.
Not sure what you mean here. Most of Hebrew Wikipedia is without vowels. (MediaWiki, Wikisource, and Wikipedia are different projects) A bigger issue might be that even biblical keyboards won't support PUA characters.
There's one more Unicode issue: HOLAM HASER FOR VAV, which is an actual Unicode 5.0 character -- the holam dot above a vav when it is a holam haser and not a holam maleh. Its existence is entirely a typographic issue. Holam haser for vav should be typeset to the left of the vav, holam maleh should be typeset above the vav. There is no programmatic ambiguity to distinguishing the two if they are not distinguished.
I would be happy to hear your opinion and suggestions about the topic. Is adding a new font really the right way to go about it?
Are there other characters which are missing?
My 2c,
--
---
Efraim Feinstein
Lead Developer
Open Siddur Project
http://opensiddur.net
http://wiki.jewishliturgy.org
--
Yes, the issue with Kamatz Katan is that it has no official shape. Some make it a bit longer, some make it
The best solution is to propose the characters to the Unicode Technical Committee. Note that the sheva na character was proposed and rejected back in 2002/3 (or thereabouts). With enough people saying "we need it", I think it could get through, as its necessity was one of the sticking points.
True, but the only way to get more people to say we need it, is to get them used it :-)
We made some attempts at this. It is doable *to some extent*, but also reveals some issues:
- Write a python script that discovers and mark them (WIP)
(1) ambiguities: basically, if you know *one* of qamats qatan or sheva na, you know the other. If you know neither,
the answer is ambiguous without understanding context. If you know either, dagesh kal/dagesh hazak are easily distinguishable automatically.
I used very basic rules, can you give me an example to something the script might miss?
(2) differences in custom: eg, is the sheva merachef (medial sheva) pronounced as a sheva na (eg, Chabad) or a sheva nach (almost everyone else?).
Yes, I'm going according to the rav mazuz custom, but I guess the script can take it as a parameter.
I guess it's not that hard to support all of them, the question is how do you mark it? Should we add a different private char for Shva merachef?
Or different char for Kamatz Katan which is different between harav broer and harav mazuz?
- Work with wikipedia people about integrating it into mediawiki, but this is more project specific.
Not sure what you mean here. Most of Hebrew Wikipedia is without vowels. (MediaWiki, Wikisource, and Wikipedia are different projects) A bigger issue might be that even biblical keyboards won't support PUA characters.
Wikipedia uses nikud in places where it explains how to pronounce a word. Also the lack of free nakdan makes it really tedious to add nikud in Hebrew. Hopefully with the
right tools it will become more common.
There's one more Unicode issue: HOLAM HASER FOR VAV, which is an actual Unicode 5.0 character -- the holam dot above a vav when it is a holam haser and not a holam maleh. Its existence is entirely a typographic issue. Holam haser for vav should be typeset to the left of the vav, holam maleh should be typeset above the vav. There is no programmatic ambiguity to distinguishing the two if they are not distinguished.
Hi,No character has an "official shape!" Unicode defines code points, not glyphs. The glyph representation is entirely up to the font designer (as it should be).
On 10/16/2012 12:46 AM, E L wrote:
Yes, the issue with Kamatz Katan is that it has no official shape. Some make it a bit longer, some make it
The UTC can also be convinced by published books. There weren't many in '02/'03. There are a lot more now. Also, users actually saying "we need it" on the Unicode email lists during the discussion would help.
The best solution is to propose the characters to the Unicode Technical Committee. Note that the sheva na character was proposed and rejected back in 2002/3 (or thereabouts). With enough people saying "we need it", I think it could get through, as its necessity was one of the sticking points.
True, but the only way to get more people to say we need it, is to get them used it :-)
The true, correct, and unhelpful answer: not without source code. :-)
We made some attempts at this. It is doable *to some extent*, but also reveals some issues:
- Write a python script that discovers and mark them (WIP)
(1) ambiguities: basically, if you know *one* of qamats qatan or sheva na, you know the other. If you know neither,
the answer is ambiguous without understanding context. If you know either, dagesh kal/dagesh hazak are easily distinguishable automatically.
I used very basic rules, can you give me an example to something the script might miss?
Try: לִשְׁמָרְךָ, צִדְּקָתְךָ
(first examples I could think of)
Not really. According to most opinions, there is no such thing as a shva merachef. It's either a sheva na or a sheva nach.
(2) differences in custom: eg, is the sheva merachef (medial sheva) pronounced as a sheva na (eg, Chabad) or a sheva nach (almost everyone else?).
Yes, I'm going according to the rav mazuz custom, but I guess the script can take it as a parameter.
I guess it's not that hard to support all of them, the question is how do you mark it? Should we add a different private char for Shva merachef?
Too many characters just gets confusing to type and read. I don't think adding characters is the answer.
Or different char for Kamatz Katan which is different between harav broer and harav mazuz?
Auto-pointing words in context is hard, but might be doable to good accuracy with machine learning; out of context is *really* hard (particularly nouns): Try this and you'll see what I mean: עבד- Work with wikipedia people about integrating it into mediawiki, but this is more project specific.
Not sure what you mean here. Most of Hebrew Wikipedia is without vowels. (MediaWiki, Wikisource, and Wikipedia are different projects) A bigger issue might be that even biblical keyboards won't support PUA characters.
Wikipedia uses nikud in places where it explains how to pronounce a word. Also the lack of free nakdan makes it really tedious to add nikud in Hebrew. Hopefully with the
right tools it will become more common.
:-)
No character has an "official shape!" Unicode defines code points, not glyphs. The glyph representation is entirely up to the font designer (as it should be).
We do need to find some solution thought. Or it will grow to be a very annoying problem while sharing texts.
Auto-pointing words in context is hard, but might be doable to good accuracy with machine learning; out of context is *really* hard (particularly nouns): Try this and you'll see what I mean: עבד
:-)
I know, though there are some articles who get to 95% accuracy in hebrew. But even a basic nakdan that just does
nikud suggestions will already save a lot of work. e.g. you press on a word and see a list of the possible nikud.
On 10/16/2012 01:49 PM, E L wrote:Not really -- as a text provider, I don't consider this my job at all, the same way the developer of a word processor doesn't care what TTFs are available on a system. Every provider might make a choice of *default*, but, beyond that, it's the user's choice.No character has an "official shape!" Unicode defines code points, not glyphs. The glyph representation is entirely up to the font designer (as it should be).
Open Siddur (proposes to -- no texts actually have this yet) use a system similar to Dovi's -- put the variant in markup.
We do need to find some solution thought. Or it will grow to be a very annoying problem while sharing texts.
I'd presume that means 95% accuracy *with context* (also, it means 1/20 points is wrong) -- Without context, I'd be very skeptical of that kind of claim. Also, it's far easier to get high accuracy when the spelling is consistent with modern Hebrew conventions.Auto-pointing words in context is hard, but might be doable to good accuracy with machine learning; out of context is *really* hard (particularly nouns): Try this and you'll see what I mean: עבד
:-)
I know, though there are some articles who get to 95% accuracy in hebrew. But even a basic nakdan that just does
nikud suggestions will already save a lot of work. e.g. you press on a word and see a list of the possible nikud.
6c(?),
-- --- Efraim Feinstein Lead Developer Open Siddur Project http://opensiddur.net http://wiki.jewishliturgy.org
--
--
Hi everyone, responses to some points and questions that have been raised:1. A silluq is the true sign for sof pasuq, and is indicated in the stressed syllable in the final word of the pasuq. It looks like a meteg but it is not one. Adding two dots after the pasuq is a custom that the manuscripts don't always keep; the silluq is actually more important than the better known "sof pasuq" sign. In terms of Unicode this is not urgent, because the final "meteg" in the verse is always really the silluq, but it would still be nice.2. It is true that mappiq and dagesh appear the same. Dagesh hazaq in "heh" would default to mappiq. So this is also not urgent, but would be nice to have someday.3. Ephraim noted: "Short meteg/left-vs-right meteg". These are two entirely different issues. In terms of left-vs-right meteg, the people at WLC are very careful about this because they want to convey every orthographic anomaly in the Leningrad Codex (even though the distinction has no value whatsoever). They have also found an adequate way to accomplish this for their needs, by designing a font that can show the meteg before or after the vowel depending on whether it is entered before or after the vowel. But I don't think this distinction is relevant to any of us, nor does it appear in any Jewish edition (only in the WLC/BHQ type literature).4. However, short/long meteg IS extremely relevant, and I will explain why. First of all it is widely used in some of the most important Tanakh editions of the past 30 years, namely Mosad Harav Kook, Horev, and Keter Yerushalayim (the three Breuer editions). It is not rare, but employed tens of thousands of times in these three editions. So there is a huge differentiation in the literature.Furthermore, through my experience working on a Tanakh edition I've learned (as others have before me) that metagim/ge'ayot are by far the most difficult and widespread choice that has to be made in any edition because the variations between the manuscripts and printed editions are so huge (as opposed to differences in letters, vowels, and cantillation notes, which are relatively rare compared to metagim). The purpose of long versus short metagim is to indicate whether the meteg occurs in the source text, or whether it has been added by the editor based on other considerations. This is how Breuer used them, and it is extremely relevant for any kind of edition we might contemplate doing. Without this the documentation becomes extraordinarily difficult or impossible. Alternatively: Different metagim have different functions, and these can be indicated by using two different signs. For either of these reasons, this is definitely a distinction that is sorely needed in order to prevent losing information when entering the text of Tanakh in a digital format.
5. Regarding "qamaz qatan": I agree with Maxim that, based on the what has already become the custom in many printed editions, it should be narrower and have a longer tail. That said, I think it is important to stress that the tail should be *significantly* longer so that the distinction is absolutely clear to someone reading from the screen. In many of the fonts I've looked at there is a distinction, but it still isn't easy to distinguish the two.
--
We should think of a way to explain it to the unicode people. But I agree with efraim, we should start with shva na, I think then dagesh chazak and then the meteg.We need to see what they are looking for a write a format letter with examples, I can take pictures from books if it will help to convince them.