A scheme for adding diacritics to English - A qu̹ic̹k brőw̵n fox jumps ōve͠r t̹h̵è lāzy dog.

1 view
Skip to first unread message

Yao Ziyuan

unread,
Mar 23, 2010, 4:13:09 PM3/23/10
to Natural Language Processing Virtual Reading Group
A better formatted version is at http://sites.google.com/site/yaoziyuan/pee001
.


The Phonetics-Enhanced English (PEE) Scheme ver 0.01

Quick Examples

Full-fledged: A qu̹ic̹k brőw̵n fox jumps ōve͠r t̹h̵è lāzy dog.
Lite: A qu̹ick brőwn fox jumps ōver t̹hè lāzy dog.

Usage

The user is supposed to learn this PEE scheme "by example", i.e. they
will know how new words sound by looking at diacritics used in known
words. They don't need to systematically study the rules in the
scheme, although gradual rule instruction in the context of reading
(by an automatic code-switching system) is definitely a boost.

Depending on the user's level of English phonics knowledge, liter
versions of this scheme can be used. Moreover, already acquired words
and word parts (e.g. -tion) don't need diacritics.

Design Remarks

Version 0.01 is a working but not optimized scheme. Some diacritics
used in this version could be replaced by ones that are visually
clearer in some fonts, and diacritic assignment could be more
scientific, logical and memorable.

Diacritics can be assigned in several ways: (1) each diacritic
corresponds to a phoneme, regardless of the letter modified; (2) each
diacritic corresponds to a certain phonetic aspect, regardless of the
letter modified; (3) each diacritic is just a randomly chosen symbol
to differentiate a letter's possible phonemes. This version makes use
of all these three principles.

Encoding

PEE is based on Unicode.

Unicode often provides both "combining codepoint" characters that can
add diacritics to other characters, and "pre-composed" characters that
are letters which already have diacritics. We should prefer the
approach which has better rendering in most fonts. For example, the
post-composed H̵ (H + U+0335) looks less disturbing than the pre-
composed Ħ (U+0126).

Combining codepoints can be found at http://en.wikipedia.org/wiki/Combining_character
.

Pre-composed characters can be found by visiting the Wikipedia page
for a basic Latin letter (e.g. http://en.wikipedia.org/wiki/A_%28letter%29)
and then looking at "Letter <X> with diacritics" at the bottom of the
page, where <X> is that basic letter.

All phonetic transcriptions in [...] in this document are in IPA
(International Phonetic Alphabet).

The Scheme

1. Unrepresentable or Variable Sounds (UNREP/VAR)

Example: bu͂siness

A "~" above (U+0342, which is clearer than U+0303 in some fonts) or
below (U+0330) a vowel/consonant letter means this letter's
corresponding sound can't be represented by diacritics in this version
(because they are rare exceptions to English orthography or are loan
words), or the letter's sound is variable depending on context (for
example, the "ea" in "read" has various sounds depending on whether
"read" is used in the past tense/as a past participle).

Pre-composed characters are preferred if they display better in most
fonts.

UNREP/VAR always appears above a vowel letter (a͂, e͂, i͂, o͂, u͂, w͂,
y͂), and usually appears below a consonant letter. If there is not
enough space below a consonant letter (e.g. g), it appears above the
letter.

UNREP/VAR does not affect vowel/consonant letters around the letter
modified.

2. Silences

Example: také

All diacritics for silence do not affect vowel/consonant letters
around the letter modified.

2.1. Single-Letter Silence

A "/" above (U+0341, which is clearer than U+0301 in some fonts) or
below (U+0317) a vowel/consonant letter silences this letter.

Pre-composed characters are preferred if they display better in most
fonts. An example is í (U+00ED).

The "/" always appears above a vowel letter (á, é, í, ó, ú, ẃ,
ý), and usually appears below a consonant letter. If there is not
enough space below a consonant letter (e.g. g), it appears above the
letter.

A short "-" inside (U+0335) a letter can also silence this letter. It
serves in cases where we want to avoid excessive separate diacritics.
Examples include "how̵", "hey̵" and "dooɍ".

Pre-composed characters are preferred if they display better in most
fonts. An example is H̵ (H+ U+0335).

2.2. Double-Letter Silence

A reverse arch above (U+035D) two letters silences these letters.
Examples include rig͝ht and chequ͝e.

3. Stress (STR)

Example: wikipẹdia

A "." below (U+0323) a vowel letter (ạ, ẹ, ị, ọ, ụ, ẉ, ỵ) means
the syllable this vowel letter belongs to is stressed.

Pre-composed characters are preferred if they display better in most
fonts. Examples include ị (U+1ECB) and ỵ (U+1EF5).

A multi-syllable word without STR means its stress is variable, e.g.
"present".

4. Syllable Separator

Example: a·way

A "·" (U+00B7) between two characters separates two syllables. It is
necessary if a word's spelling is not left-associative. For example,
"away" should be separated as "a·way" instead of "aw·ay".

5. Vowels

All vowel diacritics affect vowel letters around the letter modified.

5.1. Schwas

5.1.1. Single-Letter Short Schwa

A "\" above (U+0340, which is clearer than U+0300 in some fonts) a
vowel letter (à, è, ì, ò, ù, ẁ, ỳ) means that letter, along with
vowel letters around it, has a schwa sound (IPA [ə]). An example is
"wikipedià".

Pre-composed characters are preferred if they display better in most
fonts. An example is ì (U+00EC).

5.1.2. Double-Letter Short Schwa

A "⁓" above (U+0360) "ar", "er", "ir", "or", "ur" and "re" (a͠r, e͠r,
i͠r, o͠r, u͠r, r͠e) means it is a schwa sound. An example is worke͠r.

5.1.3. Double-Letter Long Schwa

An arch above (U+0361) "ar", "er", "ir", "or", "ur" and "yr" (a͡r,
e͡r, i͡r, o͡r, u͡r, y͡r) means it is a long schwa sound (IPA [əː]). An
example is wo͡rk.

5.2. Short Vowels

Without diacritics, the vowel letters a, e, i/y, o and u sound [æ],
[e], [i], [ɔ] and [ʌ], e.g. "bat", "bet", "bit"/"gym", "bot" and
"but".

Diacritic-equipped vowel letters for short vowels are categorized into
four "classes". The first class has the same sounds as the above
diacritic-free letters do.

1st-Class Short (U+0306): ă [æ], ĕ [e], ĭ/y̆ [i], ŏ [ɔ], ŭ [ʌ]
(e.g. băt, bĕt, bĭt/gy̆m, bŏt, bŭt)
2nd-Class Short (U+0307): ȧ [e], ė [i], , ȯ [ʌ], u̇ [u]
(e.g. ȧny, dėsign, sȯme, pu̇t)
3rd-Class Short (U+0311): ȃ [i], ȇ [ɔ], , ȏ [u], (e.g.
privȃte, ȇncore, bȏok)
4th-Class Short (U+030D): a̍ [ɔ], , , , (e.g.
swa̍p)

It is interesting that letters on anti-diagonal lines have the same
sound. For example, ȃ, ė and ĭ/y̆ all sound [i].

5.3. Long Vowels

For convenience, [juː] and [ju] are considered long vowels in this
section.

Diacritics-equipped vowel letters for long vowels are also categorized
into four classes.

1st-Class Long (U+0304): ā [ei], ē [iː], ī/ȳ [ai], ō [əu], ū/w̄ [juː]
(e.g. tāke, mēet, līght/mȳ, gō, hūge/new̄)
2nd-Class Long (U+0308): ä [ɑː], ë [ei], ï/ÿ [iː], ö [ɔː], ü/ẅ [uː]
(e.g. cär, ëight, machïne/quaÿ, förce, blüe/jeẅ)
3rd-Class Long (U+030F): ȁ [ɔː], , , ȍ [uː], ȕ [ju]
(e.g. tȁll, fȍod, cȕre)
4th-Class Long (U+030B): , , , ő [au] (e.g. rőund)

6. R's

All diacritics for "r" do not affect vowel/consonant letters around
the letter modified.

The default "r" (without any diacritic) has the [r] sound.

A right-half circle below (U+0339) "r" (r̹) means this "r" sounds
[ər]. An example is exper̹ience.

A "-" in "r" (ɍ, U+024D) means this "r" is silenced.

7. Consonants

All consonant diacritics do not affect vowel/consonant letters around
the letter modified.

7.1. [tʃ], [ʃ] and [ʒ]

[tʃ], [ʃ] and [ʒ] graphemes are assigned U+032F, U+032E and U+0331:

[tʃ]: ch, c̯h̵, t̯, t̯c͝h, t̯ɨ, c̯, c̯z̵, t͝sc̯h̵
[ʃ]: sh, s̮h̵, t̮ɨ, c̮ɨ, s̮s̮ɨ, s̮ɨ, s̮s̮, c̮h̵, s̮, s̮c̮ɨ, c̮é,
s̮ch̵, s̮c̮
[ʒ]: s̱ɨ, s̱, ẕ, ẕh̵, ṯɨ, s̱h̵

7.2. X's

The default "x" (without any diacritic) has the [ks] sound.

A right-half circle below (U+0339) "x" (x̹) means this "x" sounds
[gz]. An example is ex̹ample.

A "\" below (U+0316) "x" (x̖) means this "x" sounds [kʃ]. An example
is anx̖ɨous.

A left-half circle below (U+031C) "x" (x̜) means this "x" sounds [z].
An example is x̜ylophone.

7.3. Other Consonants

Graphemes for other consonants that may need diacritics are below:

[t]: éd̖
[ɡ]: g, gg, gu͝e, gh̵
[k]: c̹, k, c̹k, c̹h̵, c̹c̹, qu̵, q, c̹q, c̹u̵, qu͝e, kk, kh̵
[ŋ]: ng, n̹g̵, n̹, n̹g̵u͝e, n̹g̵h̵
[f]: f, ph, p̀h̵, ff, ğh̵, p̀h̵
[v]: v, vv, f̹
[θ]: th, t̜h̵, c͝ht̜h̵, p͝ht̜h̵, t̜t̜h̵
[ð]: t̹h, t̹h̵
[s]: s, c, ss, sc, st̵, p̵s, sc͝h, cc, sé, cé
[z]: s̹, z, x̜, zz, s̹s̹, zé
[dʒ]: g̀, j, d̹g̀, d̹g̀é, d̹, d̹ɨ, g̀ɨ, g̀é, d̹j, g̀g̀
[w]: u̹

Reply all
Reply to author
Forward
0 new messages