Multilingual Systems

Syllable level codes for Arabic, Urdu and Hebrew

IITM syllable Encoding applied to Arabic, Urdu and Hebrew.

  As seen before, the IITM encoding scheme operates at the level of the syllable. Each code is expressed as a triplet  and the same is mapped into three independent fields together making up 15 bits. Hence any writing system which is syllabic in nature could be handled well with this encoding scheme.

  Arabic, Urdu, Hebrew etc., employ writing systems which are syllabic in nature. The associated scripts are written right to left. These scripts (specifically Arabic and Hebrew) are characterized by some unique aspects of their writing systems, viz., the absence of specific forms for vowels. So one sees only the consonants. The vowels are shown through a carrier symbol though it is not strictly correct to call them vowel representations. In Arabic the carrier is the familiar "Alif". In Hebrew, the carrier is the "Alef". The medial vowel representations are usually in the form of short strokes in Arabic and a set of dots arranged in different patterns (called points) in Hebrew. These points or strokes usually appear above or below the consonants. The long vowels are usually represented through the addition of the consonant "ya" for "ie" and "va" for "ouh".

  The most significant aspect of the Arabic or Urdu writing system is that each consonant is written with different shapes depending on whether the consonant appears standalone, at the beginning, in the middle or at the end of a word. Thus, there are four possible shapes for a consonant though about six of the consonants have only two. The syllables are generally connected together in a continuous fashion except when these six occur. A syllable is invariably a simple consonant vowel combination and may admit of only consonant doubling. There are no conjuncts (similar to the Samyuktakshar).

  The figure below illustrates a line of text printed in Arabic. One remarkable aspect of the writing is the continuity of the strokes from syllable to syllable ( a calligrapher's delight) but such continuity can pose difficulties for the reader. It is perhaps for this reason that some of the consonants have the property of breaking the continuity so that the writing is not just one single stroke from the beginning to the end.

  It was stated that the consonants have four possible shapes depending on their position within a word.

A standalone consonant is rarely seen in normal writing and is used only to show its basic shape.

A consonant at the beginning of a word has continuity only with the syllable which follows. That is, it connects on the left.

A consonant in the middle connects with the preceding syllable as well as the one which follows and hence connects on the right as well as the left.

A consonant at the end of a word connects only with the preceding one.

  The six non connecting consonants mentioned above connect only with the preceding consonants. They do not connect with the consonant that may follow. In other worlds these connect only to the right. We now see why these provide the breaks in writing. When a consonant follows one of these six, it is always written as if it begins a new word even though it is in the middle of a word.

  The IITM encoding scheme is easily adapted for Arabic and Urdu by using only the 6 bit consonant field and the four bit vowel field. Since conjuncts are absent, the five bit intermediate field of the IITM code will be empty. But this five bit field is used to indicate which one of the four forms the consonant should appear. However, only two bits will be used here.
This approach is also applicable to the special consonants which connect on only one side. The middle form and the final form may be treated as identical in their cases.

  In the encoding for Arabic and Urdu, each of the four possible shapes for a consonant is treated as an individual syllable, though representing the same linguistic content. Thus, for each base consonant, four different syllable representations will be present. While displaying the consonant, the appropriate form is displayed by selecting the relevant code value. The two bit conjunct value is automatically inserted by the Arabic or Urdu specific state machine which handles the keyboard input.

The assignment within the five bit field is as follows.

00- The middle form of the consonant
01- The beginning form of the consonant
10- The final form of the consonant
11- The standalone form of the consonant

  This arrangement allows us to retain the approach to representing consonant vowel combinations while maintaining the display requirements. There is however the situation which requires the use of the double consonant. The writing system uses a specific mark above a consonant to indicate the doubling. In the case of Indian languages, consonant doubling is treated as a conjunct and the keyboard input method allows us to combine a consonant with itself. Since the five bit field is used for a slightly different purpose in Arabic and Urdu, it will be necessary to think of some means to generate double consonants.

  In the current implementation of the Arabic editor, consonant doubling has also been handled with a vowel mark. This is linguistically incorrect. There are three more bits available in the five bit field for additional syllables and so in principle, one can handle consonant doubling easily. This has not been done in the present editor.

If the input state machine is modified, perhaps the five bit field could be specified differently and one more bit assigned for consonant doubling. This should be tried and the next version of the editor will probably incorporate the modification.

  Please observe that for linguistic purposes, the encoding is still very appropriate. The consonant in the syllable is identified directly. So also the vowel. In modern Arabic or Urdu writing, the vowel marks are not normally shown. This may also be accommodated in the scheme by simply using a different lookup table.

The Hamza.

  In Arabic and Urdu, the Hamza represents the Glottal stop and is linguistically treated as a consonant. However, a carrier symbol is always associated with the Hamza depending on its position in a word and the vowel it goes with. The Hamza can appear with the "Alif", the medial forms of some consonants or by itself. The different forms relate to the glottal stop combined with different vowels. The conventions adopted for the use of the Hamza seem to be context specific and it may be necessaary to handle the situation properly. The coding scheme should permit the context to be discerned so that linguistic processing may proceed properly. 

Numerals and punctuation.

  The numerals in Arabic and Urdu are written left to right following the normal convention in English. The punctuation marks unique to Arabic and Urdu are handled in a fairly straight forward fashion, exactly as in the scheme for Indian scripts.

A note on lexical ordering of the letters

  The lexical ordering of the consonants is different from the ordering specified for Indian languages. Arabic and Urdu have many consonants whose sounds do not correspond to those of Sanskrit. Since there is enough code space for the consonants, one need not unduly be concerned with this issue. Lexical ordering for Hebrew also differs. It is quite easy to assign codes on the basis of the conventional ordering for these languages but transliteration across the scripts, specifically into Roman Diacritics,  will involve additional processing.