IITM Software: Editor for Urdu,
Arabic and Hebrew
Editor for Scripts written right to left
The multilingual software
from IIT Madras includes a special version of the editor for languages
such as Urdu, Arabic, Hebrew, and Avestan which are written right to left
on a page. The version described here is a preliminary version with minimum
capabilities for handling right to left text. The Editor serves to illustrate
the principle of syllable level coding developed in the lab for writing
systems which are based on syllables.
The Editor will work
with Microsoft Windows systems and does not require the presence of the
Arabic or Hebrew Language kits. The editor is a standalone application
and uses fonts developed at IIT Madras.
Besides allowing text
to be entered in these languages, the editor also supports data entry in
Roman and will allow automatic transliteration of Urdu, Arabic or Hebrew
text into Roman Diacritics. During data entry, the keyboard may be toggled
from Urdu, Arabic or Hebrew to English and vice versa using a function
key. Thus the editor is at least Bilingual and can support four scripts
in any one document.
The text prepared
with the editor may be copied into Microsoft Word and other applications
supporting the Rich Text Format. Thus applications such as email, creating
web pages in Arabic etc., may be very easily handled by copying text into
the standard applications.
The speech enhanced
version of the editor will allow visually handicapped persons use the editor.
Presently the synthesized speech may sound flat lacking in intonation and
with a foreign accent but this can be improved by using appropriate data
bases with the MBROLA speech engine which is used in the application.
Given below are some
screen shots of the Editor in operation. For those familiar with the standard
left to right scripts, the editor will appear a bit confusing to begin
with. This is largely due to the direction of text entry and the use of
the arrow keys. One will have to follow proper procedures for data entry
when switching to English. A line of text may combine both English and
the Semitic scripts.
The version is still
to be made bug free but it appears that people may use it to advantage
as it is (as on Sep. 2002)
The design of the
editor has kept in mind the syllabic structure of text in the languages
and the internal representation conforms to units of storage which map
directly to syllables, thus effecting a meaningful approach to linguistic
The editor automatically
adjusts the displayed letters to conform to the different shapes depending
on the location of the letter (syllable) within a word. Besides the letters,
a useful set of punctuation marks are supported. The entry of numerals
requires that the user type in the digits in the reverse order.
The editor does not
support word processing features. This is not a limitation, for by copying
the entered text into word, the required formatting of text can be accomplished.
The editor saves the files in two independent formats, one conforming to
the syllabic structure of the entered text and the other in the Rich Text
Format. The RTF file may be imported into Windows applications as well.
In the present version
of the editor, the keyboard mapping used is based on a phonetic relationship
with the keys on the standard ASCII keyboard. This mapping may seem a bit
arbitrary but it is possible to accommodate other mappings as well without
having to recompile the application. The figure below illustrates the mapping
used in the editor. Though vowel marks may not be explicitly shown in Arabic
and Hebrew text printed today, the editor allows these marks to be properly
A few changes have been made to the above keyboard mapping and the trial
version has the letters corresponding to M and B shifted to "[" and "]"
of the r2leditor.
When invoked, either
from a shell prompt or by clicking on its icon, the editor opens a window
which may be resized to suit the users requirements. Urdu is retained as
the default script. The cursor is positioned at the top right corner of
the window. Data entry may begin on the first line.
As data is entered,
the text is rendered from right to left, conforming to the shapes of the
consonants depending on their position in the word. A word ends when a
space or a special punctuation mark is entered.
Data entry in Roman
is accomplished by using the Function key F9 as a toggle key. When typing
data in English, the cursor will remain in the same position and the English
string will move left as each letter is typed in. Arabic or Urdu may be
continued beyond the English string by toggling F9 once more. The cursor
will jump to the leftmost character of the English string and Arabic or
Urdu letters will be rendered right to left one letter after another.
a line contains text in Arabic as well as English, certain conventions
cursor will move left after
each Arabic letter is entered. A carriage return will bring the cursor
to the next line so that another line of text may be entered.
Toggling F9 in the middle of
data entry will permit Roman letters to be typed in along with Arabic.
The cursor will not move but the entered characters will. Thus characters
will be inserted into the text when typing in English. When the ENTER key
is pressed, the cursor would move to the next line leaving the English
string in place even though the cursor was in the middle of the line.
The right and left arrow keys
must be understood and used according to the logical function they perform.
Right arrow - move the cursor
to the character that was entered following the current character. Left
arrow - move the cursor to the character that was entered prior to the
Movement of the cursor inside
a string in Arabic will be the opposite of what a person used to typing
in English will see. The cursor will move to the left when the right arrow
is pressed, for the next Arabic character entered would be at the left.
Users may want to experiment on what happens when the cursor is positioned
with an Arabic character on its right and a Roman letter on its left.
It is recommended that as far
as possible, a line of text should be in one script. If English must follow,
then the Arabic text should be entered first followed by English on the
same line but further Arabic text beyond the English string should be avoided.
Text, where each line is in one script only, either Arabic or English,
will be the best choice.
and pasting into Microsoft Word.
Text entered using
the r2leditor may be copied and pasted into Microsoft Word. Text alignment
will be retained and further formatting may be attempted in Word. The screen
shots shown below illustrate this.
A string of text in
Arabic may be automatically transliterated into Roman diacritics by blocking
the string and invoking the language switch menu. Switching from Roman
diacritics back to Arabic is not supported however.
The following steps indicate
how on-screen transliteration is done.
Step 1: Type in the required
Step 2: Make a copy
of the text if you would like to show both Arabic Text
and below it the transliteration.
Step 3: Select the
text to be transliterated using the mouse, invoke the language Menu and
select IPA. The selected line will change in script from Arabic to Roman
diacritics. The reverse operation will not work.
IIT Madras views as limitations in the present version of the r2leditor.
1. The keyboard
mapping is phonetic following a scheme which is typically suited to the
sounds of Indian languages. It is possible to assign any appropriate mapping
and thus honour existing keyboard mapping schemes for Urdu and Arabic.
We do not yet know what will be considered meaningful.
2. In the current version,
the text entered in Arabic or Urdu, is coded into syllables following the
lexical ordering for Indian languages. The lexical ordering for Arabic
(and Hebrew) are different. Hence the sorting utility supplied with the
IITM software may not sort text as required. This is a problem which may
be handled without much difficulty later.
3. It is not guaranteed that
the editor will correctly render Arabic or Urdu text consistent with the
shapes used in traditional writing. The font used by the Editor is a truetype
font which is freely available on the net. This font is meant for Urdu
and hence the shapes for the vowel marks differ from the corresponding
ones for Arabic. The "Hamza" is also not handled properly as of now. This
may be a major limitation as far as Arabic is concerned.
IIT Madras has a proposal
to design a new font consistent with the requirements for both Arabic and
4. The scheme of transliteration
used in the current version of the Editor may not present the correct equivalents.
We are still looking for schemes acceptable to all in respect of Roman
transliteration for Arabic and Urdu. We have seen many different schemes
such as "Qalam Arabic Transliteration" which seem to be popular. The scheme
used in the examples above corresponds to the ArabTeX specification. It
will be possible to accommodate almost any desired scheme so long as the
diacritics used are chosen from the standard set as per conventions used
in the past (e.g., books printed in the 19th century). The online transliteration
feature is included only to show that it can be done with the editor.
Editor for Right to Left Scripts
in the current version
(A short tutorial)
of the Editor
Trial version of the editor
is available for evaluation. Those interested in getting a copy may send
a request to the lab at the address given in the contact
is necessary to mention here that there are still many bugs to be fixed
in the editor. Though Arabic, Urdu and Roman may be typed in easily, unpredictable
results will be seen if one tries to edit an already prepared file containing
bilingual data. However, the editor can be used to prepare text for pasting