Linguistic issues in
text processing
Dealing with Text consistent
with linguistic requirements
Text processing with
linguistic requirements in mind can be effected with a minimal set of characters
and a few special symbols. By this we mean that a displayed text string
can be interpreted with respect to the language it represents. When we
are looking for the meaning of a word in a text string, the language does
come into the picture and a computer program may actually match the string
with a set of words in order to arrive at a linguistically important feature
in the word.
Interestingly, what
associates a word with a language is not the script in which the word is
written but the sounds associated with the word. For example, the bilingual
text we see in railway stations in India conveys the same linguistic information
even though written in different scripts. Unfortunately,
computers have forced us to work with scripts rather than the sounds constraining
us to handle representations for the shapes of the written letters.
The reader will agree with this readily once he/she reads the following
text strings and relates them all to the same linguistic content.
An important
consequence of the above observation is that in the case of two of the
scripts (Roman diacritics and Greek), a minimal set of about 30-40 shapes
is adequate to represent virtually any text one wishes to display. In the
case of the other two (Devanagari and Tamil), hundreds of shapes may have
to used since each shape is associated with a unique sound which is in
contrast with the other situation where a sequence of shapes from a small
set are placed one after the other. In other words,
while in the western scripts a syllable is always shown in decomposed form,
in Indian scripts, a syllable is usually shown in its individual form though
this individual form may conform to some convention in respect of how it
is generated.
In the context of
Indian scripts, one seldom runs into a problem of reading the text correctly
since the reader automatically associates the shapes with the sounds whereas
there is enough room for incorrect reading with the Roman script. Thus
the shapes of the symbols used in Indian scripts relate more directly to
linguistic content without ambiguity when one pronounces the sounds as
inferred from the shapes. This brings us to an important problem of text
representation. If we want to code the text in a way the linguistic content
and the shape are mapped one to one, we will have to find a code for each
syllable and we will have to provide for thousands of these, even for a
single language. The reader who is familiar with language primers in elementary
schools will immediately remember the very basic set consisting of all
the consonant vowel combinations. Shown below is a portion of the table
of syllable representations in their most basic form with just one vowel
with a consonant and this includes the case where the generic consonant
is represented as well. Thus the total set, equals the product of the number
of consonants and the number of vowels together with the set of vowels
and this may be what constitutes the bare minimum requirement for syllable
representation. This set is linguistically adequate though the writing
conventions may require special ligatures when specific conjuncts are formed.
This large
set of displayed shapes has certainly posed problems for the computer scientists
who had always worked with a limited set of letters. The new requirement
can be met only with schemes that allow more than eight bits per code since
the required number is far in excess of 256. Till recently, majority of
computer applications had been written only to work with eight bit codes
for text representation except perhaps those meant for use with Chinese,
Japanese and Korean, where more than 20,000 shapes are required. Surprisingly,
individual codes have been assigned to each of these ( a very tedious process
indeed but one that had been handled meticulously). To circumvent the data
entry problem with that many symbols, a dictionary based approach is used
for these specific languages where the name of the shape is typed in using
a very small set of letters (called kana) and the application substitutes
the shapes (called ideographs).
Handling Indian scripts.
Computer applications
written for the western scripts can handle about 150-200 shapes (letters,
accented letters and symbols). Designers have thought of clever approaches
to dealing with Indian scripts by identifying a minimal set of primitive
shapes from which the required shape for any syllable could be constructed.
For Indian scripts, the basic set of consonant vowel combinations can be
easily accommodated through a minimal set of basic shapes involving only
the vowels, consonants and the matras. When we write text in our languages,
we can in fact build the required shape of the syllable from these but
writing conventions are such that for almost all the scripts (except Tamil)
many syllables have independent shapes. It is very likely that as writing
systems evolved in India, the syllables which did occur more frequently
got special shapes assigned to them. We observe that there are about a
hundred and fifty of these special shapes which will have to be included
in our set if we wish to generate displays conforming to most of the conventions.
These basic shapes
can be used as the glyphs in a font so that one can generate meaningful
displays conforming to the writing conventions. If we look at the number
of glyphs, we will find that about 230-240 may be adequate to build almost
all the syllables in use. However, fonts used in computers cannot really
support this many glyphs. Each system, Win9X, Unix or the MacIntosh, has
its own specifications for the correct handling of fonts and the common
denominator that all these platforms can truly cater to is only about 190
glyphs, though individually, the Macintosh can support many more. For most
scripts, multiple copies of the Matras, each one magnified or reduced in
size and located appropriately to blend with the consonant or conjunct
will be required. In some cases, it may be difficult to add a matra by
overlaying two glyphs because the basic shape of the consonant may not
permit an attachment that is not individually tailored to it. This
happens for example with the "u" matra for the consonant "ha". In these
cases, new glyphs are invariably added.
The observations made
above may not hold for the case of text representation through Unicode
which provides a large code space of more than 64000 codes. Yet, within
this large space, each language (identified through the script associated
with it) will be confined to a much smaller set of codes but this set can
truly exceed 256. Thus Unicode, used with an appropriate 16 bit font can
accommodate a fairly large number of characters for a script. The Western
Latin set has more than 450 assigned codes to cater to most European requirements.
We will now make some
specific observations about handling our scripts and assigning codes.
1. If we agree to represent
text using codes assigned to shapes used in building up the displayed symbols,
we will certainly be able to store and display the text and possibly handle
data entry as well using the same methods adopted for plain ASCII text.
However, tracing the displayed text to the linguistic content requires
us to map the displayed shape into the consonants and vowels that make
up the syllable. This makes linguistic processing quite complicated. Also,
this approach will not work uniformly across fonts since each font has
its own selection of basic glyphs and ligatures.
2. We can agree
to assign codes to the basic vowels and consonants of our languages which
run into about fifty one symbols. However these codes cannot be directly
mapped to shapes in the displayed text. A string containing these codes
will necessarily have to be parsed to identify syllable boundaries and
the result mapped to a shape. If we do what is done in the western scripts,
we will end up with a situation such as seen below. If we take the approach
through ISCII and try to display text directly with the codes, we will
also run into similar difficulties.
In the use
of ISCII, the situation similar to Roman is acceptable so long as the convention
for including the vowel shape to only one side of the consonant is retained.
The group of codes will indeed contribute to identifying the linguistic
content properly but the display may require swapping of glyphs if the
matra addition follows a different rule.
The main advantage
of ISCII is that it provides for codes that relate to the linguistic content
(sounds) and thus these could be used uniformly across the Indian languages
which are based on a more or less common set of sounds. However, this simplistic
view does not always hold, for ISCII also prescribed the means for interpreting
specific codes to result in a specific display form. It achieved this through
two special codes called the INV and Nukta.
Going from an ISCII
string to displayed shapes requires one to identify syllable boundaries
and also properly interpret the INV and Nukta characters. This approach
will be script dependent as well as font dependent. Such a program will
code into itself the rules of the writing system followed for a language
when using the script. Clearly, writing such programs to handle multiple
scripts in the same document will not be easy. Also, since the writing
system rules are coded into the program, handling a new script for a language
will require the program to be modified and recompiled. It is however possible
to read into the program the rules if the program were written in an appropriate
manner involving data structures that directly specify the rules and are
read in at run time from appropriate files (Tables or simple structures
can help).
Going from the displayed
shape to the internal representation.
How easy or difficult
will it be for us to retrace the steps and go from a displayed shape to
the ISCII codes which generated the shape? This problem is faced in practice
when we perform copy paste operations. The problem is quite difficult to
handle since the display is based on codes corresponding to the glyphs
in the font while the internal representation conforms to ISCII (or Unicode).
What is recommended in practice is the approach through a backing store
for the displayed string, typically implemented as a buffer in memory that
retains the internal codes of the displayed text. This buffer will have
to be maintained in addition to any other buffer maintained by the application
for manipulating the text. When a block of text is selected on the screen,
a copy of the display is generated again from the internal buffer and this
is compared with the codes corresponding to the display. In other words,
one really does not go from the displayed codes to the internal codes but
rather matches the displayed codes by generating a virtual display and
comparing the two. We now appreciate the fact that if the displayed code
and the internal code were the same, there is no difficulty at all in doing
this. The writing systems which are syllable based do not permit this however.
Tracing back can be
quite complicated when the same syllable gets displayed in alternate forms
as in the illustration below.
One has perfect
freedom in choosing any of the above forms when displaying text and no
one would complain that the text is not readable since all the forms are
accepted as equivalent.
The
assignment of ISCII or Unicode values does not specify in which form a
syllable should be rendered so long as the result is acceptable.
The rendering in practice will have to take into account the availability
of the required basic shapes to build up the final form. Hence the rendering
process will depend on the font used for the script. Experience tells us
that at least in respect of Devanagari, the first and the fourth forms
above are seen only in some commercially available fonts which are normally
recommended for high quality typesetting.
Summary and specific observations.
1. The characters defined
in any coding scheme should meet the basic linguistic requirements as applicable
to a language. It is also necessary to accommodate all the special symbols
used in the writing system to add syntactic value to a string. For instance,
the Vedic marks used in Sanskrit text or the accounting symbols used in
Tamil provide additional information which may not be strictly linguistic
in nature but useful for interpreting the contents.
2. As far as possible, every
text string must conform to the basic requirement that the displayed shape
always carry specific linguistic information. That is, some amount of semantic
detail must also be part of the information conveyed by the string. In
the absence of this, an application will have great difficulty in interpreting
a text string from a linguistic angle, though the string may contain only
valid codes.
3. The same linguistic information
may be conveyed by more than one displayed shape. The coding schemes must
permit alternative representations to be traced back to specific linguistic
content.
|
Multilingual Computing- A view from SDL
Unicode for Indian Languages
Unicode support in Microsoft applications
Recommendations for Developers of Indian language Applications
|