Home --> Software Design Issues --> Unicode --> multreq
Search  
 
Linguistic issues in text processing
Dealing with Text consistent with linguistic requirements

  Text processing with linguistic requirements in mind can be effected with a minimal set of characters and a few special symbols. By this we mean that a displayed text string can be interpreted with respect to the language it represents. When we are looking for the meaning of a word in a text string, the language does come into the picture and a computer program may actually match the string with a set of words in order to arrive at a linguistically important feature in the word.

  Interestingly, what associates a word with a language is not the script in which the word is written but the sounds associated with the word. For example, the bilingual text we see in railway stations in India conveys the same linguistic information even though written in different scripts. Unfortunately, computers have forced us to work with scripts rather than the sounds constraining us to handle representations for the shapes of the written letters. The reader will agree with this readily once he/she reads the following text strings and relates them all to the same linguistic content.

  An important consequence of the above observation is that in the case of two of the scripts (Roman diacritics and Greek), a minimal set of about 30-40 shapes is adequate to represent virtually any text one wishes to display. In the case of the other two (Devanagari and Tamil), hundreds of shapes may have to used since each shape is associated with a unique sound which is in contrast with the other situation where a sequence of shapes from a small set are placed one after the other. In other words, while in the western scripts a syllable is always shown in decomposed form, in Indian scripts, a syllable is usually shown in its individual form though this individual form may conform to some convention in respect of how it is generated.

  In the context of Indian scripts, one seldom runs into a problem of reading the text correctly since the reader automatically associates the shapes with the sounds whereas there is enough room for incorrect reading with the Roman script. Thus the shapes of the symbols used in Indian scripts relate more directly to linguistic content without ambiguity when one pronounces the sounds as inferred from the shapes. This brings us to an important problem of text representation. If we want to code the text in a way the linguistic content and the shape are mapped one to one, we will have to find a code for each syllable and we will have to provide for thousands of these, even for a single language. The reader who is familiar with language primers in elementary schools will immediately remember the very basic set consisting of all the consonant vowel combinations. Shown below is a portion of the table of syllable representations in their most basic form with just one vowel with a consonant and this includes the case where the generic consonant is represented as well. Thus the total set, equals the product of the number of consonants and the number of vowels together with the set of vowels and this may be what constitutes the bare minimum requirement for syllable representation. This set is linguistically adequate though the writing conventions may require special ligatures when specific conjuncts are formed.

  This large set of displayed shapes has certainly posed problems for the computer scientists who had always worked with a limited set of letters. The new requirement can be met only with schemes that allow more than eight bits per code since the required number is far in excess of 256. Till recently, majority of computer applications had been written only to work with eight bit codes for text representation except perhaps those meant for use with Chinese, Japanese and Korean, where more than 20,000 shapes are required. Surprisingly, individual codes have been assigned to each of these ( a very tedious process indeed but one that had been handled meticulously). To circumvent the data entry problem with that many symbols, a dictionary based approach is used for these specific languages where the name of the shape is typed in using a very small set of letters (called kana) and the application substitutes the shapes (called ideographs).

Handling Indian scripts.

  Computer applications written for the western scripts can handle about 150-200 shapes (letters, accented letters and symbols). Designers have thought of clever approaches to dealing with Indian scripts by identifying a minimal set of primitive shapes from which the required shape for any syllable could be constructed. For Indian scripts, the basic set of consonant vowel combinations can be easily accommodated through a minimal set of basic shapes involving only the vowels, consonants and the matras. When we write text in our languages, we can in fact build the required shape of the syllable from these but writing conventions are such that for almost all the scripts (except Tamil) many syllables have independent shapes. It is very likely that as writing systems evolved in India, the syllables which did occur more frequently got special shapes assigned to them. We observe that there are about a hundred and fifty of these special shapes which will have to be included in our set if we wish to generate displays conforming to most of the conventions.

  These basic shapes can be used as the glyphs in a font so that one can generate meaningful displays conforming to the writing conventions. If we look at the number of glyphs, we will find that about 230-240 may be adequate to build almost all the syllables in use. However, fonts used in computers cannot really support this many glyphs. Each system, Win9X, Unix or the MacIntosh, has its own specifications for the correct handling of fonts and the common denominator that all these platforms can truly cater to is only about 190 glyphs, though individually, the Macintosh can support many more. For most scripts, multiple copies of the Matras, each one magnified or reduced in size and located appropriately to blend with the consonant or conjunct will be required. In some cases, it may be difficult to add a matra by overlaying two glyphs because the basic shape of the consonant may not permit  an attachment that is not individually tailored to it. This happens for example with the "u" matra for the consonant "ha". In these cases, new glyphs are invariably added.

  The observations made above may not hold for the case of text representation through Unicode which provides a large code space of more than 64000 codes. Yet, within this large space, each language (identified through the script associated with it) will be confined to a much smaller set of codes but this set can truly exceed 256. Thus Unicode, used with an appropriate 16 bit font can accommodate a fairly large number of characters for a script. The Western Latin set has more than 450 assigned codes to cater to most European requirements.

  We will now make some specific observations about handling our scripts and assigning codes.

1. If we agree to represent text using codes assigned to shapes used in building up the displayed symbols, we will certainly be able to store and display the text and possibly handle data entry as well using the same methods adopted for plain ASCII text.  However, tracing the displayed text to the linguistic content requires us to map the displayed shape into the consonants and vowels that make up the syllable. This makes linguistic processing quite complicated. Also, this approach will not work uniformly across fonts since each font has its own selection of basic glyphs and ligatures.

2. We can agree to assign codes to the basic vowels and consonants of our languages which run into about fifty one symbols. However these codes cannot be directly mapped to shapes in the displayed text. A string containing these codes will necessarily have to be parsed to identify syllable boundaries and the result mapped to a shape. If we do what is done in the western scripts, we will end up with a situation such as seen below. If we take the approach through ISCII and try to display text directly with the codes, we will also run into similar difficulties.
  In the use of ISCII, the situation similar to Roman is acceptable so long as the convention for including the vowel shape to only one side of the consonant is retained. The group of codes will indeed contribute to identifying the linguistic content properly but the display may require swapping of glyphs if the matra addition follows a different rule.

  The main advantage of ISCII is that it provides for codes that relate to the linguistic content (sounds) and thus these could be used uniformly across the Indian languages which are based on a more or less common set of sounds. However, this simplistic view does not always hold, for ISCII also prescribed the means for interpreting  specific codes to result in a specific display form. It achieved this through two special codes called the INV and Nukta.

  Going from an ISCII string to displayed shapes requires one to identify syllable boundaries and also properly interpret the INV and Nukta characters. This approach will be script dependent as well as font dependent. Such a program will code into itself the rules of the writing system followed for a language when using the script. Clearly, writing such programs to handle multiple scripts in the same document will not be easy. Also, since the writing system rules are coded into the program, handling a new script for a language will require the program to be modified and recompiled. It is however possible to read into the program the rules if the program were written in an appropriate manner involving data structures that directly specify the rules and are read in at run time from appropriate files (Tables or simple structures can help).

Going from the displayed shape to the internal representation.

  How easy or difficult will it be for us to retrace the steps and go from a displayed shape to the ISCII codes which generated the shape? This problem is faced in practice when we perform copy paste operations. The problem is quite difficult to handle since the display is based on codes corresponding to the glyphs in the font while the internal representation conforms to ISCII (or Unicode). What is recommended in practice is the approach through a backing store for the displayed string, typically implemented as a buffer in memory that retains the internal codes of the displayed text. This buffer will have to be maintained in addition to any other buffer maintained by the application for manipulating the text. When a block of text is selected on the screen, a copy of the display is generated again from the internal buffer and this is compared with the codes corresponding to the display. In other words, one really does not go from the displayed codes to the internal codes but rather matches the displayed codes by generating a virtual display and comparing the two. We now appreciate the fact that if the displayed code and the internal code were the same, there is no difficulty at all in doing this. The writing systems which are syllable based do not permit this however.

  Tracing back can be quite complicated when the same syllable gets displayed in alternate forms as in the illustration below.

  One has perfect freedom in choosing any of the above forms when displaying text and no one would complain that the text is not readable since all the forms are accepted as equivalent.

  The assignment of ISCII or Unicode values does not specify in which form a syllable should be rendered so long as the result is acceptable. The rendering in practice will have to take into account the availability of the required basic shapes to build up the final form. Hence the rendering process will depend on the font used for the script. Experience tells us that at least in respect of Devanagari, the first and the fourth forms above are seen only in some commercially available fonts which are normally recommended for high quality typesetting.

Summary and specific observations.

1. The characters defined in any coding scheme should meet the basic linguistic requirements as applicable to a language. It is also necessary to accommodate all the special symbols used in the writing system to add syntactic value to a string. For instance, the Vedic marks used in Sanskrit text or the accounting symbols used in Tamil provide additional information which may not be strictly linguistic in nature but useful for interpreting the contents.

2. As far as possible, every text string must conform to the basic requirement that the displayed shape always carry specific linguistic information. That is, some amount of semantic detail must also be part of the information conveyed by the string. In the absence of this, an application will have great difficulty in interpreting a text string from a linguistic angle, though the string may contain only valid codes.

3. The same linguistic information may be conveyed by more than one displayed shape. The coding schemes must permit alternative representations to be traced back to specific linguistic content.


 
Multilingual Computing- A view from SDL

Introduction
Viewpoint
Writing systems
Linguistic requirements
Dealing with Text
Computing requirements (for India)


Unicode for Indian Languages

The conceptual basis for Unicode

Unicode for Indian scripts
Data entry
Issues in rendering Unicode
Using a shaping engine
Discussion on sorting
Open type fonts


Unicode support in Microsoft applications

Uniscribe
Limitations of Uniscribe

A review of some MS applications supporting Unicode



Recommendations for Developers of Indian language Applications

Using True type fonts to render Unicode Text

Can we simplify handling Unicode text?

Guidelines for development under Linux


Summary of SDL's observations

Acharya Logo
Distant views of the Himalayan Peaks are unforgettable and awe inspiring!

Today is Apr. 23, 2018
Local Time: 11 39 23

| Home | Design issues | Online Resources | Learn Sanskrit | Writing Systems | Fonts |
| Downloads | Unicode, ISCII | SW for the Disabled | Linguistics | Contact us |
Last updated on     Best viewed at 800x600 or better