Home --> Software Design Issues --> Unicode --> truetype
Search  
 
Open type fonts and True type fonts
Do we really require an Open type font to work with Unicode text in Indian Languages?
  Handling text in a computer can involve mere data entry and display (or printing) or more complex processing such as matching strings, generating a web page through a script etc.. A suitable font is almost always required for generating the display in an interactive application. It is well known that eight bit fonts (fonts with 255 glyphs or fewer) are adequate to render text in all the Indian languages even though a few representations involving complex ligatures might not be available.

  The real issue however is not the number of glyphs but how one would identify the glyphs to be rendered, given a representation of the text. The easiest method has been to simply generate and store the text in terms of the Glyph codes themselves and use conventional methods of rendering ASCII strings. By and large most applications supporting text preparation in Indian languages seem to have adopted this method. When glyph codes are used, data entry is not intuitive and keyboard mappings tend to confuse the user since one will be typing in ligatures quite frequently and not just the basic consonants and vowels.

  The use of ISCII as  a standard for storing text required that a suitable processing module be used to arrive at the glyph codes from the internally stored ISCII codes for the consonants, vowels and the matras. It is clear that such a processing module can become quite complex since it has to first identify syllable boundaries from the input codes and map the same to the required shape by piecing together an appropriate set of glyphs from the font. As an illustration, the syllable seen below may be built from three glyphs in the Sanskrit98 font. Among the three, the middle glyph has zero width, with the corresponding ligature drawn on the left of the vertical axis.

  Zero width glyphs have helped build the required complex shapes through a simple process of concatenating the glyphs. In other words, zero width glyphs help create shapes which are formed by overlapping many shapes (usually two or three). Typically the matras in Devanagari are overlapped with the consonant shapes. The main advantage with this approach is that text can be rendered on most systems which can just take a string of eight bit codes. Most font rendering schemes have realized the need to correctly handle zero width glyphs (Win9X, Linux, Mac, Postscript). In these cases, the rendering engine is quite simple and just concatenates the shapes together.

  The question which has always been asked is "Can every display requirement  be handled through the use of zero width glyphs (in respect of most scripts in India)"?  While the answer to this question is certainly "yes", a large number of such glyphs will be required in practice to handle all  the shapes which can be generated only by overlapping more basic shapes. It is quite difficult to accommodate a large number of these glyphs in an eight bit font. It may be noted here that TeX has indeed shown that an eight bit font may be all that we need for our scripts but the approach cannot be used in interactive applications.

  Developers who desire to use Unicode for Indian languages face the problem of building up the required shape for each syllable using only a Unicode font. For majority of the languages of the world, a unicode font need have one glyph only for every unicode character defined for the language. In respect of Indian languages, the situation is very different, since the Unicode font will have to accommodate literally thousands of glyphs. Certainly one could think of a Unicode font with several thousand glyphs where each glyph is directly a representation of a syllable. Unfortunately, when Unicode assignments were made, the experts felt that a scheme similar to ISCII would be sufficient. So, each Indian language got an assignment of a limited set of 128 code values from which it was assumed all syllables will be derived (represented) using variable number of Unicode characters. It was felt that since the one to one mapping between a Unicode character and a glyph does not apply, a rendering engine would have to be used which maps the Unicode characters to the glyphs of SOME font, without specifying the range of Unicode values for the font glyphs.

  The way out out of this situation was to suggest a new font concept called the Open type font which would incorporate features to map one or more Unicode characters to one more glyphs in an appropriate Unicode range. This Open type font would permit a large number of Glyphs, several hundreds perhaps, enough to generate all the required ligatures through positioning glyphs with respect to one another. With this the required ligatures would be obtained by selecting the glyphs appropriate to a syllable but shape the display by positioning the glyphs in precisely defined locations. The need for zero width glyphs does not arise, for the font rendering program would get positioning information from the glyph to be displayed which will now identify the component glyphs to be pieced together. The Open type font allows a string of unicode characters to be mapped into a single glyph thus permitting the generation of the shape of the syllable from a variable length string. By precisely locating the glyphs in relation to one another graphically, the need for multiple zero width glyphs for the same ligature (as in True type fonts) is eliminated. It is said that such precise positioning allows superior quality typography as well. It is a different matter however, if the basic glyphs themselves are not aesthetically pleasing as is the case with the Microsoft Mangal font!

  An Open type Unicode font not only allows more than 256 glyphs but also builds into it the positioning information when multiple glyphs are overlaid. Essentially this is the same concept as that of a composite glyph in a conventional True type font. The composite glyph also has the advantage that we can specify it with just one code. However, when mapping characters in the text, a True type font will permit only one glyph to be mapped to one character. This is the distinct advantage of the Open type font where a string of Unicode values can map to a single glyph. When a font rendering program is called to display a composite glyph, it would dynamically build the glyph from the component glyphs by positioning them properly. If one uses zero width glyphs in a font, the same final result can be obtained but only by specifying a code for each glyph. If we examine the syllable shown earlier, an open type font could indeed include a glyph that is a combination of the first two ("sht" and "ra") and be mapped into the syllable "sh, t, ra" .  In reality, many glyphs in the Microsoft Mangal font are composite glyphs (almost 500 of them) and the recommendation from Microsoft experts emphasizes the use of composite glyphs for as many glyphs as possible which directly relate to a syllable.

  The Uniscribe module, which constitutes the shaping engine for Unicode in Microsoft applications, will identify that "sh", "t" and "ra" would come out as a single shape by applying the rule that when the consonant "ra" comes as the last consonant in a syllable, it would be written using a ligature which can occur either attachment to the vertical stroke of the preceding consonant (as in "p, ra") or as an individual ligature below it, if the preceding consonant does not have a vertical stroke.  It turns out that Microsoft displays the syllable in the illustration above not as a single ligature for "sh" and "t" but through a half form for "sh" and a ligature for "ra" under the consonant "t".

  It is now reasonably clear to us that a lot of  rules are hard coded into Uniscribe. Some of the rules will depend on the availability of specific shapes (glyphs) in the font under use. Since the form of the syllable is hard coded into Uniscribe, the user or the developer cannot provide alternate forms for a syllable even if this form can be pieced together from other available glyphs in the font.  Often a form where a conjunct can be shown without a halanth in any of the consonants is preferred by people. This is certainly not possible with Uniscribe as of today (March 2003). Tomorrow, if we do agree to build a new glyph into the Mangal font, Uniscribe will have to be rewritten! Of course Microsoft does not insist on the developer using Uniscribe. The onus is then on the designer to shape the syllables in the application itself, something that can lead to a lot of additional work.

  Uniscribe also works on the principle of internally defined rules which specify which form of a consonant applies in a given context. Thus "ra" occurring as the first consonant of a syllable is treated differently from a "ra" that occurs in the middle or at the end. Towards this, uniscribe also reorders the input string to handle cases where the first consonant is graphically positioned at the end, as in the case where the "reph" form applies. In Marathi, it is not always the case that the reph form is used each time "ra" occurs as the first consonant. So these rules, which are language dependent have to be handled by Uniscribe only when the language associate with the script is also specified as a parameter. It is not possible to dynamically introduce a language that uses Devanagari but has rules different from Sanskrit or Hindi!

Glyph codes are required to be Unicode values.

  Writing applications which can transfer information between themselves through copy/paste greatly benefit from scripts which map one Unicode character into one font glyph. In this case the code of the displayed character is identical with that of the character in storage. One can readily identify the internally stored text merely by looking at the displayed string.

  We have seen that this cannot be the case with respect to Indian languages, for several Unicode characters in sequence constitute a syllable and hence a shape. The computer system (basically the OS) must use only a Unicode font to render the text since everything is Unicode based. The large set of Unicode values required in a font for an Indian language (Tamil may do with a small set) cannot be accommodated in any other Unicode range unless that range has no specific Unicode assignments. Taking note of this, developers have struck a compromise by designing Unicode (Open type too) fonts having glyph codes in a region designated as "Private Use area" by the Unicode consortium, where one has the freedom to locate their own characters of their own scripts. This in essence allows the characters of any new language to be assigned Unicode values in a totally free manner without prejudice to or interfering with the codes otherwise legally assigned to several other languages in the Unicode standard.

  Thus, Unicode text in Indian languages will be represented through the standard Unicode assignments for the different Indian languages but all corresponding fonts will locate their glyphs in the Private Use area. One can readily see that this offers no loss of flexibility in processing a syllable, for what is needed is the identification of a glyph that has a valid Unicode assigned to it. In a document displayed using such a font, going from the displayed code to the internal code is still a reality so long as we retain the stored text internally in some buffer and back track from the displayed codes simply by repeatedly generating temporary display codes and matching them against the actually displayed ones. So copy paste operations will be possible. In a one code one glyph case, the need for this internally stored text does not arise because the internally stored text from which the display was generated will be identical to the displayed codes themselves. 

  When we use the Private Use area, we may have no way of finding out what language text is being displayed unless we access the Unicode values of the internally stored text. Multilingual applications will have quite some work to do in relating the display to a language if the text displayed uses fonts in the Private use area but the actual code values are different. Thus all applications dealing with Unicode in Indian languages MUST always retain a buffer in which the Unicode string that has given rise to the current display is kept. Worse still, as editing operations are performed on displayed text, pointers linking graphical positioning of the glyphs with the internally stored text string must be maintained. This is a very complex issue and we know that Microsoft applications themselves have not handled this with care, as will be seen below. 

  It is now apparent that the application has a lot of responsibility in actually positioning the syllables on the screen when Unicode strings have to be displayed. Errors in computing the widths of displayed glyphs can lead to a lot of confusion during the backtracking process. Errors of this type can cause unpleasant gaps in the displayed text and we know that this situation does exist even with Microsoft software!

  Seen below is a screen shot of three Microsoft applications handling the same text. These are Wordpad, Word and Excel all running under win/XP. The text was generated by typing into Wordpad and copied and pasted into the other two. The identical looking strings in the Wordpad display are not really identical in their internally stored form but differ due to the incorporation of zero width joiners. It is however clear that all the strings refer to the same syllable. The test as to whether the applications actually perform syllable level processing is also apparent from the illustration.

  Examine how Word displays the strings. The wavy red line put in by Word (a spelling error being pointed out) tells us what Word thinks is actually the width of the displayed string! The situation with Excel is no less amusing where it does not seem to use Uniscribe at all but goes by one Unicode one glyph maxim, ignoring the zero width joiners altogether. More interesting to observe is what happens when you try a string match for the word. Wordpad would match only one string while Word matches five and misses out the one where gaps are seen in the word.

  You can verify all this for yourself if you have WindowsXP running on your computer. Just download the Unicode text file corresponding to the displayed text which we have made available for you. You can open the file in Wordpad or Word directly but must do a copy and paste into Excel.

  At this point one might point to the inconsistencies in text processing with Unicode. Text processing at the Syllable level cannot be solved by providing modules which identify syllable boundaries alone and display the text. The need to check the linguistic validity of a text string that is perfectly valid as a Unicode string is really the crux of the problem. The multibyte nature of the syllable coupled with the need to identify codes which do not carry linguistic information but only codes that help in rendering the syllable, will require a lot of comparisons with each Unicode character and severely affect performance, besides complicating the algorithms themselves.

   All this goes to show that it is very difficult to write applications based on Unicode rendering. Applications which go only one way i.e., from Unicode text to display are perhaps the only ones which may work but this would restrict the applications to mere data entry and display. Even here an application must know how the shaping engine (Uniscribe or equivalent) renders the text to present the display appropriate to the user's needs. For instance, the onus is on the application to format the text graphically by ascertaining the character widths. Worse still, an application may actually be required to know when rendering information has to be inserted into a string through zero width or non zero width joiners and such.  A major constraint which most applications will face is in permitting multilingual data entry. It will be very difficult to build applications that  allow data entry in different scripts within the same interface unless it handles the keyboard itself. The moment you rely on the support given by the OS, you will invariably be forced to use alternate keyboards. As indicated elsewhere in this essay, it is not possible to type in punctuation marks in Tamil using the Microsoft Tamil Keyboard and one will have to switch keyboards. While one can certainly argue that this is consistent with the basic concept of Unicode where punctuation marks are assigned codes in a different region, the need to switch keyboards can be frustrating.

  It is never a good policy to require applications to handle text formatting by themselves. At least a meaningful API should be available which can take a Unicode string and render it on the display in a predictable fashion. This is very difficult to manage unless we have a one code one glyph situation. Perhaps a one code many glyphs situation is also not difficult to deal with, since the one code can really be that of a syllable. Unfortunately, Unicode has not taken this route.

  In Microsoft's implementation of Unicode support for Indian languages, it appears that the calculation of widths of displayed glyphs has some error. This is particularly so with zero width glyphs. It is clear that the responsibility for the correct display rests with the application and not the shaping engine. Shown below are screen shots of the same text in different applications, Word, Wordpad and and Netscape. One wonders how this has come about! Zero width glyphs from standard fonts (in this case a true type font from IIT Madras) are rendered correctly under Word but gaps are seen in Wordpad. Wordpad correctly interprets widths of characters in the Latha font which is Microsoft's own font but Word seems to suffer, especially with zero width space characters. If you are intrigued about the clear text  typed in Windows 2000 (Devanagari and Tamil text), just look at the simple multilingual text editor developed at IIT Madras. 

The adequacy of True type fonts

  Dealing with applications supporting user interfaces in Indian languages is entirely feasible with Unicode and True type fonts. It will be necessary to place many glyphs side by side to display a syllable but this can be managed with appropriate zero width glyphs. The application must now parse the input string to identify syllables. A significant amount of simplification can be effected if we agree to restrict syllable formation to a limited set of say about six hundred syllables (which by the way, will cover most of our requirements in respect of our languages). The mapping from a syllable to its glyphs may be accomplished through simple table lookup as opposed to complex rules built into Uniscribe. The multilingual software from IIT Madras has established that this approach is not only viable but very simple to implement. Syllable formation is effected at the input stage itself during data entry and each syllable stored internally as a fixed size code (two bytes).

  It is relatively easy to write parsing applications which can handle dynamically entered strings. The Acharya web site hosts a demo page where the viewer can verify that a sequence of consonants and vowels can be input to generate the syllables dynamically and displayed as well in any script. Syllables may also be standardized by collectively taking all the basic sounds from each language and working with a superset of vowels and consonants.

  The text rendering process can be simplified considerably if we agree to deal with a finite set of syllables as opposed to allowing arbitrarily long ones. Over the years one has seen that almost all the text ever prepared in India includes just about 500-800 syllables depending on the language, which have to be shown with special ligatures. It is therefore sufficient if this set is catered to. Restricting the set of syllables gives us the flexibility to use tables to map the syllables to glyphs. Table lookup can also be effected dynamically giving us the additional flexibility to use alternate forms of display for syllables. 

  If we carefully design our True type fonts, we can create a multilingual font supporting all the important scripts (nine of them) and place the glyphs in the region E000-E9ff region, where each script will have close to 250 glyphs. We can include many common glyphs in this font including punctuation marks, special symbols and such which we could not manage in a regular True type font for want of glyphs. Comparable Open type font would require at least 650 glyphs per script and we can see that it will be difficult to manage such a huge font, let alone design one.

  True type fonts also have other advantages. the rendering process is not tied to the availability of a specific font so long as the glyphs are present at the expected location. We can prepare text and get it rendered in any font of our choice where the glyphs occupy the specified locations. With Open type fonts, unless Unicode input conforms to the assigned code values and not the glyph codes, the characters will not be rendered right.

  If we create text in a microsoft application that allows us type in Unicode values in the private use area (E000-F7FF), we will not be able to view the text with the Mangal font even though it has glyphs in this range. There will be greater flexibility if an application can correctly identify the glyph codes and use any True type font that can render the glyphs right. This is how we currently display text in many Win9x applications where we generate ASCII text but view the same with a Devanagari or Tamil font.  While it is true that a shaping engine is always required to render Unicode in Indian languages, the shaping engine should permit flexibility for us to use any compatible font. It does not appear that this is possible as of now since there is only one Open type font available for Devanagari and Uniscribe is tied to this.

One can summarize the observations as follows.

  What Microsoft (perhaps other developers as well) has done is to demonstrate that text in Indian languages can be typed into any application. While it may appear that this is all one would require to run the application with Indian language support, the truth is that none of the applications can correctly interpret the entered text to effect further processing. In other words localization, the ability to support a truly interactive user interface where user commands are correctly and consistently interpreted across all applications, is something that has not been viewed seriously. When this does happen, we would not be surprised if the application is just monolingual and script specific.

  The use of Unicode (in respect of Indian languages) to truly bring in localization does not seem to be offering much promise. While one cannot deny that that someone can actually accomplish this in spite of the problems of multibyte codes, it is becoming clear to many that  developers will find it easier to provide script and problem specific solutions by handling the script related issues themselves, for there is no doubt that they can handle the linguistic aspects with confidence.

Multilingual Computing- A view from SDL

Introduction
Viewpoint
Writing systems
Linguistic requirements
Dealing with Text
Computing requirements (for India)


Unicode for Indian Languages

The conceptual basis for Unicode

Unicode for Indian scripts
Data entry
Issues in rendering Unicode
Using a shaping engine
Discussion on sorting
Open type fonts


Unicode support in Microsoft applications

Uniscribe
Limitations of Uniscribe

A review of some MS applications supporting Unicode



Recommendations for Developers of Indian language Applications

Using True type fonts to render Unicode Text

Can we simplify handling Unicode text?

Guidelines for development under Linux


Summary of SDL's observations

Acharya Logo
Distant views of the Himalayan Peaks are unforgettable and awe inspiring!

Today is Oct. 18, 2018
Local Time: 19 29 51

| Home | Design issues | Online Resources | Learn Sanskrit | Writing Systems | Fonts |
| Downloads | Unicode, ISCII | SW for the Disabled | Linguistics | Contact us |
Last updated on     Best viewed at 800x600 or better