Home -->  Text Processing
 
Text processing with Indian languages.

In the context of text processing with Indian languages, the basic quantum of information to be processed is a syllable. The writing systems of India are based on syllables. Computation with text in Indian languages is hence a question of working with syllables. The representation of a syllable in the computer assumes significance in this context.

Text processing algorithms have generally been written for English since most of computing has been based on the English Language and the information available electronically is mostly in English. These algorithms work on a character of information at a time. Text is represented as a string of characters specified through codes (typically ASCII) for the letters of the alphabet and special symbols. For example, an algorithm to check a word for a Palindrome simply reverses the string and tries a match with the original. The length of a word is specified in terms of the number of characters in the word. 

The approach required for Indian languages has to be different since all processing has to be done with syllables. Text in any Indian language is reckoned only in this manner and syllable identification is critical to determining the linguistic content. Therefore the algorithm to identify a syllable gains significance.

Regrettably, the approaches to representing text in Indian languages do not lend themselves to easy implementations of text processing algorithms. There have been virtually no accepted standards for coding schemes though one is constantly reminded of ISCII, Unicode or even Font based schemes.

While ISCII and Unicode have shown viability of implementations, they suffer from fairly serious problems of unambiguous representations of syllables. For effective text processing it is desirable to use codes of fixed size for a syllable. Fixed length codes lend themselves to easy processing through "Regular Expression Matching" which is the very basis of text processing. Both ISCII and Unicode are variable length codes. Moreover, using these codes the display of a syllable involves decision making in the application handling the text. This leads to a situation where the same set of syllables get rendered differently by different applications. When the display of a syllable cannot be traced back to the syllable without ambiguity, string processing of displayed text suffers.

Leaving the problems aside, the following are representative of the type of computations one would effect from a linguistic point of view.

String processing and pattern matching (Regular Expressions)

Indexing text and generating concordances
Search applications (including searches on the web)

Data Base Applications (mysql, sql, etc.,)

Grammatical Analysis of text (e.g., Morphological Analysis) 

Parsing Text and Translation

Taggers and generating Linguistic Corpora

Frequency of occurrences of syllables

Transliteration across scripts

On-the-fly conversion of text in to different formats (images, pdf etc.)

Web interfaces to Indian languages.

Text processing applications available with the IITM Software.

The syllable frequency count application is a particulary useful one. This specially written application takes care of alternate forms (linguistically equivalent but differening in view) for writing a syllable. The results of use of the application on different texts in Sanskrit and Tamil can be seen in the linked page.

The applications which perform on-the-fly conversion of text in to different formats will be very useful for serving content on the web, where the most appropriate format for the contents could be decided before sending the same to the Browser.  The "Learn Sanskrit through self study" lessons at this site have become popular all over the world since they can be viewed on almost any Browser. Here the lessons are sent in the form of images, converted on the fly when the Browser requests a page containing Devanagari text.

Search applications are easy to implement using the software base developed at IITM. The fixed size syllable level code has made life much simpler for string processing. In fact, conventional indexing software such as Swish-E can be directly used to index the local language text prepared by the IITM editor or equivalent software.
 

Acharya Logo
   Sunset on the Brahmaputra.

Today is Oct. 11, 2024
Local Time: 01 24 45


| Home |

Last updated on 08/17/20    Best viewed at 800x600 or better