Home -->  Linguistics and Computation --> Text Processing
Linguistic processing (computation) with Indian languages.

In the context of computing with Indian languages, the basic quantum of information to be processed is a syllable. The writing systems of India are based on syllables. Computation with text in Indian languages is hence a question of working with syllables. The representation of a syllable in the computer assumes significance in this context.

Text processing algorithms have generally been written for English since most of computing has been based on the English Language and the information available electronically is mostly in English. These algorithms work on a character of information at a time. Text is represented as a string of characters specified through codes (typically ASCII) for the letters of the alphabet and special symbols. For example, an algorithm to check a word for a Palindrome simply reverses the string and tries a match with the original. The length of a word is specified in terms of the number of characters in the word. 

The approach required for Indian languages has to be different since all processing has to be done with syllables. Text in any Indian language is reckoned only in this manner and syllable identification is critical to determining the linguistic content. Therefore the algorithm to identify a syllable gains significance.

Regrettably, the approaches to representing text in Indian languages do not lend themselves to easy implementations of text processing algorithms. There have been virtually no accepted standards for coding schemes though one is constantly reminded of ISCII, Unicode or even Font based schemes.
While ISCII and Unicode have shown viability of implementations, they suffer from fairly serious problems of unambiguous representations of syllables.
The pages at this site discussing the issues threadbare more than convey the problems of using variable length codes for representing syllables.

Leaving the problems aside, the following are representative of the type of computations one would effect from a linguistic point of view.

String processing and pattern matching (Regular Expressions)
Indexing text and generating concordances
Search applications (including searches on the web)
Data Base Applications (mysql, sql, etc.,)
Grammatical Analysis of text (e.g., Morphological Analysis) 
Parsing Text and Translation
Taggers and generating Linguistic Corpora
Frequency of occurrences of syllables
Transliteration across scripts
On-the-fly conversion of text in to different formats (images, pdf etc.)

Text processing applications available with the IITM Software.

The syllable frequency count application is a particulary useful one. This specially written application takes care of alternate forms (linguistically equivalent but differening in view) for writing a syllable. The results of use of the application on different texts in Sanskrit and Tamil can be seen in the linked page.

The applications which perform on-the-fly conversion of text in to different formats will be very useful for serving content on the web, where the most appropriate format for the contents could be decided before sending the same to the Browser.  The "Learn Sanskrit through self study" lessons at this site have become popular all over the world since they can be viewed on almost any Browser. Here the lessons are sent in the form of images, converted on the fly when the Browser requests a page containing Devanagari text.

Search applications are easy to implement using the software base developed at IITM. The fixed size syllable level code has made life much simpler for string processing. In fact, conventional indexing software such as Swish-E can be directly used to index the local language text prepared by the IITM editor or equivalent software.

Acharya Logo
   Sunset on the Brahmaputra.

Today is May. 27, 2020
Local Time: 13 21 51

| Home | Design issues | Online Resources | Learn Sanskrit | Writing Systems | Fonts |
| Downloads | Unicode, ISCII | SW for the Disabled | Linguistics | Contact us |
Last updated on 10/30/12    Best viewed at 800x600 or better