Indexing text and generating concordances

The syllable level coding scheme used in the IITM Software lends itself to direct use with algorithms used in indexing text. Indexing text is usually done by Hashing methods with clash-avoidance.

Standard indexing algorithms such as gnu dbm could be used to index the local language text created by the Multilingual Editor or any appropriate application. The indexing application breaks the text into words and eliminates those given in a specific list. The sixteen bit codes in the word are converted into a three byte ASCII representation before being indexed and the reverse process is used to arrive at the original syllable based representation while retrieving matches.

IITM has developed its own indexing software which can index a set of files, create a concordance map and a sorted list of words. The front end for this would be a program that uses the IITM local language library to interact with the user. 

Alternatively, the popular Swish-E application used extensively for indexing on the web may be utilized for indexing.

The Indexing software developed at IITM works the following way.

Create the required local language files and organize them into a meaningful directory structure. The IITM Multilingual Editor could be used for this purpose or conversion utilities could be used to convert Indian language text in other formats into the .llf form.

Create a list containing the pathname to each file. This can be obtained from a recursive listing of the directory and retaining only the Path names.

Run the indexing program by specifying the list on the command line.

Run the utility program to generate concordance information for each word.

Additionally, run other utilities to generate word lists and sort them.
 

The search applications hosted at this site (Bhagavadgita and Tirukkural) have been generated using the above steps.


Those interested in using the IITM indexing applications may contact the lab for binaries or sources to be run or compiled under Linux or Microsoft Windows (cygwin).