Deivathin Kural Pages- Rationale behind Concordance

The approach to creating the Concordance

When an index is generated for a book, the index takes the form of an alphabetical list with each word specified by where it is seen in the book (typically the page number). When it is required to generate concordance for an electronic document, page numbers do not make sense. Usually web applications which provide search features will identify the web page in which the word is seen.

The volumes of Deivathin Kural are available as printed books and the different chapters of the volumes are also served in electronic form from the known authoritative site www.kamakoti.org. Hence generating a concordance poses difficulties in identifying a page number as well as a web page reference depending on the needs of the person searching for information.

This has been handled at this site as explained below.

Common words in use are not included in an index as the number of occurrences of such words will be huge. It may be kept in mind that the number of words in the volumes of Deivathin Kural range from 1,30,000 to as many as 2,10,000. Therefore one has to provide for possible ways to restrict the search to special words and also words which are emphasized in text.

The group attempting to handle this decided to take an approach that would reflect reasonably well the nature of the query that someone has in mind. Typically one might be looking for words from scriptures, names of places, etymological derivations, proverbs and the like. These would have to be culled out from the huge list in each volume.

The selection mechanism suited for the above was based on the following assumptions. The assumptions are reasonable but may not conform to accepted norms followed by linguistic experts.

Words with fewer than 3 aksharas are most likely to be common words and may not merit a search unless they happen to be special, say as seen in quotation marks when dealing with topics on grammar.
Words from scriptures, historical anecdotes, names of Kings and names of places generally tend to have 6 or more aksharas.
Words between 3 and 6 aksharas may include many common words and possibly words of interest in specific contexts.
Words in English are likely to be significant when they are used in the text to explain or give a meaning to a word in Tamil or other languages. Besides, these are essential when references are made to authoritative sources written in English.
Words or short phrases in quotes also become candidates for search as the quotation marks imply some sort of importance for their presence a sentence.

Keeping all this in mind, Computer programs were written to analyse the text and generate the following. The source for analysis for each volume consisted of electronic versions of the volume coded in a special way to identify an akshara uniquely and process it using conventional methods of string matching. The coding scheme was developed at IIT Madras during the 1990s to provide fast text processing of multilingual text in Indian languages.

Full list of all the words in each volume.
List of all words with 7 or more aksharas
List of all words with 3 to 6 aksharas
List of all quoted words and short quoted phrases having 2 or 3 words
List of all quoted longer phrases having 3-20 words
List of all English words

Duplicate words within a paragraph were eliminated. For short words the first 3 matching aksharas were identified as variations on a root word and only a few (probable) root words were retained. This would be adequate for the purpose of search since all the words matching the root would be returned when searching through the full list of short words in a volume. This forms the first stage of filtering of the list. In the case of long words, root words were identified by matching the first four aksharas.

The list of long words filtered this way would now be restricted to about 15-20,000 and the short words to about 30-35,000 for each volume. Quoted words were seen to be about 1000-2000 depending on the volume and English words were seen in the range 500-1000 typically.

The filtered list of long words and short words were further manually scanned to identify important ones from the point of view of what one might want as a quick reference list. Manual scanning filtered out about 1500-2500 Long words and about 2000-3000 Short words for each volume. These were identified as the most important words one is likely to search for.

This manually prepared list would merit printing in Hard Copy form as the Concordance of most important words. This list is made available for download for each volume.

During the manual scanning process, typographical errors seen in the words were tagged but not corrected. It is therefore likely that words with spelling errors may be returned for a query.

The web application will return matching words based on the options chosen from the drop down menu. Search by volume may appear redundant when all the qualifying words from all the volumes have been included for concordance. In practice it turns out that many short words will return tens of matches if all volumes are included. Very long lists returned by the application will require one to scroll through the words and this can be a bit tedious. Seach by volume will be helpful here.

All the manually filtered lists for long words (and separately for Short words) in all the volumes were combined together as a global set and this represents the choice in the "ALL-7" option for the Filtered Long (Short) words type. In this selection approximately 20,000 words are included for each option i.e., Long or Short.

A typical search may be effected as follows.

Select the All-7 option for long or short words and submit the query. If the returned list satisfactorily reflects the needed results, one need not go further.

The next search would be a volume wise search of filtered root words and this is likley to result many more words. If required, the search could be extended to the full list of Long and Short words within a volume resulting in many more matches.

If matches are not returned for a specific query from either of the Filtered lists for a volume, one can always look for the word from the much larger and full set of each type. It is quite likely that with this full set, multiple occurrences of the same word will be returned as the full set will include all the words in the volume matching the word type (Long or Short words). In these sets one will see duplications as well as many different variations for a root word, mostly constituting commonly spoken words.

To avoid going through all the seven volumes for Short or Long words, one could attempt an advanced search across all the volumes for both Long and Short words. Obscure words very rarely seen in common use could be searched this way. Usually English words written in Tamil will merit search in this manner. The advanced search facilty is in a separate page. The advanced search will return results (Long as well as Short words) for the query from all the seven volumes. Results will be from the filtered sets.

The advanced search may be used to check if a word is present in any of the volumes. One can then return to the standard search to find multiple occurences in a volume.

In the search application, the drop down menu for the word-type allows the selection of the list of interest. there are six choices: Filtered Long, Filtered Short, quoted, Full list of Long words, Full list of Short words and English words. By selecting the full lists one after another, one would be searching for a match from about 60,000 - 90,000 words based on the volume selected.

The table in the page on concordance has the details of the different word types in each volume.

Concordance Generation

This page discusses the factors taken into consideration while generating the Concordance for the volumes of Deivathin Kural. The approach may not conform to conventional indexing of documents or web pages. Viewers are encouraged to offer their views on this.

Concordance here implies that the prepared list relates words to their occurrences in an essay in one of the volumes of Deivathin Kural. The search here does not extend to searching for phrases or conditional searches as one might see in search engines on the web.

The structure of words in Tamil is based on the principles of adding prefixes and suffixes to roots to derive variations. This linguistic speciality (Agglutinative languages) is useful while searching for words since the algorithms for word selection can be written to match the roots.

The algorithms written for concordance generation for Deivathin Kural are based on the scheme of text representation developed at IIT Madras during 1990s. This scheme which represents each syllable using a fixed length code (16bits but quite diferent form Unicode) allows regular expression matching to be effected with ease on Tamil text (or for that matter text in all Indian languages).