Home -->  Linguistics and Computation  --> Frequency Analysis in Texts
Analyzing the frequencies of occurrences of syllables

  The linguistic quantum for analyzing text in Indian languages is a syllable. Looking at the frequencies of occurrences of syllables in text would give us some idea of the most commonly occurring sounds when the text is spoken. Also from the point of view of grammar, the syllable assumes significance since syllables affixed or prefixed to root forms result in words that conform to the rules of grammar. It should therefore be possible to perform computations on a string of syllables to arrive at the underlying structure of a sentence and in the process, understand the sentence as well.

  In Indian languages, it is not unusual to form compound words whose form is related to the underlying words but the process of connecting the words results in a change of sound. Interestingly, when a compound word is seen in text, its meaning could be ascertained from the connected words but one finds that one can break the words in multiple ways, all leading to perfect but different meanings. The scriptures of India cannot be properly understood unless the splitting is correctly effected consistent wih the context conveyed by the sentence.

Poets of India had cleverly used compounds to hide the true meaning of the sentence from all but those who could correctly split the words. One is amazed to know that in the Mahabharata Epic, consisting of 100,000 Slokas, every thousandth sloka could be interpreted in two ways!

Frequency analysis should therefore be effected on text only when the splitting of words has been accomplished properly.

Fortunately for us, during the past two millennia, several scholars have given proper interpretations of the scriptures and the text of scriptures with correctly split words is available to us.

Multiple displayed forms for the same syllable.

The rules of the writing systems are not rigid about the form used for displaying a syllable so long as one of the permitted forms is used. It is always permitted to represent syllables using generic consonants when two or more consonants are present. When syllable level codes are used, one would find multiple representations for the same syllable. This does not violate the principle of fixed length codes for a syllable since a generic consonant is also viewed as a syllable.

In a way, the rules of the writing systems state that linguistic information is not lost or made ambiguous in the representation of syllables using such linear forms. This concept assumes significance in the context of text processing with Tamil, where syllable are always written in decomposed form with generic consonants. Tamil has only a few syllables with three consonants and they mostly include consonant doubling preceded by a a soft consonant.

Study of frequency of occurrences of sounds could possibly lead to interesting results and offer explanations for the manner in which changes in a language had taken place. It is with this idea in mind, SDL had undertaken the job of analyzing the text of the Bhagavadgita, Tirukkural, Tolkappiam and Sambhandar Tevaram. 

The analysis undertaken includes the computation of frequencies of occurrences of syllables which fit into to the syllable level coding scheme used in the IITM Software. Also, syllables which have been entered as a sequence of generic consonants are reckoned appropriately. The analysis yields information on syllables of the form

V, C, CV, CCV and CCCV

with provision to identify 4 and 5 consonant cases.

Frequency of occurrence for a vowel is specified in two ways, one when it occurs by itslef (usually at the beginning of a word) and when it is part of a combination with one or more consonants.

Similarly, frequency of occurrence of a consonant is specified differently depending on whether the syllable involves only that consonant or other consonants are also present.

Frequency count for a syllable implies that the number of occurrences of the syllable in the text be reckoned properly. Since there can be so many different syllables, it become unwieldy to present the results in one consolidated form for all the syllables. The Frequency Analysis program permits results to be listed in terms of the consonants in a syllable, with the occurrences of all the vowels. This way, one will be able to identify if a specific sequence of consonants occurs, and if it does, the distribution of the the count across all the vowels.

The analysis performed on the texts of different literary works are linked here.

Bhagavad Gita

Tirukkural (Tamil)

Thirugnana Sambandar Tevaram (Tamil)

Upanishads (Sanskrit)

Tamil corpora from TDIL, India


Acharya Logo
   Sunset on the Brahmaputra.

Today is Jun. 15, 2024
Local Time: 10 13 57

| Home |

Last updated on 08/17/20    Best viewed at 800x600 or better