Analyzing the frequencies
of occurrences of syllables
The linguistic quantum
for analyzing text in Indian languages is a syllable. Looking at the frequencies
of occurrences of syllables in text would give us some idea of the most
commonly occurring sounds when the text is spoken. Also from the point
of view of grammar, the syllable assumes significance since syllables affixed
or prefixed to root forms result in words that conform to the rules of
grammar. It should therefore be possible to perform computations on a string
of syllables to arrive at the underlying structure of a sentence and in
the process, understand the sentence as well.
In Indian languages,
it is not unusual to form compound words whose form is related to the underlying
words but the process of connecting the words results in a change of sound.
Interestingly, when a compound word is seen in text, its meaning could
be ascertained from the connected words but one finds that one can break
the words in multiple ways, all leading to perfect but different meanings.
The scriptures of India cannot be properly understood unless the splitting
is correctly effected consistent wih the context conveyed by the sentence.
Poets of India had cleverly
used compounds to hide the true meaning of the sentence from all but those
who could correctly split the words. One is amazed to know that in the
Mahabharata Epic, consisting of 100,000 Slokas, every thousandth sloka
could be interpreted in two ways!
Frequency analysis should
therefore be effected on text only when the splitting of words has been
Fortunately for us, during
the past two millennia, several scholars have given proper interpretations
of the scriptures and the text of scriptures with correctly split words
is available to us.
Multiple displayed forms
for the same syllable.
The rules of the writing
systems are not rigid about the form used for displaying a syllable so
long as one of the permitted forms is used. It is always permitted to represent
syllables using generic consonants when two or more consonants are present.
When syllable level codes are used, one would find multiple representations
for the same syllable. This does not violate the principle of fixed length
codes for a syllable since a generic consonant is also viewed as a syllable.
In a way, the rules of the writing systems
state that linguistic information is not lost or made ambiguous in the
representation of syllables using such linear forms. This concept
assumes significance in the context of text processing with Tamil, where
syllable are always written in decomposed form with generic consonants.
Tamil has only a few syllables with three consonants and they
mostly include consonant doubling preceded by a a soft consonant.
Study of frequency of occurrences
of sounds could possibly lead to interesting results and offer explanations
for the manner in which changes in a language had taken place. It is with
this idea in mind, SDL had undertaken the job of analyzing the text of
the Bhagavadgita, Tirukkural, Tolkappiam and Sambhandar Tevaram.
The analysis undertaken includes
the computation of frequencies of occurrences of syllables which fit into
to the syllable level coding scheme used in the IITM Software. Also, syllables
which have been entered as a sequence of generic consonants are reckoned
appropriately. The analysis yields information on syllables of the form
V, C, CV, CCV and CCCV
provision to identify 4 and 5 consonant cases.
Frequency of occurrence for
a vowel is specified in two ways, one when it occurs by itslef (usually
at the beginning of a word) and when it is part of a combination with one
or more consonants.
Similarly, frequency of occurrence
of a consonant is specified differently depending on whether the syllable
involves only that consonant or other consonants are also present.
Frequency count for a syllable
implies that the number of occurrences of the syllable in the text be reckoned
properly. Since there can be so many different syllables, it become unwieldy
to present the results in one consolidated form for all the syllables.
The Frequency Analysis program permits results to be listed in terms of
the consonants in a syllable, with the occurrences of all the vowels. This
way, one will be able to identify if a specific sequence of consonants
occurs, and if it does, the distribution of the the count across all the