Corpus linguistics seeks to further the understanding of language through the analysis of large quantity of naturally occurring data. There is a long tradition of corpus linguistic studies in Europe. The need for corpus for a language is multifarious. Starting from the preparation of a dictionary or lexicon to machine translation, corpus has become an inevitable resource for technological development of languages. Corpus means a body of huge text incorporating various types of textual materials, including newspaper, weeklies, fictions, scientific writings, literary writings, and so on. Corpus represents all styles of a language. Corpus must be very huge in size as it is to be used for many language applications such as preparation of lexicons of different sizes, and types, machine translation programs and so on.

1. Tagged corpus, Parallel corpus, and Aligned corpus

Corpuses can be distinguished as tagged corpus, parallel corpus and aligned corpus.

CIIL Corpus for Tamil:

As far as building corpus for Indian languages is concerned, it is the Central Institute of Indian languages (CIIL) which took initiative and started preparing corpus for some of the Indian languages like Tamil, Telugu, Kannada, and Malayalam. It was financed by the Department of Electronics (DOE). Initially their target was to prepare ten million word corpus for each Indian language. But due to financial crunch and time restriction, it ends up with three million words for each language. Three million word Tamil corpus is built by CIIL in this way. It is a partially tagged corpus. This corpus is available in CD and one can get a free copy from CIIL for research purpose. At present CIIL is planning to build 10 million word corpus for all Indian languages.

AUKBC’s Improved Tagged Corpus for Tamil: AUKBC Research Centre which has taken up NLP oriented works for Tamil, has improved upon the CIIL Tamil Corpus and tagged it for their MT programs. It also developed parallel corpora for English-Tamil to promote its goal of preparing an MT tool for English-Tamil translation. Parallel corpus is very useful for training the corpus and for building example based machine translation. Parallel corpus is also a useful tool for MT programs.


Annamalai, E. Corpora Development in Indian Languages. Agarawal and Pani (eds.) Information Technology Applications in Language, Script and Speech, New Delhi: BPB Publication.

Final Report: Development of Corpora of Texts of Indian Languages in Machine Readable form: Part II (Tamil, Telugu, Kannada, Malayalam), TDIL-Corpora Group, CIIL, Mysore, January 1995.


AUKBC Research Centre, MIT Campus, Chromepet, Chennai.
Central Institute of Indian Languages,  Mysore 57006.


Ganesan, M. A Scheme for Grammatical Tagging of Corpora of Indian Languages. In B.B. Rajpurohit (ed.) Technology and Languages, Mysore: Central Institute of Indian Languages. 1994

Jayaram, B.D. Development of Corpora in Indian Languages. Paper presented in the Seminar on “The use of Computers in Indian Languages” held at CIIL, Mysore, August 1992.

2. Corpus Indexing Tools (Concordance, KWIC index, etc.)

CIIL Corpus Indexing Tool:

CIIL has prepared a corpus indexing tool for indexing Tamil corpus prepared by them.

Tamil University’s Indexing of Sangam Literature:

Departments of Computer Science and Lexicography of Tamil University, Thanjavur have jointly prepared a tool for finding concordance and indexing Tamil Sangam literary works. They have prepared an Index of Tamil Sangam Literature using computer. This is going to be printed by the Tamil University.

AUKBC Corpus Indexing Tool:

Again AUKBC has to be quoted for preparing a tool for corpus indexing. The tool can sort, alphabetize and give concordance and KWIC index for Tamil texts in any font. It can find the frequency of lexical items, both in lemmatized and inflected forms.

Ganesan’s Tool for Corpus Analysis:

Ganesan and Rajan are involved in making tools for corpus analysis.


Rajan, K and Ganesan, M. Tools for Corpus Analysis, paper presented in International Conference on South Asian Linguistics (ICOSAL-4) , 3-5 December, 2002, CASL, Annamalai University, Annamalainagar. 2002


Baskaran’s study on Zip’s Law for Tamil:

Baskaran of Tamil University has studied the possibilities of using Zip’s law for the compression of Tamil texts.

AUKBC’s Research on Encryption:

AUKBC Research Centre is involved in extensive research on encryption. Encryption tools help in exchanging information from one end to another end without being intercepted, sabotaged or spied. There are attempts for preparing such tools for Tamil too.


Baskaran, S. Zip’s Law: A perspective study in Tamil, Eighth International Conference of Tamil Studies.

4. Text processing tools

Much of the information processed by computers is texts, data of the type generally called character or alphanumeric. In natural language processing, all written materials are texts. Therefore, text processing and the input and output of text are important. Dealing with text is both simpler and harder than manipulating numeric data. In terms of the physical characteristics, it is simpler and in that the text will be linear. The first character is handled, then the second, then the third, until the last is reached. At that point the data is processed. Both in logical terms, text is quite slippery. Generally, in the computer, numeric data is represented in a specific form: a number is given in a fixed quantity of bits and a set format. All integers occupy the same amount of storage in memory in a particular computer, as do real numbers. Text, on the other hand, is made up of words and names and other strings of characters, which are many in different lengths, and thus require differing amounts of memory storage.

AUKBC Text Processing tool:

AUKBC Research Centre has prepared a tool for Tamil text processing. It is a reasonably good tool for all sorts of string analysis. Using the morphological analyzer and syntactic parser prepared by the NLP team of AUKBC, the processor is able to give all sorts of linguistic information one can expect form text analysis. It can tag the text with tab labels, separate the text into sentences, phrases, and words, count the frequency of occurrence of words in terms of their tokens and word forms. It can count the number of characters of a specified length. All sorts of statistical information on string analysis can be had from the text processor.

5. Statistical analysis tool

AUKBC Statistical Analysis Tool:

AUKBC Research Centre at Chennai has prepared a tool which works on Tamil corpus and gives statistical information in terms of number of word forms, number of words (lexemes) or word tokens and number of lines and paragraph in a text. The tool is an extremely useful one.


1. Text editing tools

Baskaran, S. User Interface with Computers in Tamil - an Overview [kaNippoRi uraiyaakam] in Proceedings of the Third TamilNadu Science Congress, Pondicherry.

2. Word Processing tools

Word processing refers to the activity carried out using a computer and a suitable software to create, view, edit, manipulate, transmit, store retrieve and print documents. A document may contain text, tables, graphs, equations, pictures and drawings. Word processor is used to produce documents for business or personal use such as newsletter, reports, letters and essays. One might say that a word processor is an intelligent typewriter. One can type a whole page, make corrections (editing), use more than one type font to give beauty to the text, check up paragraphs in different styles and columns (formatting) and also check spelling, find and insert synonyms for a word (thesaurus), and process it in many more ways before they actually put that page to print. Generally, all the word processing efforts can be listed as follows:

	∙	Creating a New Document
	∙	Entering Text in a Document
	∙	Saving, Closing and Opening Documents
	∙	Moving around the Document
	∙	Scrolling the Document
	∙	Correcting Mistakes
	∙	Inserting the Text
	∙	Moving the Text
	∙	Copying the Text
	∙	Searching and Replacing the Text

CIIL Word Processor for Tamil:

Many word processors are available in the market for Tamil. CIIL, Mysore has prepared a word processor for Tamil named Bharati which is available in the CD form.


C–DAC of Pune prepared a word processor for Indian Languages named ILipi which supports Tamil too.

Word Processor of Deyvasundaram: Under supervision of Prof. Deyvasundaram of Madras University, Switha computers, Chennai has prepared a word processor for Tamil with tools like spell checker, morphological analyzer etc.

Many commercial word processors are available for Tamil. Few of them are:

∙ Ilango Tamil Word Processor which has the following features: highlights all incorrect words, adds new words to dictionary, single word and full story checking, and check for sandhi errors.

∙ Shakthi English/Tamil Office Suite (prepaed by Chnneai Kavigal, Kanini Pvt. Ltd, Chennai) has Tamil Word Processor named patami


The Resource Centre for Indian Language Technology Solutions – Tamil functioning (RCILTST) at School of Computer and Engineering, Anna University, Chennai, Tamilnadu India has prepared a Tamil Word Processor named palagai. Palagai provides basic facilities for word processing both in Tamil and English. It supports a spell checker, a grammar checker and e-mail facility. It has the following salient features:

	∙	Files of both Rich Text and Text-only formats can be created and edited
	∙	HTML files can be viewed
	∙	E-mails can be sent 
	∙	Provisions to check the spelling and grammar
	∙	Supports Sort, Find and Replace operations
	∙	Print facility is provided
	∙	Unlimited Undo and Redo operations

Navaneethan’s Multi Word Processor with online Noun Translator:

Navaneethan and his team (Navaneethan et al, 2004) have prepared a multi word processor with online noun translator. This word processor has been built on top of PANDITHAM (Protocol for Applications Development in Thamizh and Multilingual computing), protocol which provides an efficient framework for taking multiple languages into the machine. Any document opened using this processor can invoke the translator facility by selecting a Noun and using an appropriate accelerator. Other features of the processor include: multilingual support, switching between environment and application language.


Brochure on Language Technology Products of the Resource Centre for Indian Language Technology Solutions- Tamil, Chennai.

paaskaran, ca. 2004. tamizhil kaNippoRiyiyal, kaNippoRiyiyiyal tamizh. tanjcaavuur: umaa patippakam.

cuppaiyaa piLlai, 2000. iyaRkai mozhi aayvu. cennai: ulakt tamizhaaraaycci niRuvanam. Navaneethan, P et al. Multilingual Word Processor with Online Noun Translator, paper read in National Seminar on “Cross Language Communication Tools (Machine Translation) for Indian Languages”, February 25-27, 2004.

Relevant organizations/institutions/companies

Chnneai Kavigal, Kanini Pvt. Ltd, Chennai

3. DTP tools

Desktop publishing leaves the doors to writing and publishing wide open. One is not locked out from the tools he/she needs. One can use his/her personal computer to print and publish whatever is important and whatever is believed to be known by the people. This makes desktop publishing a very existing endeavour. Desktop publishing is an alternative to professional publishing. Personal computers allows one to write and edit text with ease. Page layout software enables one to design and compose design and compose pages, integrating text and graphics. Laser printers print good-looking text and one can avoid the expense of setting documents. With a little publishing experience, one can produce, effective publications which is less expensive than commercial publisher. Desktop publishing is largely a do-it-yourself operation. The publisher is responsible for a wide variety of task. They can do almost all the planning, managing, writing, editing, and production work that go into the publication. Desktop publishing is done by people who are nonpublishers and people with little, publishing experience.

As for as Tamil is concerned, many DTP softwares are available. DTP printing has replaced traditional printing technique which is very costly and time consuming. The publication of books and journal in Tamil has drastically increased due to DTP facility. The following are the few DTP softwares available for Tamil.

4. Fonts

Innumerable fonts are available for Tamil. The availability of too many fonts makes text recognition difficult as each font makes use of different codes. It nullifies uniformity. The multiplicity of fonts makes the computational processes in Tamil difficult. Few years back Tamilnadu government interfered and proposed standard fonts using standard codes. The following are the names of some of the fonts: Diamond, Kampan, Tiruvalluvar, Tolkappiyan, Kannaki, TAM, TAB and so on.


1.Word list/Vocabulary


It is one of the content creations of RCILTST, Chennai. Sollooviyam is a picture dictionary for children to teach the alphabet and simple words of Tamil language. It has about three hundred words –organized as nouns and verbs. The words are grouped alphabetically under vowels and consonants. Each word has an associated picture, an explanation and English equivalent. Every word has a small poem to illustrate its meaning. This makes the child to imagine the poem and in effect the word gets etched in his/her memory.

Union Catalog of Tamil Palmleaf Manuscript:

This is prepared by Tamil University and is available in electronic form. A database is created for cataloging Tamil Palmleaves in Tamilnadu. This project is funded by the Ministry of Education and Culture of the Government of India and undertaken by Computer Centre, Department of Palm leaf Manuscripts and Library of Tamil University. Information are collected from 49 institutions. The Catalogue consists of two parts viz. Part-A and Part-B. The Part-A consists of entries arranged in a classified sequence of subjects. In Part-B, [1] Author Index [2] Commentator Index [3] Subject wise Index [4] Title Index are provided. In addition, some statistical details of the collection are given under the heading bibliometry. The total number of entries available in the catalogue is 21, 973.

Scientific Data Base for Technical terms through computer:

This is prepared by Department of Computer Science, Tamil University, Thanjavur. As an experiment measure a system has been developed to create a scientific data base for the basic science subjects such as mathematics, physics, and chemistry to help a user to write books and articles in Tamil. The system provides an equivalent technical term in Tamil for a given word or group of words in English and vice versa.

Refrences :

Brochure on ‘Content Creation’ of the Resource Centre for Indian Language Technology Solutions- Tamil, Chennai.

2. Electronic/Online dictionaries

Many electronic/online dictionaries have been prepared for Tamil by research institutions as well as commercial organizations. A few of them are listed below:

Radha Chellappan’s Electronic dictionary for Scientific Technical Terms in Tamil:

It is a very exhaustive and efficient electronic dictionary of scientific terms in Tamil. It provides us with synonymous words along with the standardized lexical items. It has different kinds of retrieval and browsing facility. Her third version is much efficient and more informative than her earlier versions. One can straight away get the dictionary in printed form too.


It is also one of the content creations of RCILTST, Chennai. Sorputhaiyal is a language oriented software for the retrieval of lexical entries from monolingual and bilingual dictionaries by many users simultaneously. This Tamil Online Dictionary contains 20,000 root words. Each entry in the dictionary includes the Tamil root word, its English Equivalent, different meaning of the word, and the associated syntactic category. The root words are classified into 15 categories. It supports a user-friendly interface to provide the information. The tool uses the morphological analyzer, which retrieves the root word from the word given by the user. Hence even if the user specifies an inflected word, the dictionary fetches the root word and gives all relevant information about the word.

Pals e-Dictionary:

It is an English-English-Tamil dictionary. It is available in CD-ROM. It has 22,000 headwords and 35,000 subwords. The supporting software enables one to turn the pages easily. And also the search can be made quickly. Pals e-dictionary is available for browsing in Tamil Virtual University website.

Tamil Lexicon of Madras University:

Tamil Lexicon published by Madras University is available in e-form in Tamil Virtual University Website.

English-Tamil Dictionary of Madras University:

English-Tamil dictionary published by Madras University is also available in e-from in Tamil Virtual University Website.

Tamil-Tamil Dictionary of M.Shanmugam Pillai:

Tamil-Tamil dictionary prepared by Shanmugam Pillai is now available in e-form in Tamil Virtual University Website.

Multidimensional SMART dictionary:

Vijaya and John Paul (2004) proposed an electronic dictionary for Tamil named Multidimensional SMART dictionary, a project work of CIIL. The Multidimensional SMAT (Small Readable Tamil) is an advanced Tamil dictionary, which has the following unique features: it is a web based dictionary and it is open to all at any time. The dictionary will give the following details, if anyone searches a word:

	∙	Search word in Tamil
	∙	Spoken pronunciation/Alternative pronunciation in Roman transcription
	∙	Most frequent meanings in English
	∙	Grammatical Categories in English
	∙	Combination / Collocation construction in Tamil, its Roman transcription and English meaning
	∙	Compound constructions in Tamil, its Roman transcription and English meaning
	∙	Proverbial constructions in Tamil, its Roman transcription and English meaning
	∙	Antonyms in Tamil and its Roman transcription

They have collected and entered 6992 head words (search words), their grammatical categories, meanings, synonyms and antonyms.

Tamil WordNet (TWN):

TWN has been built by Department of Linguistics of Tamil University in collaboration with AUKBC Research Centre, Chennai. Tamil Virtual University has sanctioned four lakhs for the project and so far one lakh has been released. Dr.Rajendran of Tamil University is the Chief Investigator and Arulmozi of AUKBC is the co-investigator. TWN is based on the architecture of EuroWordNet which is an online lexical database. Until recently only dictionaries in printed book format represented the lexicon of a language. TWN is a semantic dictionary that is designed as a network, partly because, representing words and concepts as an interrelated system seems to be consistent with evidence for the way speakers organize their mental lexicons. TWN design resembles that of a thesaurus in that its building block is a synset consisting of all the words that express a given concept. Thus, the user of a TWN who has a given concept in mind can find, by calling up one of the words expressing this concept, other words that lexicalize the same concept. But TWN does much more than lists concepts in the form of synsets. The synsets are linked by means of a number of relations, including hyponymy, metonymy, and entailment. Different kinds of semantic oppositions lumped together in the antonymy relation link words only, rather than concepts. TWN thus clearly separates the conceptual and the lexical levels, and this distinction is reflected in the one between semantic-conceptual and lexical relations that hold among systets and words, respectively. Unlike thesaurus, the relations between concepts and words in TWN are made explicit and labeled. Users select the relation that guides them from one concept to the next and choose the direction of their navigation in conceptual space. Words express concepts, and the lexicon is constrained by the kinds of concepts that are available by virtue of ones perception of, and interaction with, the world around. TWN differs from thesauruses, where only lexicalized concepts are accounted for. In some respects, TWN resembles a traditional dictionary. For example, TWN gives definitions and sample sentences for most of its synsets. TWN also contains information about morphologically related words. TWN’s goals differ little from those of a good standard college-level dictionary, and the semantics of TWN is based on the notion of sense that lexicographers have traditionally used in writing dictionaries. It is in the organization of that information that TWN aspires to innovation (Miller, 1998). TWN does not give pronunciation, derivation morphology, etymology, usage notes, or pictorial illustrations. TWN does however, try to make the semantic relation between word senses more explicit and easier to use.

TWN relies on extensive preliminary investigations of the vocabulary ofTamil (Rajendran, 1976-2003) based on the componential analysis of meaning (Nida, 1975a & 1975b) and structural semantics (Lyons, 1977). Portions of this work have been compiled into a Tamil Thesaurus (Rajendran, 2001). The Tamil thesaurus in electronic form represents the ontological structure of Tamil (shortly OST) vocabulary giving scope to any kind of semantic/lexical relations that hold between lexical items.TWN makes the commonly accepted distinction between conceptual-semantic relations, which link concepts, and lexical relations, which link individual words. The mental lexicon tends to build semantic networks with conceptual-semantic relations, whereas workers focusing on lexical aspects use primarily lexical, word-word relations. Wordnet is organized by semantic relations. Since a semantic relation is a relation between meanings, and since meanings can be represented by synsets, it is natural to think of semantic relations as pointers between synsets. It is characteristic of semantic relations as pointers between synsets. TWN does not contain syntagmatic relations linking words form different syntactic categories. The four major syntactic categories (Noun, Verb, Adjective, Adverb) are treated separately. Nouns are organized in lexical memory as topical hierarchies, verbs are organized by a variety of entailment relations, and adjectives and adverbs are organized as N-dimensional hyperspaces.

Table of lexical/semantic relations for nouns:

Relations Subtypes Example
Synonymy puttakam ‘book’ to nduul ‘book’
Hypernymy-Hyponymy vilangku ‘animal’ to paaluuTTi ‘mammal’
Hyponymy-Hypernymy pacu ‘cow’ to paaluuTTi ‘mammal’
Holonymy-Meronymy Wholes to parts meecai ‘table’ to kaal ‘leg’
,, Groups to members tuRai ‘department’ to peeraaciriyar ‘professor’
Meronymy-Holonymy Parts to wholes cakkaram ‘wheel’ to vaNTi ‘cart’
,, Members to groups paTaittlaivar ‘captain’ to paTai ‘army’
Opposites Antonymic (gradable) ndallavan ‘good person’ to keTTavan ‘bad person’
,, Complementary iravu ‘night’ to pakal ‘day’
,, Privative (opposing features ) ahRiNai ‘irrational’ to uyartiNai ‘rational’
,, Equipollent (positive features) aaN ‘male’ to peN ‘female’
,, Reciprocal Social roles maruttuvar‘doctor’ to ndooyaaLi ‘patient’
,, Kinship Relations ammaa ‘mother’ to makaL ‘daughter’
,, Temporal Relations munnar ‘before’ to pinnar ‘after’
,, Orthogonal or perpendicular vaTakku ‘north’ to kizakku ‘east’ and meeRku ‘west’
,, Antipodal Opposition vaTakku ‘north’ to teRku ‘south’
Multiple opposites Serial onRu ‘one’, iraNTu ‘two’, muunRu ‘three’, ndaanku ‘four’
,, Cycle njaayiRu ‘Sunday’ to tingkaL ‘Monday’ .. to cani ‘Saturday’
Lexical association Collocation cingkam ‘lion’ to karji ‘roar’
,, Morphological relations paTi ‘study’ to paTittavan ‘educated man’
Compatibility ndaay ‘dog’ to cellappiraaNi ‘pet’

Table of lexical/semantic relations for verbs:

Relations Definition/sub typesExample
Synonymy Replaceable events tuungku ‘sleep’ ↝ uRangku ‘sleep’
Meronymy- Holonymy Events to super-ordinate events paRa ‘fly’ ↝ pirayaaNi ‘travel’
Troponymy Events to their subtypes ndaTa ‘walk’ ↝ ndoNTu ‘limp’
Entailment Events to the events they entail kuRaTTaiviTu ‘snore’ ↝ tuungku ‘sleep’
,, Event to its cause uyar ‘rise’ ↝ uyarttu ‘raise’
,, Event to its presupposed event vel ‘succeed’ ↝ muyal ‘try’
,, Event to its implied event kol ‘murder’ ↝ iRa ‘die’
Antonym Opposites kuuTu ‘increase’ ↝ kuRai ‘decrease’
,, Conversensess vil ‘sell’ ↝ vaangku ‘buy’
,, Directional opposites puRappaTu ‘start’ ↝ vandtuceer ‘reach’

Chitiraputhiran’s Tamil Lexical Resource:

An electronic Data Base: Chitiraputhiran has taken up a major project supported by UGC on Preparing Lexical Resource for Tamil. He tries to capture polysemy and inter and intra relations between lexical items by preparing a database. He is building his database based on dictionaries and lexicons available for Tamil. So far he has colleted lexical information for 2000 lexical items.


Chithiraputhiran, H. Tamil Lexical Resource: An electronic Data Base. Paper read in International Seminar on Tamil Computing, February 27, 28. Madras University.

Rajendran, S. Preliminaries to the Preparation of wordnet for Tamil Language in India 2:1, www.languageinindia.com. 2002.

------------. Dravidian WordNet: a proposal in R.M. Sundaram, et al (eds) Facets of Language, Thanjavur. 2003

-----------. Creating Generative Lexicon form Dictionaries: Tamil Experience. In Rajeev Sangal, S.M. Bendre & Udaya Narayana Sigh (eds.), Recent Advances in Natural Language Processing. Mysore: CIIL. 2003

Ganesan, M. Compilation of Electronic Dictionary for Tamil. Annamalai University, Annamalai Nagar. www.bhasshaindia.cjb.net/G7ganesan.pdf 2000.

3. Electronic/Online thesaurus

Tamil University’s Electronic Thesaurus for Tamil:

An electronic thesaurus for Tamil has been prepared by Rajendan and Bhaskaran of Tamil University. The electronic thesaurus is based on a paper thesaurus prepared by Rajendran for Tamil (Rajendran, 2001: taRkaalat tamizhc coRkaLanjciayam, Tamil University: Thanjavur). The classification is based on Nida’s theory of componential analysis of meaning (Nida, E.A. 1975a. Compositional Analysis of Meaning: An Introduction to Semantic Structure, The Hague: Mouton) and (Nida, E .A.1975.b. Exploring Semantic Structure, The Hague: Mouton) and the lexical and semantic relations between the lexical items are established based on John Lyon’s structural semantics (Lyons, J.1977. Semantics, volume 1, New York: Cambridge University Press). The theory of field semantics (Lehrar, A. 1974. Semantic Fields and Lexical Structure, Amsterdam: North-Holland Publishing Company) is also taken into consideration while preparing the thesaurus. The preparation of electronic thesaurus has some linguistic issues like hierarchical classification of lexical items, establishment of semantic domains, selection of lexical items, establishing network of relations between lexical items, classification and ordering of lexical items under terminal domains etc. Computerization of thesaurus needs some procedures and methods for creating a special kind of databases and for accessing system. The thesaurus is prepared in Roman scripts and provided facility to display the content in Tamil scripts also. It contains nearly thirty thousand words. They have been classified into four major domains: entities, events, abstracts and relationals. The entities consist of mainly words denoting concrete nouns. The events consist of verbs and verbal nouns, the abstract consists of abstract nouns, adjectives and adverbs, and the relationals consist of coordinators, complementizers, postpositions, case suffixes and anaphoric references like pronouns. The lexical items are arranged hierarchically to capture hypernymy-hyponymy and holonymy-meronymy relations. Also the lexical items are arranged in the horizontal axis to capture horizontal relations such as synonymy, lexical oppositions and lexical associations. Thus the lexical items are arranged in such a way that one can capture the network of lexical/semantic relations between lexical items.


Rajendran, S.. Prerequisite for the Preparation of an Electronic Thesaurus for a Text Processor in Indian Languages, Language in India 3:1, www.languageinindia.com 2003

Rajendran, s. and Baskaran, S. Preparation of Electronic Thesaurus for Tamil in Proceedings of the International Conference on Natural Language Processing. Mumbai: NCST. 2002

Prerequisite for the preparation of an Electronic Thesaurus for a Text Processor’ Language in India 3:1, www.languageinindia.com

4. Morphological analyzers/generators

Morphological analyzers and generators for Tamil which could be used for automatic lemmatization of word forms from corpus are discussed under morphological parsing.


Anandan, P, Rajani Parthasarathy, Geetha, T.V. Morphological Generator for Tamil, Tamil Inaiyam Conference. 2001.

Balakrishnan, R. Morphology and Tamil Computing. Paper read in International Seminar on Tamil Computing, February 27, 28, 2002, Madras University. 2002.

Deivasundaram, N. and Gopal, A. Computational Morphology of Tamil. B. Ramakrishna Reddy (ed.) Word Structure in Dravidian, Kuppam: Dravidian University, 406-410. 2003.

Ganesan, M. Computational Morphology of Tamil. B. Ramakrishna Reddy (ed.) Word Structure in Dravidian, Kuppam: Dravidian University, 399-405. 2003

Rajendran S, Arulmozi S, Ramesh Kumar S, & Viswanathan S. Computational Morpohology of Verbal Complex. B. Ramakrishna Reddy (ed.) Word Structure in Dravidian, Kuppam: Dravidian University, 376-398. 2003

Ranganathan, V. A Lexical Phonology Approach to Processing Tamil Word by Computer. International Journal of Dravidian Linguistics 26.1: 1997.

Ramaswamy, Vaishnavi. A Morphological Generator for Tamil. M.Phil Dissertaion submitted to University of Hyderabad. 2000.

Ramaswamy, Vaishnavi. A Morphological Analyzer for Tamil. Ph.D Dissertaion submitted to University of Hyderabad. 2003.

Viswanathan, S., Rameshkumar, S. Kumara Shanmugam, B. Arulmozhi. S. & Vijay Shnakar, K. A Tamil Morphological Analyzer. In Rajeev Sangal, S.M. Bendre & Udaya Narayana Sigh (eds.), Recent Advances in Natural Language Processing. Mysore: Central Institute of Indian Languages. 2003.

Winston Cruz, S. Parsing and Generation of Tamil Verbs in GSMorph. M.Phil. dissertation submitted to the University of Hyderabad. 2002.


Language Technologies Research Centre, Indian Institute of Information Technology, Hyderabad.


Tamil University’s Spell and Grammar Checker Project:

Department of Linguistics of Tamil University undertook a major project entitled ‘Spell and Grammar Checker for Tamil’ with the financial support of UGC. A morphological analyzer has been prepared to help in spell checking. An error analysis of written documents in Tamil has been undertaken and the errors are listed and classified. A model of Spell and grammar checker for Tamil with limited scope has been prepared.


Parser is a device that makes use of the representation of the knowledge of the structure of a language provided by the grammar. It analyzes the input language structures and provides a parsed structure as out put. This could be used for understanding and comprehending the structure and for translating the structures if necessary. The efficiency of the parser depends upon the effective database and delicate algorithms provided to the machine. In principle a parser applies whatever a grammar declares for its parsing and analysis.


Morphological processing as such forms the basis of any NLP system. Parsing is that activity where the analyzer recognizes and analyzes the given output and normally returns along with its meaning and grammatical features as output. Generation is reversal of this process in which the root is combined with various morphemes to produce one or more surface forms. Most processors available have mostly based themselves to a structural approach. A number of models are available for morphological parsing. Important of them are PCKimo, Ample and GS Morph. PCKimmo uses two level morphology. Ample and GS Morph are based on item and arrangement. PCKimmo is tested and proved to have worked well for agglutinative languages like Finnish and Tukish at least. Ample is a morphological analyzer based on item and arrangement. It works by ‘matching and filtering’. For the parsing, ample starts form the left periphery of a word given for the analysis.

Rajendran’s Morphological Analyzer for Tamil:

The first step towards a preparation of morphological analyzer for Tamil was initiated by anusaraka group of researchers under whose guidance Rajendran, Tamil University prepared a morphological analyzer for Tamil for Translating Tamil into Hindi at the word level.

Geanesan’s Morphological Analyzer for Tamil:

Ganesan developed a morphological analyzer for Tamil to analyze CIIL corpus. Now he is involved in improving his morphological parser.

Kapilan’s Morphological Analyzer for Tamil Verbal Forms:

Kapilan prepared a morphological analyzer for verbal forms in Tamil.

Vasu Ranganathan’s Tagtamil:

Tagtamil by Vasu Ranganathan is based on Lexical phonological approach. Tagtamil does morphotactics of morphological processing of verbs by using index method. Tagtamil does both tagging and generation.

AUKBC Morphological Parser for Tamil:

AUKBC NLP team under the supervision of Rajendran prepared a Morphological parser for Tamil. The API Processor of AUKBC makes use of the finite state machinery like PCKimmo. It parses, but does not generate.

Vaishnavi’s Morphological Generator for Tamil:

Vaishnavi researched for her M.Phil. dissertation on morphological generator for Tamil. The Vaishanvi’s morphological generator implements the item and process model of linguistic description. The generator works by the synthesis method of PCKimmo.

Winston Cruz’s Parsing and Generation of Tamil Verbs:

Winston Cruz makes use of GSmorph method for parsing Tamil verbs. GSmorph too does morphotactics by indexing. The algorithm simply looks up two files to see if the indices match or not. The processor generates as many forms as it parses and uses only two files.

Viashnavi’s Morphological Analyzer for Tamil:

Vaishnavi again researched for her Ph.D. dissertation on the preparation of Morphological Analyzer for Tamil.

Dhurai Pandi’s Morphological Generator and Parsing Engine for Tamil Verb Forms:

It is a full-fledged morphological generator and a parsing engine on verb patterns in modern Tamil.



AUKBC Research Centre, MIT Campus, Chennai.

Centre Of Applied Linguistics and Translation Studies, School of Humanities, University of Hyderabad.


Arulmozi, S. Aspects of Inflectional Morphology – A Computational Approach. Ph.D dissertation submitted to University of Hyderabad. 1998.

CevveeL Kapilan. kaNippoRivazhi tamizh vinaikaLin pakuuppaayvu. Chennai: Puttaakka Mozhiyiyal Kazhakam. 1994.

Dhurai Pandi, Morphological Generator and Parsing Engine for Tamil Verb Forms (abstract), in Kalyanasundaram (ed) Tamil Internet 2002: Conference Papers, Chennai: Asian Printers, p.59. 2002.

Ganesan, M. Functions of the Morphological Analyser Developed at CIIL, Mysore in Harikumar Basi (ed.) Automatic Translation (seminar proceedings), Thiruvananthapuram: ISDL. 1994.

-------------2003. Computational Morphology of Tamil In B. Ramakrishna Reddy (ed.) Word Structure in Dravidian, Kuppam: Dravidian University, pp. 399-405.

Ganesan, M and Francis Ekka. 1994. Morphological analyzer for Indian Languages. Agarawal and Pani (eds.) Information Technology Applications in Language, Script and Speech, New Delhi: BPB Publication.

Rajendran S, Arulmozi S, Ramesh Kumar S, & Viswanathan S. Computational Morphology of Verbal Complex. B. Ramakrishna Reddy (ed.). Word Structure in Dravidian, Kuppam: Dravidian University, 376-398. 2003.

Ramaswamy, V. Morphological Generator for Tamil. Unpublished M.Phil. dissertation. University of Hyderabad. 2000.

Ranganathan, V. A Lexical Phonology Approach to Tamil Words by Computer. International Journal of Dravidian Linguistics 26:1.57-70. 1997.

------------A Lexical Phonology of Agglutinating Languages (A Case Study of Tamil). Vaishnavi Ramaswamy. A Morphological Generator for Tamil. M.Phil Dissertation Submitted to University of Hyderabad. 2000.

Vaishnavi Ramaswamy. A Morphological Analyzer for Tamil. Ph.D Dissertation Submitted to University of Hyderabad. 2003.

------------------- Parsing in AMPLE, KIMMO & PERL: Nouns in Tamil. Paper presented in Viswanathan, S., Rameshkumar, S. Kumara Shanmugam, B. Arulmozhi. S. & Vijay Shnakar, K.2003. A Tamil Morphological Analyzer. In Rajeev Sangal, S.M. Bendre & Udaya Narayana Singh (eds.), Recent Advances in Natural Language Processing. Mysore: CIIL. 2003.

Winston Cruz, S. Parsing and Generation of Tamil Verbs in GSMorph. M.Phil. dissertation submitted to the University of Hyderabad. 2002.


http://www.aukbc.org/research_areas/project/documentation/docoment.html (17/9/2001)


Baskaran’s Finite-state Machine for Syntactic Parsing:

Finite-state Automata is one of the important techniques for parsing at all the levels of a language structure. On experimental basis, Baskaran (1984) has attempted a Finite-State-Machine for parsing sentences in Tamil.

Kumara Shanmugam’s Syntactic Parser for Tamil:

Phrase Structure grammars have been designed on fixed word order languages like English. Tamil is a variable word order language. In a sentence features of words or grammatical constituents can be tightly coupled or loosely coupled. In fixed word order language features like number and gender in the case of nouns and tense and number in the case of verbs are tightly coupled attachments to the respective syntactic category. The other linkages are loosely coupled and indicated by word proximity. In Tamil, in addition to features like number, gender and tense, case attachments of nouns and aspect and mood of verbs are tightly coupled through inflectional attachments and do not need word proximity to indicate dependency. So Tamil requires a different kind of grammatical formalism which rely on dependency rather than proximity. Tamil rely more on morphology than syntax in indicating grammatical functions. Keeping these characteristics of Tamil in mind Kumara Shanmugam (2004) has prepared a parser for Tamil. The parser he has designed carry out a complete morphological analysis of words of the sentences at the first level in order to help in dependency determination. The parser divides the sentences into two basic constituents, namely, noun part and verb part. In other words he has a one level syntax tree. Since Tamil has variable word order it is possible that the noun or verb parts could be discontinuous. Thus the parser uses the morphological analyzer to determine tightly coupled features which help in classification of the words. Unclassified words are classified based on heuristics. Dependencies between noun head and verb head and their respective modifiers are tackled with the help of dependency rules. Sentence patterns are then used to analyze the sentence. The selection of the sentence pattern depends on information provided by the morphological analyzer. The addition of rules for semantic dependencies can enhance the performance of the parser.

Shanmugam’s Parsing Techniques:

For processing a natural language, certain formalisms are required. The grammatical models proposed by linguists, otherwise called as grammatical formalism try to capture the phonological, grammatical and semantic organization of natural language partially or fully. Grammatical formalisms are written with the purpose of comprehending the units and patterns found in all the levels of language. Computer scientists take the grammatical formalisms, modify them suitably for creating data-base procedures for machines so to make the machines process, recognize and produce natural language units and structures. Such a computational description is called as computational formalism. Shanmugam while proposing a program for syntactic parsing in Tamil makes the following comments: “Structural description of the units of a language can be provided by the grammar of language, making use of a principle called ‘Projection Principle’. According to this principle, as in transformational grammatical treatise, the structure of a sentence or phrase can be projected or plotted from the lexical specification of the head of the phrase or sentence. That projected structure will be abstract structure which will be modified with due substitution of appropriate lexical items. Shanmugam (2002) advocates for minimalist program for Tamil parsing. All grammatical formalisms identify lexicon and certain procedures for creating and manipulating grammatical structures. Minimalist program which is a grammatical model and an extension of GB framework was proposed by Chomsky to expose the grammatical patterns found in languages. Some of his MPhil and Ph.D. students have worked for their dissertation on Context Free Grammar Formalism, Transformational Generative Grammar Formalism, Projection Principle, and Minimalist Program and prepared syntactic parser models for Tamil based on the formalism they have chosen.


Baskaran, S. Experiment in Implementation of Finite-State-Machine Parser for Tamil Sentences, Computer Society of India Communications, Mumbai.

--------- Compute as a Language Research Tools, in Proceedings of the Second Tamilnadu Congress, Chennai.

----------Tamil Computing, tamizh vaLarcci.

----------Evolution of Tamil Computing, Souvenir, National Symposium on Current Trends in Computer Applications.

Kumara Shanmugam, B. Parse Representation of Tamil Syntax. 2001.

------------------. Syntactic Parser for Tamil. M.S. (in NLP) dissertation submitted to Anna University, Chennai. 2004.

Shanmugan, C. Computer Analysis of Simple Sentences in Tamil, Paper read in UGC-SAP National Seminar on Computational Linguistics and Dravidian Languages, 22-24 February, 2001, CAS in Linguistics, Annamalai University, Annamalainagar. 2001.

-----------------Grammar and Parser: A Program for Syntactic Parsing in Tamil, International Seminar on Tamil Computing, 27-28 February and March 1, 2002, University of Madras, Chennai. 2002

------------ Minimalist Program for Tamil Parsing.


Tamil University Machine Translation system for Russian-Tamil:

On experimental basis, Departments of Computer Science, Linguistics and Translation of Tamil University, Thanjavur have developed a Machine Translation System for translating technical literature from Russian to Tamil. The research team had achieved a considerable success in the venture and a monograph entitled Tamil University Machine Translation System has been published.

Tamil-Hindi Anusaraka:

Dr.Rajeev Sangal who was working in Department of Computer Science and Engineering, IIT, Kanpur under the financial support of DOE took up a major project for preparing translation aids (Anusaraka) to translate Indian languages from one to other using computer. The translation is at the word level as it presumes that the word order of Indian languages is more or less same and that the grammatical function depends more on inflection than word order. The commonness between Indian Languages at the syntactic level has been exploited. Anusaraka works without a syntactic parser. There is only an efficient morphological analyzer. Rajendran of Tamil University, Thanjavur associated with the group to build Tamil-Hindi Anusaraka. A morphological analyzer was prepared for Tamil and the equivalent Hindi forms are given. The analyzed word structure of Tamil was mapped against their Hindi equivalents. The transfer relied on a transfer dictionary of Tamil-Hindi. A rudimentary Tamil-Hindi Anusaraka was developed in 1994. The NLP AUKBC Research Centre got all these materials, Tamil-Hindi Anusaraka Rajeev Sangal and started improving on it. The team came out with an improved version of the above mentioned device.

Machine Translation Aid to translate Linguistics texts in English into Tamil:

The demand for teaching Linguistics in Tamil has made it mandatory to look for a tool which helps in translating English Text books into Tamil. So a project with an aim to prepare a Machine Translation Aid (MTA) to translate Linguistics texts in English into Tamil was visualized by Rajendran, Department of Linguistics, Tamil University, Thanjavur. This has culminated into a dissertation by Kamakshi (2001). The project has the following objectives:

∙ To understand the different machine translation models which are in vogue for selecting a feasible model.

∙ To study the language of linguistics so that the domain specific features of the language of linguistics is thoroughly understood before proceeding to translate the linguistics texts by using machine.

∙ To correlate the structure of the source language, English and target language, Tamil so that the transfer model adopted for translation can be successfully manipulated.

∙ To prepare a prototype of Machine Translation Aid (MTA) to transfer linguistics texts in English into Tamil.

The preliminaries to prepare MTA can be listed as follows:

1. Understanding the architecture of English structure

2. Correlation of English Structure with that of Tamil to find out the salient commonness and differences

3. Listing of transfer rules

4. Studying of style of Linguistics with special reference to Chomsky’s Aspects of theory of Syntax

5. Preparation of a bilingual transfer dictionary.

A MTA model has been built to transfer Linguistics Text in English into Tamil.

UNL-Interlingual Machine Translation approach for Tamil:

This project has been undertaken by the RCILRST, Chennai. The Universal Networking Language (UNL) has been used as the intermediate representation. The device has an EnConverter and DeConverter. EnConverter is a language dependent parser that provides synchronously a framework for morphological, syntactic and semantic analysis. EnConverter generates UNL expressions from sentences (or list of words of sentences) of Tamil language by applying enConversion rules. In addition to the fundamental function of enConversion, it checks the formats of rules, and outputs the messages for any errors. It also outputs the information required for each stage of conversion in different levels. With these facilities, a rule developer can easily develop and improve rules by using Enconverter. DeConverter is also a language dependent generator that provides synchronously a framework for word selection, morphological and syntactic generation and natural collocation necessary to form a sentence. DeConverter can convert UNL expressions into a variety of native languages, using a language specific set of word dictionary, grammatical rules and co-occurrence dictionary. Given a set of structures the primary task is to retrieve the relevant dictionary entries from the Tamil language word dictionary corresponding to the words in the word part of the UNL structures. The next step in the DeConversion process is use of specific language specific, linguistic based deconversion rules to convert the UNL structure into natural language sentences. These sentences have to obey the morphological and syntactic rules of the language. This is ensured by appropriately building the deconversion rules which specify the morphological syntactic structure of the language under consideration.

Vasu Renganathan’s Interactive Approach to Development of English-Tamil Machine Translation System on Web:

The work-in-progress of this system may be tested online in ht eURLhttp://lrrc3.plc.upenn.edu/tamil/. This is a rule based system containing around five thousand words in lexicon, and a wide range of transfer rules written in Prolog encompassing frequently occurring English structures mapped to corresponding Tamil structures. Both rule base and lexicon of this system are built in such a way that the users can update the scope of this system interactively by adding words into lexicon and rules into rule base. Translating both colloquial and technical English into Tamil with a computer essentially involves construction of the two blocks, namely, lexicon and rules into rule-base.

Durai Pandi’s English-Tamil Machine Translation System:

A working model of English to Tamil Machine Aided Translation package for finite verb (simple sentences) structures has been designed based on TAM encoded MAT compatible lexicon, English and Tamil structure parsing engine and a Tamil structure generator engine. Presently the package is under development with the assistance from Tamil Software Development Fund (TSDF) of Tamil Virtual University (TVU), Chennai.


Chellamuthu, K.C. et al. Tamil University Machine Translation System (TUMTS), Thanjavur: Tamil University.

Chellamuthu, K.C. Russian to Tamil Machine Translation System at Tamil University. Kalyansundaram K (ed.) Tamil Internet 2002: Conference Papers, Chennai: Asian Printers, pp. 74-83. 2002.

Durai Pandi. English-Tamil Machine Translation System, Kalyansundaram K (ed.) Tamil Internet 2002: Conference Papers, Chennai: Asian Printers, p. 86. 2002.

Ganesan, M. Relevance on Indian Grammatical Theories for Automatic Translation: chances and challenges. paper presented in National Seminar of Dravidian Linguists Association, Dravidian University, Kuppam, June 1996.

Kamakshi, S. Preliminaries for Digitizing the Personal Pronouns in English into Tamil (Distribution Sensitive Machine Translation Aid – DSMTA) – A Demo Paper. Kalyansundaram K (ed.) Tamil Internet 2002: Conference Papers, Chennai: Asian Printers, pp. 87-97. 2002.

Kamakshi, S. Machine Recognition and Translation of ‘~ing’ words in English into Tamil through bilingual Machine Tractable Dictionary. Paper presented in International Conference on Indian Lexicography, pp. 28-30 January 2004, CASL, Annamalai University, Annamalainagar. 2004.

Kumara Shanmugam, B. ‘Machine Translation as related to Tamil,’ in Kalyansundaram K (ed.) Tamil Internet 2002: Conference Papers, Chennai: Asian Printers, pages 84-85. 2002.

Rajendran, S. and Kamakshi, S. Preliminaries to the preparation of a machine translation aid to translate Linguistics Texts in English into Tamil. Paper read in UGC-SAP National Seminar ‘On Translation’ 7th-9th March 2002, CAS in Linguistics, Annamalai University. 2002.

Vasu Renganathan. Interactive Approach to Development of English-Tamil Machine Translation System on Web, in Kalyansundaram K (ed.) Tamil Internet 2002: Conference Papers, Chennai: Asian Printers, pp. 68-73. 2002.

1. Word Sense Disambiguation (WSD) tools

AUKBC Word Sense Disambiguation Tool:

Word sense disambiguation is the task of assigning the appropriate sense for all the occurrences of ambiguous words in the text. Most of the WSD systems use the context of the ambiguous word to determine its intended usage. Baskaran, who is in the AUKBC NLP team, worked for his MS degree on word sense disambiguation in Tamil. Being a member of AUKBC NLP team he has prepared a WSD tool for Tamil to help in MT.


Baskaran, S. Word Sense Disambiguation of Tamil. MS dissertation submitted to Anna University, Chennai. 2002.

Baskaran, S. Word Sense Disambiguation in Tamil. International Journal of Dravidian Linguistics. 2003.

Baskaran, S and Vaidehi, V. Collocation Based Word Sense Disambiguation using Clustering for Tamil, International Journal of Dravidian Linguistics 33.1: 13-28. 2004.

Baskaran, S. Vijay-Shanker, K. Influence of Morphology in Word Sense Disambiguation for Tamil. In Rajeev Sangal et al (eds.) Recent Advances in Natural Language Processing. Mysore: Central Institute of Indian Languages. 2003.


1. Single font/ multifont / omnifont OCR systems

Traditionally, data was keyed and stored in a computer through a data entry operator. While it is still used, and is the most reliable method, a need for automating this process was required, in more labor intensive tasks such as: mail sorting, passport processing systems, Insurance and Finance Forms processing systems, Bill Processing systems and many such voluminous sifting in a diminutive amount of time. To automate these processes, character and handwriting recognitions techniques are used.

Krishnamoorthy’s OCR System:

Krishnamoorthy seems to be the torch bearer in the race of preparing OCR softwares for Tamil. There are many approaches in designing OCR. Krishnamoorthy has chosen a new method, based on representing a letter as a graph. If a letter is considered as strokes of thin lines, it can be considered as a graph, by inserting vertices. These vertices can be the end points, points which are local minimum or local maximum in the x and y directions. This representation of a letter as a graph has some advantages. The major advantage is that the information content gets reduced very much. Hence the processing needed to recognize a character, gets reduced in many cases. This speeds up the recognition processes very much. Krishnamoorthy has devised a method in which this graph is constructed without thinning the character map. This again adds to the speeding up of the recognition. The major disadvantage, according to Krishnamoorthy, is that sometimes too much information gets lost, and he has to resort to different methods to recognize a letter. He has listed a number of problems and solutions to overcome these problems. According to him, the solutions are complicated. They will increase the time recognition very much. But, if accuracy required is more than 95% or so, it may be necessary to employ all these techniques. Also, the logic should be built in such a way that each solution is handled intelligently so that the time taken is small. The shape of the characters, the linguistic nature of words, and the different approaches for character recognition - all these have to be mixed and used judiciously to get the best result in OCR.


Vasudevan V. Character Recognition Techniques – A Demo Program. International Seminar on Tamil Computing, 27-28 February and March 1, 2002, University of Madras, Chennai. 2002.


vasu@kamban.com.au, vasi@au.ibm.com

2. Printed / typed / handwritten /shorthand


Hewavitharana, S.A. Two Stage Classification Approach to Tamil Handwriting Recognition. Kalyansundaram K (ed.) Tamil Internet 2002: Conference Papers, Chennai: Asian Printers, p. 86. 2002.


1. Signal Processing

Natansapapathy’s study of Segmental Duration of Tamil sounds: Natanasapabathy (2002) studies the segmental duration of Tamil sounds, the knowledge of which is essential for speech synthesis. According to his analysis the timing process operates at least in three different levels:

∙ at the level of sentence, phrase, word, and syllable where it is related to the boundary phenomena represented by vowel lengthening.

∙ at the phonological level as contrastive distribution in both vowels and consonants.

∙ at the segmental phonetic level to accommodate positional and contextual effects.

The relative duration is sufficient for phonological and linguistic purposes. However, for the purpose of speech synthesis, the absolute duration or at least three levels of duration is needed for the computer to understand the speech fully and correctly.


Natanasabapathy, S. International Seminar on Tamil Computing. paper presented in International Seminar on Tamil Computing, 27-28 February and 1 March, 2002, University of Madras, Chennai. 2002.

2. Text to speech (TTS)


This is also a product of RCILTST, Chennai. Ethiroli is a Text-to-speech Engine for Tamil. The engine has been so designed that it can be plugged into any application requiring a text-to-speech output. It is Tamil language specific and can be used in systems like Content Packages, Chatterbots, Telephone Interfaces to Online Help Systems and Interactive Sale Computers. The important features of this device are:

	∙	It handles ambiguity.
	∙	It uses Concatenation methods.
	∙	It implements Cues and Silence techniques.
	∙	It processes TAM encoded text files.


Yagnanarayana, B. et al Text –to-speech system for Indian Languages Presented at the Workshop on Computer Applications in Indian Languages held at CIIL, Mysore, August 19-21. 1992.

Yegnanarayanan B, Rajendran, S, Ramachandaran V.R., Madhukumar, A.S. Significance of Knowledge Sources for a Text-to-speech System for Indian Languages in Sadhana, Academic Proceedings in Engineering Sciences of India.

Yegnanarayanan, B. Speech synthesis by Machine. In B.B. Rajpurohit (ed.) Technology and Languages. Mysore: CIIL. 1994.

3. Speech Recognition / Understanding

Syllabic Study of Nayeemulla Kahan:

Speech recognition aims at signal transformation. For recognition in Indian languages syllable-like units seem to be an appropriate unit. Nayeemulla Khan and his team in their efforts to develop a language independent syllable recognizer, studied the basic characteristics of the words and syllable-like sub-word units occurring in three Indian languages (Tamil, Telugu and Hindi). The observation about the words and syllable like-units, as to their structure, duration and statistical properties are presented in the paper quoted below. The implication of these observations in the context of speech recognition and language identification are highlighted in the paper.


Nayeemulla Khan, A. Suryakanth, V, Gangashetty, & Yegnanarayana, B. Syllabic Properties of Three Indian Languages: Implications for Speech Recognition and Language Identification. In Rajeev Sangal, S.M. Bendre & Udaya Narayana Sigh (eds.), Recent Advances in Natural Language Processing. Mysore: Central Institute of Indian Languages. 2003.


A simple application like publishing can be implemented by just having fonts to compose Indian language text. If the text to be processed is of monolingual nature, it can be done by mapping on to any existing coding. However in bilingual, multilingual context a language text cannot be processed without identifying the language code. Therefore, encoding a language is very much necessary and is inevitable for the purpose outlined below:

∙ Makes easy to identify the language characters, there by simplifying the language processing complexities.

∙ Easily intermixes with any language.

∙ Eliminates the usage of mark up languages.

1. Character level standard: ISCII/UNICODE

In the computer world, R& D Institutions, Companies involved in technology development and its usage evolve a common specification called ‘standard’ to meet their varied requirements. Sometimes, the well established protocols developed by the market leaders also evolves as a ‘standard’. There are standards for every aspect necessitating its implicit implementation and it goes with natural language as well. BIS is the nodal authority and it has standardized all Indian Languages for computer and notified as IS : 13194 : 1991. Likewise, the Unicode Consortium is defined Unicode as standard for all languages of world. The current version is Unicode 3.0. Characters of all languages can be interpreted by using Unicode. When all the OS softwares are using Unicode for information interchange, no language would turn into garbage (which is the common problem with today’s implementation). Unicode is the only wayout to solve the language problems. The present Unicode for Tamil which is encoded as vowels, consonants and vowel signs is based on earlier version of ISCII.

Shortcomings of Unicode for Tamil:

Anparasan (2001) notes the following as the shortcomings of Unicode 3.0 for Tamil:

Encoding Tamil alphabet:

Tamil alphabets are encoded as ayutham, vowels, consonants with a as and vowel sign.

∙ It is to be noted here that the matras are not part of character set of Tamil, which amount to redefining character sets of Tamil.

∙ Vowel Consonants are encoded instead of pure consonants.

∙ Anuswar is not a Tamil alphabet but encoded.

Order of Tamil alphabets:

While encoding Tamil alphabets, Tamil linguistic alphabetic order, has not been followed.

Wrong interpretation:

The following explanation is presented only to show the quantum of error and not to rectify the coding of matras. The vowel mathras such as bh, Bh and bs are treated as if they are formed of two matras and it is not correct. A vowel interpretation in a vowel consonant is identified by its allograph called matras or signs, wherein the vowel consonant is formed of one consonant and one vowel. Similarly the case of xs cannot be formed of ‘o’ short and ‘au’ vowel sign. And also two ‘au’ signs are necessary, even in the present implementation.


Natural Language Processing such as morphological analyzer, spell checker, grammar checker, translation etc entirely depend on the pure consonants. Encoding vowel consonant forces further processing apart from language identity which is unjustifiable.

Clash with Tamil Grammar:

By encoding halant, the basic rule of forming vowel consonants is getting modified. It is known that the vowel consonants are formed from the basic consonants. As per the standard, the basic letter becomes the vowel consonant i.e. ‘a’ consonant leads to many linguistic analytical problems.

Standardization efforts of Government of Tamilnadu:

Government of Tamilnadu constituted a Committee to study the issues related to encoding schemes, keyboard layouts, technical words and to recommend suitable standards for developing Tamil on computers. To deliberate, discuss and arrive at consensus on the technical issues, the first International Seminar was organized at NUS, Singapore, consequent to the efforts made by (Late) Naa. Govindaswamy during 1997. The second Intenational Seminar was organized by the Government of Tamilnadu during 1999 in Chennai. The following draft standards were announced at the valedictory function of the seminar:

	∙	Standardized phonetic keyboard
	∙	Bilingual Glyph encoding
	∙	Monolingual Glyph encoding
	∙	Character encoding

However, the first three draft standards were finalized and announced during June 1999. The character encoding scheme was clubbed with Tamil Unicode standard.

Standardization efforts of Government of India:

Department of Electronics (DoE), Government of India, has announced the First Standard ISSII-83 for all Indian languages. Further, the ISSCII code was revised in 1988 to evolve IBM PC counterpart PC-ISCII. The existing ISCII standard was adopted by Bureau of Indian Standards and announced as IS:13194:1991. On understanding the requirement of rectifying existing standard, DoE has sent a Questionnaire to all State Governments during June 2000 to clarify the points related to Unicode. In response to this communication, the Standardisation Committee constituted by Government of Tamilnadu recommended a 384 syllable encoding scheme for Tamil in Unicode which consists of various allographs and symbols apart from vowels, consonants, and vowel consonants.

There are three types of possibilities in encoding Tamil Unicode as under (Anparasan, 2001):

	Vowels, Ayutham and pure consonants
	Encode all letters, symbols, glyphs etc
	Vowels, consonants and matras 

Anparasan (2001) points out that by encoding Tamil in its alphabets i.e. vowels, ayutham and consonants, the encoding would support the following:

∙	Represent a true Tamil alphabet system.
∙	can be used for any type of writing system
∙	Tamil can be implemented with just fonts
∙	Already OS, Database, Office suites are available with Tamil Unicode support.
∙	Uniscribe technology available to develop any specific application with support for Tamil.
∙	OTF supports complex Indic scripts including Tamil
∙	Development of NLP  applications can be further accelerated due to simplicity in handling language codes
∙	Sorting and Indexing
∙	Less overhead when embedding fonts
∙	Unique representation of language alphabets and letters 



Anbarasan, N. Tamil Unicode-What do we need? Paper read in National Seminar on Computational Linguistics and Dravidian Languages. CAS in Linguistics, Annamalai University, Annamalai. 2001.

Baskaran, S. Report on Standardization of Tamil Key Board, Submitted to the Committee formed for Standardization of Tamil Key board by Tamilnadu Government.

Elangovan, A. Optimisation Techniques in Unicode Tamil Font Development in Tamil Internet 2002: Conference Papers, Cennai: Asian Printers, p. 30.

Francis Ekka and Ganesan, M. Issues on Standard ISCII Codes and Inscript Keyboard. Agarawal and Pani (eds.) Information Technology Applications in Language, Script and Speech, New Delhi: BPB Publication. 1994.

The Unicode Standard: A Technical Introduction 8/05/2001.

Michael S. Kapalan. Unicode and Tamil, in Tamil Internet 2002: Conference Papers, Cennai: Asian Printers, p.1. 2002.

Ponnavaiko, M. An Investigation on Unicode Standards for Tamil, in Tamil Internet 2002: Conference Papers, Cennai: Asian Printers, p.29. 2002.


Applesoft, Bangalore-560010, e-mail: aplesoft@vsnl.com


Centre for Development of Advanced Computing (C-DAC), Pune.



2. Glyph standardization

Muthu Nedumaran. Glyph Choices and Techniques for Building Unicode Based Tamil Fonts, in Tamil Internet. Conference Papers, Cennai: Asian Printers, page 31. 2002


It has been agued that though the classificatory systems of sounds of Tamil is based on their phonetic articulatory properties, the alphabetic system is not phonetic. The standard keyboard overlay standardized on the basis of certain convenience lacks scientific objectivity. The assumption that Tamil consonant scripts have an implicit vowel a included in them, and arbitrary provision for deleting it in conjunct formation and in some other distributional contexts are not based on any phonetic consideration. It is argued that though the existing standard ISCII character codes for consonant script with supposedly implicit vowel a and the provision for deleting by a dot (halant) many serve visual representation of the consonant characters, it is technologically inadequate for NLP involving linguistic analysis. It is further argued that the uniform keyboard overlay for all Indian languages is arbitrary since the frequency count of an alphabet differs from language to language.


Information Technology Department. Information Technology – Standardization of Tamil Key Board and Encoding of Tamil. Glyphs – Recommendations of the Sub-committee on Tamil in Information Technology – Accepted – orders – Issued. (G.O. Ms.NO.17 dated” 13 June 1999)

Kalyanasundaram, K.A. Comparison of transliteration schemes and keymapping of Tamil fonts.http://www.Geocities.com/Athens/5180/translit.html 19/12/2000

4. Operating System level support

PONN: A Tamil Operating Environment:

In order to help Tamil speakers who do not know English or do not want to make use of English for computer operations Prasanna Venkatesan-and-team has proposed a Tamil operating environment. PONN has the following generic features:

∙	Open to different standards followed in different schemes in Tamil S/W development 
∙	Platform independent
∙	Scalable
∙	Generic framework for Tamil Software development
∙	PONN Abstract Classes – reusable class library
∙	Set utilities like Shell, PONN Explorer, VASU  etc.

PONN desktop provides the user interface by launching different utilities and dynamically choosing the working language of PONN as either Tamil or English. Shell is non-GUI for the PONN storage manager. This helps the user to carry out their file and directory operations in Tamil, using keyboard. In contrast to the above, PONN Explorer is a GUI for the PONN storage manager that aids the user to visualize their files and directories, and operations on them are carried out using mouse clicks.

KURAL is a Tamil programming language. It is similar to any imperative programming language. It is designed for teaching programming in Tamil. A compiler is also designed for translating KURAL program into KURAL intermediate code. An execution unit also designed for executing these. KURAL Integrated Development Environment is created, which acts as an editor to create KURAL program and carry out the translation and execution of it.

VASU is a simple document editor designed to create documents in Tamil. Help facilities are also provided with the environment.

PONN-A Tamil Operating Environment has been implemented using VC++ and Java and tested in Windows and Linux platforms.


RCILTST has developed a package called ‘Thaenkoodu’ which supports MS-Assess format. Thaenkoodu is a Tamil database package to help business groups to handle their data efficiently. It helps to store Tamil data and also provides various means of manipulating, processing and retrieving the data. The relevant data can be extracted form the database using queries, forms and reports. The entire data base is stored in MS-Access format.


RCILTST is a presentation tool for Tamil. It helps to organize a presentation with slides consisting of text, pictures and images. Arangam is a necessity of computerized presentation in the technologically savvy world. The existing presentation software does not easily support Tamil language. It creates an affordable and a thin presentation package. Contemporary software has the following deficiencies: written only for one operating system and highly priced for a single user. It has the following features:

	∙	Supports add, remove or edit operations on objects.
	∙	Has three modes viz. Edit, Preview and Presentation Mode.
	∙	Edit mode allows correction on one selected slide.
	∙	Preview mode gives a preview of many slides.
	∙	Presentation mode is for slideshow.
	∙	Supports slide background and font format.


This is also prepared by RCILTST. Chathurangam is a spreadsheet application that helps in financial applications and calculations. It has Tamil user interface and accepts information in Tamil. Chathurangam helps to edit or save the data and view the data using various charts easily. Mathematical expressions are also handled. The salient features of this device are:

	∙	Supports many data formats.
	∙	Handles basic editing operations and mathematical expressions.
	∙	Print facility is available.
	∙	Different charts help to see the patterns in data.

6. Browser level support


This is also a product of RCILTST, Chennai. Bavani is a Tamil search engine for documents in the Internet. It searches for Tamil words in Tamil web sites available in popular font encoding schemes. An important feature of the search engine is the integration of the Morphological Analyzer. All searches are based on root words generated from the query given by the user. This allows a large coverage of documents in the Internet. It has the following salient features:

	∙	Currently supports 22 font-encoding schemes.
	∙	Allows search on multiple keywords.
	∙	Supports search on English words.
	∙	Has  an elegant user interface to enter the query.


Kasirao, V. Impact of Information Technology on Information Management Services with Special Reference to Tamil Computing: A Study. Paper read in International Seminar on Tamil Computing, February 27, 28. Madras University.

Prasanna Venkatesan, P and Chitrakekha T and Kuppuswami P. PONN: A Tamil Operating Environment in Tamil Internet 2002: Conference Papers, Chennai: Asian Printers. 2002

Brochures on Language Technology Products of the Resource Centre for Indian Language Technology Solutions- Tamil, Chennai.




Copyright CIIL-India Mysore