A. Corpus and Corpus Management Tools

 1. General, Tagged Corpus, Parallel Corpus, Aligned Corpora
 2. Corpus indexing tools (Concordance, KWIC index etc.) 
 3. Corpus compression and encryption and encryption tools 
 4. Text processing tools 
 5. Statistical Analysis Tools	

B. Text Editors and word Processors

 1. Text editing tools 
 2. Word Processing tools 
 3. DTP tools

1.Text Editing tools:


ITRANS is a package for printing texts in Indian Language Scripts. It was developed by Avinash Chopde. The output of this package includes scripts such as Devanagari (Sanskrit/Hindi/Marathi), Tamil, Telugu, Kannada, Bengali, Gujarati, Gurmukhi, and Romanized Sanskrit. The input text to ITRANS is in a transliterated form, each letter in an Indian Script is assigned an English equivalent, and the English letters are used to construct what will eventually print out in the Indian Language Script.

2. Word Processing Tools:

Word Processing software helps to type or perform word processing, often in multiple languages from with-in a single application. Some packages support right-to-left languages such as Arabic, Hebrew or Urdu. Some have supporting input methods which are necessary for double-byte languages such as Japanese, Chinese or Korean. The word processing applications may or may not have spell checking for the language which it intends to use. If one language is listed as supporting the word processor, Spell Checkers may be included or provided as a separate module purchase. Some of the word processing tools are as follows:

1) Marathi Saral Klik-2-Type is a marathi word processor software.

2) Shree-Lipi 5.0: It is a script processor and collection of fonts with a number of utilities to be used in Windows95, Windows 98, Windows NT, Windows 2000, Windows Me and Novell Netware. The supporting languages in this processor are Devnagari (Hindi, Marathi, Nepalese, Konkani etc.), Gujarati, Punjabi, Bengali, Assamese, Oriya, Tamil, Kannada, Telugu, Malayalam, Sindhi, Sanskrit, Sinhalese, Russian, Arabic, etc.

3) iLEAP is an Internet ready Indian language Word Processor on Windows. The main features of iLEAP are :

a) Self explanatory User interface

b) Multilingual Spellchecker

c) Choice of Keyboard layouts

d) Email facility for Indian languages. Send Indian language messages in LP2, ACI RTF, HTML, BMP, and JPEG formats to enable use of Web browsers or any standard text editor to view these messages

e) Facility to make web pages in Indian languages.

f) Language Sensitive Multilingual Editor

g) User Definable Shortcuts to type frequently used words and phrases.

h) Search and Replace in Indian Languages

i) Carry Indian language text in RTF format to other programs for Graphic Enhancements and Pre-press Processing

j) Define Styles and Design Templates in any language

k) Choice of Keyboard layout

		Website: www.cdacindia.com/html/gist/products/ileap.asp


1. Word Lists/ Vocabulary:

Some Marathi vocabulary lists are as follows:

		adrishya = invisible
		agyāna = ignorance
		aikya = unity
		akasmāta = unexpectedly
		akhanḍīta = unbroken
		āni = and 
		aṅgikārū = give shelter
		anna = food
		ảnta = end
		artha = money
		asā = like 
		ati = excess 
		āpalī = our
		ātā = now
		īsha = Lord 
		uḍī = jump
		upēkshā = neglect
		eka = one
		kā = why
		karā = do
		kāy = what
		kele = made
		kharā = ass
		koṇa = who
		kripā = grace
		ghyāvī = take
		gūja = secret
		chāla = walk(act)
		chāṅgale = good
		jana = people
		jāḷū = burn
		jāṇatā = wise
		jāvā = go
		jiṇe = life
		te = that
		to = he
		tujhā = you 
		tujhī = your
		tyāce =  his
		daksha = attentive
		dāna = charity
		dilā = gave
		deha = body
		duṣṭa = evil 
		dūri = far
		na = no
		nahī = not
		nishā = night 
		nīḷa = blue
		pālaṭe = change
		parī = but
		pāve = bless
		bolije = speak
		mahī = earth
		makshikā = fly
		mājhe = my
		mānīta = accept 
		mhaṇenā = utter
		mī= I
		mithyā= false
		yā = this
		yeta = come
		rahāve = live
		ritā= empty
		laksha = attention
		laṅḍī = coward
		lapāve = hide
		lokū = people
		vadā = speak
		vase = lives
		vāhaṇe = carry
		vāṇī = speech
		vikalpe = doubt
		viṣā = poison
		vīsarā = forget
		vegaḷe = different
		sattā = rule
		sāṅḍū = drop
		sāpaḍe = find
		shīkavū = teach 
		shiḷā = stone
		shoka = grief
		shreṣṭha = great 
		hā = this
		hāni = harm
		hīta = well-being

2. Electronic/ Online Dictionaries:

a) MaTra Lite–Fully Automatic On Line Translator, it is simple web based interface

b) I.I.T., Mumbai:

It is developing Marathi text corpus in electronic form (including a Marathi dictionary and 10 Marathi classics).

c) IIT, Mumbai is a participant in Universal Networking Language (UNL) project. It is an international project of United Nations University. It is an interlingua for semantic representation. In this project, the input in the source language is enconverted into UNL and then deconverted from UNL to the target language. At present, work on Marathi, Hindi and English is going on.


d) Marathi and Hindi WordNets by Indian Institute of Technology, Mumbai:

		http://www.cfilt.iitb.ac.in/wordnet/webmwn/ - Marathi WordNet
		http://www.cfilt.iitb.ac.in/wordnet/webhwn/ Hindi WordNet

These WordNets are compatible with English WordNet and Euro WordNet. There are 5521 synsets in Marathi WordNet and 11,312 in Hindi WordNet.

e) Marathi Dictionary


f) World Language page for Marathi


e) URLs on Marathi language:

Institute, Organisation on Marathi language:


D. Spell Checkers/ Grammar Checkers/Style Checkers:

Spell Checkers:

Spell Checking utilities are in various forms, including those which work with specific applications only, such as MS Word or Office. Other spell checkers will work using highlight or clipboard functions in most applications under Windows or Mac.

Grammar Checkers:

Grammar Checking is a utility, which is typically part of Spell Checkers that will check grammar, sometimes making suggestions. Spell Checkers may or may not support grammar in all the languages that are available for the spell checking.

E. Parsing Systems:

Parsing is equivalent to extracting the underlying semantics of the expression by identifying parts of speech and their inter-relations. Indian languages are free word order languages. The role of a word in a sentence is defined by its morphology, world knowledge and in a limited way by its position.

Parsing free order languages deals with semantic parsing techniques. There are two modern techniques for semantic “parsing” of sentences - 1) the Unified Networking Language, and 2) the Paninian Grammar formalism. Unified Networking Language is aimed at machine translation, Paninian Grammar formalism is a lightweight method ideal for everyones needs. Indian languages are using the Paninian Grammar formalism for the work.

The problems which are presently being tackled are the development of a stable word grouper (which tackles noun phrases and compound verbs), and a clausal level parser for Indian languages.

2. Morphological:

The Morphological Analyzer is a part of Natural Language Processing system in the context of Indian languages. Indian languages, which have free order (like Hindi), i.e., the semantics, are dependent on the surface structure of the word. Morphological analyzers identify the structural components of a word and collect information about it.

For Marathi, the morphological analyzer identifies the tense, aspect, modality and person of an inflected verb form. For Hindi, gender and number may be identified as well. The root of the verb is identified by the analyzer. Morphological analyzer determines the inflection, suffixes and prefixes of the nouns. It also analyses the lexical word groups which corresponds to the noun and determines the semantic role.


F. Machine Translation and Translation Tools:

1. Indian Institute of Technology, Bombay:

The main aim of I.I.T., Mumbai is to empower the people of India through their use of Information Technology solutions in Indian languages. It is developing new products and services for processing information in Indian languages. It also conducts research in computer processing of Indian languages. Marathi is the Indian language being focussed in I.I.T.

It has created a web site on Marathi languages and Marathi language technologies. It is developing Marathi text corpus in electronic form (including a Marathi dictionary and 10 Marathi classics). It has produced Marathi portal complete with search engine, Machine translation software along with dictionary, Wordnet and Online textbooks for schools in Marathi. It has conducted researches on Machine Translation between Marathi on the one hand and Hindi and English on the other. It also introduced Speech Technology for Marathi. The output of I.I.T. is as follows:

	1. Web site on Marathi language and Marathi language technologies
	2. Training programmes & workshops
	3. Electronic corpus of Marathi text.
	4. Hindi Wordnet.
	5. Marathi portal complete with search engine.
	6. Machine Translation software along with dictionary.
	7. Online textbooks in Marathi for schools.

2. MaTra :

Human Aided Machine Translation Tool from English-Hindi It is a Web based Subsystem for translating English to Hindi. There are two versions of the MaTra based on the amount of interaction expect from the user.

a) MaTra Lite–Fully Automatic On Line Translator, it is simple web based interface.

b) MaTra Pro- Professional Translators Tool with Auto, Semi-Auto and Manual Modes, GUI and Customizable lexicon.

They are helpful in Media News Agencies, Translation Bureaus and Educational Institutions involved in long distance and Online Education.


Machine assisted Translation Tool from English-Hindi. It translates the English text into Hindi in a specified domain of Personal Administration, specifically Gazette Notifications, Office Orders, Office Memorandums and Circulars. The strategy adopted in ManTra is lexical tree. The Mantra Technology is expanded for translating the English texts into other Indian languages such as Gujarati, Marathi, Bengali, and Telugu. They are useful for Translators, Linguists and Govt. Offices, Central Translation Bureau and other Translation Units.


It is a Language Accessor or a computer software which renders text from one Indian language to the other. It produces output which is understandable to the reader, although at times it might not be grammatical. Example: a Marathi to Hindi Anusaaraka can take a Marathi text and produce output in Hindi which can be understood by a Hindi reader, but which is not fully grammatical. The reader requires some amount of training for reading the output. Anusaarakas is built from Telugu, Kannada, Bengali, Marathi and Punjabi to Hindi. Beta versions of these languages are released for use over the internet as e-mail servers. The storage code for Anusaaraka is ISCII. It can be used in various scenarios. Example: A reader might be accessing a web site containing Indian language texts. He comes across a site of interest, and wants to read material on it. However, he does not know the language. He can run anusaaraka and read the text. Normally, the reader’s motivation is high and he is willing to put in some effort.

5. Marathi WordNet:

A Lexical Database for Marathi:

Marathi. Word Net Online helps to browse the Marathi Word Net database through HTML form interface. This web site uses Devanagari fonts.


1. Optical Character Recognition System for Devanagari

It is a Software Product that works with the help of a Scanner. An User puts a piece of paper document printed in Devanagari (Hindi) script under the scanner, runs the OCR software and gets all the text from that document available inside the computer just as if it was typed in. The data is stored in ISCII code. The system is developed using C programming language. The technology can be used with LINUX platform. It can be easily ported to Windows platform. The OCR software can be integrated with a Hindi Speech Synthesis System to make a Text to Speech system in Hindi. It can be used as front end for a Machine Aided Translation System. It is used in Newspaper (printed in davanagari script) Houses, Libraries, Offices looking for office automation, Linguistic Community (for creating Corpus), Blind People, etc.

2. Fonts and Keyboard Handler:

It is a Software Subsystem. Indian languages require composition. This Software Subsystem needs to substitute one string of key code sequence with another form of a conjunct character. Different Matras and exceptions are properly displayed as per language rules. The fonts are designed to support a large set of conjuncts. The fonts are available in True type format or Adobe type-1 format. At present, there are no acceptable standards on Indian language fonts. A variety of keyboard layouts are supported for each language. It can be installed in Windows 95 to Windows 2000. The Software Subsystem enhances Indian community to use Indian languages on PC.

3. India Multilingual Solution:

It is a library of over 250 high quality fonts in all Indian languages. These fonts are available as PostScript Type-1, TrueType, Open Type or PFR for Web. These fonts are used by Software development companies which develop eGovernance applications for various Government departments. It is also used by State governments who would like to buy enterprise license which will meet all their text and data processing requirements.

4. OCR for Printed Devanagari Text:

Various image processing algorithms is developed for obtaining the image matrices of the characters and identifying the Devanagari characters and words for laser printed text. This OCR is developed by C-DAC, Pune.

J. Speech Technology:

2. Text to Speech System (TTS):

In this system, synthesized speech will be generated after some steps by giving a sentence in text. In the first stage it will be analyzed by a Natural Language Parser. After Morphological and Phonological analysis, the grapheme string is converted to a phoneme string which can be directly mapped to the dictionary and concatenated.

K. Standardisation Issues:

1. Character Level Standards: ISCII/UNICODE

The Unicode Indic ranges are based on the Indian standard ISCII (Indian Standard Code for Information Interchange, 1988). ISCII is a well-intended standard which comprise all the major Indian scripts. In its Devanagari ranges, it tries to capture all the characters in any Indic Script. For example, while there are no equivalent characters in native Devanagari script for letter 'ZHA' which is available in Tamil and Malayalam, (in Unicode, TAMIL LETTER LLLA (U+0BB4) and MALAYALAM LETTER LLLA (U+0D34)), ISCII, and Unicode defines a dummy character for this letter. The idea is to have atleast one script which can represent all the Indian languages in a less way. ISCII is also meant to be a symmetric encoding, in the sense, if one has a text encoded in ISCII, it can be displayed in any supported script without needing much re-encoding, so long as the target script has symbols for all the characters in the text. However, while doing so, ISCII had to make lot of compromises. It also turned out as a very complex encoding, not readily implementable using the technologies used by other popular languages. Unicode used in Marathi language:

	  Script			Language			Unicode Range
	Devanagari	   	Sanskrit, Hindi, Marathi, 	U+0900 to U+097F
                                       		Konkani and Nepal

1. Keyboard Layout:


This keyboard layout is used for data entry in Indian languages. This layout uses default 101 keyboard. The mapping of the characters remains common for all Indian languages. All the vowels are placed on the left side of the keyboard layout and the consonants on the right side.

 	Web Site:www.cdacindia.com/htmlgis/standard/inscript.asp 
Marathi Keyboard Layout


This is an Indian language keyboard Program which is developed by Avinash Chopde. This software helps to type text in any Indian language script by memorizing only 50-60 keys. If the User has knowledge of basic vowel and consonants of any language, the program automatically generates the 200+ characters (glyphs) which requires to correctly typeset text in any Indian language. It contains a high quality true type font (developed by Shrikrishna Patel) and a software module that run in the background under Microsoft's Windows Operating system. The software maps the ASCII English keyboard to a particular Indian Language script.

	Web Site:www.aczoom.com/ilkeyb   

Keyboard guide:

