Difference between revisions of "3T Laboratory"
m |
m |
||
Line 1: | Line 1: | ||
{{Technical Tools for Tamil (3T) Laboratory/Header}} | {{Technical Tools for Tamil (3T) Laboratory/Header}} | ||
+ | |||
+ | = <b>Home</b> = | ||
+ | {{Technical Tools for Tamil (3T) Laboratory/Home}} | ||
= <b>About</b> = | = <b>About</b> = |
Revision as of 11:16, 14 July 2013
Technical Tools for Tamil (3T) Laboratory
Home
Vision
We see Tamil as one of the mightiest, oldest and still thriving languages in the world. We believe adopting technological advancements is crucial for the longevity and completeness of the Tamil. We aim to lay the groundwork for the technological tools that are essential to make the language available with the up-to-date technology and further preparing the Tamil language for the next generation.
Objectives
• Perform fundamental research in the language and technology that is needed develop tools dedicated for the language of Tamil.
• Gather information from present and previous works, analyse and review them. Organise these works into the themes of the 3T Laboratory in order to support the current research, and launch future projects.
• Promote synergy with independent research groups, institutions and universities that engage the research in Tamil linguistic and technology. Offer a common platform for collaboration and information exchange. Provide required resources, help and consultancy for emerging research scholars.
• Establish scientifically validated methodologies and frameworks that help standardising technology related to Tamil language. These standards will be used as ground rules and serve as the basis for future research and application development.
• Delivering urgent language essentials to researchers and communities, e.g. dictionary databases, data corpus, language packs for multi-language applications, etc.
About
Vision
We see Tamil as one of the mightiest, oldest and still thriving languages in the world. We believe adopting technological advancements is crucial for the longevity and completeness of the Tamil. We aim to lay the groundwork for the technological tools that are essential to make the language available with the up-to-date technology and further preparing the Tamil language for the next generation.
What We Do
3T Laboratory researches and develops technological tools for Tamil Language.
Who We Are
3T Laboratory is a team of experts and volunteers with a common view of building contemporary tools for Tamil language and exploiting cutting edge technologies to explore the heritage and uniqueness of the Tamil language. Our team is made up of technologists and linguists. Our team is a composition of members whose specialties include science, engineering, linguistics, library science, and archaeology. Researchers provide an institutional expertise to ensure that 3T laboratory stays at the technological and linguistic forefront of the many depth expertise areas represented in technologies for Tamil.
We Invite
We invest time and resources in the areas that we believe will most effectively promote the efficiency of the combined research in technology and language. We attract individual researchers and groups form language studies and technology to collaboratively work in this laboratory. By bringing the isolated studies and efforts together into this common platform we try to institutionalise and standardise of technology for the language. 3T laboratory creates the scope for knowledge sharing, peer reviewing, constructive criticism, and collaboration. Besides the benefits to the language this laboratory provides a pitch for individual researchers to demonstrate their knowledge and skills, leading towards their personal career development.
Open Source
We are committed to developing open source tools and the furtherance of the open source personal research community. The language of Tamil is open to everyone in the world and therefore our strong belief is that the knowledge related to the language should not be restricted within a closed community. Our research and developed tools will be published on publically available media, making it free for anyone to use, change, and improve. We see this open source approach enabling the innovation and helping to ensure the adoption of technology into the language is a transparent process with positive societal impact. Nevertheless we have various contributors donating their intellectual properties and proprietary contributions from third parties. These contributions are made by broad minded individuals and institutions with the devotion on the language of Tamil. We respect the agreements with these contributors and endeavour to protect and use their properties as stated in the conditions.
Research
Following research themes are identified as important and prioritised for the research and development in the 3T Laboratory.
- Creation of an Open ended Framework for Language Modelling
- Natural Language Processing Tools for Tamil
- Tamil Optical Character Recognition
- Tamil Speech Recognition
- Tamil Speech Synthesis
- Translation for Tamil
- Text Input Methods for Tamil
- Knowledge Engineering in Tamil
- Evolution of the Tamil Language
Creation of an Open ended Framework for Language Modelling
Parsing is the fundamental task in Natural Language Processing and Computational Linguistics and Language model is the heart of a parser which provides the ways and means to predict the words and sentences confined to the patterns and grammar of a language. Language models for Tamil have been constructed previously using a statistical model which deals about semantics and structural model which deals about syntax. Significant improvements have been achieved recently when structural information is applied to the statistical model called the hybrid model. Features of Tamil language are quite impressive when it comes to parsing. Tamil is an agglutinative language which results in long words with many suffixes. Nouns in Tamil are classified as rational (humans) and irrational (other nouns). Tamil verbs are inflected through the use of suffixes and this would indicate person, number, mood, tense and also voice. Tamil is also a head-final language where the verb comes at the end of the clause with an order of SOV (Subject Object Verb) which is in contrast to the SVO structure in English. Interestingly Tamil language allows word order to be changed making it a relatively word order free language. We aim to create an Open ended Framework for language modelling. Creating such a language model catering these features is a huge challenge and hence requires a careful study of existing approaches and development of an appropriate model.
Natural Language Processing Tools for Tamil
We would focus on studying the existing approaches and tools that have been implemented already for Tamil while also looking at their limitations. We expect to employ a broader approach in creating new NLP tools for Tamil by carefully studying the tools that are available for English language and come up with innovative solutions. Listed below are a set of tools used in text engineering widely for English while some are already implemented for Tamil.
- Tokeniser - Splitting of text into very simple tokens such as numbers and words of different types.
- Sentence Splitter - Splitting of text in to individual sentences.
- Tamil Character Analyzer - Tamil Character analyzer could analyse each and every character in a word. It could predict whether the character is a vowel or consonant (mellinam, vallinam or idaiyinam) or a syllable in the basic level. Tamil has 12 vowels and 18 consonants.
- Tamil Part of Speech Tagger - The process of labelling automatic annotation of syntactic categories (DET, ADJ, NN, NP, VP) for each word in a corpus.
- Chunker - Chunking is the task of identifying and segmenting the text into syntactically correlated word groups. It could divide a sentence into its major non-overlapping phrases and attach a label to each chunk. Phrases could be Noun Phrases and Verb Phrases.
- Dependency Parser - Dependency grammar is based on the idea that the syntactic structure of a sentence consists of binary asymmetrical relations between the words of the sentence. The dependency parsing approach should use linguistic information to give relationship between words. When given an input sentence the output should be a dependency tree and dependency relations.
- Tamil Morphological Analyzer - It is the process of segmenting words into morphemes and analyzing the word formation. It should be concerned with retrieving the structure, syntactic rules, morphological properties and the meaning of a morphologically complex word. The morphological analyzer could also be used in speech synthesis, speech recognition, spell and grammar checking and also machine translation.
- Tamil Morphological Generator - The Morphological Generator should take lemma and a Morpho-lexical description as input and give a word-form as output. It is the reverse process of Morphological Analyzer.
- Named Entity Recognizer - Given a stream of text, determine which items in the text map to proper names, such as people or places, and what the type of each such name is (e.g. person, location, and organisation).
- Co reference identifier - It should determine whether two or more expressions in natural language refer to the same entity in the text. The entity could be Person, Location or Organisation.
- Anaphora Resolution of Pronouns - Anaphora is a cohesion which points back to some previous item. The pointing back" (reference) is called an anaphor and the entity to which it refers is its antecedent. The anaphora resolver should be able to resolve the pronouns by linking it to it antecedent.
- Sentiment Analysis - A basic task in sentiment analysis is classifying the polarity of a given text at the document, sentence, or feature/aspect level whether the expressed opinion in a document, a sentence or an entity feature/aspect is positive, negative, or neutral. Advanced, "beyond polarity" sentiment classification could look, for instance, at emotional states such as "angry," "sad," and "happy."
- Word sense Disambiguation - Some words in Tamil have at least two meanings which becomes challenging when it comes to machine translation, question and answering or information retrieval. Hence the tool should aim at disambiguating such words when they occur in a text, so that the right sense is used.
- English to Tamil Machine Translator - The morphological and syntactical difference between English and Tamil makes the problem harder to solve. One of the basic grammatical structures in English SVO (Subject, Verb, and Object) becomes SOV in Tamil. This needs a syntax reordering module and morphological generator.
Tamil Optical Character Recognition
The process of character recognition involves extraction of defined characteristics called features to classify an unknown character into one of the known classes. Therefore, OCR involves two processes: Feature extraction and Classification. The process of character recognition becomes very tough in Tamil since many inter-class dependencies exist in Tamil. In Tamil language, many letters look alike. So classification becomes a big challenge. Rule-based classifiers and SVMs have been used for classification purposes recently in Tamil OCR. The large collection of online text books and other relevant information has already been captured in digital format could be of great use when there is a Tamil OCR system that reads image formats, so that the data could be used for research purposes.
Tamil Speech Recognition
Speech Recognition enables us to convert the speech to text, which enhances the user interface to a broader one. A Speech Recognition system includes two major components; Feature Extractor (FE) and the Recogniser. The FE block generates a sequence of feature vectors that represent the input speech signal. It is based on a priori knowledge that is always true and it does not change with time. The Recognizer performs the trajectory recognition and generates the correct output word. Artificial neural networks have been used in Tamil Speech Recognition in classifying the feature vectors of the input signal.
Tamil Speech Synthesis
An unrestricted text-to-speech system in Tamil is expected to produce a speech signal, corresponding to the given text in a language that is highly intelligible to a human listener. Generally speaking, the intelligibility and comprehensibility of synthesized speech should be relatively good in the naturalistic environments. Furthermore, listeners should be able to clearly perceive the message with little attention, and act on synthesized speech of a command correctly and without perceptible delay in noisy environments. Presently, unit selection-based synthesis (USS) and statistical parametric synthesis techniques are the state-of-art techniques for this task. Recently it has been proved that Hidden Markov Model (HMM) based Speech synthesis is more successful when it comes to Tamil speech synthesis.
Translation for Tamil
Machine Translation of Natural (Human) Languages has a long tradition, benefiting from decades of manual and semi-automatic analysis by linguists, sociologists, psychologists and computer scientists among others. Much of the work in the literature however, largely report on translation between languages within the European Family of languages. Anyhow recently Statistical Machine Translation approaches have been used in Translation for Tamil. This is one of the active areas of research in 3T laboratory.
Text Input Methods for Tamil
Touch and gesture interfaces are becoming popular in the modern user interfaces. This research theme is centred on developing modern text input methods useful for Tamil Language. Moving away from the conventional keyboard based input methods; this research is focused on developing standardised cross platform tools specifically designed for Tamil text input. Dynamic keyboards, handwriting, and gesture inputs are few examples necessary to be developed for Tamil Language.
Knowledge Engineering in Tamil
Digital and electronic media provide unprecedented ways of communication and information handling. Vast quantities of information have been exchanged thorough, digital documents, social media and internet. Traditionally the information has been articulated thorough textual formats and natural human language. Data retrieval is the basic form of information access, while information and knowledge retrievals are respectively higher levels of information access. Technology for simple data access is generally saturated and does not vary largely across languages. However information and knowledge access methods are more complex and heavily depended on languages. Consequently, it is important to develop language models, information processing techniques specifically for Tamil language. This research is focused into two main aspects of knowledge processing in Tamil Language, namely knowledge retrieval and knowledge representation. Knowledge retrieval techniques and tools when applied to, Tamil digital archives could provide new historical insights into the structured data. One such example could be possible generation of historical timelines. Alternatively, knowledge representation methods help creating contents in natural language automatically. Research in knowledge representation techniques leads to the development in artificial intelligence and problem solving techniques dedicated for Tamil Language.
Evolution of the Tamil Language
Evidences from archaeology and literature demonstrate that Tamil is one of the ancient languages in the world. Geological, cultural, and synthetic influences made the language to evolve in the past. Besides preserving the language it is important to understand the process of evolution of the language to develop an in depth understanding of the current form of Tamil. This knowledge can be useful in many ways such as tracing out fundamental language roots, finding historical interactions with other languages, and predicting the future trends of the language. Developing mathematical models embedded with the parameters of evolution process enables simulating trends of the language of Tamil. However, if entirely understanding a language is hard, developing a mathematical model for the language is harder and modelling the process of evolution is the hardest. Fortunately the always booming computational power gives room for plenty of simulations and trial and error methods so that approximate models can be refined until satisfactory performance is achieved. This novel research studies about developing a mathematical model for representing the trends of the Tamil Language. Opportunities of exploiting cutting edge parallel computing and simulation techniques are considered in this task. The fundamental research in this task includes finding interconnections with other language proximities, such as Sanskrit and comparing present language forms of the geographical regions where Tamil believed to be the root language.
Collaborations
Noolaham Foundation
Noolaham foundation pioneers documenting and preserving all spheres of knowledge related to Sri Lankan Tamil speaking communities. Ever growing archives need efficient and automatic methods of extracting the information from digital archives so that the actual knowledge content can be extracted and accessed. Beyond the standard textual data search necessities, Noolaham Foundation primes the 3T Laboratory’s focus towards advanced information and knowledge retrieval methods for Tamil Language.
Website : Noolaham Foundation
Cre - A
Cre-A is one the popular Tamil dictionaries at present and holds the most comprehensive word bank. Cre-A provides, the base world list, a valuable word corpus of Tamil Language for the research in 3T Laboratory.
Website : Cre - A
People
Current Members
- Sivanathan Aparajithan
- Saatviga Sudhahar Shaseevan
- Sivanathan Arunan
- Jegatheeswaran Kuruparan
News
2013/04/01
3T Laboratory started hosting a wiki for previewing the vision and research themes. Contents are currently flexible and need refinements from contributors.
2013/02/14
Cre-A contributes a valuable piece of material, a base word corpus of Tamil Language for the research in 3T Laboratory.
<headertabs/>