Hindi Text Processing Framework

The aim of this project is to handle the following tasks for the HINDI language :

Stop Word Detection
Tokenisation
Sentence Breaking
Identifying Variations
POS Tagging
Concept / Keywords Identification
Entity Recognition
Categorisation

A bit about the Project

This task was assigned to Team 7 (which is us) as a part of Major Project for the Information Retrieval and Extraction Course. This project was divided into three phases. In the first phase we were required to come up with a scope document and deliverables for the second and the third phase respectively. In the second and third phase, we had to actually code and then implement what we had proposed.

Explanation of the Individual Tasks

Below is an explanation of each individual tasks that we have tried to achieve with this project.

Tokenisation

Tokenisation is the process of breaking the stream of text into words, phrases or any other important unit for that matter. Those who have had a chance to look at the working of compilers would know this better.
Given, that we have a lot of sentences in Hindi, we would want the entire text to be segregated into tokens i.e words. Along with this we would also want that other tokens like dates, designations, abbreviations do retain their original form and don't get messed up.

Sentence Breaker

In Hindi we denote the end of a sentence by what is called a 'purna viraam' which is denoted by '।'. Our goal here was to identify that and taking care of all the scenarios, break the entire text into sentences. A given text is hence broken down into sentences and a corresponding id is provided to each one.

Stop Word Detection

These are the most common words that can be found in a particular sentence. Generally, these are the words that are very helpful in bringing grammaticality to a particular sentence. Here, two methods have been implemented to find out the different stop word. One of the methods, simply takes a number n, and using that finds a list of top n occurring words across a corpus. The other method relies on tf-idf scheme, and requires a corpus which comprises of different documents.

Part of Speech Tagging

The task here was to tag a particular using a predefined set of POS tags. It is a method by which the word categories are disambiguated. For this purpose, two models are implemented. One of them follows the HMM (Hidden Markov Model). The other one is a CRF based model.

Concept / Keyword Identification

The task here was to identify the different NP's (Noun Phrases). This has been done again by using HMM.

Entity Recognition

The task here was to identify the different named entities. For this purpose, CRF (Conditional Random Fields) were used. A toolkit named CRF++ was utilised for this purpose.

Categorisation

Every News article belongs to a set of different categories. Such categories might be politics, entertainment, sports etc. To identify an article's category, a model based on KNN (K nearest neighbour) was implemented. The articles were converted into their corresponding vectors using tf-idf method.

Identifying Variations

The task here was the identify the variations within the language. Often, it happens, that within the same language, because of some kind of divergence, the spellings of the same words change. Such identification is very important a variety of tasks. In case of Machine Translation, or for that matter in case of any sequence labelling task like NER, we would want to identify these variations in order to improve the accuracy. Hence, for this purpose, a set of rules have been designed which help identify these variations.

Link to youtube video Link to slideshare

Hindi Framework

A text processing framework for Hindi