← Back to Projects

Developing Concept Mining Methods for Turkish Document Analysis

TUBITAK funded project developing concept mining methods for Turkish documents.

Status

Completed 2011 - 2014

Developing Concept Mining Methods for Turkish Document Analysis

Funding Agency: TÜBİTAK 1001 Program (Project 110E162)

Project Manager: Tunga Güngör

Dates: 2011-2014

Concept mining can simply be described as a kind of data mining that extracts concepts from events. Concept mining on text documents aims at converting terms into concepts and analyzing these concepts. Usually it is not possible to identify the concepts directly that appear in a document, since the concepts are a kind of meta-data for a document. Thus, it is necessary to determine the term-concept matchings.

For this purpose, concept dictionaries and thesauri that involve relationships between the terms are used. During the matching process, it is possible that a single term can correspond to more than one concept or several words can define a single concept. This ambiguous relationship between terms and concepts can be resolved using the context. Meanwhile, the relationships between the concepts and the context are tried to be extracted using semantic similarity.

The main stage in concept mining is concept extraction. There are basically two approaches in concept extraction: expert-based approach and statistical approach. The first one is also called rule-based or information engineering-based approach. These systems include a set of pattern matching rules compiled by experts in the field. The second one, statistical approach, is also known as automatic learning approach. It makes use of statistical information gathered from available corpora for learning.

Our goal in this project was to develop a concept mining application that enables automatic concept extraction from Turkish documents and that performs concept-based text mining. Since it is more appropriate for our purpose, we chose the statistical approach. Although there exist some studies for extracting concepts from documents in English and some other languages, this type of work had not been performed for Turkish before.