Concept Extraction from Turkish Texts by Automatic Methods

Funding Agency: Bogazici University Research Fund, BAP (Project 5187)

Project Manager: Tunga Güngör

Dates: 2010-2011

Concept extraction is a subtopic of concept mining, which forms an important branch of data mining. Concept mining can be defined as the study of extracting important concepts that appear in documents. A basic point in concept mining is processing the words for obtaining the concepts. Usually thesauri is used during this process. Word-concept matching is usually ambiguous and context is used to resolve the ambiguity. The relationships between the concepts and the context are extracted using semantic similarity.

The two basic approaches in concept extraction can be named as expert-based approach and statistical approach. The first one is also called rule-based or information engineering-based approach. The second one, statistical approach, is also known as automatic learning approach. It makes use of statistical information gathered from available corpora for learning.

Our goal in this project was to build an automated concept extraction system for Turkish using statistical approaches. The system works as follows: First, noun words are extracted from the documents, then sub-dictionary groups are formed by clustering similar words, then these sub-dictionaries are labeled, and finally data mining techniques are applied to these concepts.

Concept Extraction from Turkish Texts by Automatic Methods

Status

Concept Extraction from Turkish Texts by Automatic Methods