Ongoing Projects

Chemical Language Processing for Target-based Drug Design

This project is funded by The Scientific and Technological Research Council of Turkey (TUBITAK 1001) and managed by Arzucan Özgür as the Principal Investigator. It started in 2019 and it is expected to end in 2022.

Contextual Text Mining from the Biomedical Scientific Literature

This project is funded under 7th FWP (Seventh Framework Programme; Research area: FP7-PEOPLE-2011-CIG Marie-Curie Action: "Career Integration Grants") and managed by Arzucan Özgür. It started in March 2012 and Ilknur Karadeniz currently works at this project.

Scientific publications are the main media through which researchers report their new findings. The huge amount and the continuing rapid growth of the number of published articles in biomedicine, has made it particularly difficult for researchers to access and utilize the knowledge contained in them. Currently, there are over 21 million publications indexed in PubMed, which is the main system that provides access to the biomedical literature. Over 2000 new entries are added to the system every day. Developing text mining techniques to automatically extract biologically important information such as relationships between biomolecules is not only useful, but also necessary to facilitate biomedical research and to speed-up scientific progress. Most of the prior studies in the biomedical text mining field tackle the problem of extracting the fact that there is a relationship between a pair of biomolecules. However, for extracted information to make sense, a great deal of biological context is required. While some of this context such as relationship type and directionality is found in the sentence that actually reports the relationship, some of it such as species and experimental method is likely stated elsewhere in the article. The goal of the proposed project is to design methods based on natural language processing and machine learning to extract relationships among biomolecules and their local (sentence-level) and non-local (document-level) context information, as well as to design novel knowledge discovery methods that utilize the extracted contextual information.

Past Projects

A Deep Learning based Turkish Dependency Parser

This project is funded by The Scientific and Technological Research Council of Turkey (TUBITAK 1005) and managed by Arzucan Özgür as the Principal Investigator. It started in June 2018 and ended in December 2019.

Protein-Ligand Database Construction

This project is funded by Bogazici University Research Fund, BAP and managed by Arzucan Özgür as the Principal Investigator. It started in 2016 and ended in 2018.

Named Entity Recognition and Hashtag Segmentation in Social Media Text

This project is funded by Bogazici University Research Fund, BAP and managed by Arzucan Özgür as the Principal Investigator. It started in 2015 and ended in 2017.

Developing an Adaptive Question Answering System Enabling Primary and Secondary Education Students Accessing Accurate and Reliable Information

This project is funded by TÜBİTAK 1001 Program (FATIH Project) and led by Tunga Güngör. It started in 2013 and ended in 2016.

Jointly Self-trained Parsers

This project is funded by Bogazici University Research Fund, BAP and managed by Arzucan Özgür as the Principal Investigator. It started in 2012 and ended in 2013.

Developing Concept Mining Methods for Turkish Document Analysis

This project (110E162) is funded by TÜBİTAK 1001 Program and managed by Tunga Güngör. It started in 2011 and is expected to end in 2014.

Concept mining can simply be described as a kind of data mining that extracts concepts from events. Concept mining on text documents aims at converting terms into concepts and analyzing these concepts. Usually it is not possible to identify the concepts directly that appear in a document, since the concepts are a kind of meta-data for a document. Thus, it is necessary to determine the term-concept matchings. For this purpose, concept dictionaries and thesauri that involve relationships between the terms are used. During the matching process, it is possible that a single term can correspond to more than one concept or several words can define a single concept. This ambiguous relationship between terms and concepts can be resolved using the context. Meanwhile, the relationships between the concepts and the context are tried to be extracted using semantic similarity.

The main stage in concept mining is concept extraction. There are basically two approaches in concept extraction: expert-based approach and statistical approach. The first one is also called rule-based or information engineering-based approach. These systems include a set of pattern matching rules compiled by experts in the field. Such systems can be developed by experts having sufficient knowledge and experience in the field. The main disadvantages of this method are finding such experts and dependence of the system to a particular domain. The second one, statistical approach, is also known as automatic learning approach. It makes use of statistical information gathered from available corpora for learning. Mostly HMM (Hidden Markov Model) is used for this purpose. The transition probabilities are estimated from the training data. The advantage of this approach is the portability of the domain. The disadvantages are the difficulty and cost of building corpora, the need to retrain the system for different concepts, and slower execution.

Concept mining studies aim at obtaining efficient solutions to some problems which are harder to solve using data mining. Extracting concepts from documents has the potential of contributing to several applications including new generation search engines. By concept-based searching, the users will be able to obtain more accurate and more extensive search results easily. Some search engines like Google and Yahoo have already initiated concept-based searching research.

Our goal in this project is to develop a concept mining application that enables automatic concept extraction from Turkish documents and that performs concept-based text mining. Since it is more appropriate for our purpose, we chose the statistical approach. Although there exist some studies for extracting concepts from documents in English and some other languages, this type of a work has not been performed for Turkish before. In the scope of this work which will be a pioneering work for Turkish, all the functionalities that exist in concept mining studies in other languages will be fulfilled. However, during these studies, the methods and products developed in other systems will not be imitated, instead new methods and a new product that are original and that have a research value will be formed by taking into account the characteristics of the Turkish language. The system will work as follows: First, terms (noun words and noun phrases) will be extracted from the documents, then sub-dictionary groups will be formed by clustering similar terms, then these sub-dictionaries will be labeled semi-automatically (by an algorithm and by the control of an expert) to determine the concepts and the sub-dictionaries will be recorded in a database with their labels, and finally data mining techniques will be applied to these concepts. For extracting words from documents, analyzing these words, and disambiguating them, the morphological parser and the morphological disambiguator developed in the scope of Boğaziçi University BAP 08M103 and TÜBİTAK 107E261 projects will be used. The cosine similarity metric will be employed for discovering term similarities. To group similar terms, we will use k-means or a related clustering approach, and the clusters will be labeled. In this way, word-concept matching will be obtained. The concept extraction phase will be completed after labeling. Finally, some interesting information will be obtained from the concepts by employing data mining methods.

Concept Extraction from Turkish Texts by Automatic Methods

This project (Project 5187) was funded by BAP and managed by Tunga Güngör. It started in 2010 and ended in 2011.

Concept extraction is a subtopic of concept mining, which forms an important branch of data mining. Concept mining can be defined as the study of extracting important concepts that appear in documents. A basic point in concept mining is processing the words for obtaining the concepts. Usually thesauri is used during this process. Word-concept matching is usually ambiguous and context is used to resolve the ambiguity. The relationships between the concepts and the context are extracted using semantic similarity. Also, formal concept analysis that makes this relationship explicit is another important topic. Concept extraction study aims at obtaining efficient solutions to some problems which are harder to solve using data mining.

The two basic approaches in concept extraction can be named as expert-based approach and statistical approach. The first one is also called rule-based or information engineering-based approach. These systems include a set of pattern matching rules compiled by experts in the field. The main disadvantage of this method is finding such experts. The second one, statistical approach, is also known as automatic learning approach. It makes use of statistical informations gathered from available corpora for learning. Mostly HMM (Hidden Markov Model) is used for this purpose. The transition probabilities are estimated from the training data. The advantage is portability of the domain. The disadvantages are the difficulty and cost of building corpora, the need to retrain the system for different concepts, and slower execution.

Our goal in this project is to build an automated concept extraction system for Turkish. Since it is more appropriate for the model we build, we will use statistical approach rather than the expert-based approach. The PASW software developed by SPSS for data mining is a successful concept extraction application for English and other well-known languages. In this project, we aim at implementing the works done by PASW Text Analytics module for Turkish. The system will work as follows: First, noun words will be extracted from the documents, then sub-dictionary groups will be formed by clustering similar words, then these sub-dictionaries will be labeled manually, and finally data mining techniques will be applied to these concepts. For extracting words from documents and disambiguating these words, the morphological parser and the morphological disambiguator developed in the scope of Bogaziçi University BAP 08M103 and TÜBITAK 107E261 projects will be used. The cosine similarity metric will be employed for discovering tem similarities. To group similar terms, we will use k-means or a related clustering approach, and the terms will be labeled manually. In this way, word-concept matching will be obtained. The concept extraction phase will be completed after labeling. Finally, some interesting information will be obtained from the concepts by employing data mining methods.

Morphology Based Language Modeling for Turkish Speech Recognition

This project (107E261) was funded by TÜBİTAK 1001 Program and managed by Tunga Güngör. It started in 2008 and is expected to end in 2010.

In this project, our aim is to develop a large vocabulary continuous speech recognition system for Turkish. The state-of-art speech recognition systems are basically composed of three main systems. These are acoustic model, language model, and speech decoder.

Acoustic model: This model is used to calculate the probability of an acoustic feature vector sequence. Hidden Markov Models (HMMs) are widely used for training an acoustic model using a speech corpus. For this purpose there are many available software packages such as HTK and the acoustic model training is not different for Turkish. Building an acoustic model for Turkish is not a contribution of this project, however it is required for all speech recognition systems. For training acoustic models, we plan to use the news speech recordings of about 200 hours that the project researcher Murat Saraçlar has compiled. The recordings have been transcribed.

Language model: This model is used to estimate the probability of a word sequence. N-gram language models have been successfully used for languages like English. Researches have found that these N-gram word models are not as successfull for agglutinative languages like Turkish. Therefore, it is an active research area to build language models for morphologically complex languages like Turkish, Finnish, and Arabic. The most important contribution of this project is to construct a language model for Turkish that takes into account morphological linguistic structure of the words. The language resources that we developed for this purpose are:

  • Morphological Parser: The importance of this parser is that it is the first system as far as we know that is based on finite-state machines and is not dependent on any other external system such as PC-Kimmo. In this project to develop a real-time speech recognition engine, the efficiency of finite-state machines and the independence from other systems are very important factors.

  • Morphological Disambiguation: Morphological parser may return more than one possible analysis for a Turkish word. Chosing the correct analysis for the word in a given context is required to build a morphology-based language model. The system that we developed has the best accuracy reported so far.

  • Turkish Web Corpus: One of the problems that we face in training a language model for Turkish is the lack of a large text corpus. The existing corpora are very small or very unclean larger web text. Researchers have found that very large corpus is better than small clean corpus even if it is noisy. In this project, we have compiled a fairly clean web corpus of about 430 million words.

We aim to build an effective language model for Turkish using these resources. We have seen the importance of using the morphological structure of the words in our previous works. For this purpose, we plan to parse the web corpus using the morphological parser and then to disambiguate these parses using our morphological disambiguation system and use the morpheme statistics to train a statistical model for Turkish. We will convert the morphological parser to a probabilistic one giving the probability of an analysis also. We are aware of the fact that the language model may need to be designed with the speech decoder. Therefore we aim to design a speech decoder that can work with the morphology-based language model.

Speech Decoder: Due to complex morphology of Turkish, in our experiments we have seen that using a morphology-based static language model with a standard speech decoder is not feasible. In this project, we aim to develop a speech decoder that uses the morphotactics of the Turkish. So that, we will be able to limit the morphemes that may be suffixed to a word and estimate the probabilities using our morphology-basd language model. We will design the speech decoder that can run in real time.

In summary, the contributions of this project is as follows:

  • Turkish Language Resources: We have submitted a paper to Language Resources and Evaluation Journal that describes our language resources namely Morphological parser, morphological disambiguation, and Turkish web corpus.

  • Morphology-based language model: A good language model has many application areas ranging from spell checking to machine translation.

  • A real-time speech decoder system for Turkish. The available outputs and the first experimental results that we obtained in this project are very promising. The Turkish language resources that we developed are very important for academic research in this area. We have already shared some of these resources with researchers. Developing a high accuracy speech recognition system for Turkish is vital. Some application areas of a speech recognition system are man-machine interaction such as dictation, broadcast news transcription, and word spotting in telephone recordings for national security. The language resources that we compiled are very important for computer processing of Turkish language. Also the development of a high accuracy speech recognition system is very important for our language.

Morphology Based Language Modeling for Turkish Speech Recognition

This project (Project 08M103) was funded by BAP and managed by Tunga Güngör. It started in 2008 and ended in 2009.

In this project, we aimed at developing a high performance large vocabulary continuous speech recognition system for Turkish. The most important contribution of this work has been to develop a morphology-based language model for Turkish. As a result of our previous work, we have built some language resources for Turkish such as a morphological parser, a morphological disambiguator, and a web corpus. Using these language resources, in this project, we developed an effective morphology-based language model for Turkish. We also replaced the static lexicon with a dynamic one based on the morphological parser. So that, we greatly alleviated the out-of-vocabulary problem for Turkish. We also developed a speech decoder which can do speech decoding on morphology-integrated search networks.

Developing Structure-preserving and Query-biased Automated Summarization Methods for Web Search Engines

This project (Project 07A106) was funded by BAP and managed by Tunga Güngör. It started in 2007 and ended in 2009.

In this project, a new summarization approach was developed to improve the effectiveness of Web search based on two stages. In the first stage, a rule-based approach and a machine learning approach were implemented to identify the sectional hierarchies of Web documents. In the second stage, query-biased summaries are created based on document structure. The evaluation results show that the system has significant improvement over unstructured summaries and Google snippets.

Developing a General-Purpose Turkish Handwritten Recognition System using a Large Lexicon

This project (Project 09A107D) was funded by BAP and managed by Tunga Güngör. It started in 2009 and ended in 2011.

The aim of this project is to develop algoritms for identifying patterns in periodic sequences and applying these algorithms to handwritten character recognition problem. Handwritten character recognition is usually divided into two groups: with segmentation (general purpose) and without segmentation (special purpose). In this work, instead of these methods, it was decided to apply a cognitive approach that became popular recently. As the short sequences obtained from each handwritten sample are combined, a periodic sequence is generated. These sequences are learnt and matched using compression algorithms.

Developing Natural Language Processing-based Methods for Text Classification

This project (Project 05A103D) was funded by BAP and managed by Tunga Güngör. It started in 2005 and ended in 2008.

Bu projede, metin sınıflandırma (text categorization) problemi için doğal dil işleme tekniklerinin kullanılması düşünülmektedir. Günümüzde metin sınıflandırma amaçlı pek çok araştırma yürütülmektedir ve bunlardan bazılarının pratik uygulamaları da mevcuttur. Fakat, bu çalışmaların başarı oranı belli bir sınırı geçememektedir. Bunun başlıca sebebi, hemen hemen bütün çalışmalarda, sadece sentaktik (sözdizimsel syntactic) bilgilerin kullanılması ve semantik (anlambilimsel semantic) bilgilerden yararlanılmamasıdır. Diğer bir deyişle, metinlerdeki kelimeler anlamlarından bağımsız olarak ele alınmaktadır. Bu projede, bu eksikliğin giderilmesi ve metinlerin içerdikleri anlam gözönüne alınarak sınıflandırılması amacıyla yeni metotlar önerilecektir.

Morphotactic based Statistical Language Modeling for Large Vocabulary Continuous Speech Recognition Systems

This project (Project 06A102) was funded by BAP and managed by Tunga Güngör. It started in 2006 and ended in 2007.

Bu projede, Türkçe gibi eklemeli dillerde geniş sözcük dağarcıklı sürekli konuşma tanıma (large vocabulary continuous speech recognition LVCSR) sistemleri için kullanılacak yeni bir dil modelinin geliştirilmesi amaçlanmaktadır. Bilindiği gibi, eklemeli dillerde sınırsız sayıda kelime üretilebilmesi, konuşma tanıma sistemlerinde dil modeli oluşturmada zorluklara neden olmaktadır ve iyi bir dil modelinin eksikliği bu sistemlerin etkinliğini önemli ölçüde etkilemektedir. İngilizce gibi nispeten eklemeli olmayan dillerde konuşma tanıma sistemlerinin başarıyla geliştirilmiş olmasının ve Türkçe gibi dillerde henüz aynı başarıya ulaşılamamasının en önemli nedenlerinden birisi, etkin bir dil modelinin eksikliğidir. Konuşma tanıma sistemlerinde yaygın olarak n-birimli (n-gram) dil modeli kullanılmaktadır. Bu model, dili istatistiki olarak modellemeye çalışmaktadır. Geniş bir metin havuzundan (corpus) sözcüklerin birbiri ardına gelme sıklıklarını göz önüne alarak oluşturulan model, sözcük dizilerinin olasılıklarını hesaplamakta kullanılmaktadır. Türkçe'de diğer bir sorun, sözcüklerin cümle içinde diziliminin nispeten serbest olmasıdır. Bu serbestlik sorunu Türkçe'nin eklemeli bir dil olması ile birleştiğinde, sözcük bazında basit bir n-birimli dil modelinin konuşma tanıma sistemlerinde etkinliğini azaltmaktadır. Bu çalışma ile Türkçe'nin morfotaktik (morphotactic) (morfların dizilim kuralları) bilgisini n-birimli bir dil modeli ile birleştirerek etkin bir dil modelinin oluşturulması hedeflenmektedir. Böylece Türkçe için birçok uygulama alanına sahip geniş sözcük dağarcıklı konuşma tanıma sistemleri geliştirilebilecektir.

Developing Dynamic and Adaptive Methods for Turkish Spam Filtering

This project (Project 04A101) was funded by BAP and managed by Tunga Güngör. It started in 2003 and ended in 2004.

Bu projede, spam e-posta mesajlarının önlenmesine yönelik olarak Türkçe için spam-önler filtreleme metotları geliştirilecektir. Günümüzde spam mesajlar tüm e-posta mesajlarının %10'unu oluşturmaktadır ve kullanıcılar açısından önemli zaman kayıplarına neden olmaktadır. İngilizce gibi yaygın diller için filtreleme algoritmaları mevcuttur, fakat Türkçe mesajlar için henüz böyle bir çalışma yapılmamıştır. Bu tür bir çalışmada Türkçe'nin karmaşık morfolojik yapısının gözönüne alınması gerekmektedir. Bu projede geliştirilecek olan metotlar dinamik olacaktır ve yapay sinir ağları ile Bayesian ağları tekniklerine dayanacaktır. Ortaya konulacak olan algoritmaların iki temel bileşeni içereceği düşünülmektedir: Mesaj içeriklerinin morfolojik analizini yapacak bir morfoloji modülü ve mesajları normal ve spam olarak sınıflandıracak bir öğrenme modülü.

Statistical Analysis of Turkish

This project (Project 02A107) was funded by BAP and managed by Tunga Güngör. It started in 2002 and ended in 2003.

Bu projede, yeni bir yaklaşım olan doğal dillerin istatistiksel işlenmesi (statistical processing) konusu Türkçe'ye uygulanacaktır. Bu konuda, bazı yaygın dillerle ilgili araştırmalar yapılmaktadır; fakat Türkçe için henüz bu tür bir çalışma bulunmamaktadır. Bu amaçla, proje elemanları tarafından kapsamlı bir literatür taraması yapılacaktır. Bu taramaya dayanılarak, Türkçe'nin istatistiksel işlenmesi için bir altyapı oluşturulacak ve bir program geliştirilecektir. Programın tasarımı, implementasyonu ve testi yapılacaktır.