Masterarbeit

Comparison of Natural Language Document Classification Approaches

Completion

2024/11

Research Area

Intelligent Information Management

Students

Sozan Hama Kaka

student

Muhammad Saad Afzal

student

Advisers

Christoph Göpfert M.Sc.

researcher

Room: 1/B204

Phone: +49 371 531 35747

Fax: +49 371 531 8 35747

Email: christoph.goepfert@informatik.tu-chemnitz.de

Prof. Dr.-Ing. Martin Gaedke

professor

Room: 1/B319

Phone: +49 371 531 25530

Fax: +49 371 531 25539

Email: gaedke@informatik.tu-chemnitz.de

Description

The emergence of large language models (LLM) has led to rapid evolvements in the domain of natural language processing (NLP). Large language models have demonstrated efficacy across various tasks, including language translation and text completion, setting new standards within the fields.

Despite these advances, content-based document classification remains a critical challenge. Traditional methods still struggle with semantic nuances and the complexity of natural language, resulting in classification inaccuracies. Accurate document classification is essential for information retrieval applications, content organization systems and various other domains. As part of this thesis project, it shall be investigated whether large language models can be employed to effectively handle such classification tasks.

The objective of this thesis is to conduct a comparative analysis of a state-of-the-art large language model to classic NLP approaches, such as machine learning-based approaches and transformer-based models. In the comparative analysis, the performance of the selected approaches has to be evaluated for both binary and multi-label classification tasks using one or more suitable test datasets. The influence of different prompting strategies on the performance of the chosen LLM approach has to be investigated. Finally, it has to be demonstrated how the most appropriate approach identified in the analysis can be used to effectively classify natural language documents of a selected domain. For this purpose, a prototype providing a web-based user interface has to be developed.