Masterarbeit
Comparison of Natural Language Document Classification Approaches
Completion
2024/11
Research Area
Intelligent Information Management
Students

Sozan Hama Kaka

Muhammad Saad Afzal
Advisers


Description
The emergence of large language models (LLM) has led to rapid evolvements in the domain of natural language processing (NLP). Large language models have demonstrated efficacy across various tasks, including language translation and text completion, setting new standards within the fields.
Despite these advances, content-based document classification remains a critical challenge. Traditional methods still struggle with semantic nuances and the complexity of natural language, resulting in classification inaccuracies. Accurate document classification is essential for information retrieval applications, content organization systems and various other domains. As part of this thesis project, it shall be investigated whether large language models can be employed to effectively handle such classification tasks.
The objective of this thesis is to conduct a comparative analysis of a state-of-the-art large language model to classic NLP approaches, such as machine learning-based approaches and transformer-based models. In the comparative analysis, the performance of the selected approaches has to be evaluated for both binary and multi-label classification tasks using one or more suitable test datasets. The influence of different prompting strategies on the performance of the chosen LLM approach has to be investigated. Finally, it has to be demonstrated how the most appropriate approach identified in the analysis can be used to effectively classify natural language documents of a selected domain. For this purpose, a prototype providing a web-based user interface has to be developed.