Masterarbeit

Automatic Construction of Scholarly Knowledge Graphs Using Large Language Models

Completion

2025/07

Research Area

Web Engineering

Students

Sandra Schaftner

student

Advisers

Jan Haas M.Sc.

researcher

Room: 1/B204

Phone: +49 371 531 32141

Fax: +49 371 531 8 32141

Email: jan-ingo.haas@informatik.tu-chemnitz.de

Prof. Dr.-Ing. Martin Gaedke

professor

Room: 1/B319

Phone: +49 371 531 25530

Fax: +49 371 531 25539

Email: gaedke@informatik.tu-chemnitz.de

Description

The rapid growth of scholarly publications has resulted in an overwhelming volume of digital documents. The predominant use of the PDF format has introduced additional complexities, particularly with respect to the reproducibility, comparability, and machine-readability of research outcomes. This phenomenon has the potential to impede the efficient flow of knowledge throughout the scientific process. In response to these challenges, the concept of Scientific or Scholarly Knowledge Graphs (SKGs) has emerged as a potential solution. These SKGs are designed to capture metadata and, in some cases, the content of research publications, including authors, locations, research topics, and citations. Ontologies are used to map deep semantic relationships, allowing sophisticated comparisons, further analysis, and exploration by exploiting the inherent link structure. However, the construction of SKGs is widely recognized as challenging in the literature. Major obstacles include knowledge extraction from diversely structured textual data, heterogeneity and linking of research objects, as well as ontology matching. These challenges lead to several downsides in current approaches to generating knowledge graphs that capture semantics of scholarly knowledge accurately and sufficiently.

Recent advances in Natural Language Processing, more specifically in the area of Natural Language Understanding through Large Language Models (LLMs), have heightened expectations for improving the quality of automated construction of Knowledge Graphs (KGs), with the potential to replace traditional methodologies. Especially due to their superior in-context learning capabilities, contextual awareness, linguistic capabilities, and reasoning they are now regarded as the primary instruments for automated KG construction. Prior work on the use of LLMs in KG construction has shown promising results when strategies to address known problems with LLMs, such as “hallucinations” and non-determinism, are implemented. The objective of this master thesis is to explore approaches to leverage LLMs for SKG construction, with the goal of challenging current state-of-the-art solutions in the domain of computer science.

To this end, an exhaustive and systematic analysis of the shortcomings of existing state-of-the-art approaches will be conducted prior to the implementation stage. Based on this, with the aim of overcoming the identified deficiencies, a comprehensive and automated SKG construction pipeline will be conceptualized. Key components of the pipeline include the extraction of entities and relations, along with their mapping to ontologies such as the Computer Science Ontology (CSO), DBPedia and Wikidata. The input to the pipeline has to be scientific publication data, such as title and abstract, and the output has to be a KG in RDF format. After giving an overview of the implementation, the concept must be evaluated with suitable benchmarking methods, including recent „LLMs-As-Judges“ approaches for automated evaluation. The evaluation by the LLMs will be performed, on the one hand, at intermediate stages of the process pipeline. On the other hand, a final qualitative evaluation of the whole integrated process will also be conducted using LLMs, benchmarking against randomly selected individual ORKG contributions. Different LLMs, encompassing both closed-source and open- source models, will be tested for their suitability to perform SKG construction tasks.