Masterarbeit / Bachelorarbeit

A LLM-based approach for mining reproducibility information from scholarly publications

Completion

2024/12

Research Area

Intelligent Information Management

Advisers

Dr. Sheeba Samuel

senior researcher

Room: 1/B203

Phone: +49 371 531 39355

Fax: +49 371 531 839355

Email: sheeba.samuel@informatik.tu-chemnitz.de

Prof. Dr.-Ing. Martin Gaedke

professor

Room: 1/B319

Phone: +49 371 531 25530

Fax: +49 371 531 25539

Email: gaedke@informatik.tu-chemnitz.de

Description

Reproducibility of results is a cornerstone of scientific research, ensuring that findings can be independently verified and built upon. Scholarly publications contain critical information necessary for reproducing results, including data presented in text, tables, and images. However, manually extracting this reproducibility metadata is a challenging and time-consuming task. Leveraging Large Language Models (LLMs) to automate the extraction of tabular information offers a powerful solution to enhance efficiency and accuracy in this crucial process.

This thesis aims to develop a LLLM-based approach, designed to automatically mine reproducibility metadata from text in scholarly publications. The solution must ensure robust and accurate extraction of reproducibility information. The thesis will specifically aim to extract information on deep learning methods from publications to understand the reproducibility factor of these methods, since Deep learning methods are widely used across various domains, from natural language processing to computer vision. The key features of the solution will include Sophisticated Table Recognition, where LLMs are utilized to accurately extract data from text, capturing detailed information necessary for reproducibility. This will be further used to expand the metadata collected from the text. The solution provides a user-friendly interface that allows users to easily input academic papers (PDFs or text files), review extracted reproducibility metadata, and make necessary adjustments or annotations. The solution should be designed to be easily extendable to accommodate other elements like images, tables, etc. and integrate with additional LLM models and data extraction techniques.

The objective of this thesis is to analyze the current state of the extraction methods from scholarly publications, identify existing challenges, and develop a comprehensive LLM-based solution to extract reproducibility metadata from publication text. This includes designing and implementing the software tool, followed by an experimental evaluation through a pilot study to demonstrate its effectiveness and usability. By advancing text extraction for reproducibility metadata through this LLM-based system, this thesis aims to significantly improve the efficiency and accuracy of data extraction processes, thereby enhancing the ability of researchers to verify and build upon published scientific results.