Masterarbeit / Bachelorarbeit
A LLM-based approach for mining reproducibility information from scholarly
publications
Research Area
Intelligent Information Management
Advisers
Description
Reproducibility of results is a cornerstone of scientific
research, ensuring that findings can be independently verified and built upon. Scholarly
publications contain critical information necessary for reproducing results, including
data presented in text, tables, and images. However, manually extracting this
reproducibility metadata is a challenging and time-consuming task. Leveraging Large
Language Models (LLMs) to automate the extraction of tabular information offers a powerful
solution to enhance efficiency and accuracy in this crucial process.
This thesis aims to develop a LLLM-based approach, designed to automatically mine
reproducibility metadata from text in scholarly publications. The solution must ensure
robust and accurate extraction of reproducibility information. The thesis will
specifically aim to extract information on deep learning methods from publications to
understand the reproducibility factor of these methods, since Deep learning methods are
widely used across various domains, from natural language processing to computer vision.
The key features of the solution will include Sophisticated Table Recognition, where LLMs
are utilized to accurately extract data from text, capturing detailed information
necessary for reproducibility. This will be further used to expand the metadata collected
from the text. The solution provides a user-friendly interface that allows users to easily
input academic papers (PDFs or text files), review extracted reproducibility metadata, and
make necessary adjustments or annotations. The solution should be designed to be easily
extendable to accommodate other elements like images, tables, etc. and integrate with
additional LLM models and data extraction techniques.
The objective
of this thesis is to analyze the current state of the extraction methods from scholarly
publications, identify existing challenges, and develop a comprehensive LLM-based solution
to extract reproducibility metadata from publication text. This includes designing and
implementing the software tool, followed by an experimental evaluation through a pilot
study to demonstrate its effectiveness and usability. By advancing text extraction for
reproducibility metadata through this LLM-based system, this thesis aims to significantly
improve the efficiency and accuracy of data extraction processes, thereby enhancing the
ability of researchers to verify and build upon published scientific results.