Masterarbeit
Assessing Data Quality in Digital Research Dataset Metadata to improve
Discoverability and Interdisciplinary Reuse
Research Area
Intelligent Information Management
Students
Biswajit Panda
Advisers
Description
Nowadays, scientists are encouraged to publish
their research artifacts in established research data repositories so that others can find
and reuse it. In this publishing process, researchers typically provide additional
information as metadata to describe characteristics of such as dataset. This metadata is
then exposed in different typical formats by a data repository together with the research
artifact and used for for indexing and crawling activities. Nevertheless, the
discoverability of such a research dataset is often still limited and mainly focusing on
administrative metadata and less on structured descriptive metadata on the content of such
a dataset.
In this project, we focus on data quality metrics that particularly address the discoverability of published research datasets and their assessment on existing metadata descriptions. In a first step, a typical research dataset discovery process and its shortcomings have to be described and which typical metadata description formats and schemas are commonly used for that. After a requirement and state-of-the-art analysis, relevant data quality critera have to be identified that can be measured on research dataset meta descriptions to assess the fitness to find, access and interdisciplinary reuse the dataset by other researchers for a particular use case. In a second step, it has to be checked which of these criteria can be assessed in an automated process on an existing meta description and it has to be conceptionally shown, how these measurements can be run for a given research dataset. Existing approaches such as FAIRmetrics or the OpenAIRE guidelines for Data archives can be taken into consideration for that. An implementation and evaluation has to show the feasibility and correctness of the approach based on a representative corpus of research datasets from different repositories or application domains.