Masterarbeit
Automated Metadata Annotation of Research Data with Homonym Disambiguation
Completion
2024/05
Research Area
Intelligent Information Management
Advisers
Description
CKAN is a popular repository and framework for Research Data
and Research Data
Management. Its core functionality consists of the ability to
upload datasets, as well as to annotate them with metadata, such as keywords, description,
and title. In the context of enriching datasets with metadata, finding the most suitable
tags out of a predefined set of available keywords for a specific dataset can be quite
difficult as that requires knowledge about the already existing tags which could result in
improper annotated datasets. Additionally, some tags might be homonyms which - if not
disambiguated - make it hard to correctly classify a dataset. This thesis explores
algorithms and techniques for automatic annotation of datasets with plausible tags based
on a specified context, e.g., the dataset description, title or previous work of the
author and their field. Additionally, the investigated and implemented techniques should
also be aware of the semantics of the keyword and thus capable of automatically
disambiguating homonyms. The objective of this thesis consists of research into the
different approaches of automatic annotation of datasets with keywords and their
capabilities in performing disambiguation on homonym tags as described above. Thus, a
thorough State-of-the-art analysis must be conducted, and the most applicable approaches
should be implemented, and their performance evaluated in an objective manner based on
different existing metrics and benchmarks. Additionally, a demonstrator has to be created
which shows the capabilities of the implemented algorithms.