Masterarbeit
Text Analysis to identify scientific concepts in Research Dataset Abstracts
Completion
2021/05
Research Area
Intelligent Information Management
Students
Egisa Kasemi
Advisers
Description
In the context of OpenScience, researchers are encouraged to
publish their research datasets in common data repositories so that others can find and
reuse it. As the files contained in such a research dataset are commonly not
self-descriptive, the authors have to provide additional meta-information about the
nature, format and content of this dataset. This commonly includes the name of the
authors, a title, a description text and some keywords. However, keywords often do not
contain all characteristics of such a dataset. Most of the information is provided in an
unstructured way in the description text (abstract) of such a meta description. It would
be a benefit, if automated means can extract relevant entities from such a text and
semantically map them to a corresponding concept identifier in a well-known
terminology.
The objective of the Master's thesis project is to
apply an appropriate Text Mining approach, such as Natural Language Processing (NLP), on a
set of research dataset abstracts in order to identify relevant entities, such as the
examined object, research objective, used device or software, file type, scientific method
or other measurement characteristics, as long as they are mentioned in this descriptive
text.
The use case can be limited to a particular research domain, such as
research datasets from human-machine interaction.
To achieve this,
a requirement analysis has be performed first. Then, a state-of-the-art analysis
concerning existing approaches has be conducted. A concept has to be designed and
described, in which a semi-structured meta data description for a research dataset form a
common data repository is provided as an input, the approach analyses the abtract text and
the solution reconciles all identified entities to an appropriate Linked Data identifier.
An implementation has to show the feasibility of this approach and an evaluation has to
assess quality parameters such as the accuracy in a practical environment.