Masterarbeit
Exploration of efficient storage mechanisms for industrial Linked Data architectures
Completion
2024/04
Research Area
Students
Shivam Patel
Advisers
Description
The production of a product involves numerous independent entities that are distributed throughout the value chain, making it difficult to record, connect, and analyze the heterogeneous data which are generated by diverse sources such as products, processes, and machines. The generalization of data is an essential step needed to build solutions which can track the product data throughout the production process. There are many architectures available as a solution which utilize the concept of linked data, along with semantic web technologies. However, the management of the substantial amount of data which are generated at high velocity & volume from various components poses a challenge for long-term data retention. Additionally, the SPARQL endpoints, which are available in various implementations, support the ability to run queries against the RDF store, yet they lack the support for Federated SPARQL queries, which is essential in the real-world scenario to fetch data from distributed sources.
This thesis aims to address challenges in storing and querying heterogeneous data generated in the product manufacturing life cycle using “Linked Data Architecture”[1]. The thesis will investigate data storage technologies such as Parquet, ORC, and Delta Lake to find a suitable archival storage alternative that allows for the archival of data from the main RDF store to a high-performing storage alternative. The thesis will also focus on finding a possible solution to support Federated SPARQL queries in the SPARQL engine and extend the current implementation. The goal is to develop an efficient and scalable solution that provides an archival mechanism for storing time-annotated Linked Data over longer periods and supports Federated SPARQL queries for data retrieval from distributed sources.
The objective of this thesis is the creation of a solution or the combination of existing approaches to solve the problem of storage mechanisms for industrial Linked Data Architecture as described above. This comprises the analysis of the state of the art of high-performance data storages, temporal linked data, federated SPARQL query support and long-term storage as well as the demonstration of the solution by implementation and a suitable evaluation based on experimentation simulating different scenarios.