Main Content

Theses

Writing a thesis is the final step in obtaining a Bachelor or Master degree. A thesis is always coupled to a scientific project in some field of expertise. Candidates who want to write their thesis in the Big Data Analytics group should, therefore, be interested and trained in a field related to our research areas.

A thesis is an independent, scientific and practical work. This means that the thesis and its related project are conducted exclusively by the candidate; the execution follows proper scientific practices; and all necessary artifacts, algorithms and evaluations have been physically implemented and submitted as part of the thesis. A proper way of sharing code and evaluation artifacts is the creation of a public GitHub repository, which can, then, be referenced in the thesis. The thesis serves as a documentation for the project and as scientific analysis and reflection of the gathered insights.

For students interested in a thesis, we offer interesting topics and a close, continuous supervision during the entire thesis time. Every thesis is supervised by at least one member of our team, who can give advice and help in critical situations. The condensed results of our best master theses have been published at top scientifc venues, such as VLDB, CIKM, EDBT, etc.

A selection of open thesis topics can be found on this page. We also encourage interested students to suggest own ideas in the context of our research areas and to contact individual members of the group directly. An ideal thesis topic is connected in some form to the research projects of a group member. That group member will then become a supervisor for the thesis. Hence, taking a look at the personal pages and our current projects is a good starter for a thesis project. Recent publications on conferences, such as VLDB or SIGMOD, or open research challenges on, for example, Kaggle are good resources for finding interesting thesis ideas.

Organizational information

  • Exposé: Before starting a thesis, Master students have to write a 2-5 pages long exposé. The exposé is a description of the planned project and includes a motivation for the topic, a literature review on related work, a draft of the research/project idea, and a plan for the final evaluation. Please consider our template with initial instructions when starting your exposé. The exposé can be created in the context of the "Selbstständiges wissenschaftliches Arbeiten" module.
  • Timetable: Once the thesis project is started, it must be finished within six months for Master and four months for Bachelor theses. Only special events, such as times of sickness, can extend this period. If you are working on a regular job or if you need to take further courses during your thesis time, the thesis time can be extended as well. A thesis can be started at any time, which is in alignment with semester times but also asynchronous to semester times.
  • Presentations: The work on a Master thesis requires students to give at least two talks. A mid-term talk serves to get some additional feedback from a larger audience and to practice the final thesis defense; this talk is not graded. The final talk is a proper defense of the thesis and the final results; this talk is graded as one part of the academic performance.

Hints for the thesis

Bachelor and Master Theses

  • Data Profiling
    • Sindy++: Reactive, Distributed Discovery of Inclusion Dependencies
      • We aim to translate the batch processing-based Sindy algorithms for the discovery of inclusion dependencies with Akka into a reactive, more efficient data profiling approach.
    • Many++: Fuzzy Inclusion Dependency Discovery for Data Integration
      • We aim to translate the Many algorithm for inclusion dependency discovery on Web Tables into a partializing IND discovery algorithm that is better suited for data integration scenarios.
    • DPQL: The Data Profiling Query Language
      • The data profiling language DPQL is a recently developed metadata profiling interface that serves the discovery of complex metadata patterns.
      • We aim to develop efficient profiling approaches that find these metadata patterns as fast as possible.
  • Time Series Analytics 
    • TimeEngine - A Time Series Engineering Library
      • IoT applications, multi-sensor systems and many distributed software systems record time series in different frequencies, temporal alignments, speeds, and formats, which makes their integrated analysis a technically and algorithmically challenging task. We therefore aim to develop a time series engineering library that assists the integration and preparation of time series for analytical tasks, such as anomaly detection, forecasting, clustering etc.
      • As part of the project, we could generate and measure our own times series with different sensors and aggregate the measurements afterwards with the time series library into a single multivariate time series.
    • Crisis in emergenCity - Intelligent Info-Station Planning
      • Based on the movement events of agents in cities, we aim to plan the placement of info-stations, such that these stations inform as many nearby agents as possible in some fixed time period.
      • The project will be conducted in collaboration with the emergenCity project.
      • We will use the streams of movement data and the Lambda engine that is currently in development at the UMR.
      • Keywords: Lambda queries, lattice search
    • Anomaly Detection in Medical Sensor Data
      • Given non-invasive medical sensor measurements, such as heart beats or temperature curves, we aim to find anomalous recordings that may indicate diseases or body malfunctions via modern anomaly detection, clustering and/or prediction techniques for time series.
      • The project will be conducted in collaboration with the VirtualDoc project.
      • Keywords: time series analytics, machine learning
    • SemanticWindows: Slicing of Time Series into Variable-Length, Meaningful Subsequences
      • In this project, we aim to slice time series into semantically meaningful subsequences. In contrast to traditional sliding or hopping windows, semantic windows should capture variable-lengths concepts, such as hearth beats in ECG data. These subsequences will then support anomaly detection algorithms or clustering algorithms in creating better results.
    • Anomaly Detection on Time Series Data Streams
      • Discovering anomalies in streaming data is a challenging task; hence, we aim to translate batch anomaly detection algorithm(s) into the streaming scenario.
      • Our goal is to discovery anomalies as fast as possibly by sacrificing as little precision as possible.
      • Keywords: stream processing
    • SoundMaker: An AI for Film Scoring
      • In film scoring, certain visual scenes are accompanied by appropriate sounds; we plan to automate this process with artificial intelligence.
      • Given a database with already scored films, we first extract the scene-to-sound mappings and, then, train a model to learn the scoring process.
      • The project will be conducted in collaboration with a professional film scorer.
      • Keywords: image processing, machine learning
  • Data Integration
    • Second-Line Schema Matching
      • First-Line schema matching produces similarity matrices which indicate how likely two attributes of different schemata represent the same semantic concept.
      • Second-Line schema matching consumes similarity matrices and aims to produce improved similarity matrices.
      • There are two main approaches for second-line matching: 1) similarity matrix boosting and 2) ensemble matching. While the former tries to transform a given similarity matrix into a more valuable one, the latter consumes multiple matrices and combines them to a single new similarity matrix.
    • HungarianMethod++: Matching Web-sized schemata
      • We aim to improve the Hungarian Method by improving its efficiency in exchange for a bit of fuzzyness/approximation (= reduced correctness)
      • Also interesting: Can we allow (to some extend) 1:n and n:m mappings in the attribute matching?
    • OntologyShredding: Transforming ontology data to relational tables
      • Knowledgebases are a valuable source of publicly available data and data integration scenarios. To make these scenarios usable also for relational data integration systems, this project aims to develop a shredding algorithms that translates linked open data into meaningful relational tables for data integration purposes.
    • RelationDecomposer: Decomposing relational databases into schema matching scenarios
      • Data integration test scenarios are very rare, especially if these scenarios should offer special properties, such as join- and unionable tables, unary and complex attributes matches, a broad selection of data types, schema-based and schema-less data, real-world data values and many other properties. This project, therefore, aims to develop a relation decomposer that takes existing, integrated datasets as input and automatically generates different integration scenarios with specific properties from these seed datasets via relational decomposition.
    • WDCIntegration - Fusing the Web Data Commons Data
      • The Web Data Commons Crawl is a large dataset of relational tables that stem from crawled HTML Web tables. These tables often store data about same/similar concepts, but they are due to their crawling completely unconnected. Hence, we aim to integrate the WDC commons corpus in a possibly meaningful and correct way, which is both a technically and conceptually challenging task.
    • LakeHouse: Virtual Integration on the Dynamic Data of Data Lakes
      • Data in data lakes is subject to constant changes. Data lakes, thereby, lack most of the control mechanisms that traditional database systems would use to, for example, standardise schemata, maintain indexes or enforce constraints. In this project, we aim to develop a system named lakehouse that dynamically integrates certain parts of a Data Lake to serve certain user-defined queries.
  • Machine Learning
    • DataGossip++: A Data Exchange Extension for Distributed Machine Learning Algorithms
      • The federated learning technique DataGossip proposes to exchange not only model weights, but also some training data items for better convergence on skewed data distributions; we aim to improve this technique with more intelligent training data selection techniques.
      • Keywords: federated learning, distributed computing
  • Teaching
    • BYTE Challenge: Platform Development for effective Online Learning
      • The BYTE Challenge is a digital learning platform for computer science that targets children from grade 3 to 13.
      • In this project, we aim to assist the platform development and the assessment and curation of digital learning material, which includes videos, quizzes, papers etc.
  • Current:
    • Efficient Partial Inclusion Dependency Discovery
    • Entwicklung einer Chat-KI für Data Engineering
    • Image2Surface: Predicting Surface Properties of Workpieces from Laserscan Images
    • Image2Surface: Data Engineering for Visual Analytics
  • Completed:
    • Erkennung anomaler medizinischer Muster – Analyse nicht invasiver medizinischer Daten mittels maschinellen Lernens (2024)
    • Data Generation and Machine Learning in the Context of Optimizing a Twin Wire Arc Spray Process (2023)
    • A Clustering Approach to Column Type Annotation: Effects of Pre-Clustering (2023)
    • Holistische Integration von WebDaten (2023)
    • User-Centric Explainable Deep Reinforcement Learning for Decision Support Systems (2023)
    • Combining Time Series Anomaly Detection Algorithms (2023)
    • DPQLEngine: Processing the Data Profiling Query Language (2023)
    • Aggregating Machine Learning Models for the Energy Consumption Forecast of Heat Generators (2023)
    • Correlation Anomaly Detection in High-Dimensional Time Series (2023)
    • HYPAAD: Hyper Parameter Optimization in Anomaly Detection (2022)
    • Time Series Anomaly Detection: An Aircraft Turbine Case Study (2022)
    • Distributed Duplicate Detection on Streaming Data (2021)
    • UltraMine - Scalable Analytics on Time Series Data (2021)
    • Distributed Graph Based Approximate Nearest Neighbor Search (2020)
    • A2DB: A Reactive Database for Theta-Joins (2020)
    • Distributed Detection of Sequential Anomalies in Time Related Sequences (2020)
    • Efficient Distributed Discovery of Bidirectional Order Dependencies (2020)
    • Distributed Unique Column Combination Discovery (2019)
    • Reactive Inclusion Dependency Discovery (2019)
    • Inclusion Dependency Discovery on Streaming Data (2019)
    • Generating Data for Functional Dependency Profiling (2018)
    • Efficient Detection of Genuine Approximate Functional Dependencies (2018)
    • Efficient Discovery of Matching Dependencies (2017)
    • Discovering Interesting Conditional Functional Dependencies (2017)
    • Multivalued Dependency Detection (2016)
    • DataRefinery - Scalable Offer Processing with Apache Spark (2016)
    • Spinning a Web of Tables through Inclusion Dependencies (2014)
    • Discovery of Conditional Unique Column Combination (2014)
    • Discovering Matching Dependencies (2013)