02.12.2025 Opening the Tap for Structural Metadata

The Marburg University received DFG funding to develop the next generation data profiling platform Metaserve

shown from left to right: Thorsten Papenbrock, Ilnaz Tayebi, Marcian Seeger

Data Profiling

Data profiling is a computer science discipline that describes the activity of inferring structural metadata, such as functional dependencies, inclusion dependencies, and unique column combinations, from arbitrary datasets. Because structural metadata is often not stored explicitly, data profiling plays a crucial role in many data-intensive applications. It is, for example, a necessary step in data discovery, cleaning, integration, normalization, querying, management, and analytics. Because data profiling is one of the computationally hardest tasks in computer science, a lot of research has been invested in its clever automation.

Metanome

In recent years, the “Big Data Analytics” research group of Prof. Dr. Thorsten Papenbrock (Marburg University) and the “Information Systems” research group of Prof. Dr. Felix Naumann (Hasso Plattner Institute) developed the data profiling system Metanome[TP1] that offers modern and highly efficient algorithms for the fully automatic discovery of various types of structural metadata. Metanome has become a popular open-source tool for data profiling and also inspired many commercial data profiling products. Yet, data profiling remains a discipline for expert users and highly skilled data engineers, because automatically discovered metadata is still very difficult to manage and apply.

Metaserve

To close the gap between modern data profiling technology and practical data management applications that require discovered metadata, the Big Data Analytics group has just received funding from the German Research Foundation (DFG) for three years to develop the next generation data profiling platform Metaserve. With this platform, users can easily discover not only individual data dependencies but also arbitrary complex, user-defined patterns of structural metadata. This is made possible with the group’s recently developed declarative metadata query language DPQL (“Data Profiling Query Language”).

Example

In relational database theory, inclusion dependencies (INDs) and unique column combinations (UCCs) are two popular types of structural metadata. For both types, efficient profiling algorithms exist. A foreign key, however, is a very common database constraint that ensures referential integrity and, in this way, keeps databases connected. Because a valid foreign key always requires a combination of an IND and a UCC, i.e., a very specific structural metadata pattern, neither INDs nor UCCs suffice to detect foreign keys reliably. Because the combination of structural metadata, such as INDs and UCCs, is very difficult (NP-complex), existing profiling technology alone does not meet the practical application demands. With Metaserve, users will be able to specify IND-UCC patterns in DPQL, such that the profiling can find foreign key structures (and many further structures) directly.

Further Information

DPQL: The Data Profiling Query Language
DPQL: Applications for Holistic Data Profiling

Contact: Prof. Dr. Thorsten Papenbrock