Main Content


Data profiling describes the activity of extracting implicit metadata, such as schema
descriptions, data types, and various kinds of data dependencies, from a given data set. The
considerable amount of research papers about novel metadata types and ever-faster data profiling
algorithms emphasize the importance of data profiling in practice. Unfortunately, though, the current
state of data profiling research fails to address practical application needs: Typical data profiling
algorithms (i. e., challenging to operate structures) discover all (i. e., too many) minimal (i. e., the
wrong) data dependencies within minutes to hours (i. e., too long). Consequently, if we look at the
practical success of our research, we find that data profiling targets data cleaning, but most cleaning
systems still use only hand-picked dependencies; data profiling targets query optimization, but hardly
any query optimizer uses modern discovery algorithms for dependency extraction; data profiling targets
data integration, but the application of automatically discovered dependencies for matching purposes
is yet to be shown - and the list goes on. We aim to solve the profiling-and-application-disconnect
with a novel data profiling engine that integrates modern profiling techniques for various types of data
dependencies and provides the applications with a versatile, intuitive, and declarative Data Profiling
Query Language (DPQL).

The first paper on the Metaserve project was presented at BTW 2023 (Cite as BibTex/EndNote/ACM Ref) .

Tools and Repeatability

Currently, Metaserve is still under development and will soon be available on Github (Engine, Frontend).

The DPQL engine is build on top of Metanome. Metanome was the first data profiling tool to support the discovery of various types of metadata and currently has an algorithm repository over 40 different algorithms (Engine, Algorithms). 





ProjectManagement Github
DeNorm Github
AdventureWorks Microsoft
Valentine Github
MusicBrainz MetaBrainz

Also, check out the Metanome Repeatability and Data (here).