Research (in progress)

DataPLANT (2021 - till date)

Research Data Management

DataPLANT is a consortium of plant researchers that is part of the NFDI. Its main objective is to drive the democratization and digital transformation of research data in the field of basic plant research. The consortium's mission is to develop a Research Data Management (RDM) platform that can meet the community's requirements and allow for the processing and contextualization of research datasets in line with the FAIR principles (Findable, Accessible, Interoperable, Reusable).

We collaborate with our Computational Systems Biology Group here at RPTU to research and develop the RDM eco-system. Our focus is on large-scale plant science data modeling, data integration, research and development of software solutions for data indexing, search & findability to enable interdisciplinary data analysis. Specifically, we work on RDM ideas, Annotated Research Contexts as Fair Digital Objects, supporting tools, and solutions.

Contact: Gajendra Doniparthi


Research (concluded)

NotaQL (2013 - 2017) 

The Cross-System Data-Transformation Language.

NoSQL databases are a popular alternative to relational database systems. They offer new access methods, data models and are made for "Big Data". Typically, access methods in these systems are very simple, and complex queries with aggregations and joins are not supported. That is why frameworks like Hadoop MapReduce, Spark or Flink are called into action. But using these frameworks involve a lot of coding effort, and they are not usable for people without coding experience.

The language NotaQL has three aims: An easy-to-learn concise language to define data transformations in a few lines of code. These also include transformations on flexible schemas and complex data types. NotaQL allows for cross-system transformations by supporting input and output engines for various database systems, file formats, and services with different data models. Transformations in NotaQL can be executed in different ways. Classical approaches iterate over the whole input data, transform, aggregate and store the result into the target system. Incremental approaches can reuse the result of a former computation and only process the changed data.

NotaQL can be used for different applications like data integration, migration, polyglot persistence, and data analytics. A transformation platform allows for a distributed execution of NotaQL transformation in different incremental and non-incremental ways.

Virga (2011 - 2015)

Incremental Recomputations in MapReduce

In 2004, Google introduced the MapReduce paradigm for parallel data processing in large shared-nothing clusters. MapReduce was primarily built for processing large amounts of unstructured data, such as web request logs or crawled web documents stored in Google’s distributed file system (GFS). More recently, MapReduce has been reported to be used on top of Google’s Bigtable, a distributed storage system for managing structured data. A key difference between GFS and Bigtable is their update model. While GFS files are typically append-only data sets, Bigtable supports row-level updates.

When Bigtable is used as input source for MapReduce jobs, often only parts of the source data have been changed since the job’s previous run. As yet, MapReduce results have to be recomputed from scratch to incorporate the latest base data changes. This approach is obviously inefficient in many situations and it seems desirable to maintain MapReduce results in an incremental way similar to materialized views. From an abstract point of view, materialized views and MapReduce computations have a lot in common. A materialized view is derived from one or more base tables in a way specified by a user-supplied view definition and persistently stored in the database. Similarly, a MapReduce job reads data from one or more Bigtable datasets, transforms it in a way specified by user-supplied Map and Reduce functions and persistently stores the result in Bigtable.

However, applying view maintenance techniques in the MapReduce environment is challenging, because the programming models (or query languages) and data models differ heavily. View definitions are specified in SQL, a language closely tied to the relational algebra and the relational data model. The MapReduce programming model is more generic; the framework provides hooks to plug-in custom Map and Reduce functions written in standard programming languages. In this project, we explore incremental recomputation techniques in the MapReduce environment to find answers to the following questions.

Given a (sequence of) MapReduce jobs, how can “incremental counterparts” be derived that consume source deltas and compute deltas to be applied to the target view? For such a derivation process, what is an appropriate level of abstraction? MapReduce by itself requires programmers to plug-in custom code. Is it feasible to identify classes of Mappers and Reducers that share interesting properties with regard to incremental processing? Infrastructure has been build on top of MapReduce to provide programmers with high-level languages such as Jaql, PigLatin, or HiveQL. Is it possible to derive incremental MapReduce programs automatically for (a subset) of any of these languages?

INDI (2008 - 2011)

Incremental Recomputations in Materialized Data Integration

Incremental recomputations have been studied by the database research community mainly in the context of the maintenance of materialized views. Materialized views and data integration systems such as Extract-Transform-Load (ETL) tools share a key characteristic: The result data is pre-computed and materialized, so that future queries can be evaluated efficiently. Upon updates to the base data, materialized views become stale and need to be maintained. A naïve solution is to recompute views from scratch. However, an incremental recomputation approach is often more efficient. While database systems are able to maintain views incrementally, today’s ETL tools lack this capability. We believe that incremental recomputation techniques can advantageously be applied in the ETL process to improve the efficiency of data warehouse maintenance. Doing so will shrink the data warehouse update window, improve data timeliness in the warehouse and thus be a step towards near-real time data warehousing.

The INDI project is carried out in close cooperation with the IBM Research & Development Lab Böblingen. We follow an algebraic approach in the sense that we aim at deriving incremental variants from ETL jobs, which are built from standard ETL processing primitives. This is analogous to algebraic view maintenance where incremental expressions are derived from SQL/RA view definitions using again SQL/RA. This approach has a couple of advantages: Incremental ETL jobs can be executed by standard ETL tools without the need for modifications, already existing ETL jobs may be “incrementalized”, and the development of new incremental ETL solutions is eased. The ETL environment has distinct characteristics that require us to rethink and adapt traditional view maintenance techniques. We identified the following major research challenges:

A common language, such as SQL/RA for relational database systems, does not exist in the ETL world. Instead, commercial ETL tools provide proprietary scripting languages or graphical user interfaces for defining ETL jobs. Because of these programming model differences, standard view maintenance techniques cannot be directly applied. Standardizing and improving the quality of source data is a key task in data integration. For this purpose, ETL tools provide rich sets of data cleansing operators. This class of operators has no counterpart in the relational world and calls for new optimization strategies. In a DWH environment, so called Change Data Capture (CDC) techniques are used to gather deltas at the source systems. The captured deltas may be incomplete (or partial) due to principal restrictions of the CDC technique or for improved CDC efficiency. Traditional view maintenance techniques, however, demand for deltas to be complete. Database view maintenance depends on transactional guarantees. In particular, transactions allow for synchronizing view maintenance and concurrent base data updates. In a warehousing environment, the source systems are distributed; distributed transactions, however, are prohibitively expensive and warehouse maintenance thus must proceed without.

GEM (2006 - 2009)

A Graphical Editor for Arbitary Metadata

Both the Caro and Paladin projects of our working group have a need to visualize and modify metadata of many different and ever evolving metamodels. Instead of creating custom editors from scratch that are specific to our respective models and would require constant maintainance to keep them up to date with the evolving metamodels, we formulated the requirement for a generic editor that could be easily and declaratively customized to allow visualizing and editing of arbitrary metadata. In the course of a diploma thesis, GEM, a functional prototype of such an editor, was created.

The editor's underlying model for the application data to be displayed is that of attributed, typed multigraphs. Almost any kind of data and metadata can be converted into this very generic representation. Once the data or metadata of the respective application has been converted into a graph, graph stylesheets are used to specify how the different types of elements of the application graph are to be displayed. Graph stylesheets are based on the concept of graph transformations: A stylesheet describes, which elements of the editor's visualization model are to be created for certain parts of the application graph. The visualization model is itself represented as a graph and coexists with the original application graph. Its elements (nodes resp. edges) directly correspond to simple and complex widgets (boxes, ovals, labels etc.) resp. connectors between these widgets, which are then displayed by the editor. The set of available widgets can be easily extended.

While an application graph can be directly represented by basic widgets, it is often desirable to aggregate larger subgraphs of the application graph into one or few more complex widgets. This is useful to mimic the respective meta model's native visualization, e.g., a class in a UML class diagram is a complex widget with many different fields, but will in general be represented by many different nodes in the graph representation of the application model. Aggregation also allows a graph stylesheet to abstract from the complexity and details of an application model, to enable the visualization of very large and complex models.

Since the necessary aggregation of elements gives rise to a variant of the well-known view-update problem, graph stylesheets are also used to explicitely specify the valid edit operations available to the users. To perform the graph transformations, GEM uses a relational database system. A number of commercial and open source DBMSs are supported.

CARO (2003 - 2010)

Robust Change Management in Distributed Environments

To get most benefit out of existing information, companies are trying to combine information from different sources by using information integration technologies. This results in highly complex, distributed systems with many dependencies between each other.

Because of restructuring, acquisitions etc. such an information infrastructure is subject to constant change. This "system evolution" normally affects only a few systems directly. But because of the interwoven dependencies, it might have an indirect impact on other systems as well. Possible consequences may be data corruption, system failures or inconsistencies that are detected much later. It is very important to reduce these consequences to a minimum in number and duration.

Caro tries to monitor the state of the whole information infrastructure, and to analyse the effects of changes to system, no matter if they are planned or ad hoc. We make two assumptions here: First, we will have to live with incomplete information, and second, predefined processes will not be adhered to. These assumptions simply reflect the human behaviour which we cannot change. Even under these difficult circumstances, our approach always works on a "best effort" basis. We may lose some detail in the analysis, but still can get approximate results. The more information (ontologies, database schemas, etc.) we have, and the more people adhere to processes (agreements between responsible administrators etc.), the better the results of the performed analysis are. A trade-off exists between putting more work into making Caro do a good analysis, or to have more manual work afterwards when problems are detected. In neither case consequences of a change will go undetected, since we always use pessimistic estimates.

PALADIN (2003 - 2009)

Pattern-based Approach to LArge-scale Dynamic INformation Integration

The goal of the PALADIN project is to develop methods and tools that enable the use of information integration technology in highly dynamic environments. A prime example for such an environment is the nascent data grid technology, which provides the infrastructure to enable access to a huge number of globally distributed and highly heterogeneous structured or semi-structured data sources. In order to benefit from these massive amounts of data, users or applications must not be confronted with the individual data sources directly, but instead be provided with an integrated view specific to the requirements of particular application. However, in these environments many assumptions made by the conventional human-driven integration approaches do no longer hold: Not only are the requirements of users or applications on the integrated schema much more diverse and volatile, but also are the data sources which contribute to the integrated schema subject to permanent change, as they are no longer under control of a single administrative entity. This results in data sources joining and leaving the grid or changing their exported schema and data. An integration solution in such an environment would necessarily have to be modified permanently, in order to keep up with these changes, which is obviously infeasible using today's slow-paced human-driven approaches.

In order to provide information integration in these dynamic and at the same time often cost-sensitive environments, the different steps in the setup of an integration solution, which are currently performed by integration and application domain experts, have to be supported by suitable tools and should ultimately be largely automatized. The initial step is the analysis of the requirements on the integrated system, essentially the choice of a suitable data model and the information schema. In the next step, suitable data sources that can contribute to this integrated schema have to be discovered and selected. Now, an integration plan has to be developed, which maps the data represented in the structure of the respective sources to the structure of the integrated view. This plan has then to be deployed to a suitable runtime environment, which gives access to the integrated data.

An essential foundation for the (partial) automation of the integration process is a unified handling of data and metadata expressed in different data models. This generic metadata management is made possible by the PALADIN metamodel (PMM), a layered metadata architecture loosely based on the Common Warehouse Metamodel. All source and target schemas are represented in PMM. Every PMM model can be understood as an attributed, typed multigraph.

The focus of our project is on the creation of the integration plan, i.e., the creation of the mapping from the source schemas to the target schema. Integration patterns are our primary concept to capture the knowledge of human integration experts about solving small and large-scale mapping problems. By using graph transformations, each pattern describes a problem constellation in an abstract fashion and provides an approach to a solution.

By combining several of these patterns, the initial situation, i.e., the schemas of the data sources is transformed into the desired end result, i.e., the integrated schema. If such a deduction is successful, the sequence of pattern applications essentially describes an abstract integration or operator plan, which is then transformed into the language or operators of the chosen target runtime environment and finally deployed.