Parquet data source in Polypheny (Master Project, Ongoing)

Author

Yulia Cher

Description

Parquet is a file format for data analytics. Unlike traditional database systems, Parquet files use a columnar format. Parquet files have a mandatory schema, but unlike relational schemas, Parquet schemas can be nested (more like a document schema). The Parquet file format is designed such that it can be efficiently queried without having to read the entire file. It also contains statistics of the columns to further optimize queries.

One of the strengths of PolyDBMS systems is to unify access to different data sources. Given that it is only natural that Polypheny should also be able to use Parquet files.

A Parquet data source offers many challenges. The columnar format with nesting requires a source that exposes data both as relational tables and document collections. In the case of relational tables a suitable relational schema must be derived from the schema of the Parquet file. Further, the query planner should take advantage of the file format to optimize read queries. The integrated workflow engine would also benefit from supporting Parquet files, for both reading and writing.


Objectives

 

Optional objectives

 

Requirements

Start / End Dates

2026/03/11 - 2026/07/22

Supervisors

Research Topics