Columnar format Parquet structure and read optimization

2021-12-03 serialize hadoop

Parquet is a columnar format mainly used in the Hadoop ecosystem. Compared to row-based formats like CSV, unnecessary columns can be skipped. Besides, there is a mechanics to read only rows that are needed, so queries can be executed efficiently.

Format

Rows are horizontally partitioned into some Row Groups, and the Column Chunks of each column are arranged in order. Column Chunks are divided into Pages, and compression and encoding are performed in that unit.

The metadata is written after RowGroups for single-pass writing.

Parquet supports nested schema. The schema is flattened and is represented with Repetition Level and Definition Level.

Read optimization

ColumnMetadata contains the number of records, min/max value, compression codec, and size so a reader can uncompress and read only necessary chunks. Therefore, the number of pages to be read can be reduced by narrowing down the selecting columns or sorting the columns used for filtering.

Metadata example

Check the actual metadata with parquet-cli, an official cli tool. parquet-tools also exists but it is already deprecated and removed from the repository.

$ brew install parquet-cli

Target a parquet transformed from json with AWS Glue.

$ cat test.json
{"A":"aaaa","B":["b"],"C":{"D":10}}

$ parquet meta test.parquet

File path:  test.parquet
Created by: parquet-glue version 1.8.2
Properties:
  org.apache.spark.sql.parquet.row.metadata: {"type":"struct","fields":[{"name":"A","type":"string","nullable":true,"metadata":{}},{"name":"B","type":{"type":"array","elementType":"string","containsNull":true},"nullable":true,"metadata":{}},{"name":"C","type":{"type":"struct","fields":[{"name":"D","type":"integer","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]}
Schema:
message glue_schema {
  optional binary A (STRING);
  optional group B (LIST) {
    repeated group list {
      optional binary element (STRING);
    }
  }
  optional group C {
    optional int32 D;
  }
}


Row group 0:  count: 1  216.00 B records  start: 4  total: 216 B
--------------------------------------------------------------------------------
                type      encodings count     avg size   nulls   min / max
A               BINARY    G   _     1         76.00 B    0       "aaaa" / "aaaa"
B.list.element  BINARY    G   _     1         66.00 B    0       "b" / "b"
C.D             INT32     G   _     1         74.00 B    0       "10" / "10"

References

What is Apache Parquet? - Databricks