Vortex: a State-of-the-Art Columnar File Format#

The File Format

Currently just a schematic. Specification forthcoming.

File Format
The Rust API

The primary interface to the Vortex toolkit.

/vortex/docs/rust/doc/vortex
Quickstart

For end-users looking to read and write Vortex files.

Quickstart
The Benchmarks

Random access, throughput, and TPC-H.

https://bench.vortex.dev/

Vortex is a fast & extensible columnar file format that is based around the latest research from the database community. It is built around cascading compression with lightweight, vectorized encodings (i.e., no block compression), allowing for both efficient random access and extremely fast decompression.

Vortex includes an accompanying in-memory format for these (recursively) compressed arrays, that is zero-copy compatible with Apache Arrow in uncompressed form. Taken together, the Vortex library is a useful toolkit with compressed Arrow data in-memory, on-disk, & over-the-wire.

Vortex consolidates the metadata in a series of flatbuffers in the footer, in order to minimize the number of reads (important when reading from object storage) & the deserialization overhead (important for wide tables with many columns).

Vortex aspires to succeed Apache Parquet by pushing the Pareto frontier outwards: 1-2x faster writes, 2-10x faster scans, and 100-200x faster random access reads, while preserving the same approximate compression ratio as Parquet v2 with zstd.

Its features include:

  • A zero-copy data layout for disk, memory, and the wire.

  • Kernels for computing on, filtering, slicing, indexing, and projecting compressed arrays.

  • Builtin state-of-the-art codecs including FastLanes (integer bit-packing), ALP (floating point), and FSST (strings).

  • Support for custom user-implemented codecs.

  • Support for, but no requirement for, row groups.

  • A read sub-system supporting filter and projection pushdown.

Vortex’s flexible layout empowers writers to choose the right layout for their setting: fast writes, fast reads, small files, few columns, many columns, over-sized columns, etc.

Documentation#