Crate vortex_file

source ·
Expand description

Read and write Vortex layouts, a serialization of Vortex arrays.

A layout is a serialized array which is stored in some linear and contiguous block of memory. Layouts are recursively defined in terms of one of three kinds:

  1. The FlatLayout. A contiguously serialized array using the Vortex flatbuffer Batch message.

  2. The ColumnarLayout. Each column of a [StructArray][vortex_array::array::StructArray] is sequentially laid out at known offsets. This permits reading a subset of columns in time linear in the number of kept columns.

  3. The ChunkedLayout. Each chunk of a ChunkedArray is sequentially laid out at known offsets. This permits reading a subset of rows in time linear in the number of kept rows.

A layout, alone, is not a standalone Vortex file because layouts are not self-describing. They neither contain a description of the kind of layout (e.g. flat, column of flat, chunked of column of flat) nor a data type (DType). A standalone Vortex file comprises seven sections, the first of which is the serialized array bytes. The interpretation of those bytes, i.e. which particular layout was used, is given in the fourth section: the footer.

SectionSizeDescription
DataIn the Footer.The serialized arrays.
MetadataIn the Footer.A table per column with a row per chunk. Contains statistics.
SchemaIn the Postscript.A serialized data type.
FooterIn the Postscript.A recursive description of the layout including the number of rows.
Postscript32 bytesTwo 64-bit offsets pointing at schema and the footer.
Version4 bytesThe file format version.
Magic bytes4 bytesThe ASCII bytes “VRTX” (86, 82, 84, 88; 0x56525458).

A Parquet-style file format is realized by using a chunked layout containing column layouts containing chunked layouts containing flat layouts. The outer chunked layout represents row groups. The inner chunked layout represents pages.

All the chunks of a chunked layout and all the columns of a column layout need not use the same layout.

Anything implementing VortexReadAt, for example local files, byte buffers, and cloud storage, can be used as the “linear and contiguous memory”.

§Reading

Layout reading is implemented by VortexFileArrayStream. The VortexFileArrayStream should be constructed by a VortexReadBuilder, which first uses an InitialRead to read the footer (schema, layout, postscript, version, and magic bytes). In most cases, these entire footer can be read by a single read of the suffix of the file.

A VortexFileArrayStream internally contains a LayoutMessageCache which is shared by its layout reader and the layout reader’s descendants. The cache permits the reading system to “read” the bytes of a layout multiple times without triggering reads to the underlying storage. For example, the VortexFileArrayStream reads an array, evaluates the row filter, and then reads the array again with the filter mask.

A LayoutReader then assembles one or more Vortex arrays by reading the serialized data and metadata.

§Apache Arrow

If you ultimately seek Arrow arrays, VortexRecordBatchReader converts a VortexFileArrayStream into a RecordBatchReader.

Modules§

Structs§

Enums§

Constants§

Traits§

Functions§

Type Aliases§