Crate vortex_file
source ·Expand description
Read and write Vortex layouts, a serialization of Vortex arrays.
A layout is a serialized array which is stored in some linear and contiguous block of memory. Layouts are recursively defined in terms of one of three kinds:
-
The
FlatLayout
. A contiguously serialized array using the Vortex flatbuffer Batchmessage
. -
The
ColumnarLayout
. Each column of a [StructArray
][vortex_array::array::StructArray] is sequentially laid out at known offsets. This permits reading a subset of columns in time linear in the number of kept columns. -
The
ChunkedLayout
. Each chunk of aChunkedArray
is sequentially laid out at known offsets. This permits reading a subset of rows in time linear in the number of kept rows.
A layout, alone, is not a standalone Vortex file because layouts are not self-describing. They
neither contain a description of the kind of layout (e.g. flat, column of flat, chunked of
column of flat) nor a data type (DType
). A standalone Vortex file
comprises seven sections, the first of which is the serialized array bytes. The interpretation
of those bytes, i.e. which particular layout was used, is given in the fourth section: the
footer.
Section | Size | Description |
---|---|---|
Data | In the Footer. | The serialized arrays. |
Metadata | In the Footer. | A table per column with a row per chunk. Contains statistics. |
Schema | In the Postscript. | A serialized data type. |
Footer | In the Postscript. | A recursive description of the layout including the number of rows. |
Postscript | 32 bytes | Two 64-bit offsets pointing at schema and the footer. |
Version | 4 bytes | The file format version. |
Magic bytes | 4 bytes | The ASCII bytes “VRTX” (86, 82, 84, 88; 0x56525458). |
A Parquet-style file format is realized by using a chunked layout containing column layouts containing chunked layouts containing flat layouts. The outer chunked layout represents row groups. The inner chunked layout represents pages.
All the chunks of a chunked layout and all the columns of a column layout need not use the same layout.
Anything implementing VortexReadAt
, for example local files, byte
buffers, and cloud storage, can be used as the “linear and
contiguous memory”.
§Reading
Layout reading is implemented by VortexFileArrayStream
. The VortexFileArrayStream
should
be constructed by a VortexReadBuilder
, which first uses an InitialRead to read the footer
(schema, layout, postscript, version, and magic bytes). In most cases, these entire footer can
be read by a single read of the suffix of the file.
A VortexFileArrayStream
internally contains a LayoutMessageCache
which is shared by its
layout reader and the layout reader’s descendants. The cache permits the reading system to
“read” the bytes of a layout multiple times without triggering reads to the underlying storage.
For example, the VortexFileArrayStream
reads an array, evaluates the row filter, and then
reads the array again with the filter mask.
A LayoutReader
then assembles one or more Vortex arrays by reading the serialized data and
metadata.
§Apache Arrow
If you ultimately seek Arrow arrays, VortexRecordBatchReader
converts a
VortexFileArrayStream
into a RecordBatchReader
.
Modules§
Structs§
- A message that has had its bytes materialized onto the heap.
- A unique locator for a message, including its ID and byte range containing the message contents.
- A RowMask captures a set of selected rows offset by a range.
- Operation to apply to data returned by the layout
- An asynchronous Vortex file that returns a [
Stream
] of [ArrayData
]s. - Builder for reading Vortex files.
Enums§
- A polling interface for reading a value from a
LayoutReader
. - Result type for an attempt to prune rows from a
LayoutReader
.
Constants§
- The layout ID for a chunked layout
- The layout ID for a column layout
- The size of the EOF marker in bytes
- The layout ID for a flat layout
- The magic bytes for a Vortex file
- The maximum length of a Vortex footer in bytes
- The size of the footer in bytes in Vortex version 1
- The current version of the Vortex file format
- The extension for Vortex files
Traits§
- A reader for a layout, a serialized sequence of Vortex arrays.
Functions§
Type Aliases§
- Unique identifier for a message within a layout
- Path through layout tree to given message