Quickstart#
The reference implementation exposes both a Rust and Python API. A C API is currently in progress.
Python#
Install#
pip install vortex-array
Convert#
You can either use your own Parquet file or download the example used here.
Use Arrow to read a Parquet file and then use array()
to construct an uncompressed
Vortex array:
>>> import pyarrow.parquet as pq
>>> import vortex
>>> parquet = pq.read_table("_static/example.parquet")
>>> vtx = vortex.array(parquet)
>>> vtx.nbytes
141070
Compress#
Use compress()
to compress the Vortex array and check the relative size:
>>> cvtx = vortex.compress(vtx)
>>> cvtx.nbytes
16756
>>> cvtx.nbytes / vtx.nbytes
0.118...
Vortex uses nearly ten times fewer bytes than Arrow. Fewer bytes means more of your data fits in cache and RAM.
Write#
Use write_path()
to write the Vortex array to disk:
>>> vortex.io.write_path(cvtx, "example.vortex")
Small Vortex files (this one is just 71KiB) currently have substantial overhead relative to their size. This will be addressed shortly. On files with at least tens of megabytes of data, Vortex is similar to or smaller than Parquet.
>>> from os.path import getsize
>>> getsize("example.vortex") / getsize("_static/example.parquet")
2.0...
Read#
Use read_path()
to read the Vortex array from disk:
>>> cvtx = vortex.io.read_path("example.vortex")
Rust#
Install#
Install vortex and all the first-party array encodings:
cargo add vortex
Convert#
You can either use your own Parquet file or download the example used here.
Use Arrow to read a Parquet file and then construct an uncompressed Vortex array:
use std::fs::File;
use arrow_array::RecordBatchReader;
use parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder;
use vortex::array::ChunkedArray;
use vortex::arrow::FromArrowType;
use vortex::{Array, IntoArray};
use vortex::dtype::DType;
let reader =
ParquetRecordBatchReaderBuilder::try_new(File::open("_static/example.parquet").unwrap())
.unwrap()
.build()
.unwrap();
let dtype = DType::from_arrow(reader.schema());
let chunks = reader
.map(|x| Array::try_from(x.unwrap()).unwrap())
.collect::<Vec<_>>();
let vtx = ChunkedArray::try_new(chunks, dtype).unwrap().into_array();
Compress#
Use the sampling compressor to compress the Vortex array and check the relative size:
use std::collections::HashSet;
use vortex::sampling_compressor::{SamplingCompressor, DEFAULT_COMPRESSORS};
let compressor = SamplingCompressor::new(HashSet::from(*DEFAULT_COMPRESSORS));
let cvtx = compressor.compress(&vtx, None).unwrap().into_array();
println!("{}", cvtx.nbytes());
Write#
Reading and writing both require an async runtime; in this example we use Tokio. The VortexFileWriter knows how to write Vortex arrays to disk:
use std::path::Path;
use tokio::fs::File as TokioFile;
use vortex_serde::file::write::writer::VortexFileWriter;
let file = TokioFile::create(Path::new("example.vortex"))
.await
.unwrap();
let writer = VortexFileWriter::new(file)
.write_array_columns(cvtx.clone())
.await
.unwrap();
writer.finalize().await.unwrap();
Read#
use futures::TryStreamExt;
use vortex::sampling_compressor::ALL_COMPRESSORS_CONTEXT;
use vortex_serde::file::read::builder::{VortexReadBuilder, LayoutDeserializer};
let file = TokioFile::open(Path::new("example.vortex")).await.unwrap();
let builder = VortexReadBuilder::new(
file,
LayoutDeserializer::new(
ALL_COMPRESSORS_CONTEXT.clone(),
LayoutContext::default().into(),
),
);
let stream = builder.build().await.unwrap();
let dtype = stream.schema().clone().into();
let vecs: Vec<Array> = stream.try_collect().await.unwrap();
let cvtx = ChunkedArray::try_new(vecs, dtype)
.unwrap()
.into_array();
println!("{}", cvtx.nbytes());
C#
Coming soon!