Dataset#

Vortex files implement the Arrow Dataset interface permitting efficient use of a Vortex file within query engines like DuckDB and Polars. In particular, Vortex will read data proportional to the number of rows passing a filter condition and the number of columns in a selection. For most Vortex encodings, this property holds true even when the filter condition specifies a single row.

VortexDataset

Read Vortex files with row filter and column selection pushdown.

VortexScanner

A PyArrow Dataset Scanner that reads from a Vortex Array.


class vortex.dataset.VortexDataset(dataset)#

Read Vortex files with row filter and column selection pushdown.

This class implements the pyarrow.dataset.Dataset interface which enables its use with Polars, DuckDB, Pandas and others.

count_rows(filter: Expression | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool | None = None, memory_pool: MemoryPool | None = None) int#

Not implemented.

filter(expression: Expression) VortexDataset#

Not implemented.

get_fragments(filter: Expression | None = None) Iterator[Fragment]#

Not implemented.

head(num_rows: int, columns: list[str] | None = None, filter: Expression | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool | None = None, memory_pool: MemoryPool | None = None) Table#

Load the first num_rows of the dataset.

Parameters:
  • num_rows (int) – The number of rows to load.

  • columns (list of str) – The columns to keep, identified by name.

  • filter (pyarrow.dataset.Expression) – Keep only rows for which this expression evalutes to True. Any rows for which this expression evaluates to Null is removed.

  • batch_size (int) – The maximum number of rows per batch.

  • batch_readahead (int) – Not implemented.

  • fragment_readahead (int) – Not implemented.

  • fragment_scan_options (pyarrow.dataset.FragmentScanOptions) – Not implemented.

  • use_threads (bool) – Not implemented.

  • memory_pool (pyarrow.MemoryPool) – Not implemented.

Returns:

table

Return type:

pyarrow.Table

join(right_dataset, keys, right_keys=None, join_type=None, left_suffix=None, right_suffix=None, coalesce_keys=True, use_threads: bool | None = None) InMemoryDataset#

Not implemented.

join_asof(right_dataset, on, by, tolerance, right_on=None, right_by=None) InMemoryDataset#

Not implemented.

replace_schema(schema: Schema)#

Not implemented.

scanner(columns: list[str] | None = None, filter: Expression | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool | None = None, memory_pool: MemoryPool | None = None) Scanner#

Construct a pyarrow.dataset.Scanner.

Parameters:
  • columns (list of str) – The columns to keep, identified by name.

  • filter (pyarrow.dataset.Expression) – Keep only rows for which this expression evalutes to True. Any rows for which this expression evaluates to Null is removed.

  • batch_size (int) – The maximum number of rows per batch.

  • batch_readahead (int) – Not implemented.

  • fragment_readahead (int) – Not implemented.

  • fragment_scan_options (pyarrow.dataset.FragmentScanOptions) – Not implemented.

  • use_threads (bool) – Not implemented.

  • memory_pool (pyarrow.MemoryPool) – Not implemented.

Returns:

table

Return type:

pyarrow.Table

property schema: Schema#

The common schema of the full Dataset

sort_by(sorting, **kwargs) InMemoryDataset#

Not implemented.

take(indices: Array | Any, columns: list[str] | None = None, filter: Expression | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool | None = None, memory_pool: MemoryPool | None = None) Table#

Load a subset of rows identified by their absolute indices.

Parameters:
  • indices (pyarrow.Array) – A numeric array of absolute indices into self indicating which rows to keep.

  • columns (list of str) – The columns to keep, identified by name.

  • filter (pyarrow.dataset.Expression) – Keep only rows for which this expression evalutes to True. Any rows for which this expression evaluates to Null is removed.

  • batch_size (int) – The maximum number of rows per batch.

  • batch_readahead (int) – Not implemented.

  • fragment_readahead (int) – Not implemented.

  • fragment_scan_options (pyarrow.dataset.FragmentScanOptions) – Not implemented.

  • use_threads (bool) – Not implemented.

  • memory_pool (pyarrow.MemoryPool) – Not implemented.

Returns:

table

Return type:

pyarrow.Table

to_batches(columns: list[str] | None = None, filter: Expression | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool | None = None, memory_pool: MemoryPool | None = None) Iterator[RecordBatch]#

Construct an iterator of pyarrow.RecordBatch.

Parameters:
  • columns (list of str) – The columns to keep, identified by name.

  • filter (pyarrow.dataset.Expression) – Keep only rows for which this expression evalutes to True. Any rows for which this expression evaluates to Null is removed.

  • batch_size (int) – The maximum number of rows per batch.

  • batch_readahead (int) – Not implemented.

  • fragment_readahead (int) – Not implemented.

  • fragment_scan_options (pyarrow.dataset.FragmentScanOptions) – Not implemented.

  • use_threads (bool) – Not implemented.

  • memory_pool (pyarrow.MemoryPool) – Not implemented.

Returns:

table

Return type:

pyarrow.Table

to_record_batch_reader(columns: list[str] | None = None, filter: Expression | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool | None = None, memory_pool: MemoryPool | None = None) RecordBatchReader#

Construct a pyarrow.RecordBatchReader.

Parameters:
  • columns (list of str) – The columns to keep, identified by name.

  • filter (pyarrow.dataset.Expression) – Keep only rows for which this expression evalutes to True. Any rows for which this expression evaluates to Null is removed.

  • batch_size (int) – The maximum number of rows per batch.

  • batch_readahead (int) – Not implemented.

  • fragment_readahead (int) – Not implemented.

  • fragment_scan_options (pyarrow.dataset.FragmentScanOptions) – Not implemented.

  • use_threads (bool) – Not implemented.

  • memory_pool (pyarrow.MemoryPool) – Not implemented.

Returns:

table

Return type:

pyarrow.Table

to_table(columns: list[str] | None = None, filter: Expression | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool | None = None, memory_pool: MemoryPool | None = None) Table#

Construct an Arrow pyarrow.Table.

Parameters:
  • columns (list of str) – The columns to keep, identified by name.

  • filter (pyarrow.dataset.Expression) – Keep only rows for which this expression evalutes to True. Any rows for which this expression evaluates to Null is removed.

  • batch_size (int) – The maximum number of rows per batch.

  • batch_readahead (int) – Not implemented.

  • fragment_readahead (int) – Not implemented.

  • fragment_scan_options (pyarrow.dataset.FragmentScanOptions) – Not implemented.

  • use_threads (bool) – Not implemented.

  • memory_pool (pyarrow.MemoryPool) – Not implemented.

Returns:

table

Return type:

pyarrow.Table

class vortex.dataset.VortexScanner(dataset: VortexDataset, columns: list[str] | None = None, filter: Expression | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool | None = None, memory_pool: MemoryPool | None = None)#

A PyArrow Dataset Scanner that reads from a Vortex Array.

Parameters:
  • dataset (VortexDataset) – The dataset to scan.

  • columns (list of str) – The columns to keep, identified by name.

  • filter (pyarrow.dataset.Expression) – Keep only rows for which this expression evalutes to True. Any rows for which this expression evaluates to Null is removed.

  • batch_size (int) – The maximum number of rows per batch.

  • batch_readahead (int) – Not implemented.

  • fragment_readahead (int) – Not implemented.

  • fragment_scan_options (pyarrow.dataset.FragmentScanOptions) – Not implemented.

  • use_threads (bool) – Not implemented.

  • memory_pool (pyarrow.MemoryPool) – Not implemented.

Returns:

table

Return type:

pyarrow.Table

count_rows(self)#

Count rows matching the scanner filter.

Returns:

count

Return type:

int

head(num_rows: int) Table#

Load the first num_rows of the dataset.

Parameters:

num_rows (int) – The number of rows to read.

Returns:

table

Return type:

pyarrow.Table

scan_batches() Iterator[TaggedRecordBatch]#

Not implemented.

to_batches() Iterator[RecordBatch]#

Construct an iterator of pyarrow.RecordBatch.

Returns:

table

Return type:

pyarrow.Table

to_reader() RecordBatchReader#

Construct a pyarrow.RecordBatchReader.

Returns:

table

Return type:

pyarrow.Table

to_table() Table#

Construct an Arrow pyarrow.Table.

Returns:

table

Return type:

pyarrow.Table