Python API Reference

array_record.python.array_record_module.ArrayRecordWriter

ArrayRecordWriter(path: str, options: str)

  • path (str): File path where the ArrayRecord to be written.

  • options (str, optional): Comma-separated options string. Default “”

Options string format

The options string can contain the following comma-separated options:

  • group_size:N - Number of records per chunk (default: 1)

  • uncompressed - Disable compression

  • brotli[:N] - Use Brotli compression with level N (0-11, default: 6)

  • zstd[:N] - Use Zstd compression with level N (-131072 to 22, default: 3)

  • snappy - Use Snappy compression

  • window_log:N - LZ77 window size (10-31) for zstd and brotli.

  • pad_to_block_boundary:true/false - Pad chunks to 64KB boundaries (default false)

User should only select one of the compression options zstd, brotli, snappy, uncompressed, otherwise an error would be raised.

ok() -> bool

Returns true when the writer object is having a healthy state.

close()

Closes the file. May raise an error if it failed to do so.

is_open() -> bool

Returns true when the file is opened.

write(record: bytes)

Writes a record to the file. May raise an error if it failed to do so.

array_record.python.array_record_module.ArrayRecordReader

ArrayRecordReader(path: str, options: str)

  • path (str): File path to read from.

  • options (str, optional): Comma-separated options string. Default “”

Options string format

The options string can contain the following comma-separated options:

  • readahead_buffer_size:N - Number of bytes for read-ahead buffer size per thread (default 0)

  • max_parallelism: N - Number of read-ahead threads.

  • index_storage_options:in_memory/offloaded - Specifies to store the record index in memory or on disk (default: in_memory)

ok() -> bool

Returns true when the reader object is having a healthy state.

close()

Closes the file. May raise an error if it failed to do so.

is_open() -> bool

Returns true when the file was opened.

num_records() -> int

Returns the number of records in the file.

record_index() -> int

Returns the current record index. This field is only relevant in the sequential reading mode.

writer_options_string() -> str

Returns the writer options string that was used when creating the ArrayRecord file.

seek(index: int)

Update the cursor to the specified index. Throws an error if the index was out of bound.

read() -> bytes

Reads a record and advance the cursor index by one. Throws an error if the cursor reaches the end of the file.

read(indices: Sequence) -> Sequence[bytes]

Reads the set of records specified by the input indices with an internal thread pool. Throws an error if any of the index was out of bound.

read(start: int, end: int) -> Sequence[bytes]

Reads the set of records by range with an internal thread pool. Throws an error if the index was out of bound.

read_all() -> Sequence[bytes]

Reads all records with an internal thread pool. Throws an error if the index was out of bound.

array_record.python.array_record_data_source.ArrayRecordDataSource

ArrayRecordDataSource(paths: Sequence[str], reader_options: str)

  • paths (Sequence[str]): File paths to read from.

  • options (str, optional): Comma-separated options string. Default “”. See ArrayRecordReader constructor options for details.

__len__() -> int

Returns the number of records of all the array record files specified in the constructor.

from array_record.python import array_record_data_source
ds = array_record_data_source.ArrayRecordDataSource(glob.glob("output.array_record*"))
len(ds)

__iter__() -> Iterator[bytes]

Iterator interface for data access.

from array_record.python import array_record_data_source
ds = array_record_data_source.ArrayRecordDataSource(glob.glob("output.array_record*"))
it = iter(ds)
record = next(it)

__getitem__(index: int) -> bytes

Reads a record at the specified index.

from array_record.python import array_record_data_source
ds = array_record_data_source.ArrayRecordDataSource(glob.glob("output.array_record*"))
ds[idx]

__getitems__(indices: Sequence[int]) -> Sequence[bytes]

Reads a set of records of the specified indices.

from array_record.python import array_record_data_source
ds = array_record_data_source.ArrayRecordDataSource(glob.glob("output.array_record*"))
ds.__getitems__(indices)