Performance Guide¶

This guide covers how the configurations discussed in the previous section (core concepts) affects the data compression ratio and read performance. Besides the configurations discussed previously, the read performance can vary by other factors:

File system. Local file systems and remote file systems (such as GCS) can have different performance characteristics. Improving compression ratio on remote file systems reduces both the storage and data transmission rate. Conversely, the data transmission is basically free on a local file system.
Record data type. Text data is in general very compressible, but most media formats (PNG, JPG, MP4, MPEG) are precompressed which do not benefit from additional compression settings. Embedding data is moderately compressible, but users may benefit more by applying custom quantization algorithms instead.

In this guide we only consider the most basic form: a local file system with plain text generated by lorem ipsum python packages. Nevertheless, users should still benefit from the benchmark results.

The lorem ipsum text data is generated with the following simple program:

from lorem_text import lorem

num_words = 200
num_records = 65536
records = []
n_bytes = 0
for _ in range(num_records):
  record = lorem.words(num_words).encode("utf-8")
  records.append(record)
  n_bytes += len(record)

Compression ratio¶

We first consider the compression ratio which plays an important role in remote file systems. We tested out four compression algorithms (zstd, brotli, snappy, and uncompressed), different group_size, and optionally the compression levels if the algorithm supports it. Note that the uncompressed file size can be larger than the original data because of the metadata and indices.

`group_size:1`¶

Can be used by both ArrayRecordReader and ArrayRecordDataSource.

Compression Algorithm	Compression size
zstd:1	54.66%
zstd:3	54.12%
zstd:5	53.58%
zstd:7	53.52%
brotli:1	55.74%
brotli:3	53.11%
brotli:5	53.72%
brotli:7	53.79%
snappy	86.95%
uncompressed	103.87%

`group_size:256`¶

Can only be used by ArrayRecordReader

Compression Algorithm	Compression size
zstd:1	27.23%
zstd:3	25.88%
zstd:5	25.61%
zstd:7	24.87%
brotli:1	30.26%
brotli:3	27.50%
brotli:5	25.21%
brotli:7	24.06%
snappy	38.49%
uncompressed	100.29%

For textual data, the difference between zstd and brotli across various compression levels was found to be minimal. The group_size parameter constitutes a more critical factor influencing the compression ratio. However, for applications requiring random access, users are advised to maintain a group_size value of 1.

Random access¶

We now consider the read performance with random access. Datasets are created with the previous write benchmark, and we use only the compression level 3 since varying the compression level didn’t affect the compression ratio significantly.

The random access indices are generated with numpy random permutation:

import numpy as np
rng = np.random.default_rng(42)
num_records = 65536  # dataset size
indices = [int(v) for v in rng.permutations(num_records)]

compression	reader type	individual access (qps)	batch access (qps)
zstd	ArrayRecordReader	4,933	310,120
zstd	ArrayRecordDataSource	5,551	188,412
brotli	ArrayRecordReader	5,433	328,267
brotli	ArrayRecordDataSource	5,268	214,682
snappy	ArrayRecordReader	5,333	451,081
snappy	ArrayRecordDataSource	5,407	258,196
uncompressed	ArrayRecordReader	5,407	490,610
uncompressed	ArrayRecordDataSource	5,155	243,658

The benchmark clearly demonstrates the superior performance afforded by batch access compared to individual record access. The internal C++ thread pool employs atomic counters to efficiently manage and track workload distribution among threads. Consequently, even if individual access were implemented using Python threads or processes, it would be unable to attain the same level of efficiency.

It is important to note that although uncompressed data yields higher throughput in this specific benchmark, this result only reflects local file access efficiency. In remote file system environments, compressed records generally provide superior throughput performance.

Sequential access¶

Finally, we examine the sequential access APIs ArrayRecordReader provides. In contrast to ArrayRecordDataSource, ArrayRecordReader accepts group_size larger than 1 which affects the sequential access performance.

We compare the sequential access with repeated calls to read() that uses read-ahead threads for prefetching, and the read_all() API that uses a thread pool to process all items concurrently.

compression	`group_size`	`read_all` (qps)	sequential read (qps)
zstd	1	260,210	229,386
zstd	256	459,331	492,653
brotli	1	222,285	252,891
brotli	256	367,190	374,484
snappy	1	493,140	544,226
snappy	256	430,323	588,599
uncompressed	1	645,149	619,762
uncompressed	256	1,094,563	1,078,190

To our surprise, the repeated calls to read() performs better than the read_all() API. This may be due to the overhead of python objects creation of read_all(). In C++ benchmark we typically see the opposite result.

Summary¶

This guide has examined how the tuning parameters discussed within the core concepts section influence both file compression ratio and read performance. Although performance characteristics may exhibit variability dependent on the dataset (e.g., media type, record size, record count) and the underlying file system, setting group_size:1 and utilizing the batch access API is anticipated to provide optimal results for most users employing ArrayRecord for random access operations.