Performance Guide

This guide covers how the configurations discussed in the previous section (core concepts) affects the data compression ratio and read performance. Besides the configurations discussed previously, the read performance can vary by other factors:

  1. File system. Local file systems and remote file systems (such as GCS) can have different performance characteristics. Improving compression ratio on remote file systems reduces both the storage and data transmission rate. Conversely, the data transmission is basically free on a local file system.

  2. Record data type. Text data is in general very compressible, but most media formats (PNG, JPG, MP4, MPEG) are precompressed which do not benefit from additional compression settings. Embedding data is moderately compressible, but users may benefit more by applying custom quantization algorithms instead.

In this guide we only consider the most basic form: a local file system with plain text generated by lorem ipsum python packages. Nevertheless, users should still benefit from the benchmark results.

The lorem ipsum text data is generated with the following simple program:

from lorem_text import lorem

num_words = 200
num_records = 65536
records = []
n_bytes = 0
for _ in range(num_records):
  record = lorem.words(num_words).encode("utf-8")
  records.append(record)
  n_bytes += len(record)

Compression ratio

We first consider the compression ratio which plays an important role in remote file systems. We tested out four compression algorithms (zstd, brotli, snappy, and uncompressed), different group_size, and optionally the compression levels if the algorithm supports it. Note that the uncompressed file size can be larger than the original data because of the metadata and indices.

group_size:1

Can be used by both ArrayRecordReader and ArrayRecordDataSource.

Compression Algorithm

Compression size

zstd:1

54.66%

zstd:3

54.12%

zstd:5

53.58%

zstd:7

53.52%

brotli:1

55.74%

brotli:3

53.11%

brotli:5

53.72%

brotli:7

53.79%

snappy

86.95%

uncompressed

103.87%

group_size:256

Can only be used by ArrayRecordReader

Compression Algorithm

Compression size

zstd:1

27.23%

zstd:3

25.88%

zstd:5

25.61%

zstd:7

24.87%

brotli:1

30.26%

brotli:3

27.50%

brotli:5

25.21%

brotli:7

24.06%

snappy

38.49%

uncompressed

100.29%

For textual data, the difference between zstd and brotli across various compression levels was found to be minimal. The group_size parameter constitutes a more critical factor influencing the compression ratio. However, for applications requiring random access, users are advised to maintain a group_size value of 1.

Random access

We now consider the read performance with random access. Datasets are created with the previous write benchmark, and we use only the compression level 3 since varying the compression level didn’t affect the compression ratio significantly.

The random access indices are generated with numpy random permutation:

import numpy as np
rng = np.random.default_rng(42)
num_records = 65536  # dataset size
indices = [int(v) for v in rng.permutations(num_records)]

compression

reader type

individual access (qps)

batch access (qps)

zstd

ArrayRecordReader

4,933

310,120

zstd

ArrayRecordDataSource

5,551

188,412

brotli

ArrayRecordReader

5,433

328,267

brotli

ArrayRecordDataSource

5,268

214,682

snappy

ArrayRecordReader

5,333

451,081

snappy

ArrayRecordDataSource

5,407

258,196

uncompressed

ArrayRecordReader

5,407

490,610

uncompressed

ArrayRecordDataSource

5,155

243,658

The benchmark clearly demonstrates the superior performance afforded by batch access compared to individual record access. The internal C++ thread pool employs atomic counters to efficiently manage and track workload distribution among threads. Consequently, even if individual access were implemented using Python threads or processes, it would be unable to attain the same level of efficiency.

It is important to note that although uncompressed data yields higher throughput in this specific benchmark, this result only reflects local file access efficiency. In remote file system environments, compressed records generally provide superior throughput performance.

Sequential access

Finally, we examine the sequential access APIs ArrayRecordReader provides. In contrast to ArrayRecordDataSource, ArrayRecordReader accepts group_size larger than 1 which affects the sequential access performance.

We compare the sequential access with repeated calls to read() that uses read-ahead threads for prefetching, and the read_all() API that uses a thread pool to process all items concurrently.

compression

group_size

read_all (qps)

sequential read (qps)

zstd

1

260,210

229,386

zstd

256

459,331

492,653

brotli

1

222,285

252,891

brotli

256

367,190

374,484

snappy

1

493,140

544,226

snappy

256

430,323

588,599

uncompressed

1

645,149

619,762

uncompressed

256

1,094,563

1,078,190

To our surprise, the repeated calls to read() performs better than the read_all() API. This may be due to the overhead of python objects creation of read_all(). In C++ benchmark we typically see the opposite result.

Summary

This guide has examined how the tuning parameters discussed within the core concepts section influence both file compression ratio and read performance. Although performance characteristics may exhibit variability dependent on the dataset (e.g., media type, record size, record count) and the underlying file system, setting group_size:1 and utilizing the batch access API is anticipated to provide optimal results for most users employing ArrayRecord for random access operations.