Performance Guide¶
This guide covers how the configurations discussed in the previous section (core concepts) affects the data compression ratio and read performance. Besides the configurations discussed previously, the read performance can vary by other factors:
File system. Local file systems and remote file systems (such as GCS) can have different performance characteristics. Improving compression ratio on remote file systems reduces both the storage and data transmission rate. Conversely, the data transmission is basically free on a local file system.
Record data type. Text data is in general very compressible, but most media formats (PNG, JPG, MP4, MPEG) are precompressed which do not benefit from additional compression settings. Embedding data is moderately compressible, but users may benefit more by applying custom quantization algorithms instead.
In this guide we only consider the most basic form: a local file system with plain text generated by lorem ipsum python packages. Nevertheless, users should still benefit from the benchmark results.
The lorem ipsum text data is generated with the following simple program:
from lorem_text import lorem
num_words = 200
num_records = 65536
records = []
n_bytes = 0
for _ in range(num_records):
record = lorem.words(num_words).encode("utf-8")
records.append(record)
n_bytes += len(record)
Compression ratio¶
We first consider the compression ratio which plays an important role in remote
file systems. We tested out four compression algorithms (zstd, brotli, snappy,
and uncompressed), different group_size, and optionally the compression levels
if the algorithm supports it. Note that the uncompressed file size can be larger
than the original data because of the metadata and indices.
group_size:1¶
Can be used by both ArrayRecordReader and ArrayRecordDataSource.
Compression Algorithm |
Compression size |
|---|---|
zstd:1 |
54.66% |
zstd:3 |
54.12% |
zstd:5 |
53.58% |
zstd:7 |
53.52% |
brotli:1 |
55.74% |
brotli:3 |
53.11% |
brotli:5 |
53.72% |
brotli:7 |
53.79% |
snappy |
86.95% |
uncompressed |
103.87% |
group_size:256¶
Can only be used by ArrayRecordReader
Compression Algorithm |
Compression size |
|---|---|
zstd:1 |
27.23% |
zstd:3 |
25.88% |
zstd:5 |
25.61% |
zstd:7 |
24.87% |
brotli:1 |
30.26% |
brotli:3 |
27.50% |
brotli:5 |
25.21% |
brotli:7 |
24.06% |
snappy |
38.49% |
uncompressed |
100.29% |
For textual data, the difference between zstd and brotli across various
compression levels was found to be minimal. The group_size parameter
constitutes a more critical factor influencing the compression ratio. However,
for applications requiring random access, users are advised to maintain a
group_size value of 1.
Random access¶
We now consider the read performance with random access. Datasets are created with the previous write benchmark, and we use only the compression level 3 since varying the compression level didn’t affect the compression ratio significantly.
The random access indices are generated with numpy random permutation:
import numpy as np
rng = np.random.default_rng(42)
num_records = 65536 # dataset size
indices = [int(v) for v in rng.permutations(num_records)]
compression |
reader type |
individual access (qps) |
batch access (qps) |
|---|---|---|---|
zstd |
ArrayRecordReader |
4,933 |
310,120 |
zstd |
ArrayRecordDataSource |
5,551 |
188,412 |
brotli |
ArrayRecordReader |
5,433 |
328,267 |
brotli |
ArrayRecordDataSource |
5,268 |
214,682 |
snappy |
ArrayRecordReader |
5,333 |
451,081 |
snappy |
ArrayRecordDataSource |
5,407 |
258,196 |
uncompressed |
ArrayRecordReader |
5,407 |
490,610 |
uncompressed |
ArrayRecordDataSource |
5,155 |
243,658 |
The benchmark clearly demonstrates the superior performance afforded by batch access compared to individual record access. The internal C++ thread pool employs atomic counters to efficiently manage and track workload distribution among threads. Consequently, even if individual access were implemented using Python threads or processes, it would be unable to attain the same level of efficiency.
It is important to note that although uncompressed data yields higher throughput in this specific benchmark, this result only reflects local file access efficiency. In remote file system environments, compressed records generally provide superior throughput performance.
Sequential access¶
Finally, we examine the sequential access APIs ArrayRecordReader provides. In
contrast to ArrayRecordDataSource, ArrayRecordReader accepts group_size larger
than 1 which affects the sequential access performance.
We compare the sequential access with repeated calls to read() that uses
read-ahead threads for prefetching, and the read_all() API that uses a thread
pool to process all items concurrently.
compression |
|
|
sequential read (qps) |
|---|---|---|---|
zstd |
1 |
260,210 |
229,386 |
zstd |
256 |
459,331 |
492,653 |
brotli |
1 |
222,285 |
252,891 |
brotli |
256 |
367,190 |
374,484 |
snappy |
1 |
493,140 |
544,226 |
snappy |
256 |
430,323 |
588,599 |
uncompressed |
1 |
645,149 |
619,762 |
uncompressed |
256 |
1,094,563 |
1,078,190 |
To our surprise, the repeated calls to read() performs better than the
read_all() API. This may be due to the overhead of python objects creation of
read_all(). In C++ benchmark we typically see the opposite result.
Summary¶
This guide has examined how the tuning parameters discussed within the core
concepts section influence both file compression ratio and read performance.
Although performance characteristics may exhibit variability dependent on the
dataset (e.g., media type, record size, record count) and the underlying file
system, setting group_size:1 and utilizing the batch access API is anticipated
to provide optimal results for most users employing ArrayRecord for random
access operations.