2 posts tagged with "operator"

Why Sort is row-based in Velox — A Quantitative Assessment

December 24, 2025 · 8 min read

Meng Duan (macduan)

Software Engineer @ ByteDance

Xiaoxuan Meng

Software Engineer @ Meta

TL;DR

Velox is a fully vectorized execution engine[1]. Its internal columnar memory layout enhances cache locality, exposes more inter-instruction parallelism to CPUs, and enables the use of SIMD instructions, significantly accelerating large-scale query processing.

However, some operators in Velox utilize a hybrid layout, where datasets can be temporarily converted to a row-oriented format. The OrderBy operator is one example, where our implementation first materializes the input vectors into rows, containing both sort keys and payload columns, sorts them, and converts the rows back to vectors.

In this article, we explain the rationale behind this design decision and provide experimental evidence for its implementation. We show a prototype of a hybrid sorting strategy that materializes only the sort-key columns, reducing the overhead of materializing payload columns. Contrary to expectations, the end-to-end performance did not improve—in fact, it was even up to 3× slower. We present the two variants and discuss why one is counter-intuitively faster than the other.

Multi-Round Lazy Start Merge

November 9, 2025 · 6 min read

Meng Duan (macduan)

Software Engineer @ ByteDance

Xiaoxuan Meng

Software Engineer @ Meta

Pedro Pedreira

Software Engineer @ Meta

Background

Efficiently merging sorted data partitions at scale is crucial for a variety of training data preparation workloads, especially for Generative Recommenders (GRs) a new paradigm introduced in the paper Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations. A key requirement is to merge training data across partitions—for example, merging hourly partitions into daily ones—while ensuring that all rows sharing the same primary key are stored consecutively. Training data is typically partitioned and bucketed by primary key, with rows sharing the same key stored consecutively, so merging across partitions essentially becomes a multi-way merge problem.

Normally, Apache Spark can be used for this sort-merge requirement — for example, via CLUSTER BY. However, training datasets for a single job can often reach the PB scale, which in turn generates shuffle data at PB scale. Although we typically apply bucketing and ordering by key when preparing training data in production, Spark can eliminate the shuffle when merging training data from multiple hourly partitions. However, each Spark task can only read the files planned from various partitions within a split sequentially, placing them into the sorter and spilling as needed. Only after all files have been read does Spark perform a sort-merge of the spilled files. This process produces a large number of small spill files, which further degrades efficiency.

TL;DR​

Background​

TL;DR

Background