Multi-Round Lazy Start Merge
Background
Efficiently merging sorted data partitions at scale is crucial for a variety of training data preparation workloads, especially for Generative Recommenders (GRs) a new paradigm introduced in the paper Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations. A key requirement is to merge training data across partitions—for example, merging hourly partitions into daily ones—while ensuring that all rows sharing the same primary key are stored consecutively. Training data is typically partitioned and bucketed by primary key, with rows sharing the same key stored consecutively, so merging across partitions essentially becomes a multi-way merge problem.
Normally, Apache Spark can be used for this sort-merge requirement — for example, via CLUSTER BY.
However, training datasets for a single job can often reach the PB scale, which in turn generates shuffle data at PB scale.
Although we typically apply bucketing and ordering by key when preparing training data in production,
Spark can eliminate the shuffle when merging training data from multiple hourly partitions.
However, each Spark task can only read the files planned from various partitions within a split
sequentially, placing them into the sorter and spilling as needed. Only after all files have been read
does Spark perform a sort-merge of the spilled files. This process produces a large number of small
spill files, which further degrades efficiency.


