<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/">
    <channel>
        <title>Velox Blog</title>
        <link>https://velox-lib.io/blog</link>
        <description>Velox Blog</description>
        <lastBuildDate>Sun, 29 Mar 2026 00:00:00 GMT</lastBuildDate>
        <docs>https://validator.w3.org/feed/docs/rss2.html</docs>
        <generator>https://github.com/jpmonette/feed</generator>
        <language>en</language>
        <item>
            <title><![CDATA[From flaky Axiom CI to a Velox bug fix: a cross-repo debugging story]]></title>
            <link>https://velox-lib.io/blog/debugging-flaky-ci-across-repos</link>
            <guid>https://velox-lib.io/blog/debugging-flaky-ci-across-repos</guid>
            <pubDate>Sun, 29 Mar 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[TL;DR]]></description>
            <content:encoded><![CDATA[<h2 class="anchor anchorWithStickyNavbar_LWe7" id="tldr">TL;DR<a href="https://velox-lib.io/blog/debugging-flaky-ci-across-repos#tldr" class="hash-link" aria-label="Direct link to TL;DR" title="Direct link to TL;DR">​</a></h2>
<p>When adding macOS CI to <a href="https://github.com/facebookincubator/axiom" target="_blank" rel="noopener noreferrer">Axiom</a>,
set operation tests kept failing intermittently — but only in macOS debug CI.
Linux CI (debug and release) passed consistently. Local runs always passed.
The root cause turned out to be a bug in
<a href="https://github.com/facebookincubator/velox" target="_blank" rel="noopener noreferrer">Velox</a> — a dependency managed as
a Git submodule. This post describes the process of debugging a CI-only failure
when the bug lives in a different repository.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="the-problem">The problem<a href="https://velox-lib.io/blog/debugging-flaky-ci-across-repos#the-problem" class="hash-link" aria-label="Direct link to The problem" title="Direct link to The problem">​</a></h2>
<p><a href="https://github.com/facebookincubator/axiom" target="_blank" rel="noopener noreferrer">Axiom</a> is a set of open-source
composable libraries for building SQL and dataframe processing front ends on
top of <a href="https://github.com/facebookincubator/velox" target="_blank" rel="noopener noreferrer">Velox</a>. It integrates
Velox as a Git submodule.</p>
<p>After adding macOS CI builds (<a href="https://github.com/facebookincubator/axiom/pull/1168" target="_blank" rel="noopener noreferrer">#1168</a>),
we couldn't get the test suite to pass reliably.
Various set operation tests (<code>SqlTest.set_*</code>) failed intermittently in macOS
debug CI — different tests on different runs, all involving EXCEPT ALL or
INTERSECT ALL. Linux debug and release CI passed consistently. Running the same
tests locally — same build type, same data, same configuration — always passed.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="why-ci-only-failures-are-hard-to-debug">Why CI-only failures are hard to debug<a href="https://velox-lib.io/blog/debugging-flaky-ci-across-repos#why-ci-only-failures-are-hard-to-debug" class="hash-link" aria-label="Direct link to Why CI-only failures are hard to debug" title="Direct link to Why CI-only failures are hard to debug">​</a></h2>
<p>The obvious approach — reproduce locally, attach a debugger, step through the
code — doesn't work. You are limited to CI runs, and each run takes about 10
minutes end-to-end: push the change, wait for CI to pick it up, build
(~8 minutes even with ccache), run tests, check results. That's 3-4 iterations
per hour at best. Every CI run has to be designed to extract maximum signal.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="step-1-make-ci-runs-count">Step 1: Make CI runs count<a href="https://velox-lib.io/blog/debugging-flaky-ci-across-repos#step-1-make-ci-runs-count" class="hash-link" aria-label="Direct link to Step 1: Make CI runs count" title="Direct link to Step 1: Make CI runs count">​</a></h2>
<p>The first change was to restructure CI to focus on the problem:</p>
<ul>
<li><strong>Disable unrelated builds.</strong> Linux debug, Linux release, and macOS release
were all passing. Disabling them reduced the CI cycle time.</li>
<li><strong>Run only <code>SqlTest.set_*</code> tests 5 times.</strong> The tests were flaky — sometimes
passing, sometimes failing. Running just the set operation tests (not the
full suite) 5 times increased the chances of triggering the failure together
with the debug logging while keeping the run fast. Without this, a CI run
could pass by luck, wasting 10 minutes with no useful signal.</li>
<li><strong>Run the full test suite after.</strong> The 5x set-test loop ran first. If it
passed, the full suite ran to check for regressions.</li>
</ul>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="step-2-iterate-on-velox-changes-using-axiom-ci">Step 2: Iterate on Velox changes using Axiom CI<a href="https://velox-lib.io/blog/debugging-flaky-ci-across-repos#step-2-iterate-on-velox-changes-using-axiom-ci" class="hash-link" aria-label="Direct link to Step 2: Iterate on Velox changes using Axiom CI" title="Direct link to Step 2: Iterate on Velox changes using Axiom CI">​</a></h2>
<p>The failing tests all involved EXCEPT ALL and INTERSECT ALL queries. Axiom's
SQL parser and optimizer translate these into Velox execution plans using
<em>counting joins</em> — a Velox-specific join type that, unlike regular semi/anti
joins that check whether a key exists, tracks how many times each key appears
on the build side. Since the implementation lives in Velox, the bug had to be
there too.</p>
<p>But the tests were in Axiom. Axiom uses Velox as a Git submodule pointing to a
specific commit on <code>facebookincubator/velox</code>. To add debug logging to Velox and
test it in Axiom CI, we needed Axiom to build from a modified Velox.</p>
<p>The key insight is that you don't have to land changes to Velox first. You can
point Axiom's submodule at a fork branch and let CI validate end-to-end.</p>
<p>The workflow:</p>
<ol>
<li><strong>Fork Velox and push changes to a branch.</strong> Create a branch on your fork
(e.g., <code>mbasmanova/velox</code>) with the debug logging or fix.</li>
<li><strong>Point Axiom's submodule at the fork.</strong> Two changes are needed:<!-- -->
<ul>
<li>Update <code>.gitmodules</code> to use the fork URL:<!-- -->
<div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">[submodule "velox"]</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    path = velox</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    url = https://github.com/mbasmanova/velox.git</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
</li>
<li>Update the submodule to the desired commit:<!-- -->
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">git -C velox checkout &lt;commit&gt;</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">git add velox .gitmodules</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
</li>
</ul>
</li>
<li><strong>Push and let CI build.</strong> Axiom CI checks out submodules recursively, so
it picks up the modified Velox automatically.</li>
</ol>
<p>One subtlety: the Velox PR and the Axiom submodule both pushed to branches on
the same fork (<code>mbasmanova/velox</code>). Initially we used the same branch name for
both, which caused force-pushes from one repo to overwrite the other. The fix
was to use separate branches — <code>fix-counting-join-merge</code> for the Velox PR and
<code>fix-counting-join-merge-axiom</code> for the Axiom submodule.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="step-3-form-a-hypothesis-then-add-targeted-logging">Step 3: Form a hypothesis, then add targeted logging<a href="https://velox-lib.io/blog/debugging-flaky-ci-across-repos#step-3-form-a-hypothesis-then-add-targeted-logging" class="hash-link" aria-label="Direct link to Step 3: Form a hypothesis, then add targeted logging" title="Direct link to Step 3: Form a hypothesis, then add targeted logging">​</a></h2>
<p>With limited CI iterations, random logging is wasteful. Each round of logging
must be designed to prove or disprove one or more hypotheses.</p>
<p>The hypothesis was:</p>
<blockquote>
<p>When multiple build drivers process overlapping keys, each driver builds its
own hash table with per-key counts. When these tables are merged, duplicate
keys are dropped without summing their counts.</p>
</blockquote>
<p>To test this, we added logging at three points in the counting join lifecycle,
logging full input/output data per driver including the driver ID:</p>
<ol>
<li><strong>Build input</strong> — log each batch of rows received by each build driver.</li>
<li><strong>Probe input</strong> — log each batch of probe rows.</li>
<li><strong>Probe output</strong> — log the rows emitted by each probe driver.</li>
</ol>
<p>By comparing build inputs across drivers (to see which keys each driver
processed) with probe outputs (to see the final results), we could determine
whether the merged hash table had correct per-key counts.</p>
<p>Before pushing to CI, we ran the tests locally. The bug didn't reproduce
locally, but the logging code paths still executed. This validated that the log
output was readable, the format was correct, and the information needed to
confirm or reject the hypothesis would be present.</p>
<p>One CI run with this logging confirmed the hypothesis: multiple build drivers
processed overlapping keys, but the probe output showed counts as if only one
driver's keys were present — the merge had dropped duplicate key counts.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="step-4-the-fix">Step 4: The fix<a href="https://velox-lib.io/blog/debugging-flaky-ci-across-repos#step-4-the-fix" class="hash-link" aria-label="Direct link to Step 4: The fix" title="Direct link to Step 4: The fix">​</a></h2>
<p>The actual bug was straightforward once identified. Velox's <code>HashTable</code> has
three code paths for inserting rows during hash table merge:</p>
<ul>
<li><code>arrayPushRow</code> for array-based hash mode (small key domains)</li>
<li><code>buildFullProbe</code> with normalized-key comparison (medium key domains)</li>
<li><code>buildFullProbe</code> with full key comparison (complex types)</li>
</ul>
<p>All three paths handled duplicate keys correctly for regular joins (linking
rows via a "next" pointer), but for counting joins (which use a count instead
of a next pointer), duplicates were silently dropped.</p>
<p>The fix adds an <code>addCount</code> method to sum counts when a duplicate key is found
during merge. Since Axiom's build does not compile Velox tests, we built Velox
standalone to develop and run the test.</p>
<p>Designing a test that exercises all three code paths was non-trivial. The hash
table mode is chosen automatically based on key types, number of distinct
values, and value ranges. Our first test attempts only hit array mode because
the key domain was small. We had to study the mode selection logic in
<code>decideHashMode()</code> to find key configurations that force each mode. This
analysis was time-consuming enough that we
<a href="https://facebookincubator.github.io/velox/develop/hash-table.html" target="_blank" rel="noopener noreferrer">documented the hash modes</a>
(<a href="https://github.com/facebookincubator/velox/pull/16953" target="_blank" rel="noopener noreferrer">#16953</a>) to spare
other developers the same exercise. The key configurations we found:</p>
<ul>
<li>Small integers <code>{1, 2}</code> → array mode</li>
<li>Two integer columns with 1500 distinct values each (combined cardinality
exceeds the 2M array threshold) → normalized-key mode</li>
<li><code>ARRAY(INTEGER)</code> keys (complex types don't support value IDs) → hash mode</li>
</ul>
<p>We verified each variant independently by reverting the fix for one code path
at a time and confirming the test failed.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="step-5-landing-the-fix">Step 5: Landing the fix<a href="https://velox-lib.io/blog/debugging-flaky-ci-across-repos#step-5-landing-the-fix" class="hash-link" aria-label="Direct link to Step 5: Landing the fix" title="Direct link to Step 5: Landing the fix">​</a></h2>
<p>With the fix validated in Axiom CI, we created two PRs:</p>
<ul>
<li><strong>Velox PR</strong> (<a href="https://github.com/facebookincubator/velox/pull/16949" target="_blank" rel="noopener noreferrer">#16949</a>):
The fix, test, documentation updates, and a new <code>hashtable.hashMode</code> runtime
stat.</li>
<li><strong>Axiom PR</strong> (<a href="https://github.com/facebookincubator/axiom/pull/1175" target="_blank" rel="noopener noreferrer">#1175</a>):
Update the Velox submodule and stop skipping set operation tests in macOS
debug CI (they were excluded via <code>GTEST_FILTER="-SqlTest.set_*"</code> as a
workaround).</li>
</ul>
<p>After the Velox PR merges, the Axiom submodule will be updated to point back to
<code>facebookincubator/velox</code>.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="why-only-macos-debug">Why only macOS debug?<a href="https://velox-lib.io/blog/debugging-flaky-ci-across-repos#why-only-macos-debug" class="hash-link" aria-label="Direct link to Why only macOS debug?" title="Direct link to Why only macOS debug?">​</a></h2>
<p>Honestly, we don't fully know.</p>
<p>The bug requires splits with overlapping keys to land on different build
drivers. The test uses 4 drivers and 3 data splits (one per input vector).
Velox assigns splits to drivers on demand — whichever driver is ready first
grabs the next split. If all splits happen to be processed by the same driver,
there is no merge and no bug.</p>
<p>Note that this is not a race condition. Given the same split-to-driver
assignment, the result is deterministic and reproducibly wrong. The
non-determinism is purely in which driver grabs which split, which depends on
thread scheduling.</p>
<p>What we can't explain is why macOS debug CI triggers this while the same macOS
debug build locally does not. Same OS, same build type, same compiler, same
test data, same number of drivers. Yet the thread scheduling differs enough
to change split distribution. Differences in CPU, load, or runtime environment
on the CI runner may play a role.</p>
<p>If you have insights into what could cause such a difference in thread
scheduling between local and CI macOS environments, please comment on
<a href="https://github.com/facebookincubator/axiom/issues/1170" target="_blank" rel="noopener noreferrer">axiom#1170</a>.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="takeaways">Takeaways<a href="https://velox-lib.io/blog/debugging-flaky-ci-across-repos#takeaways" class="hash-link" aria-label="Direct link to Takeaways" title="Direct link to Takeaways">​</a></h2>
<ul>
<li><strong>Don't dismiss "flaky" tests.</strong> A test that sometimes passes and sometimes
fails is not necessarily a bad test. In this case, the test was correct — the
production code was buggy. The flakiness was the only signal that something
was wrong. We initially excluded the set tests via <code>GTEST_FILTER</code> to land the
macOS CI work, but came back to investigate right away. Leaving them disabled
would have hidden a real bug in Velox affecting all users of EXCEPT ALL and
INTERSECT ALL.</li>
<li><strong>Design each CI run for maximum signal.</strong> When you get 3-4 iterations per
hour, you can't afford exploratory logging. Form a hypothesis first, design
logging to confirm or reject it, validate the logging locally, then push.</li>
<li><strong>Run flaky tests multiple times.</strong> Not to distinguish flaky from broken, but
to ensure you reliably trigger the failure together with your debug logging
in a single CI run.</li>
<li><strong>Use downstream CI to iterate on dependency fixes.</strong> When a bug lives in a
dependency, you can debug and validate fixes without landing anything. Point
the submodule at a fork branch and let downstream CI run end-to-end. See
Step 2 above for details.</li>
</ul>]]></content:encoded>
            <category>tech-blog</category>
            <category>debugging</category>
        </item>
        <item>
            <title><![CDATA[Adaptive Per-Function CPU Time Tracking]]></title>
            <link>https://velox-lib.io/blog/velox-adaptive-cpu-sampling</link>
            <guid>https://velox-lib.io/blog/velox-adaptive-cpu-sampling</guid>
            <pubDate>Thu, 26 Mar 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Context]]></description>
            <content:encoded><![CDATA[<h2 class="anchor anchorWithStickyNavbar_LWe7" id="context">Context<a href="https://velox-lib.io/blog/velox-adaptive-cpu-sampling#context" class="hash-link" aria-label="Direct link to Context" title="Direct link to Context">​</a></h2>
<p>Velox evaluates SQL expressions as trees of functions. A query like
<code>if(array_gte(a, b), multiply(x, y), 0)</code> compiles into a tree where each node
processes an entire vector of rows at a time. When a query runs slowly, the
first question usually is: which function is consuming the most CPU? Is it the
expensive array comparison, or the cheap arithmetic called millions of times?
This problem is even more prominent in use cases like training data loading,
when very long and deeply nested expression trees are common, and jobs may run
for many hours, or days; in such cases, the CPU usage of even seemingly
short-lived functions may add up to substantial overhead. Without a detailed
per-function CPU usage breakdown, you may be left guessing — or worse,
optimizing the wrong thing.</p>
<p>Velox supports CPU usage tracking through the <code>expression.track_cpu_usage</code> query
config. When enabled, every function invocation is wrapped in a <code>CpuWallTimer</code>
— an RAII guard that makes the <code>clock_gettime()</code> system call at entry and exit,
recording the CPU and wall time spent in each function. This gives precise,
per-function cost attribution across the entire expression tree.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="the-overhead-problem">The Overhead Problem<a href="https://velox-lib.io/blog/velox-adaptive-cpu-sampling#the-overhead-problem" class="hash-link" aria-label="Direct link to The Overhead Problem" title="Direct link to The Overhead Problem">​</a></h2>
<p>That precision comes at a price. Each <code>CpuWallTimer</code> invocation performs a heap
allocation, two <code>clock_gettime()</code> syscalls, and a heap deallocation — roughly
~5μs of fixed overhead, depending on the platform. For a function like
<code>array_gte()</code>, which spends milliseconds per vector iterating over array
elements, 5μs is a negligible noise. But for a function like <code>multiply</code> that
incurs a single instruction, the timer takes longer than the work itself:</p>
<table><thead><tr><th>Vector Size</th><th>No Tracking</th><th>Full Tracking</th><th>Overhead</th></tr></thead><tbody><tr><td>100 rows</td><td>2.13ms</td><td>5.89ms</td><td><strong>2.8x</strong></td></tr><tr><td>1,000 rows</td><td>300μs</td><td>668μs</td><td><strong>2.2x</strong></td></tr><tr><td>10,000 rows</td><td>193μs</td><td>233μs</td><td>1.2x</td></tr></tbody></table>
<p><em>Benchmark for <code>multiply(a, b)</code> on double columns. Times are per 10K
iterations.</em></p>
<p>This overhead forced production systems to disable per-function CPU tracking
entirely, losing visibility into more granular CPU consumption. Selective
per-function tracking (<code>expression.track_cpu_usage_for_functions</code>) was added to
Velox in the past to let operators opt in specific functions, but even for
tracked functions the overhead remained — you just chose which functions to pay
the cost for.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="why-uniform-sampling-falls-short">Why Uniform Sampling Falls Short<a href="https://velox-lib.io/blog/velox-adaptive-cpu-sampling#why-uniform-sampling-falls-short" class="hash-link" aria-label="Direct link to Why Uniform Sampling Falls Short" title="Direct link to Why Uniform Sampling Falls Short">​</a></h2>
<p>The obvious next step is to sample: measure every Nth vector instead of every
one, and extrapolate. A uniform rate applied to all functions would be simple —
one counter, one config key. But a single rate cannot serve both cheap and
expensive functions.</p>
<p>Consider <code>array_gte</code>, where the timer overhead is less than 0.1% of execution
time. Sampling in this case throws away measurement precision for no benefit —
the function can afford full tracking. Now consider <code>multiply</code> on small vectors,
where the timer is 2500% of the function cost. Even a 1/1000 rate may not be
aggressive enough. A rate that works for one function is either inaccurate or
insufficient for another.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="adaptive-per-function-sampling">Adaptive Per-Function Sampling<a href="https://velox-lib.io/blog/velox-adaptive-cpu-sampling#adaptive-per-function-sampling" class="hash-link" aria-label="Direct link to Adaptive Per-Function Sampling" title="Direct link to Adaptive Per-Function Sampling">​</a></h2>
<p>When a query starts running, adaptive sampling does not yet know which functions
are cheap and which are expensive. So it watches.</p>
<p>The first vector through each function is a warmup — caches and branch
predictors settle, and the system measures the cost of the <code>CpuWallTimer</code> itself
by creating and destroying a dummy one. Over the next 5 vectors, each function
runs with a lightweight stopwatch instead, quietly accumulating how long it
actually takes without any timer overhead.</p>
<p>By the sixth vector, the system has what it needs: the ratio of timer overhead
to function cost. If the ratio is within a configurable threshold (default 1%),
the function switches to full tracking: every future vector gets a
<code>CpuWallTimer</code>, with zero information loss. If the timer dominates, the function
switches to sampling at a rate that bounds overhead to the threshold:</p>
<table><thead><tr><th>Function</th><th>Per-Vector Cost</th><th>Timer Overhead</th><th>Overhead Ratio</th><th>Decision</th></tr></thead><tbody><tr><td><code>multiply(a,b)</code> 100 rows</td><td>~0.2μs</td><td>~5μs</td><td>2500%</td><td>Sample 1/2500</td></tr><tr><td><code>multiply(a,b)</code> 10K rows</td><td>~20μs</td><td>~5μs</td><td>25%</td><td>Sample 1/25</td></tr><tr><td><code>array_gte(a,b)</code> 100 rows</td><td>~3ms</td><td>~5μs</td><td>0.17%</td><td>Always track</td></tr><tr><td><code>array_gte(a,b)</code> 10K rows</td><td>~12ms</td><td>~5μs</td><td>0.04%</td><td>Always track</td></tr></tbody></table>
<p>Expensive functions get exact tracking. Cheap functions sample aggressively. No
per-function configuration needed — just a single "max acceptable overhead"
knob.</p>
<p>When stats are read, sampled functions have their CPU and wall time scaled up by
the ratio of total vectors to timed vectors. To handle short queries where the
sampling counter might never fire, the first post-calibration vector is always
timed, guaranteeing at least one measurement to extrapolate from.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="results">Results<a href="https://velox-lib.io/blog/velox-adaptive-cpu-sampling#results" class="hash-link" aria-label="Direct link to Results" title="Direct link to Results">​</a></h2>
<p>So does it work? We benchmarked adaptive sampling against both no tracking and
full tracking, across cheap and expensive functions, at different vector sizes,
and with different adaptive rates (1% and 0.5%). Results are presented below:</p>
<table><thead><tr><th>Benchmark</th><th>No Tracking</th><th>Full Tracking</th><th>Adaptive 1%</th><th>Adaptive 0.5%</th></tr></thead><tbody><tr><td>multiply 100 rows</td><td>2.13ms</td><td>5.86ms (<strong>36%</strong>)</td><td>2.19ms (<strong>97%</strong>)</td><td>2.16ms (<strong>98%</strong>)</td></tr><tr><td>multiply 1K rows</td><td>299μs</td><td>671μs (<strong>44%</strong>)</td><td>304μs (<strong>98%</strong>)</td><td>302μs (<strong>99%</strong>)</td></tr><tr><td>multiply 10K rows</td><td>193μs</td><td>224μs (<strong>86%</strong>)</td><td>165μs (<strong>117%</strong>)</td><td>193μs (<strong>100%</strong>)</td></tr><tr><td>array_gte 100 rows</td><td>188ms</td><td>193ms (<strong>98%</strong>)</td><td>189ms (<strong>100%</strong>)</td><td>187ms (<strong>100%</strong>)</td></tr><tr><td>array_gte 1K rows</td><td>32ms</td><td>32ms (<strong>100%</strong>)</td><td>32ms (<strong>100%</strong>)</td><td>32ms (<strong>100%</strong>)</td></tr><tr><td>array_gte 10K rows</td><td>11ms</td><td>11ms (<strong>100%</strong>)</td><td>11ms (<strong>100%</strong>)</td><td>11ms (<strong>100%</strong>)</td></tr></tbody></table>
<p><em>(Percentages are relative to "No Tracking" — higher is better)</em></p>
<p>For <code>multiply</code> on 100-row vectors, overhead drops from 64% to under 3%. For
<code>array_gte</code>, overhead was already negligible and remains so — the function is
fully tracked with near zero sampling.</p>
<p>The natural concern with sampling is accuracy. We ran a correctness benchmark
comparing the extrapolated CPU time from adaptive sampling against the true
value from full tracking over 1,000 vectors:</p>
<table><thead><tr><th>Function</th><th>100 rows</th><th>1,000 rows</th><th>10,000 rows</th></tr></thead><tbody><tr><td>multiply</td><td>108.8%</td><td>102.4%</td><td>103.7%</td></tr><tr><td>array_gte</td><td>99.8%</td><td>100.3%</td><td>99.6%</td></tr></tbody></table>
<p>Estimates land within 1–9% of the true value. For expensive functions that end
up in <em>always track</em> mode, accuracy is exact — every vector is timed.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="configuration">Configuration<a href="https://velox-lib.io/blog/velox-adaptive-cpu-sampling#configuration" class="hash-link" aria-label="Direct link to Configuration" title="Direct link to Configuration">​</a></h2>
<p>Two config properties control adaptive sampling:</p>
<table><thead><tr><th>Config</th><th>Type</th><th>Default</th><th>Description</th></tr></thead><tbody><tr><td><code>expression.adaptive_cpu_sampling</code></td><td>bool</td><td>false</td><td>Enable adaptive per-function sampling</td></tr><tr><td><code>expression.adaptive_cpu_sampling_max_overhead_pct</code></td><td>double</td><td>1.0</td><td>Target max overhead percentage</td></tr></tbody></table>
<p>Adaptive sampling interacts cleanly with existing tracking modes.
<code>expression.track_cpu_usage</code> (global tracking for all functions) takes highest
priority and always wins. <code>expression.track_cpu_usage_for_functions</code> (selective
per-function tracking) overrides adaptive for named functions. Adaptive sampling
applies to everything else.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="takeaway">Takeaway<a href="https://velox-lib.io/blog/velox-adaptive-cpu-sampling#takeaway" class="hash-link" aria-label="Direct link to Takeaway" title="Direct link to Takeaway">​</a></h2>
<p>Adaptive per-function CPU time tracking eliminates the tradeoff between
observability and performance. Every function gets the best tracking mode for
its cost profile — cheap functions run at near-native speed with minimal
overhead, expensive functions get exact tracking at zero additional cost. The
system is self-tuning and requires no per-function configuration.</p>]]></content:encoded>
            <category>tech-blog</category>
            <category>expressions</category>
        </item>
        <item>
            <title><![CDATA[Accelerating Unicode string processing with SIMD in Velox]]></title>
            <link>https://velox-lib.io/blog/simd-capped-unicode-length</link>
            <guid>https://velox-lib.io/blog/simd-capped-unicode-length</guid>
            <pubDate>Fri, 13 Mar 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[TL;DR]]></description>
            <content:encoded><![CDATA[<h2 class="anchor anchorWithStickyNavbar_LWe7" id="tldr">TL;DR<a href="https://velox-lib.io/blog/simd-capped-unicode-length#tldr" class="hash-link" aria-label="Direct link to TL;DR" title="Direct link to TL;DR">​</a></h2>
<p>We optimized two Unicode string helpers — <code>cappedLengthUnicode</code> and
<code>cappedByteLengthUnicode</code> — by replacing byte-by-byte <code>utf8proc_char_length</code>
calls with a SIMD-based scanning loop. The new implementation processes
register-width blocks at a time: pure-ASCII blocks skip in one step, while
mixed blocks use bitmask arithmetic to count character starts. Both helpers now
share a single parameterized template, eliminating code duplication.</p>
<p>On a comprehensive benchmark matrix covering string lengths from 4 to 1024
bytes and ASCII ratios from 0% to 100%, we measured <strong>2–15× speedups</strong> across
most configurations, with no regressions on Unicode-heavy inputs. The
optimization benefits all callers of these helpers, including the Iceberg
truncate transform and various string functions.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="why-simd-why-here">Why SIMD, why here<a href="https://velox-lib.io/blog/simd-capped-unicode-length#why-simd-why-here" class="hash-link" aria-label="Direct link to Why SIMD, why here" title="Direct link to Why SIMD, why here">​</a></h2>
<p>Velox is a high-performance execution engine, and SIMD is one of the primary
ways we deliver that performance across compute-heavy paths. This is not a
one-off micro-optimization; it is a principle we apply consistently when the
data and algorithms make SIMD a natural fit. String processing is one of those
places: it is ubiquitous in query execution and often dominated by byte-wise
work that maps well to vectorized scanning.</p>
<p>In this post, we describe how we applied this principle to two Unicode helpers
that sit on many string processing paths, and how we balanced speed,
correctness, and maintainability.</p>
<p>The old implementation calls utf8proc_char_length on every UTF character.
That is robust, but it is inefficient for long ASCII segments where every character has only 1 byte.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="what-these-helpers-do">What these helpers do<a href="https://velox-lib.io/blog/simd-capped-unicode-length#what-these-helpers-do" class="hash-link" aria-label="Direct link to What these helpers do" title="Direct link to What these helpers do">​</a></h2>
<ul>
<li><strong><code>cappedLengthUnicode</code></strong> returns the number of UTF-8 characters in a
string, stopping early once <code>maxChars</code> is reached.</li>
<li><strong><code>cappedByteLengthUnicode</code></strong> returns the <em>byte position</em> of the Nth
character, also capped by <code>maxChars</code>. This is the key primitive for
truncating strings by character count without re-parsing the entire string.</li>
</ul>
<p>Consider a UTF-8 string that mixes multi-byte characters and ASCII:</p>
<div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">input = "你好abc世界"  // 15 bytes, 7 characters</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<p>With <code>maxChars = 3</code>:</p>
<ul>
<li><code>cappedLengthUnicode</code> returns <code>3</code>.</li>
<li><code>cappedByteLengthUnicode</code> returns <code>7</code> (the byte position after the third
character, <code>"好"</code>).</li>
</ul>
<p>Both need to handle mixed ASCII and multi-byte UTF-8 inputs, invalid bytes,
and early-exit behavior. These helpers are called on the Unicode code path —
the path used when the input is <em>not</em> known to be pure ASCII at the call
site. They are hot in several places, most notably the
<a href="https://iceberg.apache.org/spec/#truncate-transform-details">Iceberg truncate transform</a>,
which must find the byte position of the Nth character to truncate partition
keys. When we profiled the Iceberg write benchmark, <code>cappedByteLengthUnicode</code>
stood out as a clear hot spot.</p>
<p>Byte-by-byte UTF-8 scanning becomes a bottleneck on large, ASCII-heavy
strings, but we cannot assume pure ASCII because inputs may mix ASCII and
multi-byte characters.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="the-simd-design">The SIMD design<a href="https://velox-lib.io/blog/simd-capped-unicode-length#the-simd-design" class="hash-link" aria-label="Direct link to The SIMD design" title="Direct link to The SIMD design">​</a></h2>
<p>We introduced a shared helper, <code>detail::cappedUnicodeImpl</code>, and parameterized
it by whether it should return characters or bytes. The key idea is to process
the string in SIMD-width blocks using
<a href="https://github.com/xtensor-stack/xsimd">xsimd</a>:</p>
<ol>
<li><strong>ASCII fast path.</strong> Load a register-width block and check whether any
byte has its high bit set. If not, the whole block is ASCII. We can skip the
entire block and advance the counters in one step.</li>
<li><strong>Mixed UTF-8 blocks.</strong> For a block containing non-ASCII bytes, we count
character starts using <code>(byte &amp; 0xC0) != 0x80</code>, convert the predicate into a
bitmask, and use <code>popcount</code> to get the number of UTF-8 characters in the
block.</li>
<li><strong>Finding the Nth character.</strong> For <code>cappedByteLengthUnicode</code>, if the
<code>maxChars</code> target falls inside a mixed block, we use <code>ctz</code> on the bitmask to
locate the Nth character start and add a small inline helper
(<code>getUtf8CharLength</code>) to compute the byte length of that UTF-8 character.</li>
<li><strong>Remainder handling.</strong> For the tail, we keep two loops:<!-- -->
<ul>
<li>A non-ASCII remainder loop that avoids ASCII fast-path mispredicts.</li>
<li>A branching ASCII loop that is cheaper than the branchless version for
short strings.</li>
</ul>
</li>
</ol>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="the-journey-what-we-tried-and-learned">The journey: what we tried and learned<a href="https://velox-lib.io/blog/simd-capped-unicode-length#the-journey-what-we-tried-and-learned" class="hash-link" aria-label="Direct link to The journey: what we tried and learned" title="Direct link to The journey: what we tried and learned">​</a></h2>
<p>The final implementation emerged through several rounds of iteration and
experimentation. Here are the key insights from that process.</p>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="optimize-at-the-core-not-the-call-site">Optimize at the core, not the call site<a href="https://velox-lib.io/blog/simd-capped-unicode-length#optimize-at-the-core-not-the-call-site" class="hash-link" aria-label="Direct link to Optimize at the core, not the call site" title="Direct link to Optimize at the core, not the call site">​</a></h3>
<p>The initial version added the SIMD fast path directly in the Iceberg truncate
function. But the real bottleneck was in <code>cappedByteLengthUnicode</code> itself.
Moving the optimization into the shared helper meant every call site benefits
— not just Iceberg.</p>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="avoid-switching-between-code-paths">Avoid switching between code paths<a href="https://velox-lib.io/blog/simd-capped-unicode-length#avoid-switching-between-code-paths" class="hash-link" aria-label="Direct link to Avoid switching between code paths" title="Direct link to Avoid switching between code paths">​</a></h3>
<p>The first SIMD version only accelerated the leading ASCII prefix: it scanned
SIMD blocks until it hit the first non-ASCII byte, then fell back to scalar
for the rest of the string. This is useless if the first character is
non-ASCII but the rest are ASCII. The obvious fix — re-entering the SIMD path
after each non-ASCII block — caused a ~20% regression on mixed and
Unicode-heavy inputs due to the overhead of switching between SIMD and scalar
modes. The solution was a single SIMD loop that handles <em>both</em> ASCII and mixed
blocks, avoiding frequent code-path switches and extra branching.</p>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="unify-the-two-helpers">Unify the two helpers<a href="https://velox-lib.io/blog/simd-capped-unicode-length#unify-the-two-helpers" class="hash-link" aria-label="Direct link to Unify the two helpers" title="Direct link to Unify the two helpers">​</a></h3>
<p><code>cappedLengthUnicode</code> and <code>cappedByteLengthUnicode</code> have nearly identical
logic — the only difference is what they return. Factoring them into a shared
<code>detail::cappedUnicodeImpl&lt;returnByteLength&gt;</code> template eliminated the code
duplication and ensures both helpers stay in sync.</p>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="short-strings-need-special-care">Short strings need special care<a href="https://velox-lib.io/blog/simd-capped-unicode-length#short-strings-need-special-care" class="hash-link" aria-label="Direct link to Short strings need special care" title="Direct link to Short strings need special care">​</a></h3>
<p>After the SIMD path was generalized, the expanded benchmark matrix revealed
regressions on very short strings (4 bytes). The fix was simple: skip the
SIMD loop entirely when the string is shorter than one register width. Short
strings go straight to the scalar path, which avoids the overhead of loading
and testing a partial SIMD register.</p>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="a-branchless-getutf8charlength">A branchless <code>getUtf8CharLength</code><a href="https://velox-lib.io/blog/simd-capped-unicode-length#a-branchless-getutf8charlength" class="hash-link" aria-label="Direct link to a-branchless-getutf8charlength" title="Direct link to a-branchless-getutf8charlength">​</a></h3>
<p>The scalar remainder loop originally called <code>utf8proc_char_length</code>, which
involves several branches. A branchless alternative using <code>std::countl_zero</code>
compiles to a handful of instructions:</p>
<div class="language-cpp codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-cpp codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">auto</span><span class="token plain"> n </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> std</span><span class="token double-colon punctuation" style="color:#393A34">::</span><span class="token function" style="color:#d73a49">countl_zero</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">(</span><span class="token operator" style="color:#393A34">~</span><span class="token generic-function function" style="color:#d73a49">static_cast</span><span class="token generic-function generic class-name operator" style="color:#393A34">&lt;</span><span class="token generic-function generic class-name keyword" style="color:#00009f">uint32_t</span><span class="token generic-function generic class-name operator" style="color:#393A34">&gt;</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">c</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">&lt;&lt;</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">24</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">|</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">1</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">return</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">n </span><span class="token operator" style="color:#393A34">==</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">0</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">||</span><span class="token plain"> n </span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">4</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">?</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">1</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> n</span><span class="token punctuation" style="color:#393A34">;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<p>This improved performance on the short-string and remainder paths.</p>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="confidence-through-comprehensive-testing">Confidence through comprehensive testing<a href="https://velox-lib.io/blog/simd-capped-unicode-length#confidence-through-comprehensive-testing" class="hash-link" aria-label="Direct link to Confidence through comprehensive testing" title="Direct link to Confidence through comprehensive testing">​</a></h3>
<p>Since <code>cappedLengthUnicode</code> is widely used across Velox (unlike
<code>cappedByteLengthUnicode</code> which is limited to Iceberg and Spark), the change
needed high confidence. The PR added tests for multi-byte UTF-8 sequences that
straddle SIMD boundaries, continuation bytes in the scalar tail, and strings
that are exactly one register width.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="benchmark-results">Benchmark results<a href="https://velox-lib.io/blog/simd-capped-unicode-length#benchmark-results" class="hash-link" aria-label="Direct link to Benchmark results" title="Direct link to Benchmark results">​</a></h2>
<p>We added <code>StringCoreBenchmark.cpp</code> to measure both helpers across a matrix of
string lengths (4, 16, 64, 256, 1024 bytes) and ASCII ratios (0%, 1%, 25%,
50%, 75%, 99%, 100%). Each benchmark processes 10,000 strings with
<code>maxChars = 128</code>.</p>
<p>ASCII ratio indicates the percentage of characters in each string that are
single-byte ASCII; the remaining characters are multi-byte UTF-8. The
following table highlights representative speedups from the full matrix
(measured on an Apple M3):</p>
<table><thead><tr><th>String Length</th><th>ASCII Ratio</th><th><code>cappedByteLengthUnicode</code></th><th><code>cappedLengthUnicode</code></th></tr></thead><tbody><tr><td>16</td><td>50%</td><td><strong>+9.7×</strong></td><td><strong>+7.7×</strong></td></tr><tr><td>64</td><td>50%</td><td><strong>+15.3×</strong></td><td><strong>+11.3×</strong></td></tr><tr><td>64</td><td>100%</td><td><strong>+6.3×</strong></td><td><strong>+7.3×</strong></td></tr><tr><td>256</td><td>25%</td><td><strong>+9.2×</strong></td><td><strong>+6.3×</strong></td></tr><tr><td>256</td><td>100%</td><td><strong>+9.0×</strong></td><td><strong>+7.2×</strong></td></tr><tr><td>1024</td><td>50%</td><td><strong>+6.1×</strong></td><td><strong>+7.4×</strong></td></tr><tr><td>1024</td><td>99%</td><td><strong>+2.3×</strong></td><td><strong>+2.4×</strong></td></tr><tr><td>1024</td><td>100%</td><td><strong>+4.9×</strong></td><td><strong>+7.6×</strong></td></tr><tr><td>4</td><td>0% (Unicode-only)</td><td>−1.25×</td><td>−1.08×</td></tr><tr><td>4</td><td>100% (short ASCII)</td><td>−1.07×</td><td>−1.02×</td></tr></tbody></table>
<p>The small regressions on 4-byte strings are within noise for such tiny
inputs. For strings ≥ 16 bytes — which dominate real workloads — speedups
range from 2× to over 15×, with the largest gains on mixed-content strings
where the SIMD loop handles both ASCII and non-ASCII blocks without falling
back to scalar.</p>
<p>We also validated the results on an x86_64 server (Intel Xeon Skylake with
AVX-512) and observed similar improvement patterns. A full set of raw numbers
is available in the
<a href="https://gist.github.com/PingLiuPing/889ec7d8cdffbf9081cb1af75e7f102f">perf data gist</a>
and summarized in the
<a href="https://github.com/facebookincubator/velox/pull/16428">PR</a>.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="takeaways">Takeaways<a href="https://velox-lib.io/blog/simd-capped-unicode-length#takeaways" class="hash-link" aria-label="Direct link to Takeaways" title="Direct link to Takeaways">​</a></h2>
<ul>
<li><strong>Measure first, optimize second.</strong> The optimization started from a real
hot spot found in the Iceberg write benchmark — not from speculation.</li>
<li><strong>Optimize the core, not the symptom.</strong> Rather than patching the call site,
we pushed the improvement into the shared helper so that all callers benefit.</li>
<li><strong>SIMD is a first-class tool</strong> in Velox for performance-critical paths.
String processing — ubiquitous in query execution and dominated by
byte-wise work — is a natural fit for vectorized scanning.</li>
<li><strong>Avoid switching between code paths.</strong> Re-entering the SIMD path after every
non-ASCII block introduced extra branching and switching overhead; a unified
loop that handles both ASCII and mixed blocks was consistently faster.</li>
<li><strong>Branchless is not always faster.</strong> A single branchless remainder loop was
slower than two tailored loops on short strings.</li>
<li><strong>Avoid duplication.</strong> The two helpers share a single parameterized
implementation, keeping behavior consistent and reducing maintenance burden.</li>
<li><strong>Data-driven iteration.</strong> We tried multiple strategies and compared them
against a realistic benchmark matrix. We kept the version that improved the
common case without regressing Unicode-heavy workloads.</li>
<li><strong>Comprehensive benchmarks</strong> across different string lengths and ASCII
mixing ratios are essential for validating correctness-sensitive
optimizations.</li>
</ul>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="acknowledgements">Acknowledgements<a href="https://velox-lib.io/blog/simd-capped-unicode-length#acknowledgements" class="hash-link" aria-label="Direct link to Acknowledgements" title="Direct link to Acknowledgements">​</a></h2>
<p>Special thanks to <a href="https://github.com/mbasmanova">Masha Basmanova</a> and
<a href="https://github.com/Yuhta">Yuhta</a> for their guidance and detailed
code reviews that shaped the SIMD design and pushed for comprehensive
benchmarking.</p>]]></content:encoded>
            <category>tech-blog</category>
            <category>performance</category>
        </item>
        <item>
            <title><![CDATA[The hidden traps of regex in LIKE and split]]></title>
            <link>https://velox-lib.io/blog/regex-hidden-traps</link>
            <guid>https://velox-lib.io/blog/regex-hidden-traps</guid>
            <pubDate>Sat, 07 Mar 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[SQL functions sometimes use regular expressions under the hood in ways that]]></description>
            <content:encoded><![CDATA[<p>SQL functions sometimes use regular expressions under the hood in ways that
surprise users. Two common examples are the
<a href="https://facebookincubator.github.io/velox/functions/presto/regexp.html#like">LIKE</a>
operator and Spark's
<a href="https://facebookincubator.github.io/velox/functions/spark/string.html#split">split</a>
function.</p>
<p>In Presto,
<a href="https://facebookincubator.github.io/velox/functions/presto/string.html#split">split</a>
takes a literal string delimiter and
<a href="https://facebookincubator.github.io/velox/functions/presto/regexp.html#regexp_split">regexp_split</a>
is a separate function for regex-based splitting. Spark's <code>split</code>, however,
always treats the delimiter as a regular expression.</p>
<p>Both LIKE and Spark's split can silently produce wrong results and waste CPU
when used with column values instead of constants. Understanding why this
happens helps write faster, more correct queries — and helps engine developers
make better design choices.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="like-is-not-contains">LIKE is not contains<a href="https://velox-lib.io/blog/regex-hidden-traps#like-is-not-contains" class="hash-link" aria-label="Direct link to LIKE is not contains" title="Direct link to LIKE is not contains">​</a></h2>
<p>A very common query pattern is to check whether one string contains another:</p>
<div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">SELECT</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">*</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">FROM</span><span class="token plain"> t </span><span class="token keyword" style="color:#00009f">WHERE</span><span class="token plain"> name </span><span class="token operator" style="color:#393A34">LIKE</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'%'</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">||</span><span class="token plain"> search_term </span><span class="token operator" style="color:#393A34">||</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'%'</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<p>This looks intuitive: wrap <code>search_term</code> in <code>%</code> wildcards and you get a
"contains" check. But LIKE is <strong>not</strong> the same as substring matching.
LIKE treats <code>_</code> as a single-character wildcard and <code>%</code> as a multi-character
wildcard. If <code>search_term</code> comes from a column and contains these characters,
the results are silently wrong:</p>
<div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">SELECT url,</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">       url LIKE '%' || search_term || '%' AS like_result,</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">       strpos(url, search_term) &gt; 0 AS contains_result</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">FROM (VALUES</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    ('https://site.com/home'),</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    ('https://site.com/user_profile'),</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    ('https://site.com/username')</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">) AS t(url)</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">CROSS JOIN (VALUES ('user_')) AS s(search_term);</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">              url              | like_result | contains_result</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">-------------------------------+-------------+----------------</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">https://site.com/home          |    false    |     false</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">https://site.com/user_profile  |    true     |     true</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">https://site.com/username      |    true     |     false</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<p><code>LIKE '%user_%'</code> matches <code>'https://site.com/username'</code> because <code>_</code> is a
wildcard that matches any single character — in this case, <code>n</code>. But
<code>strpos(url, 'user_') &gt; 0</code> treats <code>_</code> as a literal underscore and correctly
reports that <code>'https://site.com/username'</code> does not contain the substring
<code>'user_'</code>.</p>
<p>When the pattern is a constant, this distinction is visible and intentional.
But when users write <code>x LIKE '%' || y || '%'</code> where <code>y</code> is a column, the
values of <code>y</code> may contain <code>_</code> or <code>%</code> characters — and they will be silently
interpreted as wildcards, producing wrong results.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="sparks-split-treats-delimiters-as-regex">Spark's split treats delimiters as regex<a href="https://velox-lib.io/blog/regex-hidden-traps#sparks-split-treats-delimiters-as-regex" class="hash-link" aria-label="Direct link to Spark's split treats delimiters as regex" title="Direct link to Spark's split treats delimiters as regex">​</a></h2>
<p>In Presto, the <a href="https://facebookincubator.github.io/velox/functions/presto/string.html#split">split</a>
function takes a literal string delimiter, while
<a href="https://facebookincubator.github.io/velox/functions/presto/regexp.html#regexp_split">regexp_split</a>
is a separate function for regex-based splitting. This distinction makes the intent clear.</p>
<p>Spark's <a href="https://facebookincubator.github.io/velox/functions/spark/string.html#split">split</a>
function, however, always treats the delimiter as a regular expression.
Users rarely realize this, and a common pattern is to split a string using a
value from another column:</p>
<div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">select</span><span class="token plain"> split</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">dir_path</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> location_path</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">[</span><span class="token number" style="color:#36acaa">1</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">as</span><span class="token plain"> partition_name </span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> t</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<p>Here, a table stores Hive partition metadata: <code>dir_path</code> is the full partition
path (e.g., <code>/data/warehouse/db.name/table/ds=2024-01-01</code>) and <code>location_path</code>
is the table path (e.g., <code>/data/warehouse/db.name/table</code>). The user wants to
strip the table path prefix to get the partition name.</p>
<p>This works for simple paths. But <code>location_path</code> is interpreted as a regular
expression, not a literal string. If it contains <code>.</code> — as in <code>db.name</code> — the
<code>.</code> matches <strong>any character</strong>, not a literal dot. Characters like <code>(</code>, <code>)</code>,
<code>[</code>, <code>+</code>, <code>*</code>, <code>?</code>, and <code>$</code> would also cause wrong results or errors.</p>
<p>A correct alternative that also executes faster uses simple string operations:</p>
<div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">IF</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">starts_with</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">dir_path</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> location_path</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">   substr</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">dir_path</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> length</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">location_path</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">+</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">2</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">as</span><span class="token plain"> partition_name</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<p>This is a bit more verbose than <code>split(dir_path, location_path)[1]</code>, but it is
correct for all inputs and avoids regex compilation entirely.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="performance-trap">Performance trap<a href="https://velox-lib.io/blog/regex-hidden-traps#performance-trap" class="hash-link" aria-label="Direct link to Performance trap" title="Direct link to Performance trap">​</a></h2>
<p>Beyond correctness, there is a performance problem. Both LIKE and Spark's
split use <a href="https://github.com/google/re2">RE2</a> as the regex
engine. RE2 is fast and safe, but compiling a regular expression can take up
to 200x more CPU time than evaluating it.</p>
<p>When the pattern or delimiter is a constant, the regex is compiled once and
reused for every row. The cost is negligible. But when the pattern comes from
a column, a new regex may need to be compiled for every distinct value. A table
with thousands of distinct <code>location_path</code> values means thousands of regex
compilations — each one expensive and none of them necessary.</p>
<p>Velox limits the number of compiled regular expressions per function instance
per thread of execution via the
<a href="https://facebookincubator.github.io/velox/configs.html#expression-evaluation-configuration">expression.max_compiled_regexes</a>
configuration property (default: 100). When this limit is reached, the query fails with an
error.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="tempting-but-wrong-fix">Tempting but wrong fix<a href="https://velox-lib.io/blog/regex-hidden-traps#tempting-but-wrong-fix" class="hash-link" aria-label="Direct link to Tempting but wrong fix" title="Direct link to Tempting but wrong fix">​</a></h2>
<p>When users hit this limit, the natural reaction is to ask the engine developers
to raise or eliminate the cap. A recent
<a href="https://github.com/facebookincubator/velox/pull/15953">pull request</a>
proposed replacing the fixed-size cache with an evicting cache: when the limit
is reached, the oldest compiled regex is evicted to make room for the new one.</p>
<p>This sounds reasonable, and the motivation is understandable — users migrating
from Spark don't want to rewrite working queries. But it makes things worse:</p>
<ul>
<li><strong>It hides the correctness bug.</strong> The query no longer fails, so users never
discover that their LIKE pattern or split delimiter is being interpreted as
a regex and producing wrong results for inputs with special characters.</li>
<li><strong>It makes the performance problem worse.</strong> With thousands of distinct
patterns, the cache churns constantly — evicting one compiled regex only to
compile another. The query runs, but dramatically slower than necessary, and
the user has no indication why. In shared multi-tenant clusters, a single
slow query like this can consume excessive CPU and affect other users'
workloads.</li>
</ul>
<p>The error is a feature, not a bug. It is an early warning that catches misuse
before it leads to silently wrong results in production and prevents a single
query from wasting shared cluster resources.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="right-fix">Right fix<a href="https://velox-lib.io/blog/regex-hidden-traps#right-fix" class="hash-link" aria-label="Direct link to Right fix" title="Direct link to Right fix">​</a></h2>
<p><strong>For users:</strong> replace LIKE with literal string operations when checking for
substrings. Use <code>strpos(x, y) &gt; 0</code> or <code>contains(x, y)</code> instead of
<code>x LIKE '%' || y || '%'</code>. For Spark's split with literal delimiters, use
<code>substr</code> or other <a href="https://facebookincubator.github.io/velox/functions/spark/string.html">string functions</a> that don't involve regex.</p>
<p><strong>For engine developers:</strong> optimize the functions to avoid regex when it isn't
needed. Velox's LIKE implementation already does this. As described in
<a href="https://github.com/xumingming">James Xu</a>'s
<a href="https://velox-lib.io/blog/like">earlier blog post</a>, the engine analyzes each pattern
and uses fast paths — prefix match, suffix match, substring search — whenever
the pattern contains only regular characters and <code>_</code> wildcards. For simple patterns, this gives up to 750x speedup over regex.
Regex is compiled only for patterns that truly require it, and these optimized
patterns are not counted toward the compiled regex limit.</p>
<p>The same approach should be applied to Spark's split function. The engine can
check whether the delimiter contains any regex metacharacters. If it doesn't,
a simple string search can be used instead of compiling a regex. This would
make queries like <code>split(dir_path, location_path)</code> both fast and correct —
without users needing to change anything and without removing the safety net
for cases that genuinely require regex.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="takeaways">Takeaways<a href="https://velox-lib.io/blog/regex-hidden-traps#takeaways" class="hash-link" aria-label="Direct link to Takeaways" title="Direct link to Takeaways">​</a></h2>
<ul>
<li><code>LIKE</code> is not <code>contains</code>. The <code>_</code> and <code>%</code> wildcards can silently corrupt
results when the pattern comes from a column.</li>
<li>Spark's <code>split</code> treats delimiters as regex. Characters like <code>.</code> in column
values are interpreted as regex metacharacters, not literal characters.
Presto avoids this by separating <code>split</code> (literal) and <code>regexp_split</code> (regex).</li>
<li>When a query hits the compiled regex limit, the right response is to fix the
query, not to raise the limit.</li>
<li>Engine developers should optimize functions to avoid regex when the input
is a plain string, rather than making it easier to misuse regex at scale.</li>
</ul>]]></content:encoded>
            <category>tech-blog</category>
            <category>functions</category>
        </item>
        <item>
            <title><![CDATA[velox::StringView API Changes and Best Practices]]></title>
            <link>https://velox-lib.io/blog/stringview-api-changes</link>
            <guid>https://velox-lib.io/blog/stringview-api-changes</guid>
            <pubDate>Tue, 10 Feb 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Context]]></description>
            <content:encoded><![CDATA[<h2 class="anchor anchorWithStickyNavbar_LWe7" id="context">Context<a href="https://velox-lib.io/blog/stringview-api-changes#context" class="hash-link" aria-label="Direct link to Context" title="Direct link to Context">​</a></h2>
<p>Strings are ubiquitously used in large-scale analytic query processing. From storing identifiers,
names, labels, or structured data (like json/xml), to simply descriptive text, like a product
description or the contents of this very blog post, there is hardly a SQL query that does not require
the manipulation of string data.</p>
<p>This post describes in more detail how Velox handles columns of strings, the low-level C++ APIs
involved and some recent changes made to them, and presents best practices for string usage
throughout Velox's codebase.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="stringview">StringView<a href="https://velox-lib.io/blog/stringview-api-changes#stringview" class="hash-link" aria-label="Direct link to StringView" title="Direct link to StringView">​</a></h2>
<p>To efficiently enable string operations, Velox provides a specialized physical C++ type, called
<code>velox::StringView</code>. The main purpose of this layout, also called a <em>German String Layout</em>, is to
optimize common string operations over small strings, or operations that can be performed by only
accessing a string's prefix.</p>
<p>StringViews are 16-byte objects that provide a non-owning string-like API that refers to data stored
elsewhere in a columnar Velox Vector buffer. The lifetime of a <code>velox::StringView</code> is always tied to
the lifetime of the underlying Velox Vector buffer, in the sense that it only provides a pointer
(<em>view</em>) into the external buffer. It provides a similar abstraction to <code>std::string_view</code>, with
the following additional optimizations:</p>
<ul>
<li>
<p><strong>Small String Optimization (SSO):</strong> strings up to 12 characters are always stored inline, within
the object itself. With the assumption that strings are in many cases small, it removes the need
for dereferencing the larger data buffer, hopefully reducing cache misses and memory bus traffic.</p>
</li>
<li>
<p><strong>Prefix optimization:</strong> a 4 character prefix is always stored inline within the StringView object
to enable failed comparison to be short circuited, also skipping accesses to the external buffer.</p>
</li>
</ul>
<p>To enable better interoperability with other engines, a few years ago
<a href="https://engineering.fb.com/2024/02/20/developer-tools/velox-apache-arrow-15-composable-data-management/" target="_blank" rel="noopener noreferrer">we collaborated with the Arrow community</a>
in what resulted to be the
<a href="https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-view-layout" target="_blank" rel="noopener noreferrer">BinaryView Arrow format</a>
for string data. Ever since, we have seen increased adoption of this format in a series of
<a href="https://datafusion.apache.org/blog/2024/09/13/string-view-german-style-strings-part-1/" target="_blank" rel="noopener noreferrer">other systems and libraries</a>.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="the-old-unsafe-api">The Old Unsafe API<a href="https://velox-lib.io/blog/stringview-api-changes#the-old-unsafe-api" class="hash-link" aria-label="Direct link to The Old Unsafe API" title="Direct link to The Old Unsafe API">​</a></h2>
<p>For convenience, in Velox we used to allow developers to implicitly convert <code>velox::StringView</code> into
<code>std::string</code>, since many APIs are built based on <code>std::string</code>. For example, one could naively do:</p>
<div class="language-cpp codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-cpp codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic">// Given this signature.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">void</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">myFunction</span><span class="token punctuation" style="color:#393A34">(</span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> std</span><span class="token double-colon punctuation" style="color:#393A34">::</span><span class="token plain">string</span><span class="token operator" style="color:#393A34">&amp;</span><span class="token plain"> input</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token comment" style="color:#999988;font-style:italic">// ...</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic">// All would silently result in a string copy</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic">// (valueAt() returns a velox::StringView):</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token function" style="color:#d73a49">myFunction</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">stringVector</span><span class="token operator" style="color:#393A34">-&gt;</span><span class="token function" style="color:#d73a49">valueAt</span><span class="token punctuation" style="color:#393A34">(</span><span class="token number" style="color:#36acaa">0</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">std</span><span class="token double-colon punctuation" style="color:#393A34">::</span><span class="token plain">string myStr </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> stringVector</span><span class="token operator" style="color:#393A34">-&gt;</span><span class="token function" style="color:#d73a49">valueAt</span><span class="token punctuation" style="color:#393A34">(</span><span class="token number" style="color:#36acaa">0</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">std</span><span class="token double-colon punctuation" style="color:#393A34">::</span><span class="token plain">unordered_map</span><span class="token operator" style="color:#393A34">&lt;</span><span class="token plain">std</span><span class="token double-colon punctuation" style="color:#393A34">::</span><span class="token plain">string</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> size_t</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> myMap</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">myMap</span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">stringVector</span><span class="token operator" style="color:#393A34">-&gt;</span><span class="token function" style="color:#d73a49">valueAt</span><span class="token punctuation" style="color:#393A34">(</span><span class="token number" style="color:#36acaa">0</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">1</span><span class="token punctuation" style="color:#393A34">;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<p>While these usages seem harmless, they would all result in a full string copy, and potentially a
memory allocation, depending on the string size. We have found this anti-pattern to be commonly
used; in most cases, the developer did not intend for a copy to be performed, and this behavior
silently added unnecessary overhead.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="what-changed">What Changed<a href="https://velox-lib.io/blog/stringview-api-changes#what-changed" class="hash-link" aria-label="Direct link to What Changed" title="Direct link to What Changed">​</a></h2>
<p>To prevent such inadvertent string copies, in
<a href="https://github.com/facebookincubator/velox/pull/15946" target="_blank" rel="noopener noreferrer">#15946</a> we removed implicit conversion from
<code>velox::StringView</code> to <code>std::string</code>. You can still do so if needed, but you now have to explicitly
state it, for clarity:</p>
<div class="language-cpp codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-cpp codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic">// Compilation failure from now on:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token function" style="color:#d73a49">myFunction</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">stringVector</span><span class="token operator" style="color:#393A34">-&gt;</span><span class="token function" style="color:#d73a49">valueAt</span><span class="token punctuation" style="color:#393A34">(</span><span class="token number" style="color:#36acaa">0</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic">// Ok:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token function" style="color:#d73a49">myFunction</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">std</span><span class="token double-colon punctuation" style="color:#393A34">::</span><span class="token function" style="color:#d73a49">string</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">stringVector</span><span class="token operator" style="color:#393A34">-&gt;</span><span class="token function" style="color:#d73a49">valueAt</span><span class="token punctuation" style="color:#393A34">(</span><span class="token number" style="color:#36acaa">0</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<p>Instead, we are now making conversion from <code>velox::StringView</code> to <code>std::string_view</code> available
implicitly. <code>std::string_view</code>'s are non-owning, so constructing them based on a pointer and size
is essentially free.</p>
<p>A series of PR were landed before this change to clean up any dependencies on the old behavior
throughout Velox's codebase, making them explicitly defined when needed.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="follystringpiece">folly::StringPiece<a href="https://velox-lib.io/blog/stringview-api-changes#follystringpiece" class="hash-link" aria-label="Direct link to folly::StringPiece" title="Direct link to folly::StringPiece">​</a></h2>
<p><code>folly::StringPiece</code>'s are now superseded by <code>std::string_view</code>. As part of this work, we have also
cleaned up and removed every usage of <code>folly::StringPiece</code> in Velox; <strong>do not</strong> use
<code>folly::StringPiece</code> in future code in Velox. Use <code>std::string_view</code> or <code>velox::StringView</code> instead.</p>
<p>If you need to use a library that takes a <code>folly::StringPiece</code>, for example, <code>folly::parseJson()</code>,
you can pass a <code>std::string_view</code> instead, since the <code>std::string_view</code> =&gt; <code>folly::StringPiece</code>
conversion can be done implicitly.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="the-rvalue-gotcha">The RValue Gotcha<a href="https://velox-lib.io/blog/stringview-api-changes#the-rvalue-gotcha" class="hash-link" aria-label="Direct link to The RValue Gotcha" title="Direct link to The RValue Gotcha">​</a></h2>
<p>A slightly unintuitive gotcha of <code>velox::StringView</code> is the handling of rvalues and temporary
values. Since <code>velox::StringView</code> may contain inlined strings, if you take a pointer to the string
contents, the string happened to be small and inlined, and the <code>velox::StringView</code> object is
destroyed, you would end up with a dangling pointer. For example:</p>
<div class="language-cpp codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-cpp codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic">// Unsafe: would result in a std::string_view pointing to</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic">// the temporary object that gets destroyed right away.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">std</span><span class="token double-colon punctuation" style="color:#393A34">::</span><span class="token plain">string_view sv </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> stringVector</span><span class="token operator" style="color:#393A34">-&gt;</span><span class="token function" style="color:#d73a49">valueAt</span><span class="token punctuation" style="color:#393A34">(</span><span class="token number" style="color:#36acaa">0</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token function" style="color:#d73a49">LOG</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">INFO</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">&lt;&lt;</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Happily would have crashed: "</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">&lt;&lt;</span><span class="token plain"> sv</span><span class="token punctuation" style="color:#393A34">;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<p>To prevent these unsafe usages, the rvalue-based version of these conversions, both explicit and
implicit, and for both <code>std::string</code> and <code>std::string_view</code>, are disabled. This means that the
snippets above will give you a compilation error.</p>
<p>The safe behavior would be to create a non-temporary object in the stack:</p>
<div class="language-cpp codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-cpp codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">velox</span><span class="token double-colon punctuation" style="color:#393A34">::</span><span class="token plain">StringView sv1 </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> stringVector</span><span class="token operator" style="color:#393A34">-&gt;</span><span class="token function" style="color:#d73a49">valueAt</span><span class="token punctuation" style="color:#393A34">(</span><span class="token number" style="color:#36acaa">0</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">std</span><span class="token double-colon punctuation" style="color:#393A34">::</span><span class="token plain">string_view sv2 </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> sv1</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"> </span><span class="token comment" style="color:#999988;font-style:italic">// all good, as long as sv2 is not used after</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">                             </span><span class="token comment" style="color:#999988;font-style:italic">// sv1 is destroyed.</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="best-practices">Best Practices<a href="https://velox-lib.io/blog/stringview-api-changes#best-practices" class="hash-link" aria-label="Direct link to Best Practices" title="Direct link to Best Practices">​</a></h2>
<p>To summarize, from now on, the conversions:</p>
<ul>
<li><code>velox::StringView</code> =&gt; <code>std::string</code>: won't be done implicitly anymore. This now needs to be
explicitly stated, for clarity.</li>
<li><code>velox::StringView</code> =&gt; <code>std::string_view</code>: can be done implicitly.</li>
<li><code>velox::StringView</code> =&gt; <code>folly::StringPiece</code>: do not use. If needed for legacy external libraries,
pass a <code>std::string_view</code> instead.</li>
<li>rvalue <code>velox::StringView&amp;&amp;</code> =&gt; any other type: compilation error to minimize risk of dangling
pointer.</li>
</ul>
<p>Please reach out on <a href="https://velox-oss.slack.com/" target="_blank" rel="noopener noreferrer">Slack</a> if you have any questions.</p>]]></content:encoded>
            <category>tech-blog</category>
            <category>strings</category>
            <category>api</category>
        </item>
        <item>
            <title><![CDATA[Task Barrier: Efficient Task Reuse and Streaming Checkpoints in Velox]]></title>
            <link>https://velox-lib.io/blog/task-barrier</link>
            <guid>https://velox-lib.io/blog/task-barrier</guid>
            <pubDate>Tue, 06 Jan 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[TL;DR]]></description>
            <content:encoded><![CDATA[<h2 class="anchor anchorWithStickyNavbar_LWe7" id="tldr">TL;DR<a href="https://velox-lib.io/blog/task-barrier#tldr" class="hash-link" aria-label="Direct link to TL;DR" title="Direct link to TL;DR">​</a></h2>
<p>Velox Task Barriers provide a synchronization mechanism that not only enables efficient task reuse,
important for workloads such as <a href="https://engineering.fb.com/2022/09/19/ml-applications/data-ingestion-machine-learning-training-meta/" target="_blank" rel="noopener noreferrer">AI training data loading</a>,
but also delivers the strict sequencing and checkpointing semantics required for streaming workloads.</p>
<p>By injecting a barrier split, users guarantee that no subsequent data is processed until the entire
DAG is flushed and the synchronization signal is unblocked. This capability serves two critical
patterns:</p>
<ol>
<li>
<p><strong>Task Reuse</strong>: Eliminates the overhead of repeated task initialization and teardown by safely
reconfiguring warm tasks for new queries. This is a recurring pattern in AI training data
loading workloads.</p>
</li>
<li>
<p><strong>Streaming Processing</strong>: Enables continuous data handling with consistent checkpoints, allowing
stateful operators to maintain context across batches without service interruption.</p>
</li>
</ol>
<p>See the <a href="https://facebookincubator.github.io/velox/develop/task-barrier.html" target="_blank" rel="noopener noreferrer">Task Barrier Developer Guide</a> for implementation details.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="the-problem">The Problem<a href="https://velox-lib.io/blog/task-barrier#the-problem" class="hash-link" aria-label="Direct link to The Problem" title="Direct link to The Problem">​</a></h2>
<p>Creating a new Velox task for every data request introduces overhead:</p>
<ul>
<li>Memory pool initialization and configuration</li>
<li>Pipeline and operator construction</li>
<li>Connector and data source setup</li>
<li>Thread pool registration</li>
</ul>
<p>For high-throughput workloads that process thousands of requests, this overhead may accumulate
and impact overall system performance.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="the-solution-task-barrier">The Solution: Task Barrier<a href="https://velox-lib.io/blog/task-barrier#the-solution-task-barrier" class="hash-link" aria-label="Direct link to The Solution: Task Barrier" title="Direct link to The Solution: Task Barrier">​</a></h2>
<p>Task Barrier introduces a lightweight synchronization mechanism that:</p>
<ol>
<li>
<p><strong>Drains all in-flight data</strong> - Ensures all buffered data flows through the pipeline to completion.</p>
</li>
<li>
<p><strong>Resets stateful operators</strong> - Clears internal state in operators like aggregations, joins, and sorts.</p>
</li>
<li>
<p><strong>Preserves task infrastructure</strong> - Keeps memory pools, thread registrations, and connector sessions intact.</p>
</li>
</ol>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="how-it-works">How It Works<a href="https://velox-lib.io/blog/task-barrier#how-it-works" class="hash-link" aria-label="Direct link to How It Works" title="Direct link to How It Works">​</a></h3>
<p>The barrier is triggered by a special <code>BarrierSplit</code> injected into the task's source operators via
<code>requestBarrier()</code>. When received:</p>
<ol>
<li>Source operators propagate the barrier signal downstream.</li>
<li>Each operator calls <code>finishDriverBarrier()</code> to flush and reset its state.</li>
<li>Once all drivers complete barrier processing, the task is ready for a new split.</li>
</ol>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="workflow-example-pseudo-code">Workflow Example (Pseudo-code)<a href="https://velox-lib.io/blog/task-barrier#workflow-example-pseudo-code" class="hash-link" aria-label="Direct link to Workflow Example (Pseudo-code)" title="Direct link to Workflow Example (Pseudo-code)">​</a></h3>
<p>The following example demonstrates the usage pattern: the user adds splits, requests a barrier,
and then must continue to drain results until the barrier is reached. This ensures all in-flight
data leaves the pipeline, allowing the barrier to complete.</p>
<div class="language-cpp codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-cpp codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic">// 1. Initialize the task infrastructure once</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">auto</span><span class="token plain"> task </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token class-name">Task</span><span class="token double-colon punctuation" style="color:#393A34">::</span><span class="token function" style="color:#d73a49">create</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">config</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> memoryPool</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic">// 2. Loop to process multiple batches using the same task</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">while</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">scheduler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">hasMoreRequests</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token comment" style="color:#999988;font-style:italic">// A. Get new data splits (e.g., from a file or query)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token keyword" style="color:#00009f">auto</span><span class="token plain"> splits </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> scheduler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">getNextSplits</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token comment" style="color:#999988;font-style:italic">// B. Add splits to the running task</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  task</span><span class="token operator" style="color:#393A34">-&gt;</span><span class="token function" style="color:#d73a49">addSplits</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">sourcePlanNodeId</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> splits</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token comment" style="color:#999988;font-style:italic">// C. Request a barrier to mark the end of this batch.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token comment" style="color:#999988;font-style:italic">// This injects a BarrierSplit that flows behind the data.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token comment" style="color:#999988;font-style:italic">// Note: Any splits added to the task AFTER this call will be blocked</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token comment" style="color:#999988;font-style:italic">// and will not be processed until this barrier is fully cleared.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token keyword" style="color:#00009f">auto</span><span class="token plain"> barrierFuture </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> task</span><span class="token operator" style="color:#393A34">-&gt;</span><span class="token function" style="color:#d73a49">requestBarrier</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token comment" style="color:#999988;font-style:italic">// D. Drain the pipeline until the barrier is passed.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token comment" style="color:#999988;font-style:italic">// Critical: We must keep pulling results so the task can flush</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token comment" style="color:#999988;font-style:italic">// its buffers and reach the barrier.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token keyword" style="color:#00009f">while</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token operator" style="color:#393A34">!</span><span class="token plain">barrierFuture</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">isReady</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">auto</span><span class="token plain"> result </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> task</span><span class="token operator" style="color:#393A34">-&gt;</span><span class="token function" style="color:#d73a49">next</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"> </span><span class="token comment" style="color:#999988;font-style:italic">// Get next result batch</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">result </span><span class="token operator" style="color:#393A34">!=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">nullptr</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      userOutput</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">consume</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">result</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token comment" style="color:#999988;font-style:italic">// E. Barrier reached.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token comment" style="color:#999988;font-style:italic">// The task has been reset (operators cleared, buffers flushed).</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token comment" style="color:#999988;font-style:italic">// It is now ready for the next iteration immediately.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="operator-support">Operator Support<a href="https://velox-lib.io/blog/task-barrier#operator-support" class="hash-link" aria-label="Direct link to Operator Support" title="Direct link to Operator Support">​</a></h3>
<p>Task Barrier requires operators to implement <code>finishDriverBarrier()</code> to properly reset their state:</p>
<ul>
<li><strong>TableScan</strong>: Clears current split and reader state.</li>
<li><strong>HashBuild/HashProbe</strong>: Resets hash tables and probe state.</li>
<li><strong>Aggregation</strong>: Flushes partial aggregates and clears accumulators.</li>
<li><strong>OrderBy</strong>: Outputs sorted data and resets sort buffer.</li>
<li><strong>Exchange</strong>: Drains exchange queues.</li>
</ul>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="performance-impact">Performance Impact<a href="https://velox-lib.io/blog/task-barrier#performance-impact" class="hash-link" aria-label="Direct link to Performance Impact" title="Direct link to Performance Impact">​</a></h2>
<p>By eliminating repeated task creation overhead, Task Barrier provides significant performance
improvements for workloads with:</p>
<ul>
<li><strong>Short-lived Queries</strong>: Specifically those that recur frequently with the exact same plan DAG,
avoiding the need for repeated reconstruction.</li>
<li><strong>AI/ML Pipelines</strong>: High-frequency data loading requests where task initialization would
otherwise dominate execution time.</li>
<li><strong>Iterative Processing</strong>: Queries that can benefit from "warm" caches and pre-allocated memory pools.</li>
</ul>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="learn-more">Learn More<a href="https://velox-lib.io/blog/task-barrier#learn-more" class="hash-link" aria-label="Direct link to Learn More" title="Direct link to Learn More">​</a></h2>
<p>For detailed implementation information, API reference, and operator-specific behavior, see the
<a href="https://facebookincubator.github.io/velox/develop/task-barrier.html" target="_blank" rel="noopener noreferrer">Task Barrier</a> documentation.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="references">References<a href="https://velox-lib.io/blog/task-barrier#references" class="hash-link" aria-label="Direct link to References" title="Direct link to References">​</a></h2>
<ul>
<li><a href="https://engineering.fb.com/2022/09/19/ml-applications/data-ingestion-machine-learning-training-meta/" target="_blank" rel="noopener noreferrer">Data Ingestion for Machine Learning Training at Meta</a></li>
<li><a href="https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/concepts/stateful-stream-processing/" target="_blank" rel="noopener noreferrer">Flink Stateful Streaming Processing</a></li>
<li><a href="https://facebookincubator.github.io/velox/develop/task.html" target="_blank" rel="noopener noreferrer">Velox Task Model</a></li>
<li><a href="https://facebookincubator.github.io/velox/develop/task-barrier.html" target="_blank" rel="noopener noreferrer">Velox Task Barrier</a></li>
</ul>]]></content:encoded>
            <category>tech-blog</category>
            <category>task</category>
            <category>execution</category>
        </item>
        <item>
            <title><![CDATA[Why Sort is row-based in Velox — A Quantitative Assessment]]></title>
            <link>https://velox-lib.io/blog/why-row-based-sort</link>
            <guid>https://velox-lib.io/blog/why-row-based-sort</guid>
            <pubDate>Wed, 24 Dec 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[TL;DR]]></description>
            <content:encoded><![CDATA[<h2 class="anchor anchorWithStickyNavbar_LWe7" id="tldr">TL;DR<a href="https://velox-lib.io/blog/why-row-based-sort#tldr" class="hash-link" aria-label="Direct link to TL;DR" title="Direct link to TL;DR">​</a></h2>
<p>Velox is a fully vectorized execution engine[1]. Its internal columnar memory layout enhances cache
locality, exposes more inter-instruction parallelism to CPUs, and enables the use of SIMD instructions,
significantly accelerating large-scale query processing.</p>
<p>However, some operators in Velox utilize a hybrid layout, where datasets can be temporarily converted
to a row-oriented format. The <code>OrderBy</code> operator is one example, where our implementation first
materializes the input vectors into rows, containing both sort keys and payload columns, sorts them, and
converts the rows back to vectors.</p>
<p>In this article, we explain the rationale behind this design decision and provide experimental evidence
for its implementation. We show a prototype of a hybrid sorting strategy that materializes only the
sort-key columns, reducing the overhead of materializing payload columns. Contrary to expectations, the
end-to-end performance did not improve—in fact, it was even up to <strong>3×</strong> slower. We present the two
variants and discuss why one is counter-intuitively faster than the other.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="row-based-vs-non-materialized">Row-based vs. Non-Materialized<a href="https://velox-lib.io/blog/why-row-based-sort#row-based-vs-non-materialized" class="hash-link" aria-label="Direct link to Row-based vs. Non-Materialized" title="Direct link to Row-based vs. Non-Materialized">​</a></h2>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="row-based-sort">Row-based Sort<a href="https://velox-lib.io/blog/why-row-based-sort#row-based-sort" class="hash-link" aria-label="Direct link to Row-based Sort" title="Direct link to Row-based Sort">​</a></h3>
<p>The <code>OrderBy</code> operator in Velox’s current implementation uses a utility called <code>SortBuffer</code> to perform
the sorting, which consists of three stages:</p>
<ol>
<li>Input Stage: Serializes input Columnar Vectors into a row format, stored in a RowContainer.</li>
<li>Sort Stage: Sorts based on keys within the RowContainer.</li>
<li>Output Stage: Extract output vectors column by column from the RowContainer in sorted order.</li>
</ol>
<figure><img src="https://velox-lib.io/img/row-sort.png" height="100%" width="100%"></figure>
<p>While row-based sorting is more efficient than column-based sorting[2,3], what if we only
materialize the sort key columns? We could then use the resulting sort indices to gather
the payload data into the output vectors directly. This would save the cost of converting
the payload columns to rows and back again. More importantly, it would allow us to spill
the original vectors directly to disk rather than first converting rows back into vectors
for spilling.</p>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="non-materialized-sort">Non-Materialized Sort<a href="https://velox-lib.io/blog/why-row-based-sort#non-materialized-sort" class="hash-link" aria-label="Direct link to Non-Materialized Sort" title="Direct link to Non-Materialized Sort">​</a></h3>
<p>We have implemented a <a href="https://github.com/facebookincubator/velox/pull/15157" target="_blank" rel="noopener noreferrer">non-materializing sort strategy</a> designed
to improve sorting performance. The approach materializes only the sort key columns and their original vector indices,
which are then used to gather the corresponding rows from the original input vectors into the output vector after the
sort completes. It changes the <code>SortBuffer</code> to <code>NonMaterizedSortBuffer</code>, which consists of three stages:</p>
<ol>
<li>Input Stage: Holds the input vector (its shared pointer) in a list, serializes key columns
and additional index columns (VectorIndex and RowIndex) into rows, stored in a RowContainer.</li>
<li>Sort Stage: Sorts based on keys within the RowContainer.</li>
<li>Output Stage: Extracts the VectorIndex and RowIndex columns, uses them together to gather
the corresponding rows from the original input vectors into the output vector.</li>
</ol>
<figure><img src="https://velox-lib.io/img/column-sort.png" height="100%" width="100%"></figure>
<p>In theory, this should have significantly reduced the overhead of materializing payload columns,
especially for wide tables, since only sorting keys are materialized. However, the benchmark results
were the exact opposite of our expectations. Despite successfully eliminating expensive serialization
overhead and reducing the total instruction the end-to-end performance was <strong>3x times slower</strong>.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="benchmark-result">Benchmark Result<a href="https://velox-lib.io/blog/why-row-based-sort#benchmark-result" class="hash-link" aria-label="Direct link to Benchmark Result" title="Direct link to Benchmark Result">​</a></h2>
<p>To validate the effectiveness of our new strategy, we designed a benchmark with a varying number of
payload columns:</p>
<ul>
<li>Inputs: 1000 Input Vectors, 4096 rows per vector.</li>
<li>Number of payload columns: 64, 128, 256.</li>
<li>L2 cache: 80 MiB, L3 cache: 108 MiB.</li>
</ul>
<table><thead><tr><th>numPayloadColumns</th><th>Mode</th><th>Input Time</th><th>Sorting Time</th><th>Output Time</th><th>Total Time</th><th>Desc</th></tr></thead><tbody><tr><td>64</td><td>Row-based</td><td>4.27s</td><td>0.79s</td><td>4.23s</td><td>11.64s</td><td>Row-based is 3.9x faster</td></tr><tr><td></td><td>Columnar</td><td>0.28s</td><td>0.84s</td><td>42.30s</td><td>45.90s</td><td></td></tr><tr><td>128</td><td>Row-based</td><td>20.25s</td><td>1.11s</td><td>5.49s</td><td>31.43s</td><td>Row-based is 2.0x faster</td></tr><tr><td></td><td>Columnar</td><td>0.27s</td><td>0.51s</td><td>59.15s</td><td>64.20s</td><td></td></tr><tr><td>256</td><td>Rows-based</td><td>29.34s</td><td>1.02s</td><td>12.85s</td><td>51.48s</td><td>Row-based is 3.0x faster</td></tr><tr><td></td><td>Columnar</td><td>0.87s</td><td>1.10s</td><td>144.00s</td><td>154.80s</td><td></td></tr></tbody></table>
<p>The benchmark results confirm that Row-based Sort is the superior strategy,
delivering a 1.9x to 3.9x overall speedup compared to Columnar Sort. While Row-based Sort
incurs a significantly higher upfront cost during the Input phase (peaking at 104s), it
maintains a highly stable and efficient Output phase (maximum 32s). In contrast, Columnar
Sort suffers from severe performance degradation in the Output phase as the payload increases,
with execution times surging from 42s to 283s, resulting in a much slower total execution time
despite its negligible input overhead.</p>
<p>To identify the root cause of the performance divergence, we utilized <code>perf stat</code> to analyze
micro-architectural efficiency and <code>perf mem</code> to profile memory access patterns during the critical
Output phase.</p>
<table><thead><tr><th>Metrics</th><th>Row-based</th><th>Columnar</th><th>Desc</th></tr></thead><tbody><tr><td>Total Instructions</td><td>555.6 Billion</td><td>475.6 Billion</td><td>Row +17%</td></tr><tr><td>IPC (Instructions Per Cycle)</td><td>2.4</td><td>0.82</td><td>Row 2.9x Higher</td></tr><tr><td>LLC Load Misses (Last Level Cache)</td><td>0.14 Billion</td><td>5.01 Billion</td><td>Columnar 35x Higher</td></tr></tbody></table>
<table><thead><tr><th>Memory Level</th><th>Row-based Output</th><th>Columnar Outputs</th></tr></thead><tbody><tr><td>RAM Hit</td><td>5.8%</td><td>38.1%</td></tr><tr><td>LFB Hit</td><td>1.7%</td><td>18.9%</td></tr><tr><td>RAM Hit</td><td>5.8%</td><td>38.1%</td></tr></tbody></table>
<p>The results reveal a stark contrast in CPU utilization. Although the Row-based approach
executes 17% more instructions (due to serialization overhead), it maintains a high IPC of 2.4,
indicating a fully utilized pipeline. In contrast, the Columnar approach suffers from a low IPC
of 0.82, meaning the CPU is stalled for the majority of cycles. This is directly driven by the
35x difference in LLC Load Misses, which forces the Columnar implementation to fetch data from
main memory repeatedly. The memory profile further confirms this bottleneck: Columnar mode is
severely latency-bound, spending 38.1% of its execution time waiting for DRAM (RAM Hit) and
experiencing significant congestion in the Line Fill Buffer (18.9% LFB Hit), while Row-based
mode effectively utilizes the cache hierarchy.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="the-memory-access-pattern">The Memory Access Pattern<a href="https://velox-lib.io/blog/why-row-based-sort#the-memory-access-pattern" class="hash-link" aria-label="Direct link to The Memory Access Pattern" title="Direct link to The Memory Access Pattern">​</a></h2>
<p>Why does the non-materializing sort, specifically its gather method, cause so many cache misses?
The answer lies in its memory access pattern. Since Velox is a columnar engine, the output is
constructed column by column. For each column in an output vector, the gather process does the following:</p>
<ol>
<li>It iterates through all rows of the current output vector.</li>
<li>For each row, locate the corresponding input vector via the sorted vector index.</li>
<li>Locates the source row in the corresponding input vector.</li>
<li>Copies the data from that single source cell to the target cell.</li>
</ol>
<p>The sorted indices, by nature, offer low predictability. This forces the gather operation for a single
output column to jump unpredictably across as many as different input vectors, fetching just one
value from each. This random access pattern has two devastating consequences for performance.</p>
<p>First, at the micro-level, every single data read becomes a "long-distance" memory jump.
The CPU's hardware prefetcher is rendered completely ineffective by this chaotic access pattern,
resulting in almost every lookup yielding a cache miss.</p>
<p>Second, at the macro-level, the problem compounds with each column processed. The sheer
volume of data touched potentially exceeds the size of the L3 cache. This ensures
that by the time we start processing the next payload column, the necessary vectors have already
been evicted from the cache. Consequently, the gather process must re-fetch the same vector
metadata and data from main memory over and over again for each of the 256 payload columns.
This results in 256 passes of cache-thrashing, random memory access, leading to a catastrophic
number of cache misses and explaining the severe performance degradation.</p>
<figure><img src="https://velox-lib.io/img/columnar-mem.png" height="100%" width="100%"></figure>
<p>In contrast, Velox’s current row-based approach serializes all input vectors into rows, with
each allocation producing a contiguous buffer that holds a subset of those rows. Despite the
serialization, the row layout preserves strong locality when materializing output
vectors: once rows are in the cache, they can be used to extract multiple output columns.
This leads to much better cache-line utilization and fewer cache misses than a columnar layout,
where each fetched line often yields only a single value per column. Moreover, the largely
sequential scans over contiguous buffers let the hardware prefetcher operate effectively,
boosting throughput even in the presence of serialization overhead.</p>
<figure><img src="https://velox-lib.io/img/row-mem.png" height="100%" width="100%"></figure>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="conclusion">Conclusion<a href="https://velox-lib.io/blog/why-row-based-sort#conclusion" class="hash-link" aria-label="Direct link to Conclusion" title="Direct link to Conclusion">​</a></h2>
<p>This study reinforces the core principle of performance engineering: Hardware Sympathy.
Without understanding the characteristics of the memory hierarchy and optimizing for it,
simply reducing the instruction count usually does not guarantee better performance.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="reference">Reference<a href="https://velox-lib.io/blog/why-row-based-sort#reference" class="hash-link" aria-label="Direct link to Reference" title="Direct link to Reference">​</a></h2>
<ul>
<li>[1] <a href="https://velox-lib.io/blog/velox-primer-part-1/" target="_blank" rel="noopener noreferrer">https://velox-lib.io/blog/velox-primer-part-1/</a></li>
<li>[2] <a href="https://duckdb.org/pdf/ICDE2023-kuiper-muehleisen-sorting.pdf" target="_blank" rel="noopener noreferrer">https://duckdb.org/pdf/ICDE2023-kuiper-muehleisen-sorting.pdf</a></li>
<li>[3] <a href="https://duckdb.org/2021/08/27/external-sorting" target="_blank" rel="noopener noreferrer">https://duckdb.org/2021/08/27/external-sorting</a></li>
</ul>]]></content:encoded>
            <category>tech-blog</category>
            <category>sort</category>
            <category>operator</category>
        </item>
        <item>
            <title><![CDATA[Multi-Round Lazy Start Merge]]></title>
            <link>https://velox-lib.io/blog/multi-round-local-merge</link>
            <guid>https://velox-lib.io/blog/multi-round-local-merge</guid>
            <pubDate>Sun, 09 Nov 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Background]]></description>
            <content:encoded><![CDATA[<h2 class="anchor anchorWithStickyNavbar_LWe7" id="background">Background<a href="https://velox-lib.io/blog/multi-round-local-merge#background" class="hash-link" aria-label="Direct link to Background" title="Direct link to Background">​</a></h2>
<p>Efficiently merging sorted data partitions at scale is crucial for a variety of training data preparation
workloads, especially for Generative Recommenders (GRs) a new paradigm introduced in the paper
<em><a href="https://arxiv.org/pdf/2402.17152" target="_blank" rel="noopener noreferrer">Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations</a></em>.
A key requirement is to merge training data across partitions—for example, merging hourly partitions into
daily ones—while ensuring that all rows sharing the same primary key are stored consecutively. Training data is
typically partitioned and bucketed by primary key, with rows sharing the same key
stored consecutively, so merging across partitions essentially becomes a <a href="https://en.wikipedia.org/wiki/K-way_merge_algorithm#" target="_blank" rel="noopener noreferrer">multi-way merge problem</a>.</p>
<p>Normally, <a href="https://spark.apache.org/" target="_blank" rel="noopener noreferrer">Apache Spark</a> can be used for this sort-merge requirement — for example, via <code>CLUSTER BY</code>.
However, training datasets for a single job can often reach the PB scale, which in turn generates shuffle data at PB scale.
Although we typically apply bucketing and ordering by key when preparing training data in production,
Spark can eliminate the shuffle when merging training data from multiple hourly partitions.
However, each Spark task can only read the files planned from various partitions within a split
sequentially, placing them into the sorter and spilling as needed. Only after all files have been read
does Spark perform a sort-merge of the spilled files. This process produces a large number of small
spill files, which further degrades efficiency.</p>
<p>Moreover, Spark’s spill is row-based with a low compression ratio, resulting in approximately 4 times
amplification compared to the original columnar training data in the data lake. These factors
significantly degrade task stability and performance. Velox has a <code>LocalMerge</code> operator that can be
introduced into Apache Spark via <a href="https://gluten.apache.org/" target="_blank" rel="noopener noreferrer">Gluten</a> or <a href="https://www.youtube.com/watch?v=oq5M2861WaQ" target="_blank" rel="noopener noreferrer">PySpark on Velox</a>.</p>
<p><em>Note: To keep the focus on merging, the remainder of this article also assumes that each partition’s
training data is already sorted by primary key—a common setup in training data pipelines.</em></p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="localmerge-operator">LocalMerge Operator<a href="https://velox-lib.io/blog/multi-round-local-merge#localmerge-operator" class="hash-link" aria-label="Direct link to LocalMerge Operator" title="Direct link to LocalMerge Operator">​</a></h2>
<p>The <code>LocalMerge</code> operator consolidates its sources’ outputs into a single, sorted stream of rows.
It runs single-threaded, while its upstream sources may run multi-threaded within the same task,
producing multiple sorted inputs concurrently. For example, when merging 24 hourly partitions into
a single daily partition (as shown in the figure below), the merge plan fragment is split into two pipelines:</p>
<ul>
<li>Pipeline 0: contains two operators, <code>TableScan</code> and <code>CallbackSink</code>. 24 drivers are instantiated to scan the 24 hourly partitions.</li>
<li>Pipeline 1: contains only a single operator, <code>LocalMerge</code>, with one driver responsible for performing the sort merge.</li>
</ul>
<p>A <code>CallbackSink</code> operator is installed at the end of each driver in Pipeline 0. It pushes the <code>TableScan</code>
operator’s output vectors into the queues backing the merge streams. Inside <code>LocalMerge</code>, a <code>TreeOfLosers</code>
performs a k-way merge over the 24 merge streams supplied by the Pipeline 0 drivers, producing a single,
globally sorted output stream.</p>
<figure><img src="https://velox-lib.io/img/merge.png" height="100%" width="100%"></figure>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="multi-round-spill-merge">Multi-Round Spill Merge<a href="https://velox-lib.io/blog/multi-round-local-merge#multi-round-spill-merge" class="hash-link" aria-label="Direct link to Multi-Round Spill Merge" title="Direct link to Multi-Round Spill Merge">​</a></h2>
<p>Although <code>LocalMerge</code> minimizes comparisons during merging, preserves row-ordering guarantees, and cleanly
isolates the single-threaded merge from the multi-threaded scan phase for predictable performance, it can
cause substantial memory pressure—particularly in training-data pipelines. In these workloads, extremely
wide tables are common, and even after column pruning, thousands of columns may remain.</p>
<p>Moreover, training data is typically stored in <a href="https://www.vldb.org/pvldb/vol17/p148-zeng.pdf" target="_blank" rel="noopener noreferrer">PAX-style formats such as Parquet, ORC, or DRWF</a>.
Using Parquet as an example, the reader often needs to keep at least one page per column in memory. As a result,
simply opening a Parquet file with thousands of columns can consume significant memory even before any
merging occurs. Wide schemas further amplify per-column metadata, dictionary pages, and decompression buffers,
inflating the overall footprint. In addition, the k-way merge must hold input vectors from multiple sources
concurrently, which drives peak memory usage even higher.</p>
<p>To cap memory usage and avoid OOM when merging a large number of partitions, we extend <code>LocalMerge</code> to
process fewer local sources at a time, leverage existing spill facilities to persist intermediate
results, and introduce lazy-start activation for merge inputs. Using the case of merging 24 hourly
partitions into a single daily partition, the process is organized into two phases:</p>
<p><strong>Phase 1</strong></p>
<ol>
<li>Break the scan-and-merge into multiple rounds (e.g., 3 rounds).</li>
<li>In each round, lazily start a limited number of drivers (e.g., drivers 0–7, eight at a time).</li>
<li>The started drivers scan data and push it into the queues backing their respective merge streams.</li>
<li>Perform an in-memory k-way merge and spill the results, producing a spill-file group (one or more spill files per group).</li>
<li>After all inputs from drivers 0–7 are consumed and spilled, the drivers will be closed, and close the file streams opened by their <code>TableScan</code> operators, and release associated memory.</li>
<li>Repeat the above steps for the remaining rounds (drivers 8–15, then drivers 16–23), ensuring peak memory stays within budget.</li>
</ol>
<p><strong>Phase 2</strong></p>
<ol>
<li>Create a concatenated file stream for each spill-file group produced in Phase 1.</li>
<li>Schedule one async callback for each concatenated stream to prefetch and push data into a merge stream.</li>
<li>Merge the outputs of the three merge streams using a k-way merge (e.g., a loser-tree), and begin streaming the final, globally sorted results to downstream operators.</li>
<li>The output batch rows is limited adaptively by estimating row size from the merge streams which use the averaged row size from the first batch.</li>
</ol>
<figure><img src="https://velox-lib.io/img/spill.merge.png" height="100%" width="100%"></figure>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="how-to-use">How To Use<a href="https://velox-lib.io/blog/multi-round-local-merge#how-to-use" class="hash-link" aria-label="Direct link to How To Use" title="Direct link to How To Use">​</a></h2>
<p>Set <code>local_merge_spill_enabled</code> to <code>true</code> to enable spilling for the <code>LocalMerge</code>
operator (it is <code>false</code> by default). Then, set <code>local_merge_max_num_merge_sources</code> to
control the number of merge sources per round according to your memory management strategy.</p>
<p><em>Note: An executor must be configured for spilling, as it would schedule an asynchronous
callback for each concatenated stream to prefetch data and push it into the merge stream.</em></p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="future-work">Future Work<a href="https://velox-lib.io/blog/multi-round-local-merge#future-work" class="hash-link" aria-label="Direct link to Future Work" title="Direct link to Future Work">​</a></h2>
<p>The number of merge sources is adjusted dynamically based on available memory, rather than being
determined by the <code>local_merge_max_num_merge_sources</code> parameter. The process starts with a small
number of sources, such as 2, and incrementally increases this number for subsequent
rounds (e.g., to 4) as long as sufficient memory is available. The number of sources stops increasing
once it reaches a memory-constrained limit.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="acknowledgements">Acknowledgements<a href="https://velox-lib.io/blog/multi-round-local-merge#acknowledgements" class="hash-link" aria-label="Direct link to Acknowledgements" title="Direct link to Acknowledgements">​</a></h2>
<p>Thanks to <a href="https://www.linkedin.com/in/xiaoxuanmeng/" target="_blank" rel="noopener noreferrer">Xiaoxuan Meng</a> and <a href="https://www.linkedin.com/in/pedro-pedreira/" target="_blank" rel="noopener noreferrer">Pedro Pederia</a> for their guidance, review, and brainstorming.
I also appreciate the excellent collaboration and work from my colleagues,
<a href="https://www.linkedin.com/in/%E7%BF%94-%E5%A7%9A-1b513a359/" target="_blank" rel="noopener noreferrer">Xiang Yao</a>,
<a href="https://github.com/zjuwangg" target="_blank" rel="noopener noreferrer">Gang Wang</a>,
and <a href="https://www.linkedin.com/in/xu-weixin-75b06786/" target="_blank" rel="noopener noreferrer">Weixin Xu</a>.</p>]]></content:encoded>
            <category>tech-blog</category>
            <category>spill</category>
            <category>operator</category>
        </item>
        <item>
            <title><![CDATA[Enabling Shared Library Builds in Velox]]></title>
            <link>https://velox-lib.io/blog/shared-build</link>
            <guid>https://velox-lib.io/blog/shared-build</guid>
            <pubDate>Sat, 01 Nov 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[In this post, I’ll share how we unblocked shared library builds in Velox, the challenges we encountered with our large CMake build system, and the creative solution that let us move forward without disrupting contributors or downstream users.]]></description>
            <content:encoded><![CDATA[<p>In this post, I’ll share how we unblocked shared library builds in Velox, the challenges we encountered with our large CMake build system, and the creative solution that let us move forward without disrupting contributors or downstream users.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="the-state-of-the-velox-build-system">The State of the Velox Build System<a href="https://velox-lib.io/blog/shared-build#the-state-of-the-velox-build-system" class="hash-link" aria-label="Direct link to The State of the Velox Build System" title="Direct link to The State of the Velox Build System">​</a></h2>
<p>Velox’s codebase was started in Meta’s internal monorepo, which still serves as a source of truth. Changes from pull requests in the Github repository are not merged directly via the web UI. Instead, the changes are imported into the internal review and CI tool <a href="https://developers.facebook.com/blog/post/2022/11/15/meta-developers-workflow-exploring-tools-used-to-code/" target="_blank" rel="noopener noreferrer">Phabricator</a>, as a ‘diff’. There, it has to pass an additional set of CI checks before being merged into the monorepo and in turn, exported to Github as a commit on the ‘main’ branch.</p>
<p>Internally, Meta uses the Buck2 build system which, like its predecessor Buck, promotes   small, <a href="https://buck.build/concept/what_makes_buck_so_fast.html" target="_blank" rel="noopener noreferrer">granular build targets</a> to improve caching and distributed compilation. Externally, however, the open-source community relies on CMake as a build system. When the CMake build system was first added, its structure mirrored the granular nature of the internal build targets.</p>
<p>The result was hundreds of targets: over one hundred library targets, built as static libraries, and more than two hundred executables for tests and examples. While Buck(2) is optimized  for managing such  granular targets, the same can not be said for CMake. Each subdirectory maintained its own <code>CMakeLists.txt</code>, and header includes were managed through global <code>include_directories()</code>, with no direct link between targets and their associated header files. This approach encouraged tight coupling across module boundaries. Over the years, dependencies accumulated organically, resulting in a highly interconnected, and in parts cyclic, dependency graph.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="static-linking-vs-shared-linking">Static Linking vs. Shared Linking<a href="https://velox-lib.io/blog/shared-build#static-linking-vs-shared-linking" class="hash-link" aria-label="Direct link to Static Linking vs. Shared Linking" title="Direct link to Static Linking vs. Shared Linking">​</a></h2>
<p>The combination of 300+ targets and static linking resulted in a massive build tree, dozens of GiB when building in release mode and several times larger when built with debug information. This grew to the point where we had to provision larger CI runners just to accommodate the build tree in debug mode!</p>
<p>Individual test executables could reach several GiB, making it impractical to transfer the executables between runners to parallelize the execution of the tests across different CI runners. Significantly delaying CI feedback for developers when pushing changes to a PR.</p>
<p>This size is the result of each static library containing a full copy of all of its dependencies' object files. With over 100+ strongly connected libraries, we end up with countless redundant copies of the same objects scattered throughout the build tree, blowing up the total size.</p>
<p>CMake usually <a href="https://cmake.org/cmake/help/latest/command/target_link_libraries.html?utm_source=chatgpt.com#id12" target="_blank" rel="noopener noreferrer">requires</a> the library dependency graph to be acyclic (a DAG). However it allows circular dependencies between static libraries, because it’s possible to adjust the linker invocation to work around possible issues with missing symbols.</p>
<p>For example, say library <code>A</code> depends on <code>foo()</code> from library <code>B</code>, and <code>B</code> in turn depends on <code>bar()</code> defined in <code>A</code>. If we link them in order <code>A B</code>, the linker will find <code>foo()</code> in <code>B</code> for use in <code>A</code> but will fail to find <code>bar()</code> for use in <code>B</code>. This happens because the linker processes the libraries from left to right and symbols that are not (yet) required are effectively ignored.</p>
<p>The trick is now simply to repeat the entire circular group in the linker arguments <code>A B A B</code>. Previously ignored symbols are now in the list of required symbols and can be resolved when the linker processes the repeated libraries.</p>
<p>However, there is no workaround for shared libraries. So the moment we attempted to build Velox as a set of shared libraries, CMake failed to configure. The textbook solution would involve refactoring dependencies between targets, explicitly marking dependencies - as <code>PUBLIC</code>( meaning forwarded to both direct and transitive dependents) or <code>PRIVATE</code>, ( required only for this target at build time), and manually breaking cycles. But with hundreds of targets and years of organic growth, this became an unmanageable task.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="solution-attempt-1-automate"><del>Solution</del> Attempt 1: Automate<a href="https://velox-lib.io/blog/shared-build#solution-attempt-1-automate" class="hash-link" aria-label="Direct link to solution-attempt-1-automate" title="Direct link to solution-attempt-1-automate">​</a></h2>
<p>Manually addressing this was unmanageable so, our first instinct as Software Engineers was, of course, to automate the process. I wrote a <a href="https://github.com/assignUser/cmake-refactor" target="_blank" rel="noopener noreferrer">python package</a> that parses all <code>CMakelists.txt</code> files in the repository using <a href="https://github.com/antlr/antlr4/" target="_blank" rel="noopener noreferrer">Antlr4</a>, tracks which source files and headers belong to which target and reconstructs the library dependency graph. Based on the graph and whether headers are included from a source or a header file, the tool grouped the linked targets as either PRIVATE or PUBLIC.</p>
<p>While this worked locally and made it easier to resolve the circular dependencies, it proved far too brittle and disruptive to implement at scale. Velox's high development velocity meant many parts of the code base changed daily, making it impossible to land modifications  across over 250+ CMake files in a timely manner while also allowing for thorough review and testing. Maintaining this carefully constructed DAG of dependencies would also have required an unsustainable amount of specialized reviewer attention, time that could be spent much more productively elsewhere.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="solution-monolithic-library-via-cmake-wrappers">Solution: Monolithic Library via CMake Wrappers<a href="https://velox-lib.io/blog/shared-build#solution-monolithic-library-via-cmake-wrappers" class="hash-link" aria-label="Direct link to Solution: Monolithic Library via CMake Wrappers" title="Direct link to Solution: Monolithic Library via CMake Wrappers">​</a></h2>
<p>Given the practical constraints we needed a solution that would unblock shared builds without disrupting the ongoing development. To achieve this we used a common pattern of larger CMake code bases, project specific wrapper functions. A special thanks to <a href="https://github.com/cryos" target="_blank" rel="noopener noreferrer">Marcus Hanwell</a> for introducing me to the pattern he successfully used in <a href="https://www.kitware.com/vtk-modularization-and-modernization/" target="_blank" rel="noopener noreferrer">VTK</a>.</p>
<p>We created wrapper functions for a selection of the CMake commands that create and manage build targets: <code>add_library</code> became <code>velox_add_library</code>, <code>target_link_libraries</code> turned into <code>velox_link_libraries</code> etc. These functions just forward the arguments to the wrapped core command, unless the option to build Velox as a monolithic library is set. In the latter case instead of ~100 only a single library target is created.</p>
<p>Each call to <code>velox_add_library</code> just adds the relevant source files to the main <code>velox</code> target. To ensure drop-in compatibility, for every original library target, the wrapper creates an <code>ALIAS</code> target referring to <code>velox</code>. This way, all existing code, downstream consumers, old and new test targets, can continue to link to their accustomed <code>velox_xyz</code> targets, which effectively point to the same monolithic library. There is no need to adjust CMake files throughout the codebase, keeping the migration quick and entirely transparent to developers and users.</p>
<p>Some special-case libraries, such as those involving the CUDA driver, still needed to be built separately for runtime or integration reasons. The wrapper logic allows exemptions for these, letting them be built as standalone libraries even in monolithic mode.</p>
<p>Here's a simplified version of the relevant logic:</p>
<div class="language-cmake codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-cmake codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">function(velox_add_library TARGET)</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  if(VELOX_MONO_LIBRARY)</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      if(TARGET velox)</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        # Target already exists, append sources to it.</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        target_sources(velox PRIVATE ${ARGN})</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      else()</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        add_library(velox ${ARGN})</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      endif()</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      # create alias for compatibility</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      add_library(${TARGET} ALIAS velox)</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  else()</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    # Just call add_library as normal</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    add_library(${TARGET} ${ARGN})</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  endif()</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">endfunction()</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<p>This new, monolithic library obviously doesn’t suffer from cyclic dependencies, as it only has external dependencies. So after implementing the wrapper functions, building Velox as a shared library only required some minor adjustments!</p>
<p>We have now switched the default configuration to build the monolithic library and transitioned to build the shared library in our main <a href="https://github.com/facebookincubator/velox/actions/workflows/linux-build.yml" target="_blank" rel="noopener noreferrer">PR CI</a> and <a href="https://github.com/facebookincubator/velox/actions/workflows/macos.yml" target="_blank" rel="noopener noreferrer">macOS builds</a>.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="result">Result<a href="https://velox-lib.io/blog/shared-build#result" class="hash-link" aria-label="Direct link to Result" title="Direct link to Result">​</a></h2>
<p>The results of our efforts can be clearly seen in our <a href="https://facebookincubator.github.io/velox/bm-report" target="_blank" rel="noopener noreferrer">Build Metrics Report</a> with the size of the resulting binaries being the biggest win as seen in Table 1. The workflow that collects these metrics uses our regular CI image with GCC 12.2 and ld 2.38.</p>
<p><img decoding="async" loading="lazy" alt="Velox Build Metrics Report" src="https://velox-lib.io/assets/images/bm-report-f451875bacd9ea675d78d8322ed7e6cd.png" width="1440" height="849" class="img_ev3q"></p>
<p>The new executables built against the shared <a href="http://libvelox.so/" target="_blank" rel="noopener noreferrer"><code>libvelox.so</code></a> are a mere ~4% of their previous size! This unlocks improvements across the board both for the local developer experience, packaging/deployments for downstream users as well as CI.</p>
<table><thead><tr><th style="text-align:left">Size of executables</th><th style="text-align:left">Static</th><th style="text-align:left">Shared</th></tr></thead><tbody><tr><td style="text-align:left">Release</td><td style="text-align:left">18.31G</td><td style="text-align:left">0.76G</td></tr><tr><td style="text-align:left">Debug</td><td style="text-align:left">244G</td><td style="text-align:left">8.8G</td></tr></tbody></table>
<p>Table 1: Total size of test, fuzzer and benchmark executables</p>
<p>Less impressive but still notable is the reduction of the debug build time by almost two hours<sup><a href="https://velox-lib.io/blog/shared-build#user-content-fn-1-1fbaae" id="user-content-fnref-1-1fbaae" data-footnote-ref="true" aria-describedby="footnote-label">1</a></sup>. The improvement is entirely explained by the much faster link time for the shared build of ~30 minutes compared to the 2 hours and 20 minutes it takes to link the static library.</p>
<table><thead><tr><th style="text-align:left">Total CPU Time</th><th style="text-align:left">Static</th><th style="text-align:left">Shared</th></tr></thead><tbody><tr><td style="text-align:left">Release</td><td style="text-align:left">13:21h</td><td style="text-align:left">13:23h</td></tr><tr><td style="text-align:left">Debug</td><td style="text-align:left">12:50h</td><td style="text-align:left">11:07h</td></tr></tbody></table>
<p>Table 2: Total time spent on compilation and linking across all cores</p>
<p>In addition to unblocking shared builds, the wrapper function pattern offers further benefits for future feature additions or build-system wide adjustments with minimal disruption. We already use them to automate installation of header files along the Velox libraries, with no changes to the targets themselves.</p>
<p>The wrapper functions will also help with defining and enforcing the public Velox API by allowing us to manage header visibility, component libraries like GPU and Cloud extensions and versioning in a consistent and centralized manner.</p>
<!-- -->
<section data-footnotes="true" class="footnotes"><h2 class="anchor anchorWithStickyNavbar_LWe7 sr-only" id="footnote-label">Footnotes<a href="https://velox-lib.io/blog/shared-build#footnote-label" class="hash-link" aria-label="Direct link to Footnotes" title="Direct link to Footnotes">​</a></h2>
<ol>
<li id="user-content-fn-1-1fbaae">
<p>This is of course much less in terms of wall time the user experiences (~ <code>CPU time/n</code> where n is the number of cores used for the build) <a href="https://velox-lib.io/blog/shared-build#user-content-fnref-1-1fbaae" data-footnote-backref="" aria-label="Back to reference 1" class="data-footnote-backref">↩</a></p>
</li>
</ol>
</section>]]></content:encoded>
            <category>tech-blog</category>
            <category>packaging</category>
        </item>
        <item>
            <title><![CDATA[Velox switches to C++20 standard]]></title>
            <link>https://velox-lib.io/blog/cpp20-standard</link>
            <guid>https://velox-lib.io/blog/cpp20-standard</guid>
            <pubDate>Sat, 30 Aug 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Background]]></description>
            <content:encoded><![CDATA[<h2 class="anchor anchorWithStickyNavbar_LWe7" id="background">Background<a href="https://velox-lib.io/blog/cpp20-standard#background" class="hash-link" aria-label="Direct link to Background" title="Direct link to Background">​</a></h2>
<p>The C++ standard used in a project determines what built-in functionality developers can use to ease development and extended capabilities.</p>
<p>Since its inception in August of 2021 Velox used the C++17 standard.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="benefits">Benefits<a href="https://velox-lib.io/blog/cpp20-standard#benefits" class="hash-link" aria-label="Direct link to Benefits" title="Direct link to Benefits">​</a></h2>
<p>Switching to C++20 contributes to having a modern ecosystem that is attractive in use, to develop in, and to maintain.
Going forward Velox is looking to enhance the codebase by making use of newer compiler functionalities, such as sanitization checks, by also moving to support newer compiler versions for both GCC and Clang.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="changes">Changes<a href="https://velox-lib.io/blog/cpp20-standard#changes" class="hash-link" aria-label="Direct link to Changes" title="Direct link to Changes">​</a></h2>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="new-minimum-compiler-versions">New minimum compiler versions<a href="https://velox-lib.io/blog/cpp20-standard#new-minimum-compiler-versions" class="hash-link" aria-label="Direct link to New minimum compiler versions" title="Direct link to New minimum compiler versions">​</a></h3>
<p>This change also changes the minimum required compiler versions for GCC and CLang.
The following minimum versions are now required:</p>
<ul>
<li>GCC11 and later</li>
<li>CLang 15 and later</li>
</ul>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="c20-major-new-features-relevant-to-velox">C++20 major new features relevant to Velox<a href="https://velox-lib.io/blog/cpp20-standard#c20-major-new-features-relevant-to-velox" class="hash-link" aria-label="Direct link to C++20 major new features relevant to Velox" title="Direct link to C++20 major new features relevant to Velox">​</a></h3>
<ul>
<li>coroutines language feature</li>
<li>modules language feature</li>
<li>Calendar and Timezone library <code>&lt;chrono&gt;</code></li>
<li>constraints and concepts language feature</li>
<li>3-way operator language feature</li>
<li>Ranges library feature</li>
</ul>
<p>Please refer to the C++20 standard for a complete list of new features.</p>
<p>The minimum targeted compiler versions support</p>
<ul>
<li>coroutines</li>
<li>Calendar and Timezone library <code>&lt;chrono&gt;</code></li>
</ul>
<p>Newer versions of compilers implement more and more and C++20 features and library support.
Supporting Velox on newer compiler versions is a continuous effort.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="lessons">Lessons<a href="https://velox-lib.io/blog/cpp20-standard#lessons" class="hash-link" aria-label="Direct link to Lessons" title="Direct link to Lessons">​</a></h2>
<p>There was some interesting behavior by the compilers. Changing the C++20 standard caused some compile errors in the existing code.
One of these errors was caused by a compiler issue itself.</p>
<p>In GCC 12.2.1, which is used in the CI, the std::string <code>+</code> operator implementation ran into a known issue causing a warning.
Because all warnings are errors we had to explictly add an exemption for this particlar compiler version.</p>
<p>In general, however, the found compile errors were reasonably easy to fix.
Most of the changes were compatible with C++17 as well which means the code is bit more clean.
However, one change caused slight trouble because it emitted warnings in C++17 causing build failures due to turning all warnigns into errors.
This was the change to require <code>this</code> in the lambda capture where applicable.
On the other hand, not addressing the changes to the lamda capture caused errors in C++20.</p>
<p>Overall, the switch to C++20 took some time but was not overly complicated. No changes to the CI pipelines used in the project were needed. It was limited to CMake and code changes.</p>]]></content:encoded>
            <category>build</category>
        </item>
        <item>
            <title><![CDATA[Extending Velox - GPU Acceleration with cuDF]]></title>
            <link>https://velox-lib.io/blog/extending-velox-with-cudf</link>
            <guid>https://velox-lib.io/blog/extending-velox-with-cudf</guid>
            <pubDate>Fri, 11 Jul 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[TL;DR]]></description>
            <content:encoded><![CDATA[<h2 class="anchor anchorWithStickyNavbar_LWe7" id="tldr">TL;DR<a href="https://velox-lib.io/blog/extending-velox-with-cudf#tldr" class="hash-link" aria-label="Direct link to TL;DR" title="Direct link to TL;DR">​</a></h2>
<p>This post describes the design principles and software components for extending Velox with hardware acceleration libraries like NVIDIA's <a href="https://github.com/rapidsai/cudf" target="_blank" rel="noopener noreferrer">cuDF</a>. Velox provides a flexible execution model for hardware accelerators, and cuDF's data structures and algorithms align well with core components in Velox.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="how-velox-and-cudf-fit-together">How Velox and cuDF Fit Together<a href="https://velox-lib.io/blog/extending-velox-with-cudf#how-velox-and-cudf-fit-together" class="hash-link" aria-label="Direct link to How Velox and cuDF Fit Together" title="Direct link to How Velox and cuDF Fit Together">​</a></h2>
<p>Extensibility is a key feature in Velox. The cuDF library integrates with Velox using the <a href="https://github.com/facebookincubator/velox/blob/2a9c9043264a60c9a1b01324f8371c64bd095af9/velox/experimental/cudf/exec/ToCudf.cpp#L293" target="_blank" rel="noopener noreferrer">DriverAdapter</a> interface to rewrite query plans for GPU execution. The rewriting process allows for operator replacement, operator fusion and operator addition, giving cuDF a lot of freedom when preparing a query plan with GPU operators. Once Velox converts a query plan into executable pipelines, the cuDF DriverAdapter rewrites the plan to use GPU operators:</p>
<figure><img src="https://velox-lib.io/img/2025-07-11-cudf-driveradapter.png" height="100%" width="100%"></figure>
<p>Generally the cuDF DriverAdapter replaces operators one-to-one. An exception happens when a GPU operator is not yet implemented in cuDF, in which case a <a href="https://github.com/facebookincubator/velox/blob/main/velox/experimental/cudf/exec/CudfConversion.h" target="_blank" rel="noopener noreferrer">CudfConversion</a> operator must be added to transfer from GPU to CPU, fallback to CPU execution, and then transfer back to GPU. For end-to-end GPU execution where cuDF replaces all of the Velox CPU operators, cuDF relies on Velox's <a href="https://facebookincubator.github.io/velox/develop/task.html" target="_blank" rel="noopener noreferrer">pipeline-based execution model</a> to separate stages of execution, partition the work across drivers, and schedule concurrent work on the GPU.</p>
<p>Velox and cuDF benefit from a shared data format, using columnar layout and <a href="https://engineering.fb.com/2024/02/20/developer-tools/velox-apache-arrow-15-composable-data-management/" target="_blank" rel="noopener noreferrer">Arrow-compatible buffers</a>. The <a href="https://github.com/facebookincubator/velox/tree/2a9c9043264a60c9a1b01324f8371c64bd095af9/velox/experimental/cudf/vector" target="_blank" rel="noopener noreferrer">CudfVector</a> type in the experimental cuDF backend inherits from Velox's <a href="https://facebookincubator.github.io/velox/develop/vectors.html#rowvector" target="_blank" rel="noopener noreferrer">RowVector</a> type, and manages a <code>std::unique_ptr</code> to a <code>cudf::table</code> (<a href="https://docs.rapids.ai/api/libcudf/stable/classcudf_1_1table" target="_blank" rel="noopener noreferrer">link</a>) which owns the GPU-resident data. CudfVector also maintains an <code>rmm::cuda_stream_view</code> (<a href="https://docs.rapids.ai/api/rmm/stable/librmm_docs/cuda_streams/" target="_blank" rel="noopener noreferrer">link</a>) to ensure that asynchronously launched GPU operations are <a href="https://developer.nvidia.com/blog/using-cuda-stream-ordered-memory-allocator-part-1/" target="_blank" rel="noopener noreferrer">stream-ordered</a>. Even when the data resides in GPU memory, CudfVector allows the operator interfaces to follow Velox's standard execution model of producing and consuming RowVector objects.</p>
<p>The first set of GPU operators in the experimental cuDF backend include <a href="https://github.com/facebookincubator/velox/tree/2a9c9043264a60c9a1b01324f8371c64bd095af9/velox/experimental/cudf/connectors" target="_blank" rel="noopener noreferrer">TableScan</a>, <a href="https://github.com/facebookincubator/velox/blob/2a9c9043264a60c9a1b01324f8371c64bd095af9/velox/experimental/cudf/exec/CudfHashJoin.h" target="_blank" rel="noopener noreferrer">HashJoin</a>, <a href="https://github.com/facebookincubator/velox/blob/2a9c9043264a60c9a1b01324f8371c64bd095af9/velox/experimental/cudf/exec/CudfHashAggregation.h" target="_blank" rel="noopener noreferrer">HashAggregation</a>, <a href="https://github.com/facebookincubator/velox/blob/2a9c9043264a60c9a1b01324f8371c64bd095af9/velox/experimental/cudf/exec/CudfFilterProject.h" target="_blank" rel="noopener noreferrer">FilterProject</a>, <a href="https://github.com/facebookincubator/velox/blob/2a9c9043264a60c9a1b01324f8371c64bd095af9/velox/experimental/cudf/exec/CudfOrderBy.h" target="_blank" rel="noopener noreferrer">OrderBy</a> and more. cuDF's C++ API covers many features beyond this initial set, including broad support for list and struct types, high-performance string operations, and configurable null handling. These features are critical for running workloads in Velox with correct semantics for Presto and Spark users. Future work will integrate cuDF support for more Velox operators.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="gpu-execution-fewer-drivers-and-larger-batches">GPU execution: fewer drivers and larger batches<a href="https://velox-lib.io/blog/extending-velox-with-cudf#gpu-execution-fewer-drivers-and-larger-batches" class="hash-link" aria-label="Direct link to GPU execution: fewer drivers and larger batches" title="Direct link to GPU execution: fewer drivers and larger batches">​</a></h2>
<p>Compared to the typical settings for CPU execution, GPU execution with cuDF benefits from a lower driver count and larger batch size. Whereas Velox CPU performs best with ~1K rows per batch and driver count equal to the number of physical CPU cores, we have found Velox's cuDF backend to perform best with ~1 GiB batch size and 2-8 drivers for a single GPU. cuDF GPU operators launch one or more device-wide kernels and a single driver may not fully utilize GPU compute. Additional drivers improve throughput by queuing up more GPU work and avoiding stalling during host-to-device copies. Adding more drivers increases memory pressure linearly, and query processing throughput saturates once GPU compute is fully utilized. Please check out our talk, “<a href="https://www.youtube.com/watch?v=l1JEo-mTNlw" target="_blank" rel="noopener noreferrer">Accelerating Velox with RAPIDS cuDF</a>” from VeloxCon 2025 to learn more.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="call-to-action">Call to Action<a href="https://velox-lib.io/blog/extending-velox-with-cudf#call-to-action" class="hash-link" aria-label="Direct link to Call to Action" title="Direct link to Call to Action">​</a></h2>
<p>The <a href="https://github.com/facebookincubator/velox/tree/main/velox/experimental/cudf" target="_blank" rel="noopener noreferrer">experimental cuDF backend in Velox</a> is under active development. It is implemented entirely in C++ and does not require CUDA programming knowledge. There are dozens more APIs in cuDF that can be integrated as Velox operators, and many design areas to explore such as spilling, remote storage connectors, and expression handling.</p>
<p>If you're interested in bringing GPU acceleration to the Velox ecosystem, please join us in the #velox-oss-community channel on the Velox Slack workspace. Your contributions will help push the limits of performance in Presto, Spark and other tools powered by Velox.</p>]]></content:encoded>
            <category>tech-blog</category>
        </item>
        <item>
            <title><![CDATA[SEGFAULT due to Dependency Update]]></title>
            <link>https://velox-lib.io/blog/segfault-dependency-update</link>
            <guid>https://velox-lib.io/blog/segfault-dependency-update</guid>
            <pubDate>Mon, 07 Jul 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Background]]></description>
            <content:encoded><![CDATA[<h2 class="anchor anchorWithStickyNavbar_LWe7" id="background">Background<a href="https://velox-lib.io/blog/segfault-dependency-update#background" class="hash-link" aria-label="Direct link to Background" title="Direct link to Background">​</a></h2>
<p>Velox depends on several <a href="https://github.com/facebookincubator/velox/blob/main/CMake/resolve_dependency_modules/README.md" target="_blank" rel="noopener noreferrer">libraries</a>.
Some of these dependencies include open-source libraries from Meta, including <a href="https://github.com/facebook/folly" target="_blank" rel="noopener noreferrer">Folly</a> and
<a href="https://github.com/facebook/fbthrift" target="_blank" rel="noopener noreferrer">Facebook Thrift</a>. These libraries are in active development and also depend on each other, so they all have to be updated to the same version at the same time.</p>
<p>Updating these dependencies typically involves modifying the Velox code to align with any public API or semantic changes in these dependencies.
However, a recent upgrade of Folly and Facebook Thrift to version <em>v2025.04.28.00</em> caused a <em>SEGFAULT</em> only in one unit test in Velox
named <em>velox_functions_remote_client_test</em>.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="investigation">Investigation<a href="https://velox-lib.io/blog/segfault-dependency-update#investigation" class="hash-link" aria-label="Direct link to Investigation" title="Direct link to Investigation">​</a></h2>
<p>We immediately put on our <a href="https://en.wikipedia.org/wiki/GNU_Debugger" target="_blank" rel="noopener noreferrer">gdb</a> gloves and looked at the stack traces. This issue was also reproducible in a debug build.
The SEGFAULT occurred in Facebook Thrift's <em>ThriftServer</em> Class during it's initialization but the offending call was invoking a destructor of a certain handler.
However, the corresponding source code was pointing to an invocation of a different function. And this code was present inside a Facebook Thrift
header called <em>AsyncProcessor.h</em>.</p>
<p>This handler (<em>RemoteServer</em>) was implemented in Velox as a Thrift definition. Velox compiled this thrift file using Facebook Thrift, and the generated code
was using the <em>ServerInterface</em> class in Facebook Thrift. This <em>ServerInterface</em> class was further extended from both the <em>AsyncProcessorFactory</em> and
<em>ServiceHandlerBase</em> interfaces in Facebook Thrift.</p>
<p>One of the culprits resulting in SEGFAULTs in the past was the conflict due to the usage of Apache Thrift and Facebook Thrift.
However, this was not the root cause this time because we were able to reproduce this issue by just building the test without the Apache Thrift dependency installed.
We were entering a new territory to investigate, and we were not sure where to start.</p>
<p>We then compiled an example called <em>EchoService</em> in Facebook Thrift that was very similar to the <em>RemoteServer</em>, and it worked. Then we copied and compiled the Velox <em>RemoteServer</em>
in Facebook Thrift and that worked too! So the culprit was likely in the compilation flags, which likely differed between Facebook Thrift and Velox.
We enabled the verbose logging for both builds and were able to spot one difference. We saw the GCC <em>coroutines</em> flag being used in the Facebook Thrift build.</p>
<p>We were also curious about the invocation of the destructor instead of the actual function. We put our gdb gloves back on and dumped the entire
<a href="https://en.wikipedia.org/wiki/Virtual_method_table" target="_blank" rel="noopener noreferrer">vtable</a> for the <em>RemoteServer</em> class and its base classes. The vtables were different when it was built in Velox vs. Facebook Thrift.
Specifically, the list of functions inherited from <em>ServiceHandlerBase</em> was different.</p>
<p>The vtable for the <em>RemoteServer</em> handler in the Velox build had the following entries:</p>
<div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">folly::SemiFuture&lt;folly::Unit&gt; semifuture_onStartServing()</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">folly::SemiFuture&lt;folly::Unit&gt; semifuture_onStopRequested()</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">Thunk ServiceHandlerBase::~ServiceHandlerBase</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<p>The vtable for the <em>RemoteServer</em> handler in the Facebook Thrift build had the following entries:</p>
<div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">folly::coro::Task&lt;void&gt; co_onStartServing()</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">folly::coro::Task&lt;void&gt; co_onStopRequested()</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">folly::SemiFuture&lt;folly::Unit&gt; semifuture_onStartServing()</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">folly::SemiFuture&lt;folly::Unit&gt; semifuture_onStopRequested()</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">Thunk ServiceHandlerBase::~ServiceHandlerBase</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<p>Tying up both pieces of evidence, we could conclude that Velox generated a different vtable structure compared to what Facebook Thrift (and thus ThriftServer) was built with.
Looking around further, we noticed that the ServiceHandlerBase was conditionally adding functions based on the <em>coroutines</em> compile flag that influences the <em>FOLLY_HAS_COROUTINES</em> macro from the portability header.</p>
<div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">class ServiceHandlerBase {</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  ....</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"> public:</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">#if FOLLY_HAS_COROUTINES</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  virtual folly::coro::Task&lt;void&gt; co_onStartServing() { co_return; }</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  virtual folly::coro::Task&lt;void&gt; co_onStopRequested() {</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    throw MethodNotImplemented();</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">#endif</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  virtual folly::SemiFuture&lt;folly::Unit&gt; semifuture_onStartServing() {</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">#if FOLLY_HAS_COROUTINES</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    return co_onStartServing().semi();</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">#else</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    return folly::makeSemiFuture();</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">#endif</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  }</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<p>As a result, the ThriftServer would access an incorrect function (<em>~ServiceHandlerBase</em> destructor at offset 3 in the first vtable above) instead of the expected
initialization function (<em>semifuture_onStartServing</em> at offset 3 in the second vtable above), thus resulting in a SEGFAULT.
We recompiled the Facebook Thrift dependency for Velox with the <em>coroutines</em> compile flag disabled, and the test passed.</p>]]></content:encoded>
            <category>tech-blog</category>
            <category>build</category>
        </item>
        <item>
            <title><![CDATA[A Velox Primer, Part 3]]></title>
            <link>https://velox-lib.io/blog/velox-primer-part-3</link>
            <guid>https://velox-lib.io/blog/velox-primer-part-3</guid>
            <pubDate>Mon, 12 May 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[At the end of the [previous]]></description>
            <content:encoded><![CDATA[<p>At the end of the <a href="https://velox-lib.io/blog/velox-primer-part-2" target="_blank" rel="noopener noreferrer">previous
article</a>, we were halfway
through running our first distributed query:</p>
<div class="language-Shell language-shell codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-shell codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">SELECT l_partkey, count(*) FROM lineitem GROUP BY l_partkey;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<p>We discussed how a query starts, how tasks are set up, and the interactions
between plans, operators, and drivers. We have also presented how the first
stage of the query is executed, from table scan to partitioned output - or the
producer side of the shuffle.&nbsp;&nbsp;</p>
<p>In this article, we will discuss the second query stage, or the consumer side
of the shuffle.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="shuffle-consumer">Shuffle Consumer<a href="https://velox-lib.io/blog/velox-primer-part-3#shuffle-consumer" class="hash-link" aria-label="Direct link to Shuffle Consumer" title="Direct link to Shuffle Consumer">​</a></h2>
<p>As presented in the previous post, on the first query stage each worker reads
the table, then produces a series of information packets (<em>SerializedPages</em>)
intended for different workers of stage 2. In our example, the lineitem table
has no particular physical partitioning or clustering key. This means that any
row of the table can have any value of <em>l_partkey</em> in any of the files forming
the table. So in order to group data based on <em>l_partkey</em>, we first need to make
sure that rows containing the same values for <em>l_partkey</em> are processed by the
same worker – the data shuffle at the end of stage 1.</p>
<p>The overall query structure is as follows:</p>
<figure><img src="https://velox-lib.io/img/velox-primer-part2-1.png" height="70%" width="70%"></figure>
<p>The query coordinator distributes table scan splits to stage 1 workers in no
particular order. The workers process these and, as a side effect, fill
destination buffers that will be consumed by stage 2 workers. Assuming there
are 100 stage 2 workers, every stage 1 Driver has its own PartitionedOutput
which has 100 destinations. When the buffered serializations grow large enough,
these are handed off to the worker's OutputBufferManager.&nbsp;</p>
<p>Now let’s focus on the stage 2 query fragment. Each stage 2 worker has the
following plan fragment: PartitionedOutput over Aggregation over LocalExchange
over Exchange.&nbsp;&nbsp;</p>
<p>Each stage 2 Task corresponds to one destination in the OutputBufferManager of
each stage 1 worker. The first stage 2 Tasks fetches destination zero from all
the stage 1 Tasks. The second stage 2 fetches destination 1 from the first
stage Tasks and so on.&nbsp; Everybody talks to everybody else. The shuffle proceeds
without any central coordination.</p>
<p>But let's zoom in to what actually happens at the start of stage 2.</p>
<figure><img src="https://velox-lib.io/img/velox-primer-part3-1.png" height="70%" width="70%"></figure>
<p>The plan has a LocalExchange node after the Exchange. This becomes two
pipelines: Exchange and LocalPartition on one side, and LocalExchange,
HashAggregation, and PartitionedOutput on the other side.&nbsp;</p>
<p>The Velox Task is intended to be multithreaded, with typically 5 to 30 Drivers.
There can be hundreds of Tasks per stage, thus amounting to thousands of
threads per stage. So, each of the 100 second stage workers is consuming 1/100
of the total output of the first stage. But it is doing this in a multithreaded
manner, with many threads consuming from the ExchangeSource. We will explain
this later.</p>
<p>In order to execute a multithreaded group by, we can either have a thread-safe
hash table or we can partition the data in <strong>n</strong> disjoint streams, and then proceed
aggregating each stream on its own thread. On a CPU, we almost always prefer to
have threads working on their own memory, so data will be locally partitioned
based on <em>l_partkey</em> using a local exchange. CPUs have complex cache coherence
protocols to give observers a consistent ordered view of memory, so
reconciliation after many cores have written the same cache line is both
mandatory and expensive. Strict memory-to-thread affinity is what makes
multicore scalability work.&nbsp;</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="localpartition-and-localexchange">LocalPartition and LocalExchange<a href="https://velox-lib.io/blog/velox-primer-part-3#localpartition-and-localexchange" class="hash-link" aria-label="Direct link to LocalPartition and LocalExchange" title="Direct link to LocalPartition and LocalExchange">​</a></h2>
<p>To create efficient and independent memory access patterns, the second stage
reshuffles the data using a local exchange. In concept, this is like the remote
exchange between tasks, but is scoped inside the Task. The producer side
(LocalPartition) calculates a hash on the partitioning column <em>l_partkey</em>, and
divides this into one destination per Driver in the consumer pipeline. The
consumer pipeline has a source operator LocalExchange that reads from the queue
filled by LocalPartition. See
<a href="https://github.com/facebookincubator/velox/blob/main/velox/exec/LocalPartition.h" target="_blank" rel="noopener noreferrer">LocalPartition.h</a>
for details. Also see the code in
<a href="https://github.com/facebookincubator/velox/blob/main/velox/exec/Task.cpp" target="_blank" rel="noopener noreferrer">Task.cpp</a>
for setting up the queues between local exchange producers and consumers.</p>
<p>While remote shuffles work with serialized data, local exchanges pass in-memory
vectors between threads. This is also the first time we encounter the notion of
using columnar encodings to accelerate vectorized execution. Velox became known
by its extensive use of such techniques, which we call <em>compressed execution</em>. In
this instance, we use dictionaries to slice vectors across multiple
destinations – we will discuss it next.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="dictionary-magic">Dictionary Magic<a href="https://velox-lib.io/blog/velox-primer-part-3#dictionary-magic" class="hash-link" aria-label="Direct link to Dictionary Magic" title="Direct link to Dictionary Magic">​</a></h2>
<p>Query execution often requires changes to the cardinality (number of rows in a
result) of the data. This is essentially what filters and joins do – they
either reduce the cardinality of the data, e.g., filters or selective joins, or
increase the cardinality of the data, e.g. joins with multiple key matches.</p>
<p>Repartitioning in LocalPartition assigns a destination to each row of the input
based on the partitioning key. It then makes a vector for each destination with
just the rows for that destination. Suppose rows 2, 8 and 100 of the current
input hash to 1. Then the vector for destination 1 would only have rows 2, 8
and 100 from the original input. We could make a vector of three rows by
copying the data. Instead, we save the copy by wrapping each column of the
original input in a DictionaryVector with length 3 and indices 2, 8 and 100.
This is much more efficient than copying, especially for wide and nested data.</p>
<p>Later on, the LocalExchange consumer thread running destination 1 will see a
DictionaryVector of 3 rows. When this is accessed by the HashAggregation
Operator, the aggregation sees a dictionary and will then take the indirection
and will access for row 0 the value at 2 in the base (inner) vector, for row 1
the value at 8, and so forth. The consumer for destination 0 does the exact
same thing but will access, for example, rows 4, 9 and 50.</p>
<p>The base of the dictionaries is the same on all the consumer threads. Each
consumer thread just looks at a different subset. The cores read the same cache
lines, but because the base is not written to (read-only), there is no
cache-coherence overhead.&nbsp;</p>
<p>To summarize, a <em>DictionaryVector&lt;T&gt;</em> is a wrapper around any vector of T. The
DictionaryVector specifies indices, which give indices into the base vector.
Dictionary encoding is typically used when there are few distinct values in a
column. Take the strings <em>“experiment”</em> and <em>“baseline”</em>. If a column only has
these values, it is far more efficient to represent it as a vector with
<em>“experiment”</em> at 0 and <em>“baseline”</em> at 1, and a DictionaryVector that has,
say, 1000 elements, where these are either the index 0 or 1.&nbsp;&nbsp;</p>
<p>Besides this, DictionaryVectors can also be used to denote a subset or a
reordering of elements of the base. Because all places that accept vectors also
accept DictionaryVectors, the DictionaryVector becomes the universal way of
representing selection. This is a central precept of Velox and other modern
vectorized engines. We will often come across this concept.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="aggregation-and-pipeline-barrier">Aggregation and Pipeline Barrier<a href="https://velox-lib.io/blog/velox-primer-part-3#aggregation-and-pipeline-barrier" class="hash-link" aria-label="Direct link to Aggregation and Pipeline Barrier" title="Direct link to Aggregation and Pipeline Barrier">​</a></h2>
<p>We have now arrived at the second pipeline of stage 2. It begins with
LocalExchange, which then feeds into HashAggregation. The LocalExchange picks a
fraction of the Task's input specific to its local destination, about
1/number-of-drivers of the task's input.</p>
<p>We will talk about hash tables, their specific layout and other tricks in a
later post. For now, we look at HashAggregation as a black box. This specific
aggregation is a final aggregation, which is a full pipeline barrier that only
produces output after all input is received.</p>
<p>How does the aggregation know it has received all its input? Let's trace the
progress of the completion signal through the shuffles. A leaf worker knows
that it is at end if the Task has received the “no more splits” message in the
last task update from the coordinator.</p>
<p>So, if one DataSource inside a tableScan is at end and there are no more
splits, this particular TableScan is not blocked and it is at end. This will
have the Driver call <em>PartitionedOutput::noMoreInput()</em> on the tableScan's
PartitionedOutput. This will cause any buffered data for any destination to get
flushed and moved over to OutputBufferManager with a note that no more is
coming. OutputBufferManager knows how many Drivers there are in the TableScan
pipeline. After it has received this many “no more data coming” messages, it
can tell all destinations that this Task will generate no more data for them.&nbsp;</p>
<p>Now, when stage 2 tasks query stage 1 producers, they will know that they are
at end when all the producers have signalled that there is no more data coming.
The response to the get data for destination request has a flag identifying the
last batch. The ExchangeSource on the stage 2 Task set the no-more-data flag.
This is then queried by all the Drivers and each of the Exchange Operators sees
this. This then calls noMoreInput in the LocalPartition. This queues up a “no
more data” signal in the local exchange queues. If the LocalExchange at the
start of the second pipeline of stage 2 sees a “no more data” from each of its
sources, then it is at end and noMoreInput is called on the HashAggregation.&nbsp;</p>
<p>This is how the end of data propagates. Up until now, HashAggregation has
produced no output, since the counts are not known until all input is received.
Now, HashAggregation starts producing batches of output, which contain the
<em>l_partkey</em> value and the number of times this has been seen. This reaches the
last PartitionedOutput, which in this case has only one destination, the final
worker that produces the result set. This will be at end when all the 100
sources have reported their own end of data.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="recap">Recap<a href="https://velox-lib.io/blog/velox-primer-part-3#recap" class="hash-link" aria-label="Direct link to Recap" title="Direct link to Recap">​</a></h2>
<p>We have finally walked through the distributed execution of a simple query. We
presented how data is partitioned between workers in the cluster, and then a
second time over inside each worker.</p>
<p>Velox and Presto are designed to aggressively parallelize execution, which
means creating distinct, non-overlapping sets of data to process on each
thread. The more threads, the more throughput. Also remember that for CPU
threads to be effective, they must process tasks that are large enough (often
over 100 microseconds worth of cpu time), and not communicate too much with
other threads or write to memory other threads are writing to. This is
accomplished with local exchanges.</p>
<p>Another important thing to remember from this article is how columnar encodings
(DictionaryVectors, in particular) can be used as a zero-copy way of
representing selection/reorder/duplication. We will see this pattern again with
filters, joins, and other relational operations.</p>
<p>Next we will be looking at joins, filters, and hash aggregations. Stay tuned!</p>]]></content:encoded>
            <category>tech-blog</category>
            <category>primer</category>
        </item>
        <item>
            <title><![CDATA[A Velox Primer, Part 2]]></title>
            <link>https://velox-lib.io/blog/velox-primer-part-2</link>
            <guid>https://velox-lib.io/blog/velox-primer-part-2</guid>
            <pubDate>Tue, 25 Mar 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[In this article, we will discuss how a distributed compute engine executes a]]></description>
            <content:encoded><![CDATA[<p>In this article, we will discuss how a distributed compute engine executes a
query similar to the one presented in
<a href="https://velox-lib.io/blog/velox-primer-part-1" target="_blank" rel="noopener noreferrer">our first article</a>:</p>
<div class="language-Shell language-shell codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-shell codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">SELECT l_partkey, count(*) FROM lineitem GROUP BY l_partkey;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<p>We use the TPC-H schema to illustrate the example, and
<a href="https://github.com/prestodb/presto/tree/master/presto-native-execution" target="_blank" rel="noopener noreferrer">Prestissimo</a>
as the compute engine orchestrating distributed query execution. Prestissimo is
responsible for the query engine frontend (parsing, resolving metadata,
planning, optimizing) and distributed execution (allocating resources and
shipping query fragments), and Velox is responsible for the execution of plan
fragments within a single worker node. Throughout this article, we will present
which functions are performed by Velox and which by the distributed engine -
Prestissimo, in this example.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="query-setup">Query Setup<a href="https://velox-lib.io/blog/velox-primer-part-2#query-setup" class="hash-link" aria-label="Direct link to Query Setup" title="Direct link to Query Setup">​</a></h2>
<p>Prestissimo first receives the query through a <em>coordinator</em> node, which is
responsible for parsing and planning the query. For our sample query, a
distributed query plan with three query fragments will be created:</p>
<ol>
<li>The first fragment reads the <em>l_partkey</em> column from the lineitem table and
divides its output according to a hash of <em>l_partkey</em>.</li>
<li>The second fragment reads the output from the first fragment and updates a
hash table from <em>l_partkey</em> containing the number of times the particular value
of <em>l_partkey</em> has been seen (the count(*) aggregate function implementation).</li>
<li>The final fragment then reads the content of the hash tables, once the
second fragment has received all the rows from the first fragment.</li>
</ol>
<figure><img src="https://velox-lib.io/img/velox-primer-part2-1.png" height="70%" width="70%"></figure>
<p>The shuffle between the two first fragments partitions the data according to
<em>l_partkey</em>. Suppose there are 100 instances of the second fragment. If the hash
of <em>l_partkey</em> modulo 100 is 0, the row goes to the first task in the second
stage; if it is 1, the row goes to the second task, and so forth. In this way,
each second stage Task gets a distinct subset of the rows. The shuffle between
the second and third stage is a <em>gather</em>, meaning that there is one Task in the
third stage that will read the output of all 100 tasks in the second stage.</p>
<p>A <em>Stage</em> is the set of Tasks that share the same plan fragment. A Task is the
main integration point between Prestissimo and Velox; it’s the Velox execution
instance that physically processes all or part of the data that passes through
the stage.</p>
<p>To set up the distributed execution, Prestissimo first selects the workers from
the pool of Prestissimo server processes it manages. Assuming that stage 1 is
10 workers wide, it selects 10 server processes and sends the first stage plan
to them. It then selects 100 workers for stage 2 and sends the second stage
plan to these. The last stage that gathers the result has only one worker, so
Prestissimo sends the final plan to only one worker. The set of workers for
each stage may overlap, so a single worker process may host multiple stages of
one query.</p>
<p>Let's now look more closely at what each worker does at query setup time.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="task-setup">Task Setup<a href="https://velox-lib.io/blog/velox-primer-part-2#task-setup" class="hash-link" aria-label="Direct link to Task Setup" title="Direct link to Task Setup">​</a></h2>
<p>In Prestissimo, the message that sets up a Task in a worker is called <em>Task
Update</em>. A Task Update has the following information: the plan, configuration
settings, and an optional list of splits. Splits are further qualified by what
plan node they are intended for, and whether more splits for the recipient plan
node and split group will be coming.</p>
<p>Since split generation involves enumerating files from storage (so they may
take a while), Presto allows splits to be sent to workers asynchronously, such
that the generation of splits can run in parallel with the execution of the
first splits. Therefore, the first task update specifies the plan and the
config. Subsequent ones only add more splits.</p>
<p>Besides the plan, the coordinator provides configs as maps from string key to
string value, both top level and connector level. The connector configs have
settings for each connector; connectors are used by table scan and table writer
to deal with storage and file formats. These configs and other information,
like thread pools, top level memory pool etc. are handed to the Task in a
QueryCtx object. See <code>velox/core/QueryCtx.h</code> in the Velox repo for details.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="from-plans-to-drivers-and-operators">From Plans to Drivers and Operators<a href="https://velox-lib.io/blog/velox-primer-part-2#from-plans-to-drivers-and-operators" class="hash-link" aria-label="Direct link to From Plans to Drivers and Operators" title="Direct link to From Plans to Drivers and Operators">​</a></h2>
<p>Once the Velox Task is created, a TaskManager hands it splits to work on. This
is done with <code>Task::addSplit()</code>, and can be done after the Task has started
executing. See <code>velox/exec/Task.h</code> for details.</p>
<p>Let's zoom into what happens at Task creation: A PlanNode tree specifying what
the Task does is given to the Task as part of a <em>PlanFragment</em>. The most
important step done at Task creation is splitting the plan tree into pipelines.
Each pipeline then gets a DriverFactory, which is the factory class that makes
Drivers for the pipeline. The Drivers, in their turn, contain the Operators
that do the work of running the query. The DriverFactories are made in
LocalPlanner.cpp. See <code>LocalPlanner::plan</code> for details.</p>
<figure><img src="https://velox-lib.io/img/velox-primer-part2-2.png" height="70%" width="70%"></figure>
<p>Following the execution model known as Volcano, the plan is represented by an
operator tree where each node consumes the output of its child operators, and
returns output to the parent operator. The root node is typically a
PartitionedOutputNode or a TableWriteNode. The leaf nodes are either
TableScanNode, ExchangeNode or ValuesNode (used for query literals). The full
set of Velox PlanNode can be found at velox/core/PlanNode.h.</p>
<p>The PlanNodes mostly correspond to Operators. PlanNodes are not executable as
such; they are only a structure describing how to make Drivers and Operators,
which do the actual execution. If the tree of nodes has a single branch, then
the plan is a single pipeline. If it has nodes with more than one child
(input), then the second input of the node becomes a separate pipeline.</p>
<p><code>Task::start()</code> creates the DriverFactories, which then create the Drivers. To
start execution, the Drivers are queued on a thread pool executor. The main
function that runs Operators is <code>Driver::runInternal()</code>. See this function for
the details of how Operators and the Driver interface: <code>Operator::isBlocked()</code>
determines if the Driver can advance. If it cannot, it goes off thread until a
future is realized, which then puts it back on the executor.</p>
<p>getOutput() retrieves data from an Operator and addInputs() feeds data into
another Operator. The order of execution is to advance the last Operator which
can produce output and then feed this to the next Operator. if an Operator
cannot produce output, then getOutput() is called on the Operator before it
until one is found that can produce data. If no operator is blocked and no
Operator can produce output, then the plan is at end. The noMoreInput() method
is called on each operator. This can unblock production of results, for
example, an OrderBy can only produce its output after it knows that it has all
the input.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="the-minimal-pipeline-table-scan-and-repartitioning">The Minimal Pipeline: Table Scan and Repartitioning<a href="https://velox-lib.io/blog/velox-primer-part-2#the-minimal-pipeline-table-scan-and-repartitioning" class="hash-link" aria-label="Direct link to The Minimal Pipeline: Table Scan and Repartitioning" title="Direct link to The Minimal Pipeline: Table Scan and Repartitioning">​</a></h2>
<p><strong>Table Scan.</strong> With the table scan stage of our sample query, we have one pipeline
with two operators: TableScan and PartitionedOutput. Assume this pipeline has
five Drivers, and that all these five Drivers go on the thread pool executor.
The PartitionedOutput cannot do anything because it has no input. The
TableScan::getOutput() is then called. See <code>velox/exec/TableScan.cpp</code> for
details. The first action TableScan takes is to look for a Split with
task::getSplitOrFuture(). If there is no split available, this returns a
future. The Driver will then park itself off thread and install a callback on
the future that will reschedule the Driver when a split is available.</p>
<p>It could also be the case that there is no split and the Task has been notified
that no more splits will be coming. In this case TableScan would be at the end.
Finally, if a Split is available, TableScan interprets it. Given a TableHandle
specification provided as part of the plan (list of columns and filters), the
Connector (as specified in the Split) makes a DataSource. The DataSource
handles the details of IO and file and table formats.</p>
<p>The DataSource is then given the split. After this, DataSource::next() can be
called repeatedly to get vectors (batches) of output from the file/section of
file specified by the Split. If the DataSource is at the end, TableScan looks
for the next split. See Connector.h for the Connector and DataSource
interfaces.</p>
<p><strong>Repartitioning.</strong> Now we have traced execution up to the TableScan returning its
first batch of output. The Driver feeds this to PartitionedOutput::addInput().
See PartitionedOutput.cpp for details. PartitionedOutput first calculates a
hash on the partitioning key, in this case, <em>l_partkey</em>, producing a destination
number for each row in the batch RowVectorPtr input_.</p>
<p>Each destination has a partly filled serialization buffer (VectorStreamGroup)
for each destination worker. If there are 100 Tasks in the second stage, each
PartitionedOutput has 100 destinations, each with one VectorStreamGroup. The
main function of a VectorStreamGroup is append(), which takes a RowVectorPtr
and a set of row numbers in it. It serializes each value identified by the row
numbers and adds it to the partly formed serialization. When enough rows are
accumulated in the VectorStreamGroup, it produces a SerializedPage. See flush()
in PartitionedOutput.cpp.</p>
<p>The SerializedPage is a self contained serialized packet of information that
can be transmitted over the wire to the next stage. Each such page only
contains rows intended for the same recipient. These pages are then queued up
in the worker process' OutputBufferManager. Note the code with BlockingReason
in flush(). The buffer manager maintains separate queues of all consumers of
all Tasks. If a queue is full, adding output may block. This returns a future
that is realized when there is space to add more data to the queue. This
depends on when the Task's consumer Task fetches the data.</p>
<p>Shuffles in Prestissimo are implemented by PartitionedOutput at the producer
and Exchange at the consumer end. The OutputBufferManager keeps ready
serialized data for consumers to pick up. The binding of these to the Presto
wire protocols is in TaskManager.cpp for the producer side, and in
PrestoExchangeSource.cpp for the consumer side.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="recap">Recap<a href="https://velox-lib.io/blog/velox-primer-part-2#recap" class="hash-link" aria-label="Direct link to Recap" title="Direct link to Recap">​</a></h2>
<p>We presented how a plan becomes executable and how data moves between and
inside Operators. We discussed that a Driver can block (go off thread) to wait
for a Split to become available, or to wait for its output to be consumed. We
have now scratched the surface of running a leaf stage of a distributed query.
There is much more to Operators and vectors, though. In the next installment of
Velox Primer, we will look at what the second stage of our minimal sample query
does.</p>]]></content:encoded>
            <category>tech-blog</category>
            <category>primer</category>
        </item>
        <item>
            <title><![CDATA[A Velox Primer, Part 1]]></title>
            <link>https://velox-lib.io/blog/velox-primer-part-1</link>
            <guid>https://velox-lib.io/blog/velox-primer-part-1</guid>
            <pubDate>Mon, 17 Feb 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[This is the first part of a series of short articles that will take you through]]></description>
            <content:encoded><![CDATA[<p>This is the first part of a series of short articles that will take you through
Velox’s internal structures and concepts. In this first part, we will discuss
how distributed queries are executed, how data is shuffled among different
stages, and present Velox concepts such as Tasks, Splits, Pipelines, Drivers,
and Operators that enable such functionality.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="distributed-query-execution">Distributed Query Execution<a href="https://velox-lib.io/blog/velox-primer-part-1#distributed-query-execution" class="hash-link" aria-label="Direct link to Distributed Query Execution" title="Direct link to Distributed Query Execution">​</a></h2>
<p>Velox is a library that provides the functions that go into a <em>query fragment</em> in
a distributed compute engine. Distributed compute engines, like Presto and
Spark run so-called <em>exchange parallel plans</em>. Exchange is also known as a data
shuffle, and allows data to flow from one stage to the next. Query fragments
are the things connected by shuffles, and comprise the processing that is
executed within a single worker node. A shuffle takes input from a set of
fragments and routes rows of input to particular consumers based on a
characteristic of the data, i.e., a partitioning key. The consumers of the
shuffle read from the shuffle and get rows of data from whatever producer, such
that the partitioning key of these rows matches the consumer.</p>
<p>Take the following query as an example:</p>
<div class="language-Shell language-shell codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-shell codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">SELECT key, count(*) FROM T GROUP BY key</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<p>Suppose it has <strong>n</strong> leaf fragments that scan different parts of <strong>T</strong> (the
green circles at stage 1). At the end of the leaf fragment, assume there is a
shuffle that shuffles the rows based on key. There are <strong>m</strong> consumer fragments
(the yellow circles at stage 2), where each gets a non-overlapping selection of
rows based on column <strong>key</strong>. Each consumer then constructs a hash table keyed
on <strong>key</strong>, where they store a count of how many times the specific value of
<strong>key</strong> has been seen.</p>
<figure><img src="https://velox-lib.io/img/velox-primer-part1.png" height="100%" width="100%"></figure>
<p>Now, if there are 100B different values of <strong>key</strong>, the hash table would not
conveniently fit on a single machine. For efficiency, there is a point in
dividing this into 100 hash tables of 1B entries each. This is the point of
exchange parallel scale-out. Think of shuffle as a way for consumers to each
consume their own distinct slice of a large stream produced by multiple
workers.</p>
<p>Distributed query engines like Presto and Spark both fit the above description.
The difference is in, among other things, how they do the shuffle, but we will
get back to that later.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="tasks-and-splits">Tasks and Splits<a href="https://velox-lib.io/blog/velox-primer-part-1#tasks-and-splits" class="hash-link" aria-label="Direct link to Tasks and Splits" title="Direct link to Tasks and Splits">​</a></h2>
<p>Within a worker node, the Velox representation of a query fragment is called a
Task (<code>velox::exec::Task</code>). A task is informed by mainly two things:&nbsp;</p>
<ul>
<li><strong>Plan</strong> - <code>velox::core::PlanNode</code>: specifies what the Task does.</li>
<li><strong>Splits</strong> - <code>velox::exec::Split</code>: specifies the data the Task operates on.</li>
</ul>
<p>The Splits correspond to the plan that the Task is executing. For the first
stage Tasks (table scanning), the Splits specify pieces of files to scan. For
the second stage Tasks (group by), their Splits identify the table scan Tasks
from which the group by reads its input. There are file splits
(<code>velox::connector::ConnectorSplit</code>) and remote splits
(<code>velox::exec::RemoteConnectorSplit</code>). The first identifies data to read, the
second identifies a running Task.</p>
<p>The distributed engine makes PlanNodes and Splits. Velox takes these and makes
Tasks. Tasks send back statistics, errors and other status information to the
distributed engine.&nbsp;</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="pipelines-drivers-and-operators">Pipelines, Drivers and Operators<a href="https://velox-lib.io/blog/velox-primer-part-1#pipelines-drivers-and-operators" class="hash-link" aria-label="Direct link to Pipelines, Drivers and Operators" title="Direct link to Pipelines, Drivers and Operators">​</a></h2>
<p>Inside a Task, there are <em>Pipelines</em>. Each pipeline is a linear sequence of
operators (<code>velox::exec::Operator</code>), and operators are the objects that implement
relational logic. In the case of the group by example, the first task has one
pipeline, with a TableScan (<code>velox::exec::TableScan</code>) and a PartitionedOutput
(<code>velox::exec::PartitionedOutput</code>). The second Task too has one pipeline, with
an Exchange, a LocalExchange, HashAggregation, and a
PartitionedOutput.</p>
<p>Each pipeline has one or more <em>Drivers</em>. A Driver is a container for one linear
sequence of Operators, and typically runs on its own thread. The pipeline is
the collection of Drivers with the same sequence of Operators. The individual
Operator instances belong to each Driver. The Drivers belong to the Task. These
are interlinked with smart pointers such that they are kept alive for as long as
needed.</p>
<p>An example of a Task with two pipelines is a hash join, with separate pipelines
for the build and for the probe side. This makes sense because the build must
be complete before the probe can proceed. We will talk more about this later.</p>
<p>The Operators on each Driver communicate with each other through the Driver.
The Driver picks output from one Operator, keeps tabs on stats and passes it to
the next Operator. The data passed between Operators consists of vectors. In
particular, an Operator produces/consumes a RowVector, which is a vector with a
child vector for every column of the relation - it is the equivalent of
RecordBatch in Arrow.&nbsp;All vectors are subclasses of <code>velox::BaseVector</code>.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="operator-sources-sinks-and-state">Operator Sources, Sinks and State<a href="https://velox-lib.io/blog/velox-primer-part-1#operator-sources-sinks-and-state" class="hash-link" aria-label="Direct link to Operator Sources, Sinks and State" title="Direct link to Operator Sources, Sinks and State">​</a></h2>
<p>The first Operator in a Driver is called a <em>source</em>. The source operator takes
Splits to figure the file/remote Task that provides its data, and produces a
sequence of RowVectors. The last operator in the pipeline is called a sink. The
sink operator does not produce an output RowVector, but rather puts the data
somewhere for a consumer to retrieve. This is typically a PartitionedOutput.
The consumer of the PartitionedOutput is an Exchange in a different task, where
the Exchange is in the source position. Operators that are neither sources or
sinks are things like FilterProject, HashProbe, HashAggregation and so on, more
on these later.</p>
<p>Operators also contain a state. They can be blocked, accepting input, have
output to produce or not, or may be notified that no more input is coming.
Operators do not call the API functions of other Operators directly, but
instead the Driver decides which Operator to advance next. This has the benefit
that no part of the Driver's state is captured in nested function calls. The
Driver has a flat stack and therefore can go on and off thread at any time,
without having to unwind and restore nested function frames.&nbsp;</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="recap">Recap<a href="https://velox-lib.io/blog/velox-primer-part-1#recap" class="hash-link" aria-label="Direct link to Recap" title="Direct link to Recap">​</a></h2>
<p>We have seen how a distributed query consists of fragments joined by shuffles.
Each fragment has a Task, which has pipelines, Drivers and Operators. PlanNodes
represent what the task is supposed to do. These tell the Task how to set up
Drivers and Operators. Splits tell the Task where to find input, e.g. from
files or from the other Tasks. A Driver corresponds to a thread of execution.
It may be running or blocked, e.g. waiting for data to become available or for
its consumer to consume already produced data. Operators pass data between
themselves as vectors.&nbsp;</p>
<p>In the next article we will discuss the different stages in the life of a query.</p>]]></content:encoded>
            <category>tech-blog</category>
            <category>primer</category>
        </item>
        <item>
            <title><![CDATA[Velox Query Tracing]]></title>
            <link>https://velox-lib.io/blog/velox-query-tracing</link>
            <guid>https://velox-lib.io/blog/velox-query-tracing</guid>
            <pubDate>Sun, 15 Dec 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[TL;DR]]></description>
            <content:encoded><![CDATA[<h2 class="anchor anchorWithStickyNavbar_LWe7" id="tldr">TL;DR<a href="https://velox-lib.io/blog/velox-query-tracing#tldr" class="hash-link" aria-label="Direct link to TL;DR" title="Direct link to TL;DR">​</a></h2>
<p>The query trace tool helps analyze and debug query performance and correctness issues. It helps prevent
interference from external noise in a production environment (such as storage, network, etc.) by allowing
replay of a part of the query plan and dataset in an isolated environment, such as a local machine.
This is much more efficient for query performance analysis and issue debugging, as it eliminates the need
to replay the whole query in a production environment.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="how-tracing-works">How Tracing Works<a href="https://velox-lib.io/blog/velox-query-tracing#how-tracing-works" class="hash-link" aria-label="Direct link to How Tracing Works" title="Direct link to How Tracing Works">​</a></h2>
<p>The tracing process consists of two distinct phases: the tracing phase and the replaying phase. The
tracing phase is executed within a production environment, while the replaying phase is conducted in
a local development environment.</p>
<p><strong>Tracing Phase</strong></p>
<ol>
<li>Trace replay required metadata, including the query plan fragment, query configuration,
and connector properties, is recorded during the query task initiation.</li>
<li>Throughout query processing, each traced operator logs the input vectors or splits
storing them in a designated storage location.</li>
<li>The metadata and splits are serialized in JSON format, and the operator data inputs are
serialized using a <a href="https://prestodb.io/docs/current/develop/serialized-page.html" target="_blank" rel="noopener noreferrer">presto serializer</a>.</li>
</ol>
<p><strong>Replaying Phase</strong></p>
<ol>
<li>Read and deserialize the recorded query plan, extract the traced plan node, and assemble a plan
fragment with customized source and sink nodes.</li>
<li>The source node reads the input from the serialized operator inputs on storage and the sink operator
prints or logs out the execution stats.</li>
<li>Build a task with the assembled plan fragment in step 1. Apply the recorded query configuration and
connector properties to replay the task with the same input and configuration setup as in production.</li>
</ol>
<p><strong>NOTE</strong>: The presto serialization might lose input vector encoding, such as lazy vector and nested dictionary
encoding, which affects the operator’s execution. Hence, it might not always be the same as in production.</p>
<figure><img src="https://velox-lib.io/img/tracing.png" height="100%" width="100%"></figure>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="tracing-framework">Tracing Framework<a href="https://velox-lib.io/blog/velox-query-tracing#tracing-framework" class="hash-link" aria-label="Direct link to Tracing Framework" title="Direct link to Tracing Framework">​</a></h2>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="trace-writers">Trace Writers<a href="https://velox-lib.io/blog/velox-query-tracing#trace-writers" class="hash-link" aria-label="Direct link to Trace Writers" title="Direct link to Trace Writers">​</a></h3>
<p>There are three types of writers: <code>TaskTraceMetadataWriter</code>, <code>OperatorTraceInputWriter</code>,
and <code>OperatorTraceSplitWriter</code>. They are used in the prod or shadow environment to record
the real execution data.</p>
<ul>
<li>The <code>TaskTraceMetadataWriter</code> records the query metadata during task creation, serializes it,
and saves it into a file in JSON format.</li>
<li>The <code>OperatorTraceInputWriter</code> records the input vectors from the target operator, it uses a Presto
serializer to serialize each vector batch and flush immediately to ensure that replay is possible
even if a crash occurs during execution.</li>
<li>The <code>OperatorTraceSplitWriter</code> captures the input splits from the target <code>TableScan</code> operator. It
serializes each split and immediately flushes it to ensure that replay is possible even if a crash
occurs during execution.</li>
</ul>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="storage-location">Storage Location<a href="https://velox-lib.io/blog/velox-query-tracing#storage-location" class="hash-link" aria-label="Direct link to Storage Location" title="Direct link to Storage Location">​</a></h3>
<p>It is recommended to store traced data in a remote storage system to ensure its preservation and
accessibility even if the computation clusters are reconfigured or encounter issues. This also
helps prevent nodes in the cluster from failing due to local disk exhaustion.</p>
<p>Users should start by creating a root directory. Writers will then create subdirectories within
this root directory to organize the traced data. A well-designed directory structure will keep
the data organized and accessible for replay and analysis.</p>
<p><strong>Metadata Location</strong></p>
<p>The <code>TaskTraceMetadataWriter</code> is set up during the task creation so it creates a trace directory
named <code>$rootDir/$queryId/$taskId</code>.</p>
<p><strong>Input Data and Split Location</strong></p>
<p>The node ID consolidates the tracing for the same tracing plan node. The pipeline ID isolates the
tracing data between operators created from the same plan node (e.g., HashProbe and HashBuild from
the HashJoinNode). The driver ID isolates the tracing data of peer operators in the same pipeline
from different drivers.</p>
<p>Correspondingly, to ensure the organized and isolated tracing data storage, the <code>OperatorTraceInputWriter</code>
and <code>OpeartorTraceSplitWriter</code> are set up during the operator initialization and create a data or split
tracing directory in</p>
<div class="language-Shell language-shell codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-shell codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">$rootDir/$queryId$taskId/$nodeId/$pipelineId/$driverId</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="memory-management">Memory Management<a href="https://velox-lib.io/blog/velox-query-tracing#memory-management" class="hash-link" aria-label="Direct link to Memory Management" title="Direct link to Memory Management">​</a></h3>
<p>Add a new leaf system pool named tracePool for tracing memory usage, and expose it
like <code>memory::MemoryManager::getInstance()-&gt;tracePool()</code>.</p>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="trace-readers">Trace Readers<a href="https://velox-lib.io/blog/velox-query-tracing#trace-readers" class="hash-link" aria-label="Direct link to Trace Readers" title="Direct link to Trace Readers">​</a></h3>
<p>Three types of readers correspond to the query trace writers: <code>TaskTraceMetadataReader</code>,
<code>OperatorTraceInputReader</code>, and <code>OperatorTraceSplitReader</code>. The replayers typically use
them in the local environment, which will be described in detail in the Query Trace Replayer section.</p>
<ul>
<li>The <code>TaskTraceMetadataReader</code> can load the query metadata JSON file and extract the query
configurations, connector properties, and a plan fragment. The replayer uses these to build
a replay task.</li>
<li>The <code>OperatorTraceInputReader</code> reads and deserializes the input vectors in a tracing data file.
It is created and used by a <code>QueryTraceScan</code> operator which will be described in detail in
the <strong>Query Trace Scan</strong> section.</li>
<li>The <code>OperatorTraceSplitReader</code> reads and deserializes the input splits in tracing split info files,
and produces a list of <code>exec::Split</code> for the query replay.</li>
</ul>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="trace-scan">Trace Scan<a href="https://velox-lib.io/blog/velox-query-tracing#trace-scan" class="hash-link" aria-label="Direct link to Trace Scan" title="Direct link to Trace Scan">​</a></h3>
<p>As outlined in the <strong>How Tracing Works</strong> section, replaying a non-leaf operator requires a
specialized source operator. This operator is responsible for reading data records during the
tracing phase and integrating with Velox’s <code>LocalPlanner</code> with a customized plan node and
operator translator.</p>
<p><strong>TraceScanNode</strong></p>
<p>We introduce a customized ‘TraceScanNode’ to replay a non-leaf operator. This node acts as
the source node and creates a specialized scan operator, known as <code>OperatorTraceScan</code> with
one per driver during the replay. The <code>TraceScanNode</code> contains the trace directory for the
designated trace node, the pipeline ID associated with it, and a driver ID list passed during
the replaying by users so that the OperatorTraceScan can locate the right trace input data or
split directory.</p>
<p><strong>OperatorTraceScan</strong></p>
<p>As described in the <strong>Storage Location</strong> section, a plan node may be split into multiple pipelines,
each pipeline can be divided into multiple operators. Each operator corresponds to a driver, which
is a thread of execution. There may be multiple tracing data files for a single plan node, one file
per driver.</p>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="query-trace-replayer">Query Trace Replayer<a href="https://velox-lib.io/blog/velox-query-tracing#query-trace-replayer" class="hash-link" aria-label="Direct link to Query Trace Replayer" title="Direct link to Query Trace Replayer">​</a></h3>
<p>The query trace replayer is typically used in the local environment and works as follows:</p>
<ol>
<li>Load traced query configurations, connector properties,
and a plan fragment.</li>
<li>Extract the target plan node from the plan fragment using the specified plan node ID.</li>
<li>Use the target plan node in step 2 to create a replay plan node. Create a replay plan.</li>
<li>If the target plan node is a <code>TableScanNode</code>, add the replay plan node to the replay plan
as the source node. Get all the traced splits using <code>OperatorInputSplitReader</code>.
Use the splits as inputs for task replaying.</li>
<li>For a non-leaf operator, add a <code>QueryTraceScanNode</code> as the source node to the replay plan and
then add the replay plan node.</li>
<li>Add a sink node, apply the query configurations (disable tracing), and connector properties,
and execute the replay plan.</li>
</ol>
<p><em>Detail usage please see the tracing doc in <a href="https://facebookincubator.github.io/velox/develop/debugging/tracing.html" target="_blank" rel="noopener noreferrer">https://facebookincubator.github.io/velox/develop/debugging/tracing.html</a></em></p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="future-work">Future Work<a href="https://velox-lib.io/blog/velox-query-tracing#future-work" class="hash-link" aria-label="Direct link to Future Work" title="Direct link to Future Work">​</a></h2>
<ul>
<li>Add support for more operators</li>
<li>Customize the replay task execution instead of AssertQueryBuilder</li>
<li>Supports IcebergHiveConnector</li>
<li>Add trace replay for an entire pipeline</li>
</ul>]]></content:encoded>
            <category>tech-blog</category>
            <category>tracing</category>
        </item>
        <item>
            <title><![CDATA[Optimizing and Migrating Velox CI Workloads to Github Actions]]></title>
            <link>https://velox-lib.io/blog/ci-migration</link>
            <guid>https://velox-lib.io/blog/ci-migration</guid>
            <pubDate>Fri, 23 Aug 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[TL;DR]]></description>
            <content:encoded><![CDATA[<figure><img src="https://velox-lib.io/img/abstract-white-background.jpg" height="100%" width="100%"></figure>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="tldr">TL;DR<a href="https://velox-lib.io/blog/ci-migration#tldr" class="hash-link" aria-label="Direct link to TL;DR" title="Direct link to TL;DR">​</a></h2>
<p>In late 2023, the Meta OSS (Open Source Software) Team requested all Meta teams to move the CI deployments from CircleCI to Github Actions. <a href="http://voltrondata.com/" target="_blank" rel="noopener noreferrer">Voltron Data</a> and Meta in collaboration migrated all the deployed Velox CI jobs. For the year 2024, Velox CI spend was on track to overshoot the allocated resources by a considerable amount of money. As part of this migration effort, the CI workloads were consolidated and optimized by Q2 2024, bringing down the projected 2024 CI spend by 51%.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="veloxs-continuous-integration-workload">Velox’s Continuous Integration Workload<a href="https://velox-lib.io/blog/ci-migration#veloxs-continuous-integration-workload" class="hash-link" aria-label="Direct link to Velox’s Continuous Integration Workload" title="Direct link to Velox’s Continuous Integration Workload">​</a></h2>
<p>Continuous Integration (CI) is crucial for Velox’s success as an open source project as it helps protect from bugs and errors, reduces likelihood of conflicts and leads to increased community trust in the project. This is to ensure the Velox builds works well on a myriad of system architectures, operating systems and compilers - along with the ones used internally at Meta. The OSS build version of Velox also supports additional features that aren't used internally in Meta (for example, support for Cloud blob stores, etc.).</p>
<p>When a pull request is submitted to Velox, the following jobs are executed:</p>
<ol>
<li>Linting and Formatting workflows:<!-- -->
<ol>
<li>Header checks</li>
<li>License checks</li>
<li>Basic Linters</li>
</ol>
</li>
<li>Ensure Velox builds on various platforms<!-- -->
<ol>
<li>MacOS (Intel, M1)</li>
<li>Linux (Ubuntu/Centos)</li>
</ol>
</li>
<li>Ensure Velox builds under its various configurations<!-- -->
<ol>
<li>Debug / Release builds</li>
<li>Build default Velox build</li>
<li>Build Velox with support for Parquet, Arrow and External Adapters (S3/HDFS/GCS etc.)</li>
<li>PyVelox builds</li>
</ol>
</li>
<li>Run prerequisite tests<!-- -->
<ol>
<li>Unit Tests</li>
<li>Benchmarking Tests<!-- -->
<ol>
<li><a href="https://velox-conbench.voltrondata.run/runs/5bd139fffa9b4e0eb020da4d63211121/" target="_blank" rel="noopener noreferrer">Conbench</a> is used to store and compare results, and also alert users on regressions</li>
</ol>
</li>
<li>Various Fuzzer Tests (Expression / Aggregation/ Exchange / Join etc)</li>
<li>Signature Check and Biased Fuzzer Tests ( Expression / Aggregation)</li>
<li>Fuzzer Tests using Presto as source of truth</li>
</ol>
</li>
<li>Docker Image build jobs<!-- -->
<ol>
<li>If an underlying dependency is changed, a new Docker CI image is built for<!-- -->
<ol>
<li>Ubuntu Linux</li>
<li>Centos</li>
<li>Presto Linux image</li>
</ol>
</li>
</ol>
</li>
<li>Documentation build and publish Job<!-- -->
<ol>
<li>If underlying documentation is changed, Velox documentation pages are rebuilt and published</li>
<li>Netlify is used for publishing Velox web pages</li>
</ol>
</li>
</ol>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="velox-ci-optimization">Velox CI Optimization<a href="https://velox-lib.io/blog/ci-migration#velox-ci-optimization" class="hash-link" aria-label="Direct link to Velox CI Optimization" title="Direct link to Velox CI Optimization">​</a></h2>
<p>Previous implementation of CI in CircleCI grew organically and was unoptimized, resulting in long build times, and also significantly costlier. This opportunity to migrate to Github Actions helped to take a holistic view of CI deployments and actively optimized to reduce build times and CI spend. Note however, that there has been continued investment in reducing test times to further improve Velox reliability, stability and developer experience. Some of the optimizations completed are:</p>
<ol>
<li>
<p><strong>Persisting build artifacts across builds</strong>: During every build, the object files and binaries produced are cached. In addition to this, artifacts such as scalar function signatures and aggregate function signatures are produced. These signatures are used to compare with the baseline version, by comparing against the changes in the current PR to determine if the current changes are backwards incompatible or bias the newly added changes. Using a stash to persist these artifacts helps save one build cycle.</p>
</li>
<li>
<p><strong>Optimizing our Instances</strong>: Building Velox is Memory and CPU intensive job. Some beefy instances (16-core machines) are used to build Velox. After the build, the build artifacts are copied to smaller instances (4 core machines) to run fuzzer tests and other jobs. Since these fuzzers often run for an hour and are less intensive than the build process, it resulted in significant CI savings while increasing the test coverage.</p>
</li>
</ol>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="instrumenting-velox-ci-builds">Instrumenting Velox CI Builds<a href="https://velox-lib.io/blog/ci-migration#instrumenting-velox-ci-builds" class="hash-link" aria-label="Direct link to Instrumenting Velox CI Builds" title="Direct link to Instrumenting Velox CI Builds">​</a></h2>
<p>Velox CI builds were instrumented in Conbench so that it can capture various metrics about the builds:</p>
<ol>
<li>Build times at translation unit / library/ project level.</li>
<li>Binary sizes produced at TLU/ .a,.so / executable level.</li>
<li>Memory pressure</li>
<li>Measure across time how our changes affect binary sizes</li>
</ol>
<p>A nightly job is run to capture these build metrics and it is uploaded to Conbench. Velox build metrics report is available here: <a href="https://facebookincubator.github.io/velox/bm-report/" target="_blank" rel="noopener noreferrer">Velox Build Metrics Report</a></p>
<a href="https://facebookincubator.github.io/velox/bm-report/"><img src="https://velox-lib.io/img/velox-build-metrics.png" height="100%" width="100%"></a>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="acknowledgements">Acknowledgements<a href="https://velox-lib.io/blog/ci-migration#acknowledgements" class="hash-link" aria-label="Direct link to Acknowledgements" title="Direct link to Acknowledgements">​</a></h2>
<p>A large part of the credit goes to Jacob Wujciak and the team at Voltron Data. We would also like to thank other collaborators in the Open Source Community and at Meta, including but not limited to:</p>
<p><strong>Meta</strong>: Sridhar Anumandla, Pedro Eugenio Rocha Pedreira, Deepak Majeti, Meta OSS Team, and others</p>
<p><strong>Voltron Data</strong>: Jacob Wujciak, Austin Dickey, Marcus Hanwell, Sri Nadukudy, and others</p>]]></content:encoded>
            <category>tech-blog</category>
            <category>packaging</category>
        </item>
        <item>
            <title><![CDATA[Further Optimizing TRY_CAST and TRY]]></title>
            <link>https://velox-lib.io/blog/optimize-try-more</link>
            <guid>https://velox-lib.io/blog/optimize-try-more</guid>
            <pubDate>Fri, 31 May 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[TL;DR]]></description>
            <content:encoded><![CDATA[<h2 class="anchor anchorWithStickyNavbar_LWe7" id="tldr">TL;DR<a href="https://velox-lib.io/blog/optimize-try-more#tldr" class="hash-link" aria-label="Direct link to TL;DR" title="Direct link to TL;DR">​</a></h2>
<p>Queries that use TRY or TRY_CAST may experience poor performance and high CPU
usage due to excessive exception throwing. We optimized CAST to indicate
failure without throwing and introduced a mechanism for scalar functions to do
the same. Microbenchmark measuring worst case performance of CAST improved
100x. Samples of production queries show 30x cpu time improvement.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="try-and-trycast">TRY and TRY(CAST)<a href="https://velox-lib.io/blog/optimize-try-more#try-and-trycast" class="hash-link" aria-label="Direct link to TRY and TRY(CAST)" title="Direct link to TRY and TRY(CAST)">​</a></h2>
<p>TRY construct can be applied to any expression to suppress errors and turn them
into NULL results. TRY_CAST is a version of CAST that suppresses errors and
returns NULL instead.</p>
<p>For example, <code>parse_datetime('2024-05-', 'YYYY-MM-DD')</code> fails:</p>
<div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">	 Invalid format: "2024-05-" is too short</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<p>, but <code>TRY(parse_datetime('2024-05-', 'YYYY-MM-DD'))</code> succeeds and returns NULL.</p>
<p>Similarly, <code>CAST('foo' AS INTEGER)</code> fails:</p>
<div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">	Cannot cast 'foo' to INT</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<p>, but <code>TRY_CAST('foo' AS INTEGER)</code> succeeds and returns NULL.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="try_cast-vs-trycast">TRY_CAST vs. TRY(CAST)<a href="https://velox-lib.io/blog/optimize-try-more#try_cast-vs-trycast" class="hash-link" aria-label="Direct link to TRY_CAST vs. TRY(CAST)" title="Direct link to TRY_CAST vs. TRY(CAST)">​</a></h2>
<p>TRY can wrap any expression, so one can wrap CAST as well:</p>
<div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">  TRY(CAST('foo' AS INTEGER))</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<p>Wrapping CAST in TRY is similar to TRY_CAST, but not equivalent. TRY_CAST
suppresses only cast errors, while TRY suppresses any error in the expression
tree.</p>
<p>For example, <code>CAST(1/0 AS VARCHAR)</code> fails:</p>
<div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">  Division by zero</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<p>, <code>TRY_CAST(1/0 AS VARCHAR)</code> also fails:</p>
<div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">  Division by zero</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<p>, but <code>TRY(CAST(1/0 AS VARCHAR))</code> succeeds and returns NULL.</p>
<p>In this case, the error is generated by division operation (1/0). TRY_CAST
cannot suppress that error, but TRY can. More generally, TRY(CAST(...))
suppresses all errors in all expressions that are evaluated to produce
an input for CAST as well as errors in CAST itself, but TRY_CAST suppresses
errors in CAST only.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="what-happens-when-many-rows-fail">What happens when many rows fail?<a href="https://velox-lib.io/blog/optimize-try-more#what-happens-when-many-rows-fail" class="hash-link" aria-label="Direct link to What happens when many rows fail?" title="Direct link to What happens when many rows fail?">​</a></h2>
<p>In most cases only a fraction of rows generates an error. However, there are
queries where a large percentage of rows fail. In these cases, a lot of CPU
time goes into handling exceptions.</p>
<p>For example, one Prestissimo query used 3 weeks of CPU time, 93% of
which was spent processing try(date_parse(...)) expressions where most rows
failed. Here is a profile for that query that shows that all the time went into
stack unwinding:</p>
<figure><img src="https://velox-lib.io/img/optimize-try-more-profile1.png"></figure>
<p>This query processes 14B rows, ~70% of which fail in date_parse(...) function
due to the date string being empty.</p>
<div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">    presto&gt; select try(date_parse('', '%Y-%m-%d'));</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">     _col0</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    -------</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">     NULL</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    (1 row)</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    – TRY suppressed Invalid format: "" error and produced a NULL.</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<p>Velox tracks the number of suppressed exceptions per operator / plan node and
reports these as numSilentThrow runtime stat. For this query, Velox reported
21B throws for a single FilterProject node that processed 14B rows. Before the
optimizations, each failing row used to throw twice. An <a href="https://velox-lib.io/blog/optimize-try_cast.">earlier blog post</a>
from <a href="https://www.linkedin.com/in/laith-sakka-629ab384/">Laith Sakka</a> explains why.
After the optimizations this query’s CPU time dropped
to 17h: 30x difference from the original cpu time. Compared to
Presto Java, this query uses 4x less cpu time (originally it used 6x more).</p>
<figure><img src="https://velox-lib.io/img/optimize-try-more-numSilentThrow.png"></figure>
<p>We observed similar issues with queries that use other functions that parse
strings as well as casts from strings.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="solution">Solution<a href="https://velox-lib.io/blog/optimize-try-more#solution" class="hash-link" aria-label="Direct link to Solution" title="Direct link to Solution">​</a></h2>
<p>To avoid the performance penalty of throwing exceptions we need to report errors
differently. Google’s <a href="https://abseil.io/docs/cpp/guides/status">Abseil</a> library uses absl::Status to return errors from
void functions and absl::StatusOr to return value or error from non-void
functions. <a href="https://arrow.apache.org/cookbook/cpp/basic.html#id8">Arrow</a> library
has similar Status and Result. Our own Folly has <a href="https://github.com/facebook/folly/blob/main/folly/Expected.h">folly::Expected</a>.
Inspired by these examples we introduced <a href="https://github.com/facebookincubator/velox/pull/8084">velox::Status</a> and <a href="https://github.com/facebookincubator/velox/pull/9858">velox::Expected</a>.</p>
<p>velox::Status holds a generic error code and an error message.</p>
<p><code>velox::Expected&lt;T&gt;</code> is a typedef for <code>folly::Expected&lt;T, velox::Status&gt;</code>.</p>
<p>For example, a non-throwing modulo operation can be implemented like this:</p>
<div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">  Expected&lt;int&gt; mod(int a, int b) {</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    if (b == 0) {</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      return folly::makeUnexpected(Status::UserError(“Division by zero”));</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    }</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    return a % b;</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  }</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="non-throwing-simple-functions">Non-throwing Simple Functions<a href="https://velox-lib.io/blog/optimize-try-more#non-throwing-simple-functions" class="hash-link" aria-label="Direct link to Non-throwing Simple Functions" title="Direct link to Non-throwing Simple Functions">​</a></h2>
<p>We extended the Simple Function API to allow authoring non-throwing scalar
functions. The function author can now define a ‘call’ method that returns
Status. Such a function can indicate an error by returning a non-OK status.</p>
<div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">  Status call(result&amp;, arg1, arg2,..)</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<p>These functions are still allowed to throw and exceptions will be handled
properly, but not throwing improves performance of expressions that use TRY.</p>
<p>Modulo SQL function would look like this:</p>
<div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">    template &lt;typename TExec&gt;</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    struct NoThrowModFunction {</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      VELOX_DEFINE_FUNCTION_TYPES(TExec);</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      Status call(int64_t&amp; result, const int64_t&amp; a, const int64_t&amp; b) {</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        if (b == 0) {</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">          return Status::UserError("Division by zero");</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        }</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        result = a % b;</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        return Status::OK();</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      }</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    };</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<p>We changed date_parse, parse_datetime, and from_iso8601_date Presto functions
to use the new API and report errors without throwing.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="non-throwing-vector-functions">Non-throwing Vector functions<a href="https://velox-lib.io/blog/optimize-try-more#non-throwing-vector-functions" class="hash-link" aria-label="Direct link to Non-throwing Vector functions" title="Direct link to Non-throwing Vector functions">​</a></h2>
<p>Vector functions can implement non-throwing behavior by leveraging the new
EvalCtx::setStatus(row, status) API. However, nowadays we expect virtually all
functions to be written using Simple Function API.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="non-throwing-cast">Non-throwing CAST<a href="https://velox-lib.io/blog/optimize-try-more#non-throwing-cast" class="hash-link" aria-label="Direct link to Non-throwing CAST" title="Direct link to Non-throwing CAST">​</a></h2>
<p>CAST is complex. A single name refers to multiple dozen individual operations.
The full matrix of <a href="https://facebookincubator.github.io/velox/functions/presto/conversion.html#supported-conversions">supported conversions</a> is available in the Velox
documentation. Not all casts throw. For example, cast from an integer to a
string does not throw. However, casts from strings may fail in multiple ways. A
common failure scenario is cast from an empty string. Laith Sakka optimized
this use case earlier.</p>
<div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">&gt; select cast('' as integer);</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">Cannot cast '' to INT</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<p>However, we are also seeing failures in casting non-empty strings and NaN floating point values to integers.</p>
<div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">&gt; select cast(nan() as bigint);</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">Unable to cast NaN to bigint</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">&gt; select cast('123x' as integer);</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">Cannot cast '123x' to INT</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<p>CAST from string to integer and floating point value is implemented using
<code>folly::to</code> template. Luckily there is a non-throwing version: <code>folly::tryTo</code>.
We changed our CAST implementation to use <code>folly::tryTo</code> to avoid throwing.
Not throwing helped improve performance of TRY_CAST by 20x.</p>
<p>Still, the profile showed that there is room for further improvement.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="do-not-produce-or-store-error-messages-under-try">Do not produce or store error messages under TRY<a href="https://velox-lib.io/blog/optimize-try-more#do-not-produce-or-store-error-messages-under-try" class="hash-link" aria-label="Direct link to Do not produce or store error messages under TRY" title="Direct link to Do not produce or store error messages under TRY">​</a></h2>
<p>After switching to non-throwing implementation, the profile showed that half the
cpu time went into folly::makeConversionError. folly::tryTo returns result or
ConversionCode enum. CAST uses folly::makeConversionError to convert
ConversionCode into a user-friendly error message. This involves allocating and
populating a string for the error message, copying it into the std::range_error
object, then copying it again into Status. This error message is very helpful
if it is being propagated all the way to the user, but it is not needed if the
error is suppressed via TRY or TRY_CAST.</p>
<figure><img src="https://velox-lib.io/img/optimize-try-more-profile2.png"></figure>
<p>To solve this problem we introduced a thread-local flag, threadSkipErrorDetails,
that indicates whether Status needs to include a detailed error message or not.
By default, this flag is ‘false’, but TRY and TRY_CAST set it to ‘true’. CAST
logic checks this flag to decide whether to call folly::makeConversionError or
not. This change gives a 3x performance boost to TRY_CAST and 2x
to TRY.</p>
<div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">    if (threadSkipErrorDetails()) {</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      return folly::makeUnexpected(Status::UserError());</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    }</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    return folly::makeUnexpected(Status::UserError(</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        "{}", folly::makeConversionError(result.error(), "").what()));</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<p>After this optimization, we observed that TRY(CAST(...)) is up to 5x slower than
TRY_CAST when many rows fail.</p>
<p>The profile revealed that 30% of cpu time went to
EvalCtx::ensureErrorsVectorSize. For every row that fails, we call
EvalCtx::ensureErrorsVectorSize to resize the error vector to accommodate that
row. When many rows fail we end up resizing a lot: resize(1), resize(2),
resize(3),...resize(n). We fixed this by pre-allocating the error vector in the TRY
expression.</p>
<p>Another 30% of cpu time went into managing reference counts for
<code>std::shared_ptr&lt;std::exception_ptr&gt;</code> stored in the errors vector. We do not need
error details for TRY, hence, no need to store these values. We fixed this by
making error values in error vector optional and updating EvalCtx::setStatus to
skip writing these under TRY.</p>
<figure><img src="https://velox-lib.io/img/optimize-try-more-profile3.png"></figure>
<p>After all these optimizations, the microbenchmark that measures performance of
casting invalid strings into integers showed 100x improvement. The benchmark
evaluates 4 expressions:</p>
<ul>
<li>TRY_CAST(‘’ AS INTEGER)</li>
<li>TRY(CAST(‘’ AS INTEGER))</li>
<li>TRY_CAST(‘$’ AS INTEGER)</li>
<li>TRY(CAST(‘$’ AS INTEGER))</li>
</ul>
<p>When we started, the benchmark results were:</p>
<div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">===============================================================</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">[...]hmarks/ExpressionBenchmarkBuilder.cpp     relative  time/iter   iters/s</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">================================================================</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">cast##try_cast_invalid_empty_input                          2.40ms    417.47</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">cast##tryexpr_cast_invalid_empty_input                    402.63ms      2.48</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">cast##try_cast_invalid_nan                                392.14ms      2.55</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">cast##tryexpr_cast_invalid_nan                            827.09ms      1.21</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<p>At the end the numbers improved 100x:</p>
<div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">cast##try_cast_invalid_empty_input                          2.16ms    463.62</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">cast##tryexpr_cast_invalid_empty_input                      4.29ms    232.95</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">cast##try_cast_invalid_nan                                  5.47ms    182.83</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">cast##tryexpr_cast_invalid_nan                              7.76ms    128.81</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<p>Note: The performance of TRY_CAST(‘’ AS INTEGER) hasn’t changed because this
particular use case has been optimized by Laith Sakka earlier.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="next-steps">Next steps<a href="https://velox-lib.io/blog/optimize-try-more#next-steps" class="hash-link" aria-label="Direct link to Next steps" title="Direct link to Next steps">​</a></h2>
<p>We can identify queries with a high percentage of numSilentThrow rows and
change throwing functions to not throw.</p>
<p>For simple functions this involves changing the ‘call’ method to return Status
and replacing ‘throw’ statements with return Status::UserError(...). You get
extra points for producing error messages conditionally based on thread-local
flag threadSkipErrorDetails().</p>
<div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">template &lt;typename TExec&gt;</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">struct NoThrowModFunction {</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  VELOX_DEFINE_FUNCTION_TYPES(TExec);</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  Status call(int64_t&amp; result, const int64_t&amp; a, const int64_t&amp; b) {</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    if (b == 0) {</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      If (threadSkipErrorDetails()) {</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">          return Status::UserError();</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      }</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      return Status::UserError("Division by zero");</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    }</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    result = a % b;</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    return Status::OK();</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">};</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<p>We are changing CAST(varchar AS date) to not throw.</p>
<p>We provided a non-throwing ‘call’ API for simple functions that never return a
NULL for a non-NULL input. This covers the majority of Presto functions. For
completeness, we would want to provide non-throwing ‘call’ APIs for all other
use cases:</p>
<ul>
<li>bool call() for returning NULL sometimes</li>
<li>callAscii for processing all-ASCII inputs</li>
<li>callNullable for processing possibly NULL inputs</li>
<li>callNullFree for processing complex inputs with all NULLs removed.</li>
</ul>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="acknowledgements">Acknowledgements<a href="https://velox-lib.io/blog/optimize-try-more#acknowledgements" class="hash-link" aria-label="Direct link to Acknowledgements" title="Direct link to Acknowledgements">​</a></h2>
<p>Thank you <a href="https://www.linkedin.com/in/laith-sakka-629ab384/">Laith Sakka</a>
for doing the initial work to investigate and optimize TRY_CAST for empty strings
and sharing your findings in a blog post.</p>
<p>Thank you <a href="https://www.linkedin.com/in/orrierling">Orri Erling</a> for
adding numSilentThrow runtime stat to report number of suppressed exceptions.</p>
<p>Thank you <a href="https://www.linkedin.com/in/pedro-pedreira/">Pedro Eugenio Rocha Pedreira</a>
for introducing the velox::Status class.</p>
<p>Thank you <a href="https://www.linkedin.com/in/bikramjeet-vig/">Bikramjeet Vig</a>,
<a href="https://www.linkedin.com/in/gongchuo-lu-23bb251b/">Jimmy Lu</a>,
<a href="https://www.linkedin.com/in/orrierling">Orri Erling</a>,
<a href="https://www.linkedin.com/in/pedro-pedreira/">Pedro Eugenio Rocha Pedreira</a> and
<a href="https://www.linkedin.com/in/xiaoxuanmeng">Xiaoxuan Meng</a> for brainstorming and
helping with code reviews.</p>]]></content:encoded>
            <category>tech-blog</category>
            <category>expressions</category>
        </item>
        <item>
            <title><![CDATA[Improve LIKE's performance]]></title>
            <link>https://velox-lib.io/blog/like</link>
            <guid>https://velox-lib.io/blog/like</guid>
            <pubDate>Sat, 27 Jan 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[What is LIKE?]]></description>
            <content:encoded><![CDATA[<h2 class="anchor anchorWithStickyNavbar_LWe7" id="what-is-like">What is LIKE?<a href="https://velox-lib.io/blog/like#what-is-like" class="hash-link" aria-label="Direct link to What is LIKE?" title="Direct link to What is LIKE?">​</a></h2>
<p><a href="https://prestodb.io/docs/current/functions/comparison.html#like">LIKE</a> is a very useful SQL operator.
It is used to do string pattern matching. The following examples for LIKE usage are from the Presto doc:</p>
<div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">SELECT * FROM (VALUES ('abc'), ('bcd'), ('cde')) AS t (name)</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">WHERE name LIKE '%b%'</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">--returns 'abc' and  'bcd'</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">SELECT * FROM (VALUES ('abc'), ('bcd'), ('cde')) AS t (name)</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">WHERE name LIKE '_b%'</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">--returns 'abc'</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">SELECT * FROM (VALUES ('a_c'), ('_cd'), ('cde')) AS t (name)</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">WHERE name LIKE '%#_%' ESCAPE '#'</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">--returns 'a_c' and  '_cd'</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<p>These examples show the basic usage of LIKE:</p>
<ul>
<li>Use <code>%</code> to match zero or more characters.</li>
<li>Use <code>_</code> to match exactly one character.</li>
<li>If we need to match <code>%</code> and <code>_</code> literally, we can specify an escape char to escape them.</li>
</ul>
<p>When we use Velox as the backend to evaluate Presto's query, LIKE operation is translated
into Velox's function call, e.g. <code>name LIKE '%b%'</code> is translated to
<code>like(name, '%b%')</code>. Internally Velox converts the pattern string into a regular
expression and then uses regular expression library <a href="https://github.com/google/re2">RE2</a>
to do the pattern matching. RE2 is a very good regular expression library. It is fast
and safe, which gives Velox LIKE function a good performance. But some popularly used simple patterns
can be optimized using direct simple C++ string functions instead of regex.
e.g. Pattern <code>hello%</code> matches inputs that start with <code>hello</code>, which can be implemented by direct memory
comparison of prefix ('hello' in this case) bytes of input:</p>
<div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">// Match the first 'length' characters of string 'input' and prefix pattern.</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">bool matchPrefixPattern(</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    StringView input,</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    const std::string&amp; pattern,</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    size_t length) {</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  return input.size() &gt;= length &amp;&amp;</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      std::memcmp(input.data(), pattern.data(), length) == 0;</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">}</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<p>It is much faster than using RE2. Benchmark shows it gives us a 750x speedup. We can do similar
optimizations for some other patterns:</p>
<ul>
<li><code>%hello</code>: matches inputs that end with <code>hello</code>. It can be optimized by direct memory comparison of suffix bytes of the inputs.</li>
<li><code>%hello%</code>: matches inputs that contain <code>hello</code>. It can be optimized by using <code>std::string_view::find</code> to check whether inputs contain <code>hello</code>.</li>
</ul>
<p>These simple patterns are straightforward to optimize. There are some more relaxed patterns that
are not so straightforward:</p>
<ul>
<li><code>hello_velox%</code>: matches inputs that start with 'hello', followed by any character, then followed by 'velox'.</li>
<li><code>%hello_velox</code>: matches inputs that end with 'hello', followed by any character, then followed by 'velox'.</li>
<li><code>%hello_velox%</code>: matches inputs that contain both 'hello' and 'velox', and there is a single character separating them.</li>
</ul>
<p>Although these patterns look similar to previous ones, but they are not so straightforward
to optimize, <code>_</code> here matches any single character, we can not simply use memory comparison to
do the matching. And if user's input is not pure ASCII, <code>_</code> might match more than one byte which
makes the implementation even more complex. Also note that the above patterns are just for
illustrative purpose. Actual patterns can be more complex. e.g. <code>h_e_l_l_o</code>, so trivial algorithm
will not work.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="optimizing-relaxed-patterns">Optimizing Relaxed Patterns<a href="https://velox-lib.io/blog/like#optimizing-relaxed-patterns" class="hash-link" aria-label="Direct link to Optimizing Relaxed Patterns" title="Direct link to Optimizing Relaxed Patterns">​</a></h2>
<p>We optimized these patterns as follows. First, we split the patterns into a list of sub patterns, e.g.
<code>hello_velox%</code> is split into sub-patterns: <code>hello</code>, <code>_</code>, <code>velox</code>, <code>%</code>, because there is
a <code>%</code> at the end, we determine it as a <code>kRelaxedPrefix</code> pattern, which means we need to do some prefix
matching, but it is not a trivial prefix matching, we need to match three sub-patterns:</p>
<ul>
<li>kLiteralString: hello</li>
<li>kSingleCharWildcard: _</li>
<li>kLiteralString: velox</li>
</ul>
<p>For <code>kLiteralString</code> we simply do a memory comparison:</p>
<div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">if (subPattern.kind == SubPatternKind::kLiteralString &amp;&amp;</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    std::memcmp(</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        input.data() + start + subPattern.start,</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        patternMetadata.fixedPattern().data() + subPattern.start,</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        subPattern.length) != 0) {</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  return false;</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">}</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<p>Note that since it is a memory comparison, it handles both pure ASCII inputs and inputs that
contain Unicode characters.</p>
<p>Matching <code>_</code> is more complex considering that there are variable length multi-bytes character in
unicode inputs. Fortunately there are existing libraries which provides unicode related operations: <a href="https://juliastrings.github.io/utf8proc/">utf8proc</a>.
It provides functions that tells us whether a byte in input is the start of a character or not,
how many bytes current character consists of etc. So to match a sequence of <code>_</code> our algorithm is:</p>
<div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">if (subPattern.kind == SubPatternKind::kSingleCharWildcard) {</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  // Match every single char wildcard.</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  for (auto i = 0; i &lt; subPattern.length; i++) {</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    if (cursor &gt;= input.size()) {</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      return false;</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    }</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    auto numBytes = unicodeCharLength(input.data() + cursor);</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    cursor += numBytes;</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">}</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<p>Here:</p>
<ul>
<li><code>cursor</code> is the index in the input we are trying to match.</li>
<li><code>unicodeCharLength</code> is a function which wraps utf8proc function to determine how many bytes current character consists of.</li>
</ul>
<p>So the logic is basically repeatedly calculate size of current character and skip it.</p>
<p>It seems not that complex, but we should note that this logic is not effective for pure ASCII input.
Every character is one byte in pure ASCII input. So to match a sequence of <code>_</code>, we don't need to calculate the size
of each character and compare in a for-loop. In fact, we don't need to explicitly match <code>_</code> for pure ASCII input as well.
We can use the following logic instead:</p>
<div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">for (const auto&amp; subPattern : patternMetadata.subPatterns()) {</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    if (subPattern.kind == SubPatternKind::kLiteralString &amp;&amp;</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        std::memcmp(</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">            input.data() + start + subPattern.start,</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">            patternMetadata.fixedPattern().data() + subPattern.start,</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">            subPattern.length) != 0) {</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      return false;</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    }</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">}</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<p>It only matches the kLiteralString pattern at the right position of the inputs, <code>_</code> is automatically
matched(actually skipped). No need to match it explicitly. With this optimization we get 40x speedup
for kRelaxedPrefix patterns, 100x speedup for kRelaxedSuffix patterns.</p>
<p>Thank you <a href="https://github.com/mbasmanova">Maria Basmanova</a> for spending a lot of time
reviewing the code.</p>]]></content:encoded>
            <category>tech-blog</category>
            <category>performance</category>
        </item>
        <item>
            <title><![CDATA[reduce_agg lambda aggregate function]]></title>
            <link>https://velox-lib.io/blog/reduce-agg</link>
            <guid>https://velox-lib.io/blog/reduce-agg</guid>
            <pubDate>Mon, 02 Oct 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[Definition]]></description>
            <content:encoded><![CDATA[<h2 class="anchor anchorWithStickyNavbar_LWe7" id="definition">Definition<a href="https://velox-lib.io/blog/reduce-agg#definition" class="hash-link" aria-label="Direct link to Definition" title="Direct link to Definition">​</a></h2>
<p><a href="https://facebookincubator.github.io/velox/functions/presto/aggregate.html#reduce_agg">Reduce_agg</a>
is the only lambda aggregate Presto function. It allows users to define arbitrary aggregation
logic using 2 lambda functions.</p>
<div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">reduce_agg(inputValue T, initialState S, inputFunction(S, T, S), combineFunction(S, S, S)) → S</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">Reduces all non-NULL input values into a single value. inputFunction will be invoked for</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">each non-NULL input value. If all inputs are NULL, the result is NULL. In addition to taking</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">the input value, inputFunction takes the current state, initially initialState, and returns the</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">new state. combineFunction will be invoked to combine two states into a new state. The final</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">state is returned. Throws an error if initialState is NULL or inputFunction or combineFunction</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">returns a NULL.</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<p>Once can think of reduce_agg as using inputFunction to implement partial aggregation and
combineFunction to implement final aggregation. Partial aggregation processes a list of
input values and produces an intermediate state:</p>
<div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">auto s = initialState;</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">for (auto x : input) {</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">   s = inputFunction(s, x);</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">}</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">return s;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<p>Final aggregation processes a list of intermediate states and computes the final state.</p>
<div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">auto s = intermediates[0];</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">for (auto i = 1; i &lt; intermediates.size(); ++i)</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">   s = combineFunction(s, intermediates[i]);</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">}</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">return s;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<p>For example, one can implement SUM aggregation using reduce_agg as follows:</p>
<div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">reduce_agg(c, 0, (s, x) -&gt; s + x, (s, s2) -&gt; s + s2)</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<p>Implementation of AVG aggregation is a bit trickier. For AVG, state is a tuple of sum and
count. Hence, reduce_agg can be used to compute (sum, count) pair, but it cannot compute
the actual average. One needs to apply a scalar function on top of reduce_agg to get the
average.</p>
<div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">SELECT id, sum_and_count.sum / sum_and_count.count FROM (</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  SELECT id, reduce_agg(value, CAST(row(0, 0) AS row(sum double, count bigint)),</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    (s, x) -&gt; CAST(row(s.sum + x, s.count + 1) AS row(sum double, count bigint)),</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    (s, s2) -&gt; CAST(row(s.sum + s2.sum, s.count + s2.count) AS row(sum double, count bigint))) AS sum_and_count</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  FROM t</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  GROUP BY id</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">);</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<p>The examples of using reduce_agg to compute SUM and AVG are for illustrative purposes.
One should not use reduce_agg if a specialized aggregation function is available.</p>
<p>One use case for reduce_agg we see in production is to compute a product of input values.</p>
<div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">reduce_agg(c, 1.0, (s, x) -&gt; s * x, (s, s2) -&gt; s * s2)</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<p>Another example is to compute a list of top N distinct values from all input arrays.</p>
<div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">reduce_agg(x, array[],</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">            (a, b) -&gt; slice(reverse(array_sort(array_distinct(concat(a, b)))), 1, 1000),</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">            (a, b) -&gt; slice(reverse(array_sort(array_distinct(concat(a, b)))), 1, 1000))</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<p>Note that this is equivalent to the following query:</p>
<div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">SELECT array_agg(v) FROM (</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    SELECT DISTINCT v</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    FROM t, UNNEST(x) AS u(v)</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    ORDER BY v DESC</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    LIMIT 1000</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">)</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="implementation">Implementation<a href="https://velox-lib.io/blog/reduce-agg#implementation" class="hash-link" aria-label="Direct link to Implementation" title="Direct link to Implementation">​</a></h2>
<p>Efficient implementation of reduce_agg lambda function is not straightforward. Let’s
consider the logic for partial aggregation.</p>
<div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">auto s = initialState;</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">for (auto x : input) {</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">   s = inputFunction(s, x);</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">}</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<p>This is a data-dependent loop, i.e. the next loop iteration depends on the results of
the previous iteration. inputFunction needs to be invoked on each input value <code>x</code>
separately. Since inputFunction is a user-defined lambda, invoking inputFunction means
evaluating an expression. And since expression evaluation in Velox is optimized for
processing large batches of values at a time, evaluating expressions on one value at
a time is very inefficient. To optimize the implementation of reduce_agg we need to
reduce the number of times we evaluate user-defined lambdas and increase the number
of values we process each time.</p>
<p>One approach is to</p>
<ol>
<li>convert all input values into states by evaluating inputFunction(initialState, x);</li>
<li>split states into pairs and evaluate combineFunction on all pairs;</li>
<li>repeat step (2) until we have only one state left.</li>
</ol>
<p>Let’s say we have 1024 values to process. Step 1 evaluates inputFunction expression
on 1024 values at once. Step 2 evaluates combineFunction on 512 pairs, then on 256
pairs, then on 128 pairs, 64, 32, 16, 8, 4, 2, finally producing a single state.
Step 2 evaluates combineFunction 9 times. In total, this implementation evaluates
user-defined expressions 10 times on multiple values each time. This is a lot more
efficient than the original implementation that evaluates user-defined expressions
1024 times.</p>
<p>In general, given N inputs, the original implementation evaluates expressions N times
while the new one log2(N) times.</p>
<p>Note that in case when N is not a power of two, splitting states into pairs may leave
an extra state. For example, splitting 11 states produces 5 pairs + one extra state.
In this case, we set aside the extra state, evaluate combineFunction on 5 pairs, then
bring extra state back to a total of 6 states and continue.</p>
<p>A benchmark, velox/functions/prestosql/aggregates/benchmarks/ReduceAgg.cpp, shows that
initial implementation of reduce_agg is 60x slower than SUM, while the optimized
implementation is only 3x slower. A specialized aggregation function will always be
more efficient than generic reduce_agg, hence, reduce_agg should be used only when
specialized aggregation function is not available.</p>
<p>Finally, a side effect of the optimized implementation is that it doesn't support
applying reduce_agg to sorted inputs. I.e. one cannot use reduce_agg to compute an
equivalent of</p>
<div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">	SELECT a, array_agg(b ORDER BY b) FROM t GROUP BY 1</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<p>The array_agg computation depends on order of inputs. A comparable implementation
using reduce_agg would look like this:</p>
<div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">	SELECT a,</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        reduce_agg(b, array[],</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">                    (s, x) -&gt; concat(s, array[x]),</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">                    (s, s2) -&gt; concat(s, s2)</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">                    ORDER BY b)</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    FROM t GROUP BY 1</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<p>To respect ORDER BY b, the reduce_agg would have to apply inputFunction to each
input value one at a time using a data-dependent loop from above. As we saw, this
is very expensive. The optimization we apply does not preserve the order of inputs,
hence, cannot support the query above. Note that
<a href="https://github.com/facebookincubator/velox/issues/6434">Presto</a> doesn't
support applying reduce_agg to sorted inputs either.</p>
<p>Thank you <a href="https://www.linkedin.com/in/orrierling">Orri Erling</a> for brainstorming
and <a href="https://www.linkedin.com/in/xiaoxuanmeng">Xiaoxuan Meng</a> and
<a href="https://www.linkedin.com/in/pedro-pedreira/">Pedro Eugênio Rocha Pedreira</a> for
reviewing the code.</p>]]></content:encoded>
            <category>tech-blog</category>
            <category>functions</category>
        </item>
    </channel>
</rss>