Speeding Up Sumcheck
Presentation: https://www.youtube.com/watch?v=YXANnCp5118
The sum-check protocol is a fundamental tool in the design of modern succinct interactive proofs. It allows us to verify claims of the form
for a -variate polynomial of degree at most in each variable by checking the evaluation of at a random point with a single oracle query at random challenge :
Recap of the Protocol
Setup: Assume that the verifier has oracle access to the polynomial , and aims to verify:
For each round :
sends the univariate polynomial claimed to equal:
does so by sending the evaluations to .
sends a random challenge to .
derives , then sets . After rounds, we reduce to the claim that . checks this claim by making a single oracle query to .
Typically, the initial polynomial involve "small" values (e.g., from a base field ), but for security, the protocol's challenges are drawn from a much larger possible values (e.g., from a extension field ). Below, I will denote this challenge field as . This creates a significant performance gap between different types of multiplication:
(small-small): Base field by base field. Very fast.
(small-large): Base field by extension field. Fast.
(large-large): Extension field by extension field. Very slow.
Therefore, the one of the optimization idea is reducing the number of multiplications.
Existing Algorithms
Consider the case where the polynomial is given as
where are multilinear polynomials in variables such that has small values for all and .
Algorithm 1:
The prover maintains arrays which initially store all evaluations of over , e.g., we have for all and . The prover halves the size of the arrays after each round , via “binding” the arrays to the challenge received from the verifier.
Cost Analysis:
In round 1, since the initial evaluations are small, all multiplications are . Since we need to compute evaluations , and each evaluation is a sum over products of small values, the total cost is multiplications. We then take multiplications for updating the arrays.
From round , all array values are large, so we incur multiplications for computing all evaluations, and multiplications for updating the arrays.
Lemma. When the evaluations of are small, Algorithm 1’s dominant cost is multiplications, followed by multiplications, , and field additions.
Algorithm 2: Quasilinear Time and Square-Root Space
The main idea is to avoid updating the arrays in each round , and instead compute directly using the following lemma.
Lemma. Let be a multilinear polynomial. Then for any , any , and any ,
With this formula, we can calculate as follows:
After rounds, as both the time and space needed for the evaluations goes past , we will switch to Algorithm 1, which incurs a negligible cost of . Note that in practice, one could switch as soon as the prover is no longer space-constrained, which may happen well before round .
Cost Analysis:
Inner sum has multiplications between the terms and the evaluations .
multiplications for the product .
Outer summation then gives us, multiplications for each evaluation and we need to evaluate over points.
Besides these, we have lower-order costs of multiplications to compute the evaluations and multiplications to compute the evaluations and store them in an array.
Lemma. When the evaluations of are small, Algorithm 2’s cost is dominated by multiplications (spread over the first l/2 rounds), and multiplications. The lower-order costs include multiplications and field additions.
Algorithm 3: Warm-up Attempt
In the Algorithm 1, the prover performs exclusively multiplications starting from round 2, due to the need to bind the multilinear polynomials after round 1. If we want to reduce the number of multiplications, we must delay this binding process. A natural attempt at this leads to Algorithm 3, which is typically faster than Algorithm 1 in the first few rounds, and uses significantly less memory.
If we rearrange the Equation (10), we can get:
Exchanging product with the inner sum
Distribute
Expand eq
Switch inner with outer
Switch the order of outer sums
part is invariant over so we can factor it out.
At this point, we have rewritten the computation of as an inner product between two vectors of length (indexed by ): one dependent on the challenges , and the other dependent on the multilinear polynomials .
We can do slightly better by noticing that the challenge-dependent terms only depend on and not on the individual —there are only distinct such terms. By re-indexing based on , we can further rewrite:
Notice that above sum over the hypercube can be thought of as a inner product of a two vector of size :
Since the accumulators do not depend on the challenges , the prover can precompute them before the protocol begins. For each of these, we don't have to recompute and instead reuse these across them. For this, we can do multilinear extension on over variable to get:
and now we only need to evaluations of over the boolean hypercube.
Cost Analysis. Algorithm 3 uses space for rounds to (accumulators and challenge vectors) and space thereafter (cached results). Since in practice, the overall space complexity is dominated by the latter term, representing a space saving factor of roughly compared to Algorithm 1.
Regarding runtime, Algorithm 3 is bottlenecked by the multiplications needed to compute the products . We also incur roughly mults when using Algorithm 2 for the -th round, and roughly mults for the remaining rounds using Algorithm 1.
Algorithm 4: Optimizing with Toom-Cook Multiplication
While Algorithm 3 almost eliminates multiplications for the first few rounds and slashes memory usage, we still end up performing a lot of multiplications (namely, by a factor of compared to the number of mults in Algorithm 1). To reduce this cost, we need another way to rewrite the products .
The idea is to see the product as the evaluation of the polynomial:
at the challenge point .
Since has individual degree , we can apply Lagrange interpolation to each variable to rewrite as
As a result, when we plug in the evaluation , we can rewrite the -th sum-check polynomial as:
then we have:
Just like with Algorithm 3, we can precompute the accumulators by computing each of the product
over , and once, then inserting the result into the appropriate terms over all rounds. For details, see Section A.3.
Cost Analysis. Similar to Algorithm 3, Algorithm 4 also saves a factor of in space. Unlike Algorithm 3, however, it incurs only , instead of . This factor can be roughly 5 times smaller for common parameters (e.g. and ). The cost of and are the same as in Algorithm 3.
Optimizations for Sum-Check involving EqPoly
In this section, the optimization only applies to sum-check where is of the form:
with a vector of verifier’s challenges, and are multilinear polynomials. (e.g zero-check in Spartan).
Gruen's Optimization
The main idea of this optimization by Gruen et al. is to reduce the computation of the degree- polynomial sent by the prover in each round , to the computation of a degree- factor and a linear factor . This way, the linear term is “free” to compute, and the prover saves an evaluation when computing instead of .
This is possible due to the special structure of the polynomial, which for every can be decomposed as:
If we adapt the original definition for to this case, we get:
Thus, if we set and
then we have .
Using this fact, the sum-check prover now instead computes evaluations of at , then rely on the equality to compute:
The prover now has evaluations of , and thus can interpolate and compute . Note that computing can be done generically using either Algorithm 1 or 2; since our focus is on optimizing time rather than space, we assume that Algorithm 1 is used.
Algorithm 5
We can further reduce associated with the polynomial. Our insight is that we can further split off the terms for the first rounds, so that:
for all , where and . The sum-check prover now computes all inner sums over , then all outer sums over . For rounds , we no longer have the inner sum, and proceed similarly to Algorithm 1. Note that we put the left part in the inner sum, since it leads to better memory locality in practice, assuming the evaluations of are streamed in big-endian order (so that are low-order bits compared to , which means their chunks are contiguous in memory).
Cost Analysis. Our optimization avoids computing the -sized table of evaluations for each round , and instead only compute two square-root-sized tables, which costs mults for and mults for . With memoization, these evaluations can be computed for all rounds in multiplications total. The (negligible) downside is that we need to compute an extra multiplications when computing the outer sum over .
Using the same process for cost analysis as in Lemma 3.2, we have the following cost breakdown for Algorithm 5.
Lemma 5.1. Algorithm 5’s dominant cost is , reducing from Algorithm 1’s cost in the same setting by roughly .
Algorithm 6: Combining SVO with Eq-Poly optimization
Our starting point is Equation (17) in the exposition of Algorithm 5. We then apply the multivariate Lagrange decomposition formula in Equation (14) to the evaluations , which gives:
where the accumulators are now of the form:
Unlike the small-value only case, here because of the presence of the terms, we need to use (instead of ) multiplications in computing the accumulators, and the additions are over the big field as well. We also note that we have swapped the lengths of the vectors and in the accumulator formula. This is so that we can compute the same inner sum across the accumulators over all rounds, and we can dispatch these inner sums to different accumulators in different rounds afterwards, with negligible cost. See Section A.4 from the paper for more details.
Cost Analysis. Besides the extra subtleties as noted above, the cost estimate of Algorithm 6 is fairly straightforward and follows directly from combining the estimates of Algorithm 4 and Algorithm 5. In particular, the small-value precomputation process is now dominated by . Next, the streaming round takes roughly , and the rest of the linear-time sum-check rounds cost roughly . We thus summarize as follows.
Theorem 6.1. Algorithm 6’s cost is dominated by and .
Summary
Algorithm 1
Algorithm 2
Algorithm 3
Algorithm 4
Algorithm 5
Algorithm 6
References:
Speeding Up Sum-Check Proving by Suyash et al, 2025.
Optimizing the Sumcheck Protocol by Thomas Coratger, 2025.
Written by Batzorig Zorigoo of Fractalyze
Last updated