Speeding Up Sumcheck
Presentation: https://www.youtube.com/watch?v=YXANnCp5118
The sum-check protocol is a fundamental tool in the design of modern succinct interactive proofs. It allows us to verify claims of the form
for a ℓ-variate polynomial g of degree at most d in each variable by checking the evaluation of g at a random point with a single oracle query at random challenge r∈Fℓ:
Recap of the Protocol
Setup: Assume that the verifier has oracle access to the polynomial g, and aims to verify:
For each round i=1,…,ℓ:
P sends the univariate polynomial si(X) claimed to equal:
P does so by sending the evaluations {si(u):u∈Ud} to V.
Here, the original protocol typically evaluates from some evaluation domain Ud but we can take Ud=Ud/{1} and since verifier can calculate si(1) using other evaluations.
V sends a random challenge ri←F to P.
V derives si(1):=Ci−1−si(0), then sets Ci:=si(ri). After ℓ rounds, we reduce to the claim that g(r1,…,rℓ)=Cℓ. V checks this claim by making a single oracle query to g.
Typically, the initial polynomial involve "small" values (e.g., from a base field Fp), but for security, the protocol's challenges are drawn from a much larger possible values (e.g., from a extension field Fpk). Below, I will denote this challenge field as Fc. This creates a significant performance gap between different types of multiplication:
ss(small-small): Base field by base field. Very fast.
sl (small-large): Base field by extension field. Fast.
ll (large-large): Extension field by extension field. Very slow.
Therefore, the one of the optimization idea is reducing the number of ll multiplications.
Existing Algorithms
Consider the case where the polynomial g(x) is given as
where p1,...,pd are d multilinear polynomials in ℓ variables such that pi(x) has small values for all i∈{1,...,d} and x∈{0,1}ℓ.
Algorithm 1:
The prover maintains d arrays P1,…,Pd which initially store all evaluations of p1,...,pd over {0,1}ℓ, e.g., we have Pk[x]=pk(x) for all k∈[1,d] and x∈{0,1}ℓ. The prover halves the size of the arrays after each round i, via “binding” the arrays {Pk}k∈[0,d] to the challenge ri received from the verifier.
Setup: arrays P1(0),…,Pd(0) of size 2ℓ, corresponding to evaluations of each polynomial over the boolean hypercube {0,1}ℓ
For each round i=1,…,ℓ:
For u∈Ud, compute si(u) using the formula:
Send evaluations {si(u):u∈Ud} to the verifier.
Receive challenge ri∈Fc from the verifier.
For k=1,...,d and x′∈{0,1}ℓ−i, update arrays:
Cost Analysis:
In round 1, since the initial evaluations pk(x):k∈[0,d],x∈{0,1}ℓ are small, all multiplications are ss. Since we need to compute d evaluations s1(u):u∈Ud, and each evaluation is a sum over 2ℓ−1 products of d small values, the total cost is d⋅(d−1)⋅2ℓ−1 ss multiplications. We then take d⋅2ℓ−1 sl multiplications for updating the arrays.
From round i≥2, all array values are large, so we incur d⋅(d−1)⋅2ℓ−i ll multiplications for computing all evaluations, and d⋅2ℓ−i ll multiplications for updating the arrays.
Lemma. When the evaluations of p1,...,pd are small, Algorithm 1’s dominant cost is d⋅2⋅2ℓ−1ll multiplications, followed by d⋅2ℓ−1sl multiplications, d(d−1)⋅2ℓ−1ss, and O(n) field additions.
Algorithm 2: Quasilinear Time and Square-Root Space
The main idea is to avoid updating the arrays P1,…,Pd in each round i≤ℓ/2, and instead compute ∏k=1dpk(r1,…,ri−1,u,x′) directly using the following lemma.
Lemma. Let p:Fℓ→F be a multilinear polynomial. Then for any 0≤i≤ℓ, any (r1,...,ri)∈Fci, and any x′∈{0,1}ℓ−i,
With this formula, we can calculate si(u) as follows:
After ℓ/2 rounds, as both the time and space needed for the eq evaluations goes past 2ℓ/2, we will switch to Algorithm 1, which incurs a negligible cost of d2⋅2ℓ/2 ll. Note that in practice, one could switch as soon as the prover is no longer space-constrained, which may happen well before round ℓ/2.
For each round i=1,…,ℓ/2:
For u∈Ud, compute si(u) using the formula:
Send evaluations {si(u):u∈Ud} to the verifier.
Receive challenge ri∈Fc from the verifier.
During round i=ℓ/2:
Store the evaluations of pk(r[1,ℓ/2],b,x′) for all k=1,…,d, b∈{0,1} and x′∈{0,1}ℓ/2, obtained while computing round ℓ/2.
For the leftover rounds:
Follow algorithm 1.
Cost Analysis:
Inner sum has 2i−1 sl multiplications between the eq terms and the evaluations pk(y′,u,x′).
(d−1) ll multiplications for the product ∏k=1dpk(r[1,i−1],u,x′).
Outer summation then gives us, 2l−i⋅(2i−1sl+(d−1)ll)=2l−1sl+2l−i(d−1)ll multiplications for each si(u) evaluation and we need to evaluate over d points.
Besides these, we have lower-order costs of d(d−1)⋅2ℓ−1 ss multiplications to compute the evaluations {pk(y′,u,x′)}k,y′,u,x′ and 2i−1ll multiplications to compute the evaluations {eq(r[0,i−1],y′)}y′∈{0,1}i−1 and store them in an array.
Lemma. When the evaluations of p1,…,pd are small, Algorithm 2’s cost is dominated by d⋅ℓ0⋅2ℓ−1sl multiplications (spread over the first l/2 rounds), and d⋅(d−1)⋅2ℓll multiplications. The lower-order costs include d⋅ℓ0⋅(d−1)⋅2l−1ss multiplications and O(n) field additions.
Algorithm 3: Warm-up Attempt
In the Algorithm 1, the prover performs exclusively ll multiplications starting from round 2, due to the need to bind the multilinear polynomials after round 1. If we want to reduce the number of ll multiplications, we must delay this binding process. A natural attempt at this leads to Algorithm 3, which is typically faster than Algorithm 1 in the first few rounds, and uses significantly less memory.
If we rearrange the Equation (10), we can get:
Exchanging product with the inner sum
Distribute ∏
Expand eq
Switch inner ∏ with outer ∏
Switch the order of outer sums
eq part is invariant over x′ so we can factor it out.
At this point, we have rewritten the computation of si(u) as an inner product between two vectors of length 2d(i−1) (indexed by y1′,...,yd′): one dependent on the challenges r1,...,ri−1, and the other dependent on the multilinear polynomials p1,...,pd.
We can do slightly better by noticing that the challenge-dependent terms only depend on v∈{∑k=1dyk′[j]}j∈[1,i−1] and not on the individual {yk′}k∈[1,d]—there are only (d+1)i−1 distinct such terms. By re-indexing based on v∈[0,d]i−1, we can further rewrite:
Notice that above sum over the hypercube v∈{0,…,d}i−1 can be thought of as a inner product of a two vector of size (d+1)i−1:
(1−rj)vj⋅rjd−vj: this can have d different values depending on the value of vj which can range from 0 to d. If we store them in a vector, we can denote as:
The outer sum and product is then taking product of all possible i−1 combinations of entries within this vector and summing them. These product combinations can be thought of as Kronecker product:
Let us denote the right hand side of the inner product as:
Then we have:
Since the accumulators Ai(v,u) do not depend on the challenges r1,...,ri−1, the prover can precompute them before the protocol begins. For each of these, we don't have to recompute {pk(yk′,u,x′)}k∈[1,d] and instead reuse these across them. For this, we can do multilinear extension on pk over variable u to get:
and now we only need to evaluations of pk over the boolean hypercube.
Cost Analysis. Algorithm 3 uses O((d+1)ℓ0) space for rounds 1 to ℓ0 (accumulators and challenge vectors) and d⋅2ℓ−ℓ0 space thereafter (cached results). Since (d+1)ℓ0≪d⋅2ℓ−ℓ0 in practice, the overall space complexity is dominated by the latter term, representing a space saving factor of roughly 2ℓ0 compared to Algorithm 1.
Regarding runtime, Algorithm 3 is bottlenecked by the (d−1)⋅2dℓ0⋅2ℓ−ℓ0ss multiplications needed to compute the products {∏k=1dpk(yk′,x′)}y,x′. We also incur roughly d⋅2ℓsl mults when using Algorithm 2 for the (ℓ0+1)-th round, and roughly d2⋅2ℓ−ℓ0−1ll mults for the remaining rounds using Algorithm 1.
Algorithm 4: Optimizing with Toom-Cook Multiplication
While Algorithm 3 almost eliminates ll multiplications for the first few rounds and slashes memory usage, we still end up performing a lot of ss multiplications (namely, by a factor of ≈2(d−1)⋅ℓ0 compared to the number of ll mults in Algorithm 1). To reduce this cost, we need another way to rewrite the products ∏k=1dpk(r[1,i−1],u,x′).
The idea is to see the product as the evaluation of the polynomial:
at the challenge point r[1,i−1].
Since F has individual degree d, we can apply Lagrange interpolation to each variable to rewrite F(Y1,…,Yi−1) as
As a result, when we plug in the evaluation (r1,…,ri−1), we can rewrite the i-th sum-check polynomial si(X) as:
then we have:
Just like with Algorithm 3, we can precompute the accumulators Ai(v,u) by computing each of the ≈(d+1)ℓ0 product
over v∈Udi−1, u∈Ud and x′∈{0,1}ℓ−i once, then inserting the result into the appropriate Ai(v,u) terms over all i∈[1,ℓ0] rounds. For details, see Section A.3.
Cost Analysis. Similar to Algorithm 3, Algorithm 4 also saves a factor of 2l0 in space. Unlike Algorithm 3, however, it incurs only (d−1)⋅(d+1)ℓ0⋅2ℓ−ℓ0ss, instead of (d−1)⋅2d⋅ℓ0⋅2ℓ−ℓ0. This factor can be roughly 5 times smaller for common parameters (e.g. d=2 and ℓ0=5). The cost of sl and ll are the same as in Algorithm 3.
Optimizations for Sum-Check involving EqPoly
In this section, the optimization only applies to sum-check where g is of the form:
with w∈Fcℓ a vector of verifier’s challenges, and p1(X),…,pd(X) are multilinear polynomials. (e.g Qio zero-check in Spartan).
Gruen's Optimization
The main idea of this optimization by Gruen et al. is to reduce the computation of the degree-(d+1) polynomial si(X) sent by the prover in each round i, to the computation of a degree-d factor ti(X) and a linear factor li(X). This way, the linear term is “free” to compute, and the prover saves an evaluation when computing ti(X) instead of si(X).
This is possible due to the special structure of the eq polynomial, which for every i∈[1,ℓ] can be decomposed as:
If we adapt the original definition for si(u) to this case, we get:
Thus, if we set li(X):=eq(w[1,i−1],r[1,i−1])⋅eq(wi,X) and
then we have si(X)=li(X)⋅ti(X).
Using this fact, the sum-check prover now instead computes d evaluations of ti(u) at u∈Ud, then rely on the equality si(0)+si(1)=Ci to compute:
The prover now has d+1 evaluations of ti, and thus can interpolate ti(X) and compute si(X). Note that computing {ti(u):u∈Ud} can be done generically using either Algorithm 1 or 2; since our focus is on optimizing time rather than space, we assume that Algorithm 1 is used.
Algorithm 5
We can further reduce 2ℓ−1ll associated with the eq polynomial. Our insight is that we can further split off the eq terms for the first i<ℓ/2 rounds, so that:
for all u∈Ud, where wR=w[i+1,ℓ/2] and wL=w[ℓ/2+1,n]. The sum-check prover now computes all inner sums over xL, then all outer sums over xR. For rounds i≥ℓ/2, we no longer have the inner sum, and proceed similarly to Algorithm 1. Note that we put the left part eq(wL,xL) in the inner sum, since it leads to better memory locality in practice, assuming the evaluations of pk are streamed in big-endian order (so that xL are low-order bits compared to xR, which means their chunks are contiguous in memory).
Initialize: Length-2ℓ arrays P1(0),…,Pd(0) such that
Pre-computation: Use Procedure 3 from the paper to compute the evaluations:
for all i∈[1,ℓ/2],xL∈{0,1}ℓ/2−i, and xR∈{0,1}ℓ/2−i.
For each round i=1,…,ℓ:
For u∈Ud, if i<ℓ/2, compute ti(u) using the formula
If i≥ℓ/2, compute ti(u) using the formula:
Use Procedure 8 form the paper to compute si(X) from ({ti(u)},w,r[1,i−1])
Send evaluations of si over Ud
Receive challenge ri∈F from the verifier.
For k∈[1,d] and x′∈{0,1}ℓ−i, update arrays:
Cost Analysis. Our optimization avoids computing the 2ℓ−i-sized table of evaluations {eq(w[i+1,ℓ],x′):x′∈{0,1}ℓ−i} for each round i, and instead only compute two square-root-sized tables, which costs 2ℓ/2ll mults for eq(wR,xR) and 2ℓ/2−ill mults for eq(wL,xL). With memoization, these evaluations can be computed for all rounds i in 2ℓ/2ll multiplications total. The (negligible) downside is that we need to compute an extra 2ℓ/2ll multiplications when computing the outer sum over xR.
Using the same process for cost analysis as in Lemma 3.2, we have the following cost breakdown for Algorithm 5.
Lemma 5.1. Algorithm 5’s dominant cost is d(d+1)/2⋅N ll, reducing from Algorithm 1’s cost in the same setting by roughly N ll.
Algorithm 6: Combining SVO with Eq-Poly optimization
Our starting point is Equation (17) in the exposition of Algorithm 5. We then apply the multivariate Lagrange decomposition formula in Equation (14) to the evaluations ∏k=1dpk(r,u,xL,xR), which gives:
where the accumulators Ai(v,u) are now of the form:
Unlike the small-value only case, here because of the presence of the eq(w,⋅) terms, we need to use sl (instead of ss) multiplications in computing the accumulators, and the additions are over the big field as well. We also note that we have swapped the lengths of the vectors xR and xL in the accumulator formula. This is so that we can compute the same inner sum across the accumulators over all rounds, and we can dispatch these inner sums to different accumulators in different rounds afterwards, with negligible cost. See Section A.4 from the paper for more details.
Pre-computation: Use Procedure 9 from the paper to compute the accumulators:
for all i∈[1,ℓ0],v∈Udi−1, and u∈Ud.
Initialize: R1=(1).
For each round i=1,…,ℓ0:
For u∈Ud, compute ti(u) using the formula
Use Procedure 8 form the paper to compute si(X) from ({ti(u)},w,r[1,i−1])
Send evaluations of si over Ud
Receive challenge ri∈Fc from the verifier.
Compute Ri+1=Ri⊗(LUd,k(rj))k=0d.
For round ℓ0+1: Follow Algorithm 2, storing the evaluations pk(r[1,ℓ0−1],b,x′) for all k∈[1,d],b∈{0,1}, and x′∈{0,1}ℓ−ℓ0.
Receive challenge rℓ+1∈Fc from the verifier, and perform updates to get pk(r[1,ℓ0],x′) for all k and x′, similar to Algorithm 1.
For rounds ℓ+2,…,ℓ: Follow Algorithm 5.
Cost Analysis. Besides the extra subtleties as noted above, the cost estimate of Algorithm 6 is fairly straightforward and follows directly from combining the estimates of Algorithm 4 and Algorithm 5. In particular, the small-value precomputation process is now dominated by (d+1)ℓ0⋅2ℓ−ℓ0sl. Next, the streaming round takes roughly d⋅2ℓsl, and the rest of the linear-time sum-check rounds cost roughly d2⋅2ℓ−ℓ0−1ll. We thus summarize as follows.
Theorem 6.1. Algorithm 6’s cost is dominated by ((2d+1)ℓ0+d)N sl and d2/2ℓ0+1N ll.
Summary
Algorithm 1
d⋅2ℓ−1sl+d⋅2ℓll
d⋅2ℓ
Algorithm 2
d⋅ℓ0⋅2ℓ−1sl+d⋅2ℓll
d⋅2ℓ−ℓ0
Algorithm 3
(d−1)⋅2dℓ0⋅2ℓ−ℓ0ss+d⋅2ℓsl+d2⋅2ℓ−ℓ0−1ll
d⋅2ℓ−ℓ0
Algorithm 4
(d−1)⋅(d+1)ℓ0⋅2ℓ−ℓ0ss+d⋅2ℓsl+d2⋅2ℓ−ℓ0−1ll
d⋅2ℓ−ℓ0
Algorithm 5
d(d+1)/2⋅N ll
d⋅2ℓ
Algorithm 6
((2d+1)ℓ0+d)N sl+d2/2ℓ0+1N ll
d⋅2ℓ−ℓ0
References:
Speeding Up Sum-Check Proving by Suyash et al, 2025.
Optimizing the Sumcheck Protocol by Thomas Coratger, 2025.
Written by Batzorig Zorigoo of Fractalyze
Last updated