Bernstein-Yang's Inverse
Introduction
GCD-based algorithms are widely used as an efficient method for computing modular inverses. However, most of them involve input-dependent branching, which makes them vulnerable to side-channel attacks.
For example, an implementation of Stein’s binary GCD algorithm in OpenSSL was shown to be susceptible to an RSA key extraction attack. This kind of attack demonstrates that the execution path of conditional branches can leak secret information.
A simple countermeasure is to use Fermat’s Little Theorem, which computes the inverse as a−1≡ap−2modp. This method can be implemented in constant time, but it is generally less efficient.
This raises the question: Is there a way to compute the modular inverse that is both constant-time secure and faster than the Fermat-based approach?
Background
Limitations of Existing Constant-Time GCD Algorithms
The Binary GCD algorithm (also known as Stein’s Algorithm) involves branching based on the parity of the input integers or comparisons between them, which causes the execution path to vary. For example:
If a is even → right shift
If b is even → right shift
If a>b → a:=a−b
Branching based on whether a or b is even is relatively safe since it only inspects the least significant bit. However, branching on comparisons like a>b often requires examining multiple bits (or limbs), which makes it difficult to ensure constant-time execution.
Protocol Explanation
The original paper also covers the case of computing inverses of polynomials, but for simplicity, we will restrict ourselves to the case of integers. The explanation closely follows the implementation described in https://github.com/bitcoin-core/secp256k1/blob/master/doc/safegcd_implementation.md, while adhering to the original paper's formulation of the divstep update as much as possible.
GCD with divstep
divstepWhen a is an odd integer and b is any integer, the divstep function is defined as follows:
This operation is designed to execute in constant-time, relying only on fixed-bit values like δ and the least significant bit of b to determine behavior, ensuring that only a single bit needs to be checked.
Stages
The divstep operation can be viewed as a composition of two functions:
Conditional Swap
Condition: δ>0 and b is odd
Operation: (δ,a,b)→(−δ,b,a)
TODO(chokobole): explain the reason why δ helps to reduce b to 0.
Elimination
Always performed
Purpose: to reduce the size of b
Operation: (δ,a,b)→(1+δ,a,2(amod2)⋅b−(bmod2)⋅a)
Since a is always odd, amod2=1 is guaranteed, allowing simplification:
(δ,a,b)→(1+δ,a,2b−(bmod2)⋅a)This division by 2 is always valid because:
If b is odd, then b−a is even (odd - odd = even)
If b is even, the result is trivially divisible by 2
GCD Preservation
This logic is rooted in the classical Euclidean algorithm:
Thus, divstep preserves the GCD while continually reducing the size of b. Eventually, b reaches 0, and the GCD remains in a.
Simplified divstep
divstepThe behavior of divstep can be more concretely classified as:
As previously discussed, this operation ultimately returns the GCD.
Code
Modular Inverse with divstep
divstepThe idea of using divstep to compute the modular inverse becomes most clear when we assume that the modulus p is an odd prime. In this section, we explain the procedure for computing the inverse of an integer a∈Fp, i.e., a−1modp.
Initial Setup
We initialize the variables as u=p, v=a, x1=0, and x2=1, and maintain the following invariants throughout:
At each iteration, we apply a divstep transformation to the pair (u,v). Since p is a prime and a∈Fp×, we know that gcd(p,a)=1. This guarantees that repeated applications of divstep will eventually reduce v to 0, and u to either 1 or −1.
Formally, after a sufficient number of iterations, we reach:
At this point, because the algorithm maintains the invariant ax1≡umodp, we conclude:
Since u=±1, multiplying x1 by u gives the correct modular inverse of a.
The divstep function divstep(δ,u,v) branches into three cases depending on δ and the parity of v. We analyze how the invariants are preserved in each case.
Case 1: δ>0 and v is odd
We use the rule divstep(δ,a,b)=(1−δ,b,2a−b) for Case 1. Accordingly, the values of u and v are updated as:
Since u is guaranteed to be odd, the subtraction u−v is always even, making the division by 2 valid.
We apply the same update rule to x1 and x2 as follows:
Case 2: δ≤0 and v is odd
We use the rule divstep(δ,a,b)=(1+δ,a,2b−a) for Case 2 Accordingly, the values of u and v are updated as:
Since u is guaranteed to be odd, the subtraction u−v is always even, making the division by 2 valid.
We apply the same update rule to x1 and x2 as follows:
Case 3: v is even
We use the rule divstep(δ,a,b)=(1+δ,a,2b) for Case 3 Accordingly, the values of u and v are updated as:
We apply the same update rule to x1 and x2 as follows:
Code
Batching Multiple Divsteps
Motivation
The divstep operation involves division by 2 for the variables u,v,x1,x2. Instead of performing division by 2 on x1 and x2 at every iteration, we can batch N divsteps together and apply them to x1 and x2 all at once. This reduces N individual divisions into a single division.
Transition Matrix
A single divstep can be represented using a matrix multiplication. The one-step update is expressed as:
Here, α is 0 or p depending on x1 and x2 that make sure the expression inside the parentheses becomes divisible by 2.
Note that u and v are standard integers, while x1 and x2 are integers modulo p
The transition matrix T used to update u,v is the same one applied to x1,x2. The key idea is to compute the transition matrix from u and v over N steps, and then apply it to x1,x2 using matrix-vector multiplication as follows:
Here, m1p and m2p are correction terms chosen so that the expression inside the parentheses becomes divisible by 2N. This will be explained in here.
The transition matrix T(δ,u,v) is defined as follows:
Upper Bound on Iteration Count: Theorem 11.2
Let b=max(bit_length(u),bit_length(v)), which can usually be considered the bit length of the modulus. Then, the number of N-step matrix applications is bounded by:
Example: For the BN254 curve where the prime has 254 bits:
So if we set N to 62, it has been proven that a modular inverse can be computed with at most 12 iterations of the loop.
Code
transition_matrix: Computes the matrix afterNdivsteps, this also computes u(N),v(N)
div2n&update_x1x2: Applies the matrix to [x1,x2] to compute x1(N),x2(N)
Why div2n makes x divisible by 2N:
div2n makes x divisible by 2N:We can express an integer x as:
where r is the remainder mod 2N. To make x divisible by 2N, we need to eliminate the r part.
To do this, we first compute:
Then, we subtract m⋅p from x:
which is now cleanly divisible by 2N.
Additionally, since m⋅p≡rmod2N, the subtraction step preserves the modular equivalence:
modinv: Combines all components to compute the modular inverse
Avoiding Modulus Operations
The original modinv function already involved a modulus operation, but with the introduction of batched divstep, expensive modular operations now occur at every loop iteration. In this section, we aim to eliminate those costly modulus computations.
Removing Modulus Operation in div2n
div2nOriginal Purpose:
The div2n function is used to reduce results into the range [0,p).
Optimization Insight:
Instead of restricting outputs to modulo p, we can eliminate the mod operation if we ensure that all intermediate values remain within a wider range, such as (−2p,p).
✅ Version 1: Allowing Range Expansion
🔍 Range Analysis of Version 1
Each transition matrix T acts on the vector [x1,x2]T in one of the following ways:
After N iterations, we get:
Since x1,x2∈(−p,p) initially:
In update_x1x2_optimized_ver1, we subtract a multiple of p before division:
With 0≤mx1,mx2<2N, we obtain:
After division by 2N:
Hence, the result stays within (−2p,p).
⚠️ Drawback
After each divstep, the range expands:
This increasing range leads to:
Unnecessary arithmetic overhead
Larger intermediate values
Potential memory inefficiencies
✅ Version 2: Pre-Clamping Strategy
To prevent range blow-up, we can add p to negative values to bring the range back to (−p,p):
Avoiding Modulus Operation in modinv
modinvAfter the final iteration, the results x1(N),x2(N) lie in (−2p,p). We want to map x1 into the range [0,p) using the following normalization:
if v < 0: v += p
−p<v<p
0≤v<p
−p<v<p
0≤v<p
if sign == -1: v = -v
0<v<p
−p<v≤0
−p<v<p
0≤v<p
if v < 0: v += p
0<v<p
0≤v<p
0<v<p
0≤v<p
Final Implementation
Conclusion
Traditional GCD-based modular inverse algorithms are fast but suffer from a major drawback: they are vulnerable to side-channel attacks due to conditional branches and comparison operations. This vulnerability can be critical in environments where sensitive key material is involved, such as RSA or ECC.
This article introduced a robust alternative: the Bernstein–Yang divstep-based modular inverse algorithm. This approach offers several compelling advantages:
✅ Constant-time execution: Designed to ensure execution time does not vary based on input values
✅ High performance: Faster than Fermat’s method and comparable to (or better than) classical GCD-based methods
✅ Optimized computation: Improves efficiency through batch processing via matrix multiplication and by avoiding modulus operations ↪ For additional optimizations, refer to Bitcoin Core’s safegcd implementation guide
Ultimately, the divstep-based algorithm strikes an excellent balance between performance and security. It is especially well-suited for environments that demand resistance to timing attacks, such as cryptographic protocol implementations or hardware acceleration contexts.
References
Written by Ryan Kim of Fractalyze
Last updated