# CycloneMSM

## Introduction

**CycloneMSM** is an FPGA-accelerated architecture for Multi-Scalar Multiplication (MSM) over the BLS12-377 elliptic curve, developed by [Jump Trading](https://www.jumptrading.com/) and [Jump Crypto](https://jumpcrypto.com/). Optimized for the case of a **fixed base point**, this implementation achieves **high-performance execution**, completing 4 million-point MSMs in under one second and 64 million-point MSMs in approximately six seconds—delivering over **6× speedup** compared to other state-of-the-art software. At its core, CycloneMSM leverages the **bucket method of the Pippenger algorithm**, and applies aggressive **pipeline optimization and scheduling techniques** to perform curve operations without bottlenecks.

## Background

### FPGA

To dramatically accelerate MSM computation, **CycloneMSM leverages Field Programmable Gate Arrays (FPGAs)**. Unlike CPUs or GPUs, FPGAs offer a fundamentally different computation model, making them particularly well-suited for exploiting the **parallelism and pipeline structure** inherent in MSM.

#### Comparison: CPU vs GPU vs FPGA

<table><thead><tr><th width="112.8134765625">Architecture</th><th width="442.5908203125">Characteristics</th><th>Suitability for MSM</th></tr></thead><tbody><tr><td><strong>CPU</strong></td><td>General-purpose processor; optimized for sequential tasks</td><td>❌ Too slow</td></tr><tr><td><strong>GPU</strong></td><td>Designed for parallel computation with many execution units</td><td>⭕ Reasonable performance</td></tr><tr><td><strong>FPGA</strong></td><td>Custom logic design and deep pipelining capabilities</td><td>✅ Ideal choice</td></tr></tbody></table>

* **CPUs** excel at complex control flow and branching, but are inefficient for highly repetitive operations like point additions and multiplications.
* **GPUs** offer strong parallelism but struggle with **fine-grained scheduling and memory access control**.
* **FPGAs**, on the other hand, allow direct **circuit-level customization**, enabling **deeply pipelined designs** that sustain a continuous flow of operations with minimal control overhead.

### Coordinate Systems

MSM involves millions of point additions, so the **efficiency of the addition operation** plays a crucial role in overall performance. In particular, [**affine coordinates**](https://fractalyze.gitbook.io/intro/~/revisions/6ILOvDSvW4zAg8fUzupj/primitives/abstract-algebra/weierstrass-curve/coordinate-forms#affine) require an inverse operation per addition, which can significantly increase the total computation cost. (Reference: [Explicit Formulas Database](https://www.hyperelliptic.org/EFD))

<figure><img src="https://755218234-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Frwz1ZAZJtK5FHz4Y1esA%2Fuploads%2Fmo5fQDjy3cYK5d7IllmF%2Fimage.png?alt=media&#x26;token=c0569dfb-64a9-4212-9cf0-9bb0c9dd4110" alt=""><figcaption></figcaption></figure>

While [**Extended Jacobian coordinates**](https://fractalyze.gitbook.io/intro/~/revisions/6ILOvDSvW4zAg8fUzupj/primitives/abstract-algebra/weierstrass-curve/coordinate-forms#xyzz-extended-jacobian) for [**Weierstrass Curves**](https://fractalyze.gitbook.io/intro/~/revisions/6ILOvDSvW4zAg8fUzupj/primitives/abstract-algebra/elliptic-curve/weierstrass-curve) enable inverse-free operations, the formulas used depend on the context (e.g., addition vs doubling). This results in **variable computational cost**, making it difficult to pipeline efficiently in hardware.

In contrast, [**Extended Projective coordinates**](https://fractalyze.gitbook.io/intro/~/revisions/6ILOvDSvW4zAg8fUzupj/primitives/abstract-algebra/edwards-curve/coordinate-forms#extended-projective) for [**Twisted Edwards**](https://fractalyze.gitbook.io/intro/~/revisions/6ILOvDSvW4zAg8fUzupj/primitives/abstract-algebra/elliptic-curve/edwards-curve) form of elliptic curves support both addition and doubling with a **single unified formula**, resulting in **constant computational cost**. This property makes them highly suitable for **hardware pipeline design**.

Accordingly, CycloneMSM adopts the following coordinate system strategy:

* **Software implementation**: Uses **Extended Jacobian coordinates** for inverse-free operations
* **FPGA hardware implementation**: Uses **Extended projective coordinates** for pipelining optimization

This coordinate system selection is a key architectural decision aimed at maximizing computational efficiency in each environment.

### Pippenger Algorithm

The [**Pippenger algorithm**](https://fractalyze.gitbook.io/intro/~/revisions/6ILOvDSvW4zAg8fUzupj/primitives/abstract-algebra/elliptic-curve/msm/pippengers-algorithm) is a classic technique for efficiently performing MSM (Multi-Scalar Multiplication). Given $$N$$ pairs of (point, scalar) and a window size $$c$$, each window $$W$$ requires the following operations:

* **Bucket Accumulation**: $$N \times \mathsf{MixedAdd}$$
* **Bucket Reduction**: $$2^c \times \mathsf{Add}$$
* **Window Reduction**: $$c \times \mathsf{Double} + 1 \times \mathsf{Add}$$

Among these, **Bucket Accumulation accounts for the majority of the total computation**. Therefore, optimizing this stage is critical for improving overall MSM performance.

## Protocol Explanation

### Scheduler: Optimizing Operation Order to Avoid Pipeline Conflicts

CycloneMSM is designed around an **FPGA-based architecture**, enabling **fully pipelined execution** of computations. In particular, the **Bucket Accumulation phase**, which dominates MSM computation, is highly accelerated by this pipelining structure.

#### How the Pipeline Works

The **Curve Adder**, a core computational unit of CycloneMSM, is a pipelined structure designed to efficiently perform point addition on Twisted Edwards curves. This module supports both $$\mathsf{MixedAdd}$$ and $$\mathsf{Add}$$ operations and is optimized for execution on FPGA hardware.

The Curve Adder takes $$T$$ cycles to complete a single addition. In CycloneMSM, $$T = 96$$. A new operation is fed into the pipeline every cycle, allowing multiple point additions across different buckets to run in parallel, as illustrated below:

<table data-header-hidden data-full-width="true"><thead><tr><th></th><th></th><th></th><th></th><th></th><th></th><th></th></tr></thead><tbody><tr><td></td><td><span class="math">t</span></td><td><span class="math">t + 1</span></td><td><span class="math">t + 2</span></td><td><span class="math">\dots</span></td><td><span class="math">t + {T-1}</span></td><td><span class="math">t + T</span></td></tr><tr><td><span class="math">B_{k^{(t)}}</span></td><td><span class="math">P^{(t)}</span>﻿</td><td></td><td></td><td></td><td></td><td><span class="math">P^{(t + T)}</span>﻿</td></tr><tr><td><span class="math">B_{k^{(t + 1)}}</span></td><td></td><td><span class="math">P^{(t + 1)}</span></td><td></td><td></td><td></td><td></td></tr><tr><td><span class="math">B_{k^{(t + 2)}}</span></td><td></td><td></td><td><span class="math">P^{(t + 2)}</span>﻿</td><td></td><td></td><td></td></tr><tr><td><span class="math">\vdots</span></td><td></td><td></td><td></td><td><span class="math">\ddots﻿</span></td><td></td><td></td></tr><tr><td><span class="math">B_{k^{(t + T - 1)}}</span>​﻿</td><td></td><td></td><td></td><td></td><td><span class="math">P^{(t + T - 1)}</span>﻿</td><td></td></tr></tbody></table>

The addition at each cycle is defined as:

$$
B\_{k^{(t)}} \leftarrow B\_{k^{(t)}} + P^{(t)}
$$

**Definition of Terms**

* $$t$$: The pipeline time step (i.e., cycle index). At each $$t$$, a new operation enters the pipeline.
* $$P^{(t)}$$: The point being added at time $$t$$.
* $$k^{(t)}$$: The reduced scalar index corresponding to point $$P^{(t)}$$. It determines the bucket.
* $$B\_{k^{(t)}}$$​: The bucket that accumulates all points whose reduced scalar index is $$k^{(t)}$$.

However, for this pipeline to operate correctly, **no two additions in the pipeline should access the same bucket** $$B\_k$$ **within a** $$T$$**-cycle window**. This is enforced by the following constraint:

$$
k^{(t\_1)} \ne k^{(t\_2)} \quad \text{ if } |t\_1 - t\_2| < T
$$

#### Design Goals of the Scheduler

Based on this, the Scheduler must satisfy two key objectives:

1. **Correctness**: Prevent pipeline conflicts by adhering to the constraint above.
2. **Efficiency**: Avoid excessive reordering of points; most operations should proceed in near-original order.

#### Scheduler Implementation Strategies

CycloneMSM explores two strategies to meet these goals:

1. **Delayed Scheduler**

* **Idea**: When a conflict is detected, defer processing of that point by placing it in a **FIFO queue** and retry it later.
* **Estimated conflict count**: Approximately $$\frac{NT}{2^{c-1}}$$.
* Example: With $$N = 2^{26}$$, $$c = 16$$, $$T = 100$$, about **204,800** conflicts arise (\~0.3% of points).
* **In practice**: Most points are processed successfully within 2–3 passes.
* **Advantages**: Simple architecture and easy to implement in hardware
* **Drawbacks**:
  * Requires maintaining a queue of size proportional to the number of conflicts, which can be memory-intensive
  * In worst-case scenarios (e.g., non-uniform scalars), frequent collisions could degrade performance

> CycloneMSM adopts this approach in its actual FPGA implementation.

2. **Greedy Scheduler**

* **Idea**: Attempt to process delayed points at the earliest valid time (e.g., $$t + T + 1$$), even if conflicts occurred earlier.
* **Advantages**: Queue is drained quickly, leading to **smaller queue size**
* **Drawbacks**:
  * Requires tracking last access times for all buckets, which complicates state management
  * Increased memory updates and lookups each cycle place **heavy demands on hardware resources**

> Due to memory access overhead, this strategy was **not adopted** in CycloneMSM’s hardware environment.

### **Software-Side Affine Optimization**

When implementing MSM in software, **non-Affine coordinate systems** such as **Extended Jacobian** are typically used. The main reason is that **point addition in Affine coordinates requires field inversion**, which is computationally expensive. For instance, consider a typical Bucket Accumulation where multiple points are added sequentially:

$$
P \leftarrow P\_1 + P\_2 + \dots + P\_{n-1} + P\_n
$$

If Affine coordinates were used, each addition would involve an inverse operation, leading to significant performance overhead.&#x20;

{% hint style="info" %}
[BATZORIG ZORIGOO](https://app.gitbook.com/u/lqk5Tx9zY4XYVRfF3ReRDiOhbxG3 "mention") mentioned that batch inversion can be applied using a tree-structured addition approach, which reduces the number of operations to $$\log\_2 n \times \mathsf{Inv}$$
{% endhint %}

#### Scheduler Enables the Use of Affine Coordinates

However, thanks to the **Scheduler** introduced in CycloneMSM, points can now be grouped into batches for addition **without conflicts between buckets**:

$$
B\_{k^{(t)}} \leftarrow B\_{k^{(t)}} + P^{(t)} \\
B\_{k^{(t + 1)}} \leftarrow B\_{k^{(t + 1)}} + P^{(t + 1)} \\
\vdots \\
B\_{k^{(t + T - 1)}} \leftarrow B\_{k^{(t + T - 1)}} + P^{(t + T - 1)} \\
$$

This structure allows for **batch addition of** $$T$$ **Affine points**, making it possible to apply **batch inversion** and significantly reduce the overall computation cost.

#### Comparison of Operation Costs

* Affine addition (1 time): $$3 \times \mathsf{Mul} + 1 \times \mathsf{Inv}$$
* $$T$$ Affine additions (naive): $$3T \times \mathsf{Mul} + T \times \mathsf{Inv}$$
* $$T$$ Affine additions (with batch inversion): $$6T \times \mathsf{Mul} + 1 \times \mathsf{Inv}$$
* $$T$$ Extended Joaobian addtions: $$10T \times \mathsf{Mul}$$

Thus, **the number of inversion operations is reduced from** $$T$$ **to 1**, while the rest is replaced by multiplications, leading to a dramatic boost in efficiency. If $$4T \times \mathsf{Mul} > 1 \times \mathsf{Inv}$$, then $$T$$ Affine addition (with batch inversion) is more efficient.

#### Real-World Implementations

This Affine optimization has been adopted in actual open-source libraries:

* [gnark-crypto #261](https://github.com/Consensys/gnark-crypto/pull/261)
* [halo2curves #130](https://github.com/privacy-scaling-explorations/halo2curves/pull/130)

Thanks to this optimization, a **10–20% performance improvement** over Extended Jacobian-based implementations has been reported, significantly enhancing MSM computation efficiency in practice.

### FPGA Design

#### Field Arithmetic

<figure><img src="https://755218234-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Frwz1ZAZJtK5FHz4Y1esA%2Fuploads%2FXhEaX4KNySzyU5cradYN%2Fimage.png?alt=media&#x26;token=9194e8cf-ec2d-496d-8b76-6cd806c267d1" alt=""><figcaption></figcaption></figure>

Arithmetic in the finite field $$\mathbb{F}\_q$$ is performed using 377-bit operations in [**Montgomery representation**](https://fractalyze.gitbook.io/intro/~/revisions/6ILOvDSvW4zAg8fUzupj/modular-arithmetic/modular-reduction/montgomery-reduction#the-montgomery-representation), with a Montgomery parameter of $$R = 2^{384}$$. Field multiplication is implemented using **three 384-bit integer multiplications** based on the [**Karatsuba algorithm**.](https://fractalyze.gitbook.io/intro/~/revisions/6ILOvDSvW4zAg8fUzupj/primitives/multiplication/karatsuba-multiplication)

Additionally, since $$R \ge 4q$$, the system can safely operate on inputs $$a, b \in \[0, 2q)$$ for multiplication. This characteristic allows certain modular reductions to be skipped when implementing the Curve Adder module:

1. **When** $$a, b \in \[0, q)$$, the sum $$a + b \in \[0, 2q)$$ → **modular reduction is unnecessary**
2. **To handle subtraction safely**, compute $$q + a - b \in \[0, 2q)$$ → **modular reduction is also unnecessary**

By avoiding modular reductions in these cases, resource usage and processing delays in the pipeline are further minimized.

#### Constant Multiplier

<figure><img src="https://755218234-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Frwz1ZAZJtK5FHz4Y1esA%2Fuploads%2FjbIqE4JQ3wSw2xAhcEh7%2Fimage.png?alt=media&#x26;token=d9d714c0-250c-4855-b1b5-4a4d672c9d3d" alt=""><figcaption></figcaption></figure>

In Montgomery representation, two constant multiplications are required as follows:

$$
a \times (R^2 \mod q) \times (R^{-1} \mod q)
$$

Instead of using the standard double-and-add approach for multiplying constants, **CycloneMSM employs NAF (Non-Adjacent Form)**. This reduces the **Hamming weight** of the constant from $$\frac{1}{2}$$ to approximately $$\frac{1}{3}$$, resulting in several advantages:

* Fewer additions in the double-and-add process → **improved speed**
* Shallower and faster **adder trees** in the FPGA design
  * Specifically, **adder tree depth is reduced from** $$\log\_2 W(b)$$ **to** $$\log\_3 W(b)$$, where $$W(b)$$ denotes the Hamming weight of the constant $$b$$.

This optimization plays a critical role in improving both performance and resource efficiency in the CycloneMSM hardware pipeline.

#### Curve Arithmetic

<figure><img src="https://755218234-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Frwz1ZAZJtK5FHz4Y1esA%2Fuploads%2FdvJcVhVO9dnqAD7HENOw%2FScreenshot%202025-04-17%20at%201.28.31%E2%80%AFPM.png?alt=media&#x26;token=67900194-e5ab-4eb1-9adc-8a3587a20815" alt=""><figcaption></figcaption></figure>

<figure><img src="https://755218234-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Frwz1ZAZJtK5FHz4Y1esA%2Fuploads%2FqORpeYbZdfGu8CQVJNJr%2FScreenshot%202025-04-17%20at%201.44.52%E2%80%AFPM.png?alt=media&#x26;token=e9d06da7-fec3-49ba-a2f2-941e42a03234" alt=""><figcaption></figcaption></figure>

**Pipelined Structure**

* **Clock speed**: 250 MHz → 1 cycle = 4 ns
* **Cycle latency**:
  * $$\mathsf{MixedAdd}$$ outputs are produced **after 96 cycles (= 384 ns)**
  * $$\mathsf{mul}$$ outputs are produced **after 39 cycles (= 156 ns)**

**MixedAdd Operation (Refer to the left side of Figure 4)**

<figure><img src="https://755218234-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Frwz1ZAZJtK5FHz4Y1esA%2Fuploads%2FtugxiCjR55sUb3cNNSVS%2Fimage.png?alt=media&#x26;token=eb450afb-20ae-4f86-a318-1646e97610ad" alt=""><figcaption></figcaption></figure>

In Twisted Edwards curves, $$\mathsf{MixedAdd}$$ uses the following two inputs:

* $$P\_1 = (X\_1 : Y\_1 : Z\_1 : T\_1)$$ (extended projective)
* $$P\_2 = (x\_2, y\_2, u\_2 = kx\_2y\_2)$$ (extended affine)

This operation involves **7 field multiplications**, and it is optimized using formulas from \[[HWCD08, Sec. 4.2](https://eprint.iacr.org/2008/522.pdf#page=8)]. Due to its pipelined nature, the adder can accept new points every cycle.

**Full Add Operation (Refer to the right side of Figure 4)**

A general $$\mathsf{Add}$$ operation requires **9 multiplications**. CycloneMSM implements this as a **2-cycle pipelined process**. The processing flow is as follows:

1. **Stage 0**

   Inputs: $$Z\_1$$, $$Z\_2$$ → passed to the multiplier
2. **Stage 1**

   Inputs: $$T\_2$$, curve constant $$k$$ → passed to the multiplier
3. **Stage 40 (After the stages required for a single Montgomery Multiplier)**

   Invoke MixedAdd with outputs below:

$$
\mathsf{MixedAdd}\left(P\_1 = (X\_1 : Y\_1 : r\_0 : T\_1),\ P\_2 = (X\_2, Y\_2, r\_1)\right)
$$

where $$r\_0 = Z\_1 \cdot Z\_2$$ and $$r\_1 = k \cdot T\_2$$

This design converts a full Add into a MixedAdd call, effectively **reusing the existing pipeline** while precomputing necessary multiplications to maintain efficiency.

#### MSM Acceleration

<figure><img src="https://755218234-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Frwz1ZAZJtK5FHz4Y1esA%2Fuploads%2FKhw88wDMqVxZTY0vc78y%2Fimage.png?alt=media&#x26;token=1cc125ab-d003-4204-9191-3156c5ac67c0" alt=""><figcaption><p>They use <span class="math">S_k</span> for bucket instead of <span class="math">B_k</span>.</p></figcaption></figure>

<figure><img src="https://755218234-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Frwz1ZAZJtK5FHz4Y1esA%2Fuploads%2FJ3JKUQaV6hMw2u8U0erQ%2Fimage.png?alt=media&#x26;token=07d41691-cbcf-40f5-b6c4-6f89e739bcae" alt=""><figcaption></figcaption></figure>

The MSM operation in CycloneMSM is executed through coordinated interaction between the FPGA and the host, following the flow of functions outlined below:

* `MSM_INIT()`:
  * The user provides $$N$$ points in **Weierstrass affine coordinates**
  * [The host converts these points to the **Twisted Edwards coordinate system**](https://fractalyze.gitbook.io/intro/~/revisions/6ILOvDSvW4zAg8fUzupj/primitives/abstract-algebra/elliptic-curve/edwards-curve/twisted-edwards-short-weierstrass-transformation)
    * Typically affine → extended affine: $$(x, y) \rightarrow (x, y, u = kxy)$$
  * The converted points are transferred to the FPGA and stored in **DDR memory**
  * Since the conversion involves inversion, **batch inversion** is used to optimize performance
* `MSM()`:
  * The host splits each scalar into **16-bit windows (**$$c = 16$$**)** using `PREPROCESS()`
    * A total of $$W = 16$$ windows are created
  * For each window, the following functions are executed:
    1. `ACCUMULATE(j)`: performs bucket accumulation
    2. `AGGREGATE(j)`: performs bucket reduction
  * The final result of all windows is computed by the host via **Window Reduction**
* `ACCUMULATE(j)`:
  * The host invokes `FPGA.ACCUMULATE(j)`
  * Scalars are reduced to **16-bit reduced scalars** and streamed to the FPGA
    * Data is batched and transmitted in 64-bit chunks (8 scalars per batch)
* `FPGA.ACCUMULATE(j)`:
  * The FPGA invokes `SCHED()` to read points from DDR memory
    * If a conflict is detected, the scalar and point are stored back into DDR via FIFO
    * Otherwise, the addition is performed and the result is stored in the corresponding bucket in SRAM
* `AGGREGATE()`:
  * The host invokes `FPGA.AGGREGATE()`
* `FPGA.AGGREGATE()`:
  * The FPGA accumulates $$2^{c-1}$$ buckets in a sequential manner using the following formula:

    $$R = B\_K + (B\_K + B\_{K-1}) + \dots + (B\_1 + \dots + B\_K)$$
  * This operation is performed using a **2-cycle** $$\mathsf{Add}$$ **pipeline** implemented on the FPGA

## Conclusion

<figure><img src="https://755218234-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Frwz1ZAZJtK5FHz4Y1esA%2Fuploads%2FzzriPCJqdX8Mx95nShp8%2Fimage.png?alt=media&#x26;token=40be3ffd-d4e0-4aec-89dd-4a0b5a52bfbf" alt=""><figcaption></figcaption></figure>

<figure><img src="https://755218234-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Frwz1ZAZJtK5FHz4Y1esA%2Fuploads%2Fw50gVMPC1txH0OsHDj6l%2Fimage.png?alt=media&#x26;token=8bef8765-7bde-44bb-90ed-004630d44bb9" alt=""><figcaption></figcaption></figure>

CycloneMSM leverages the architectural strengths of FPGAs and the inherent parallelism of MSM operations to deliver a high-performance system that significantly outpaces traditional software-based MSM implementations. The system's achievements are underpinned by the following design philosophies and optimization strategies:

* **Optimized Coordinate System Selection**:

  On hardware, the Twisted Edwards coordinate system was chosen for its uniform operation count, which maximizes pipelining efficiency. On the software side, performance was improved through affine coordinates combined with batch inversion.
* **Pipelined Curve Adder**:

  A 96-stage pipeline running at 250MHz accepts new inputs every 4ns, enabling the core operation of MSM—$$\mathsf{MixedAdd}$$—to be processed in every clock cycle.
* **Conflict-Free Scheduling**:

  By applying a Delayed Scheduler, most points were processed without collisions. This also enabled batch inversion in the affine coordinate system, contributing to performance gains on the software side.
* **Modular and Parallel MSM Flow**:

  The MSM process—initialization, bucket accumulation, reduction, and window reduction—was modularized and optimized, with roles distributed efficiently between the FPGA and host. This maximized pipeline utilization across the system.
* **Superior Performance and Energy Efficiency**:

  CycloneMSM completes MSM on 64M points in under 6 seconds, achieving **4–6× speedups** compared to software, while consuming **only one-tenth the power**.

In summary, CycloneMSM is more than just a hardware implementation—it is a **deeply integrated system combining algorithmic and architectural design**, demonstrating the real potential of hardware-accelerated MSM for more complex ZK and large-scale proof systems in the future.

## References

* <https://www.youtube.com/watch?v=aZ3CzKZBK38>
* <https://eprint.iacr.org/2022/1396.pdf>

> Written by [Ryan Kim](https://app.gitbook.com/u/cPk8gft4tSd0Obi6ARBfoQ16SqG2 "mention") of Fractalyze
