zk-ECDSA Part 2: Under the Hood

May 4, 2022

zk-ECDSA Part 2: Under the Hood

by Yi Sun, Tony L., Wen-Ding L., and gubsheep

Earlier this year, we presented circom-ecdsa, an efficient proof-of-concept implementation of zkSNARK circuits for ECDSA algorithms in circom. This post goes more in-depth on some of the building blocks that go into the circuit constructions.

In a future post, we'll also discuss some of the more advanced optimizations we implemented for secp256k1 operations.

How does ECDSA work?

In Part 1 of our zk-ECDSA series, we covered the role that the Elliptic Curve Digital Signature Algorithm (ECDSA) plays in networks including Bitcoin and Ethereum. Now, we'll dive into the implementation details and see how to translate ECDSA operations into arithmetic circuits in circom, a programming language for ZK circuits. These circuits in turn specify a zkSNARK that we can use to implement proof of group membership, private airdrops, on-chain vote aggregation, etc. (For more details on how this "circuit to zkSNARK" translation is done, see Vitalik's post and a writeup from zCash.)

For our purposes, we'll focus on two ECDSA operations: (1) converting a private key to a public key; and (2) verifying an ECDSA signature. We won't need to implement signature generation itself, since zero-knowledge verification of a signature's validity accomplishes the same thing as zero-knowledge verification of the valid computation of a signature.

The ECDSA protocol takes several protocol parameters: an elliptic curve equation, a prime number specifying the field which (x, y) coordinates on the curve are drawn from, a generator point on the curve, and the order of the group generated by the generator point. We'll be operating on the secp256k1 curve. Specifically, the curve is defined by the equation:

y2x3+7 (mod p)

where the 256-bit prime p is

p=225623229282726241

The secp256k1 standard also specifies a particular value for generator point G and corresponding group order n. See here for more details on the parameters.

Additionally, Ethereum relies on the cryptographically secure hash function keccak256, which is what we'll be using here. We use a circom implementation of keccak from Vocdoni.

ECDSAPrivToPub

Given a private key, ECDSA tells us how to generate a corresponding public key that can be used for signature verification.

The private key d_A is a randomly generated integer in the range [1, n - 1] and the public key is defined as Q_A = d_A * G. In other words, we simply need to multiply our base point G by the private key, using elliptic curve multiplication by a scalar. This will be done using a combination of elliptic curve point addition and point doubling.

ECDSAVerify

Signing a message m with a public key Q_A produces a signature (r, s). Now, say we are given a signature (r, s) and public key Q_A. We want to check that this signature is valid.

You can find a complete description of the signature verification algorithm here. At a high level, this algorithm can be broken down into the following operations:

  • Hashing the message m (with keccak256)
  • Modular multiplication and inverse
  • Elliptic curve multiplication by a scalar, and elliptic curve point addition

Circuits and Techniques

Now that we have a good understanding of the operations involved, we're onto the fun part: let's translate these operations into arithmetic circuits. Compared to using a procedural programming language, writing circuits in circom both introduces additional constraints, but also allows us to use a few tricks to reduce the amount of computation required. We'll measure progress based on how many constraints our final circuits require.

As a brief refresher on arithmetic circuits, a constraint is simply an equation of the form A * B + C = 0, where A, B, and C are linear combinations of signals, which for our purposes can be thought of as variables.

Producing a constraint-satisfying witness (variable assignments for intermediate values) is relatively easy to do since we can perform any operations to generate the witness--not just + and *. However, ensuring that our constraints are sufficient is much harder. Our goal is to construct circuits such that the output signals are the correct function of the input signals. In other words, there should be no way to manipulate the values without violating the constraints. Currently, we rely on auditing our code and running tests to verify the correctness of these circuits.

BigInt Arithmetic

BigInt Representation

First, let's find a suitable representation of big integers. Recall that today's zkSNARK's proving systems generally use elliptic curves with a 254-bit prime that effectively limits the register[1] sizes to 254 bits. However, the prime p for secp256k1 is 256 bits, which creates a bit of awkwardness.

If we want to support multiplication "natively", we should limit our register size to at most (254 - 1) / 2 = 126 bits, since multiplying two k-bit numbers results in a 2k-bit number. However, this would require ceil(256 / 126) = 3 registers to represent integers in [0, p - 1]. Then we're "wasting" 3 * 126 - 256 = 122 bits, which is basically a full register![2]

Given that we're forced to use 3 registers already, we can simply pick the number of bits to be ceil(256 / 3) = 86. This way, we're only "wasting" 3 * 86 - 256 = 2 bits. Not terrible! This is the best we can do given the constraints.

Let's assume 86-bit registers going forward, keeping in mind that we need 3 of them to represent integer ranges we care about. We'll represent a[k] and b[k] as two integers, each using k registers.

BigInt Comparison/Addition/Subtraction

For comparisons, we can simply check register by register, e.g. checking a < b amounts to:

  • a[k - 1] < b[k - 1]
  • or (a[k - 1] == b[k - 1] and a[k - 2] < b[k - 2])
  • or (a[k - 1] == b[k - 1] and a[k - 2] = b[k - 2] and a[k - 3] < b[k - 3])
  • etc...

These can be done by simply composing the existing AND/OR circuits in circom.

For addition, we can simply sum up each pair of registers a[i] + b[i] and emulate the carry with a separate signal. Subtraction is similar, if we assume a >= b and calculate a[i] - b[i] with a borrow signal.

BigInt Multiplication

Normally, with k registers, a naive multiplication implementation uses two for-loops to compute intermediate products, for O(k^2) constraints, followed by a bunch of carries that takes O(nk) constraints (the main cost being a set of range checks that each of 2k registers is at most n bits). Our initial implementation of bigint multiplication required several thousand constraints to verify the multiplication of two 256-bit numbers. Can we do better?

Our strategy is as follows. First, we define a few terms:

  • Define the proper representation of a number n in base x to be an array a such that each value a[i] is a nonnegative integer less than x, and such that n = a[0] + a[1] * x + ... + a[k - 1] * x ** (k - 1). Note that for any nonnegative integer n and any base x, there is exactly one proper representation of n.
  • Define an m-overflow representation of n to be an array a such that each value a[i] is a nonnegative integer less than m * x, and such that n = a[0] + a[1] * x + ... + a[k - 1] * x ** (k - 1). Note that a nonnegative integer n may have many different m-overflow representations for the same base x.

In the first part of our approach, we find a specific k * 2**n-overflow representation (c[2k]) of the product of our two bigint numbers, represented as a[k] and b[k]. Importantly, we pick a particular k * 2**n-overflow representation that can be efficiently constrained, inspired by a trick from xJsnark. In the second part of our approach, we build a circuit that constrains an overflow representation of a number to equal its proper representation--a process we call reducing.

Let's break down the first part. Since our result is 2k registers, the best we can hope for is roughly 2k constraints, e.g. one for each register.

We can define the polynomials:

  • A(x) = a[0] + a[1] * x + ... + a[k - 1] * x ** (k - 1)
  • B(x) = b[0] + b[1] * x + ... + b[k - 1] * x ** (k - 1)
  • C(x) = c[0] + c[1] * x + ... + c[2 * k - 2] * x ** (2 * k - 2)

If we ignore carries (we'll get back to these later), verifying a multiplication a * b = c is the same as verifying the polynomial identity A(x) * B(x) = C(x) using x = 2 ** 86, where the coefficients c[j] may be up to k * (2 ** 86) ** 2. The two for-loop implementation amounts to calculating each of the coefficients c[j] = sum a[i] * b[j - i] separately. However, what if we just evaluate the two polynomials on 2k - 1 points?

The key insight here: if two degree-m polynomials agree at m + 1 distinct points, they must be equal. For instance, two points uniquely determine a line (degree-1), three points determine a parabola (degree-2), and so on. Since our polynomials have degree 2 * k - 2, the 2 * k - 1 constraints C(y) = A(y) * B(y) for y = 0, ..., 2 * k - 2 are enough to check polynomial equality. Because each of A(y), B(y), C(y) is simply a weighted sum of the signals a[i], b[i], c[i], each equality only costs a single constraint in our computational model.

This "polynomial trick" allows us to obtain a K * (2 ** n)-overflow representation of our desired product, reducing the set of constraints from O(k ** 2) to O(k)!

Now we are left with the problem of reducing this overflow representation into the proper representation. In other words, we'd like to prove the equivalence of our overflow representation c[2k] and the proper representation d[2k]. Our strategy here will be to initialize an array of intermediate signals carry[2k], and going register-by-register to verify that the difference between c[i] and d[i] (including carries from previous registers) is always an integer multiple of 2 ** n. We use the following constraints:

  • Initialize intermediate signals carry[2k].
  • Constrain carry[0] = c[0] - d[0] * (2 ** -n).
  • Constrain carry[i] = (c[i] - d[i] + carry[i-1]) * (2 ** -n) for all i > 0. We can do this with additions and multiplications by constraining 2 ** n * carry[i] = c[i] - d[i] + carry[i-1].
  • Range check that each value d[i] is at most n bits.
  • Range check that each value carry[i] is at most n + log(k) bits.
  • Constrain carry[2k-1] = 0.

BigInt Division/Modulo

As it turns out, division is only slightly more complicated than multiplication in terms of constructing constraints. If we write a / b as (q, r), where q is the quotient and r is the remainder, we simply need to constrain a = b * q + r and 0 <= r < b to hold. We still need to calculate the values using long division to produce a witness, but constraining this witness is much more straightforward as we can reuse the BigInt multiplication constraints.

With division, we can implement the a mod p operation, which is just the remainder upon division by p. We can use this circuit to build addition, subtractions, and multiplication mod p as well! The only operation left is modular inverse, which we can constrain as a_inv * a = 1 (mod p) and reuse our multiplication circuit. We can calculate a_inv for the witness by using Fermat's Little Theorem, i.e. a ** (p - 2) = a_inv (mod p). Once again, we can turn division into multiplication when writing constraints.

ECC Operations

In this section, we discuss how to go from BigInt arithmetic to elliptic curve operations over a non-native field. Constraint counts below are quoted for our v0.0.0 implementation. In the past few months we've implemented further optimizations, which we're still in the process of cleaning up and documenting.

Point Addition and Doubling

In the conventional computational model, modular inverse is substantially more expensive than modular multiplication. As a result, point addition and doubling are usually done in Jacobian form, which removes modular inversions at the cost of more modular multiplications. In our setting, modular inverse and multiplication cost the same number of constraints, so we use the more efficient standard formulas in Weierstrass form. This is another example of how programming zkSNARKs creates a new regime for performance optimization.

Scalar Multiplication

Implementing ECDSAPrivToPub requires scalar multiplication d_A * G of a base point G on secp256k1 by the private key d_A, a scalar. The standard algorithm for this is double-and-add, which uses the binary representation of d_A:

d_A = bits[0] + bits[1] * 2 + ... + bits[255] * (2 ** 255).

It works by successively doubling and adding the base point, with pseudocode:

def double_and_add(bits):
    res = O                   // O is the additive identity on secp256k1
    for i in [0, ..., 255]:
        res = res + res       // doubling on secp256k1
        if bits[i] == 1:
            res = res + G     // addition of unequal points on secp256k1
    return res

For a 256-bit private key, in the worst case this requires 256 point additions and doublings, which translates to 9037920 constraints. Because the base point G is known and fixed in our setting, we can reduce the number of point operations by precomputing and caching multiples of G by powers of two. We can then compute the public key as the sum of cached multiples of G corresponding to the binary digits of the private key.

Gpow[256] = precompute_G_powers()  // Gpow[i] = (2 ** i) * G

def cached_double_and_add(bits):
    let res = O                    // O is the additive identity on secp256k1
    for i in [0, ..., 255]:
        if bits[i] == 1:
            res = res + Gpow[i]    // addition of unequal points on secp256k1
    return res

This cached double-and-add method requires at most 256 point additions and no point doublings, reducing the number of constraints to 3733697.

In our implementation, we further reduce the number of point additions using a cached version of the windowed method. That is, we group binary digits of the private key into groups of size stride, cache multiples of G corresponding to numbers with non-zero binary digits only in a single group, and represent G as a sum of the cached elliptic curve points:

// Gpow[i][j] = j * (2 ** (i * stride)) * G for j = 1, ..., 2**stride - 1
Gpow[ceil(256 / stride)][2 ** stride] = precompute_G_powers(stride)  

def cached_windowed(bits, stride):
    let res = O                     // O is the additive identity on secp256k1
    for i in [0, ..., ceil(256 / stride) - 1]:
        j = bits[i * stride] + ... + bits[(i + 1) * stride - 1] * 2**(stride - 1)
        if j != 0:
            res = res + Gpow[i][j]  // addition of unequal points on secp256k1
    return res

This cached windowed method reduces the number of point additions to ceil(256 / stride), which decreases as stride increases. However, as stride increases, we must maintain a growing cache of size ceil(256 / stride) * 2**stride points and select the jth element from Gpow[i] using a multiplexer with 2 ** stride possible selections. Surprisingly, this is an expensive operation in a zkSNARK, requiring O(2 ** stride) constraints per selection.

Running an exhaustive search for the stride which balances constraint usage in point additions and multiplexers, we find the following relation between the stride and number of constraints:

Stride13891011
Constraints37336971237111482276436936416883432869

The number of constraints seems to be minimized at stride 10. The resulting stride 10 windowed method is what we use in our circuit with 416883 constraints, an over 20-fold decrease from our starting point!

Outlook and Future Work

Further optimizing the R1CS representation size

Beyond the strategy that we laid out above, there are still some opportunities for further optimizing our circuits in R1CS form. We'll briefly mention one common technique: lazy reduction.

After each big integer arithmetic operations which we apply during an ECC operation, we keep the big integer in range [0, p - 1] and keep all the registers in the desired range as well. As a result, we have two kinds of reduction which correspond to keeping these values in range. However, it is possible to skip some of these reductions, and only do the reduction when it is necessary. This is commonly called lazy reduction, and with this we are in fact able to further reduce the total number of constraints. In a future blog post, we'll go into some of these optimizations in more detail.

New Proving System - PLONK

One advantage of PLONK is that it doesn't require a trusted setup per application-specific circuit, and this will greatly ease the burden of developers who want to deploy ZK applications in real world. Our circuits in R1CS form can be trivially converted to the form for PLONK with just standard addition and multiplication gates. One caveat of such naive conversion is that addition gates do not come for free like in Groth16, so the circuits might be suboptimal.

We are aware that there are many recent developments of PLONKish proving systems with carefully designed custom gates or lookup tables which have very promising results on implementing ECC operations (e.g. Turbo Plonk). We are also thinking about optimize our circuit by taking advantage of the efficient range proof in PLONK and PLOOKUP, and redesign our circuit with smaller size of register. The bottom line right now however is that we'd like to deliver an implementation that actually works in today's circom ecosystem, as it can help developers to prototype and develop new zk applications more easily.

Pairing-Based Crypto and Other Primitives

It is possible to instead build an ECC module with a pairing-friendly curve besides the secp256k1 curve we used here since most of our techniques are general enough for other curves as well. Another 0xPARC team will soon be sharing an implementation of pairing based crypto inside a zkSNARK, which can potentially lead to additional applications like BLS signatures.

Circuit Auditing and Verification

As we noted in the first part of this serie, these circuits are unaudited and we do not recommend them for production use as of now. We did our best effort to have witness tests cover most circuit modules, but generally speaking, witness testing alone is not enough. There are a few directions we can take to address this issue:

  • Developing better testing method: Currently we only test whether witness generations is correct and satisfied the circuit constraints, it would be better to have a test that are able to catch "under-contrained" bugs such as the one report here.
  • Applying formal verfication: Besides testing, formal method is also suitable in our case because the spec for our circuits is easy to specified mathematically.
  • Submitting for professional audit: If there is strong demand, we can also submit them to professional auditting company for security audit.

Conclusion

We invite you to check out our repo and would love to hear any feedback, ideas for improvement / optimization, or application experiments you'd like to build on top!

This project was built during 0xPARC's Applied ZK Learning Group #1, and further optimized during the 0xPARC ZK ID Working Group.

We use a circom implementation of keccak from Vocdoni. We also use some circom utilities for converting an ECDSA public key to an Ethereum address implemented by lsankar4033, jefflau, and veronicaz41 for another ZK Learning Group project in the same cohort. We use an optimization for big integer multiplication from xJsnark.

Footnotes


  1. In the literature these are commonly referred to as "limbs." ↩︎

  2. Concretely, the "waste" here comes from needing to perform range checks for unnecessarily large registers. Range checks are one of the most common (and most expensive, constraints-wise) operations in ZK circuits. The constraint cost of a range check is proportional to the number of bits of the number we are checking; range-checking three 126-bit numbers is more expensive than range-checking three 86-bit numbers. ↩︎