FPT-Sized Zero-Knowledge Proofs for Table Edit Distance
The previous posts established a complete picture: table edit distance is NP-hard, treewidth controls the boundary, polymorphisms explain why, and coordinate descent is the practical workaround. This final post asks a different question: can the hardness be used constructively?
Zero-knowledge proofs let a prover convince a verifier that a statement is true without revealing why. “These two medical databases have edit distance at most 50” — provable without revealing a single patient record. “Two financial reports differ in fewer than 10 cells” — verifiable without exposing the numbers.
The challenge: generic ZKP compilation of NP-hard problems produces exponential-size proofs. The opportunity: bounded treewidth — the same structural property that enables FPT algorithms — also bounds the proof size. This post sketches how.
1. The Motivation
1.1 Privacy-Preserving Table Comparison
Consider these scenarios:
Medical data. Two hospitals want to verify that their patient databases are “similar” (edit distance below a threshold) before merging. Neither can reveal patient records to the other.
Financial auditing. An auditor needs to verify that a company’s internal report and its regulatory filing differ in fewer than $d$ cells. The company doesn’t want to reveal which cells differ (competitive sensitivity).
Federated learning. Multiple parties hold feature tables and want to verify data quality (e.g., schema consistency, low pairwise edit distance) before federated training. Raw data stays private.
In each case, the statement is: “the table edit distance $\tau(T_1, T_2)$ is at most $d$,” and the proof should reveal nothing beyond this fact.
1.2 The NP Relation
The statement “$\tau(T_1, T_2) \leq d$” is in NP. The witness is an edit script $S$ (a sequence of insert/delete/modify operations) with total cost $\leq d$ that transforms $T_1$ into $T_2$. Given $S$, verification is polynomial: apply each operation, check the result equals $T_2$, sum costs.
So zero-knowledge proofs exist (Goldreich-Micali-Wigderson, 1986). The question is efficiency.
2. Why Generic ZKP Is Too Expensive
2.1 The Compilation Pipeline
Modern ZKP systems (zk-SNARKs, STARKs) work by:
- Express the verification as an arithmetic circuit $C$.
- The prover computes a proof $\pi$ that $C$ accepts the witness.
- The verifier checks $\pi$ in time $O(|C|^{O(1)})$ or better.
Proof size and prover time are proportional to $|C|$ (circuit size).
2.2 Circuit Size for Edit Distance
For the NP relation “$S$ is a valid edit script of cost $\leq d$”:
- Applying $S$ to $T_1$: $O(|S|)$ gates.
- Checking the result equals $T_2$: $O(n \cdot m)$ gates.
- Summing costs: $O(|S|)$ gates.
Total: $|C| = O(|S| + n \cdot m) = O(n + m + \text{modifications})$.
This is linear in the input size — not exponential! The circuit is small because verification is easy, even though finding the optimal script is NP-hard.
2.3 So What’s the Problem?
The circuit is small. The prover’s burden is finding the witness:
- Prover computes the optimal edit script $S^*$. This is NP-hard — exponential time in the worst case.
- Prover generates the ZKP. This takes $O(|C|)$ time — polynomial.
- Verifier checks the ZKP. This takes $O(\text{polylog}(|C|))$ time for zk-SNARKs.
The bottleneck is step 1. In theory, the prover can always find $S^*$ given enough time. In practice, for large tables, finding the exact optimal script is infeasible.
But: this is exactly where FPT helps. When the table has few columns (bounded treewidth), the prover can find $S^*$ in $O(2^{2k} \cdot n^2)$ time — exponential in $k$ but polynomial in $n$.
3. The FPT Insight
3.1 Treewidth Controls Everything
The FPT algorithm for bounded-treewidth edit distance (Post 1) enumerates all $O(2^{2k})$ column alignments, solves each with an $O(n^2)$ row LCS, and takes the minimum. The total computation is $O(2^{2k} \cdot n^2)$.
This computation can be expressed as a circuit:
$$ |C_{\text{FPT}}| = O\bigl(2^{2k} \cdot n^2\bigr) $$
where $k$ = number of columns (treewidth) and $n$ = number of rows.
Compare with the generic approach (not exploiting treewidth):
$$ |C_{\text{generic}}| = O\bigl(2^{n \cdot m}\bigr) \quad \text{(worst case)} $$
For $k = 10$, $n = 10000$:
| Approach | Circuit size | Practical? |
|---|---|---|
| Generic | $2^{10000 \cdot 10}$ | No |
| FPT | $2^{20} \cdot 10^8 \approx 10^{14}$ | Borderline |
| Verification-only | $O(n + m) \approx 10000$ | Yes |
The FPT circuit is exponentially smaller than the generic circuit. But $10^{14}$ is still too large for practical zk-SNARKs (which typically handle circuits up to $\sim 10^8$ gates).
3.2 Structuring the Proof Along the Tree Decomposition
The key to reducing proof size further: don’t prove the entire FPT computation. Prove local optimality at each bag of the tree decomposition, then aggregate.
A tree decomposition of width $k$ for a table with $n$ rows and $k$ columns has $O(n)$ bags, each of size $O(k)$. The standard DP on tree decompositions:
- For each bag (corresponding to a row or group of rows), solve the local CSP: find the optimal alignment of the $O(k)$ columns for this bag’s rows.
- Combine children into parent via DP: for each assignment of the parent bag’s variables, take the best combination of children’s assignments.
The circuit for this DP:
- Each bag’s local solve: $O(d^k)$ where $d$ = domain size (small for edit distance — binary match/no-match).
- Combination at each bag: $O(d^{2k})$ (parent × children assignments).
- Number of bags: $O(n)$.
- Total: $O(n \cdot d^{2k})$.
For $d = 2$ (binary), $k = 10$, $n = 10000$:
$$ |C_{\text{decomp}}| = 10000 \cdot 2^{20} \approx 10^{10} $$
Still large, but the structure is regular and parallelizable. With modern zk-SNARK systems (which handle structured circuits efficiently via polynomial commitments), this is at the edge of feasibility.
3.3 The FPT-ZKP Property
Definition. A ZKP system has the FPT property with parameter $k$ if the proof size is $f(k) \cdot n^{O(1)}$, where $f$ is any computable function and $n$ is the input size.
For table edit distance with treewidth $k$:
$$ \text{proof size} = O\bigl(n \cdot d^{2k}\bigr) = f(k) \cdot n $$
where $f(k) = d^{2k}$.
This is exponentially better than the generic approach (proof size $O(2^{nm})$) and polynomially better than the naive FPT compilation (proof size $O(2^{2k} \cdot n^2)$).
4. Protocol Sketch
4.1 Setup
Public inputs: Table dimensions $(n_1, m_1)$ and $(n_2, m_2)$, threshold $d$, commitment to $T_1$ ($\text{com}_1$) and $T_2$ ($\text{com}_2$), column count $k$ (treewidth parameter).
Private inputs (prover only): $T_1$, $T_2$, edit script $S$ of cost $\leq d$.
4.2 Protocol
Phase 1: Commit.
The prover commits to the tables using a polynomial commitment scheme (e.g., KZG commitments). Let $\text{com}_1 = \text{Commit}(T_1)$ and $\text{com}_2 = \text{Commit}(T_2)$. These are made public.
Phase 2: Tree decomposition.
The prover constructs a tree decomposition of the constraint graph (width $k$). For tables, this is straightforward: the decomposition follows the row structure, with each bag containing the $k$ column variables for one row.
Phase 3: Local proofs.
For each bag $B_i$ (corresponding to row $i$), the prover generates a local proof that:
- The cell values at $B_i$’s columns are consistent with $\text{com}_1$ and $\text{com}_2$ (opening the commitments at specific positions).
- The local edit operations at $B_i$ are valid and have the claimed cost.
Each local proof has size $O(d^{2k})$.
Phase 4: Aggregation.
Local proofs are combined into a global proof using a recursive aggregation scheme (similar to Marlin’s sum-check-based aggregation):
$$ \pi_{\text{global}} = \text{Aggregate}\bigl(\pi_1, \ldots, \pi_n\bigr) $$
The aggregated proof has size $O(\text{polylog}(n) \cdot d^{2k})$ — logarithmic in $n$ (via recursive composition), exponential in $k$.
Phase 5: Verification.
The verifier checks:
- $\text{com}_1$ and $\text{com}_2$ are well-formed (commitment verification).
- $\pi_{\text{global}}$ proves: there exists an edit script of cost $\leq d$ consistent with the commitments.
- Total cost $\leq d$.
Verification time: $O(\text{polylog}(n) \cdot \text{poly}(k))$.
4.3 Properties
| Property | Guarantee |
|---|---|
| Completeness | Honest prover with valid script always convinces verifier |
| Soundness | No cheating prover can convince verifier of false statement (except with negligible probability) |
| Zero-knowledge | Verifier learns nothing beyond “$\tau(T_1, T_2) \leq d$” |
| Proof size | $O(\text{polylog}(n) \cdot d^{2k})$ — FPT in $k$ |
| Prover time | $O(n \cdot d^{2k})$ — FPT computation + proof generation |
| Verifier time | $O(\text{polylog}(n) \cdot \text{poly}(k))$ |
5. Applications Revisited
5.1 Medical Database Comparison
Two hospitals, each with a patient table (columns: patient ID, age, diagnosis, treatment — $k = 4$ columns, $n = 100000$ rows).
Without ZKP: Hospitals must exchange data to compute edit distance. Privacy violation.
With FPT-ZKP:
- Proof size: $O(\log(100000) \cdot 2^8) \approx 2^{8} \cdot 17 \approx 4000$ field elements — a few kilobytes.
- Prover time: $O(100000 \cdot 2^8) \approx 10^7$ — seconds on a modern CPU.
- Verifier time: $O(\log(100000) \cdot 8) \approx 140$ — milliseconds.
The hospitals prove their databases are similar without revealing any patient data.
5.2 Financial Auditing
A company proves its internal report and regulatory filing differ in $\leq 10$ cells. Columns: account, period, amount — $k = 3$.
- Proof size: $O(\log(n) \cdot 2^6) \approx$ a few hundred field elements.
- The auditor learns whether the difference is within tolerance, not which cells differ.
5.3 Federated Data Quality
Multiple parties verify pairwise edit distances are below a threshold before federated training. Each party generates a ZKP for its pairwise distance to a reference table. The coordinator checks all proofs without seeing any raw data.
6. Open Problems
6.1 Concrete Efficiency
The protocol sketch above is asymptotic. Turning it into a working system requires:
- Choice of polynomial commitment: KZG (trusted setup, small proofs) vs. FRI (transparent, larger proofs).
- Aggregation scheme: Marlin-style or Plonk-style?
- Implementation: No existing ZKP system natively supports tree decomposition-structured circuits. Building one is an engineering challenge.
6.2 Approximate ZKP
Can the prover prove “$\tau(T_1, T_2) \leq (1 + \epsilon) \cdot d$ for some threshold $d$” — an approximate distance bound — more efficiently than the exact version? This would leverage the FPT structure without requiring exact computation.
This connects to the PCP theorem and hardness of approximation: if the edit distance is hard to approximate within $n^{1-\epsilon}$ (via the biclique connection), then approximate ZKP also has limits.
6.3 Proving Optimality
The protocol above proves “there exists a script of cost $\leq d$.” Proving “the optimal cost is exactly $d$” (i.e., $d$ is the minimum) requires a lower-bound proof: no script of cost $< d$ exists. This is a coNP statement, harder for ZKP.
For bounded treewidth, the FPT algorithm computes the exact optimum, so the prover knows the true distance. But proving that no better script exists requires proving the optimality of the DP computation, not just the validity of a script.
6.4 Average-Case Hardness for Cryptographic Use
NP-hardness is worst-case. For cryptographic applications, we need average-case hardness: random table instances should be hard. Maximum Biclique is easy on random bipartite graphs (the expected biclique size is $O(\log n)$). This suggests that random table edit distance instances might be easy, limiting cryptographic applicability.
However, structured tables (real-world data with correlations) may be hard on average. Characterizing this is an open question.
7. Series Conclusion
This five-post series traced a path from a practical problem (diffing tables in Rust) to deep theory and back to application:
| Post | Question | Answer |
|---|---|---|
| 1 | How hard is table edit distance? | NP-hard (universal), treewidth controls boundary |
| 2 | Why is it hard? | No polymorphisms — solutions don’t compose |
| 3 | What about higher dimensions? | Same boundary, maps to tensor rank |
| 4 | How does tate cope? | Coordinate descent — provable convergence, no approximation guarantee |
| 5 | Can hardness be useful? | FPT-sized ZKP — treewidth controls proof size |
The unifying thread: treewidth. It determines when edit distance is tractable (Post 1), why DP works or fails (Post 2), how hardness scales with dimensions (Post 3), where the practical limits of heuristic algorithms lie (Post 4), and how zero-knowledge proofs can exploit the structure (Post 5).
The code is in tate. The theory is in these posts. The open problems are real.
Thanks for reading. If you work on ZKP systems, parameterized complexity, or structured diff algorithms, I’d love to hear from you.