jimmy-yang.dev

Previous posts established that table edit distance is NP-hard, that treewidth controls the boundary, and that the algebraic root cause is the absence of compositional structure. Theory is clean; reality is messy. Users of tate need to diff tables today, and they need the result in milliseconds, not in the exponential time that exact computation would require.

This post covers tate’s practical answer: coordinate descent — an alternating optimization scheme that exploits the one structural property the NP-hardness result leaves intact: for any fixed column alignment, the row alignment problem is polynomial (it’s 1D LCS).

We prove convergence, characterize the optimality gap, and show how the algorithm maps to tate’s Rust implementation.

1. The Coupled Problem

1.1 Why Direct Optimization Fails

Table edit distance requires simultaneously choosing:

A column alignment $C$: which columns of $T_1$ correspond to which columns of $T_2$.
A row alignment $R$: which rows of $T_1$ correspond to which rows of $T_2$.

The total cost is:

$$ \text{cost}(C, R) = \underbrace{\alpha \cdot \text{row_ops}(R) + \beta \cdot \text{col_ops}(C)}{\text{structural cost}} + \underbrace{\gamma \cdot \text{mismatches}(C, R)}{\text{cell cost}} $$

The difficulty: the cell cost depends on both $C$ and $R$ jointly. You cannot minimize over $C$ and $R$ independently.

1.2 The Key Insight

For any fixed column alignment $C$, the row alignment problem reduces to 1D LCS:

Build a row signature for each row: the concatenation of cell values at the matched column positions.
Align row signatures by standard LCS — $O(n_1 \cdot n_2)$.

Similarly, for any fixed row alignment $R$, the column alignment problem reduces to a matching problem solvable by greedy or Hungarian assignment.

This is the treewidth-1 escape: fixing one axis collapses the 2D grid into a 1D path, restoring polynomial solvability. Coordinate descent exploits this by alternating.

2. The Algorithm

2.1 Overview

Input: tables T₁, T₂

C₀ = align_columns_by_header(T₁, T₂)          // §2.2
R₀ = align_rows_by_lcs(T₁, T₂, C₀)            // §2.3

for i = 0, 1, ..., max_iters-1:
    C_{i+1} = align_columns_by_data(T₁, T₂, Rᵢ)   // §2.4
    R_{i+1} = align_rows_by_lcs(T₁, T₂, C_{i+1})   // §2.5

    if C_{i+1} == Cᵢ and R_{i+1} == Rᵢ:
        break  // converged

Output: (C_{final}, R_{final})

Each step is an exact solve of a subproblem. The alternation is where the heuristic lives.

2.2 Initial Column Alignment: Header LCS

Tate detects the header row on each side (the first row filling $\geq 80%$ of the grid width), normalizes header cells (trim + lowercase), and runs LCS on the header sequences:

LCS Equal → matched column slot (exists in both sides).
LCS Delete → column only in $T_1$ (removed).
LCS Insert → column only in $T_2$ (added).

If no usable header exists, columns are aligned positionally (1:1).

Weakness: Header LCS matches by text equality. Renamed columns (“name” → “full_name”) or missing headers produce incorrect alignment. This is what refinement fixes.

2.3 Initial Row Alignment: Signature LCS

For each row, extract cells at the matched column indices and join them with a NUL separator (\u{0}) to form a signature string:

$$ \text{sig}(\text{row}_i) = T[i, c_1] ; \texttt{\textbackslash0} ; T[i, c_2] ; \texttt{\textbackslash0} ; \cdots $$

Row signatures are aligned by LCS. Since signatures only use matched columns, row comparison is insensitive to inserted/removed columns.

2.4 Data-Driven Column Re-alignment

This is the key innovation. After the initial row alignment, we know which rows correspond. We can use this correspondence to build a per-column similarity matrix:

$$ \text{sim}(p, q) = \frac{|{(i, j) \in R : T_1[i, p] = T_2[j, q]}|}{|R|} $$

where $R$ is the set of matched row pairs. This is the fraction of matched rows where $T_1$’s column $p$ agrees with $T_2$’s column $q$.

Greedy assignment: repeatedly pick the highest-similarity column pair $(p^, q^)$ with $\text{sim} > 0.5$, mark both as used, create a matched slot. Remaining columns become one-sided (added/removed).

Why this catches what header LCS misses:

Renamed columns: Headers differ (“name” vs “full_name”), but if 95% of row data matches, $\text{sim} = 0.95 > 0.5$ → matched.
Missing headers: A column with no header text but consistent data still has high $\text{sim}$.
Reordered columns: Same data, different position → high $\text{sim}$ → matched, regardless of header order.

2.5 Row Re-alignment

With the improved column slots, recompute row signatures and re-run LCS. This may match rows that were previously unmatched (because the improved column alignment produces better signatures).

2.6 Convergence Check

If both the column alignment and row alignment are unchanged from the previous iteration, the algorithm has converged. Tate also enforces a maximum iteration count (refinement_iters = 2 by default).

3. Convergence Proof

Theorem. The coordinate descent converges to a local optimum in finite steps.

Proof. Define the cost function $\text{cost}(C, R)$ as above. At each step:

Column step: $C_{i+1}$ is chosen to minimize cost given $R_i$:

$$ \text{cost}(C_{i+1}, R_i) \leq \text{cost}(C_i, R_i) $$

(The greedy assignment finds the optimal column matching for the given row pairs. If no pair exceeds the similarity threshold, $C_{i+1} = C_i$.)

Row step: $R_{i+1}$ is chosen to minimize cost given $C_{i+1}$:

$$ \text{cost}(C_{i+1}, R_{i+1}) \leq \text{cost}(C_{i+1}, R_i) $$

(LCS on row signatures produces the minimum-edit row alignment for the given column slots.)

Combining:

$$ \text{cost}(C_{i+1}, R_{i+1}) \leq \text{cost}(C_i, R_i) $$

So the cost sequence is monotonically non-increasing.

Since the alignment space is finite (there are finitely many possible column matchings and row matchings), the cost cannot decrease forever. The algorithm must reach a state where:

$$ C_{i+1} = C_i \quad \text{and} \quad R_{i+1} = R_i $$

At this fixed point, neither coordinate can be improved unilaterally, which is the definition of a local optimum (Nash equilibrium of the two-player game between row-aligner and column-aligner). $\blacksquare$

3.1 Convergence Speed

The number of iterations is bounded by the number of distinct $(C, R)$ pairs, which is exponential in theory. In practice:

Each iteration either improves the cost or terminates.
Cost improvements are discrete (integer or rational with bounded denominator).
Typical tables converge in 1–2 refinement passes.

Tate’s default refinement_iters = 2 covers the vast majority of real-world cases.

4. The Optimality Gap

Coordinate descent finds a local optimum. How far can this be from the global optimum?

4.1 Unbounded Gap (Worst Case)

Construction. Consider two tables where the header row is misleading:

T₁: | name  | role  |          T₂: | name  | function |
    | Alice | admin |              | Alice | admin    |
    | Bob   | user  |              | Bob   | user     |

Header LCS fails to match “role” ↔ “function” (different text). Initial column alignment treats “role” as deleted and “function” as added. Row signatures use only the “name” column, so all rows appear identical → all matched as equal. The “role” → “function” change is invisible.

Refinement should fix this (the data is identical), but if the header fill ratio causes “role” to be treated as a non-header column, the similarity matrix might not trigger.

Result: Coordinate descent reports 0 cell modifications (everything looks equal on the “name” column). The true edit distance includes 1 column rename. The gap is $\frac{\text{reported}}{\text{true}} = 0$.

4.2 Small Gap (Typical Case)

On real-world structured tables (CDISC define.xml comparisons, financial reports, database snapshots), the gap is typically 0 (coordinate descent finds the global optimum) or within a small constant factor. This is because:

Real tables have informative headers (header LCS works well).
Data is consistent (similarity matrix reliably detects column correspondences).
Row signatures are discriminative (LCS finds the correct row alignment).

No formal approximation guarantee exists. The algorithm is a practical heuristic with provable convergence but no provable approximation ratio.

5. Implementation in tate

5.1 The Main Loop

From grid.rs, the core loop is approximately:

pub fn grid_diff(
    rows_a: &[Vec<String>],
    rows_b: &[Vec<String>],
    opts: &GridOptions,
) -> GridDiff {
    // Initial column alignment via header LCS
    let mut slots = align_columns_by_header(
        rows_a, rows_b, width_a, width_b, header_a, header_b
    );

    // Initial row alignment via signature LCS
    let mut row_pairs = align_rows_by_lcs(rows_a, rows_b, &slots, opts);

    // Iterative refinement (coordinate descent)
    for _ in 0..opts.refinement_iters {
        let new_slots = align_columns_by_data(
            rows_a, rows_b, &row_pairs, &slots, width_a, width_b
        );
        if new_slots == slots { break; }
        slots = new_slots;

        let new_pairs = align_rows_by_lcs(rows_a, rows_b, &slots, opts);
        if new_pairs == row_pairs { break; }
        row_pairs = new_pairs;
    }

    // Render aligned grid with per-cell status
    render(rows_a, rows_b, &slots, &row_pairs)
}

5.2 Configuration

Every heuristic is exposed via GridOptions:

pub struct GridOptions {
    pub lcs_row_budget: usize,           // 4000: skip LCS above this
    pub header_fill_ratio: f64,          // 0.8: header detection threshold
    pub row_similarity_threshold: f64,   // 0.5: promote del+ins → modified
    pub use_header_names: bool,          // true: use header text for col names
    pub refinement_iters: usize,         // 2: coordinate descent iterations
}

The lcs_row_budget is the practical treewidth limit: above 4000 rows, the LCS-based row alignment is replaced by positional alignment (the treewidth is too high for exact DP, so tate falls back to the cheapest heuristic).

5.3 Complexity per Iteration

Step	Algorithm	Complexity
Header LCS	1D LCS on headers	$O(m_1 \cdot m_2)$
Row signature LCS	1D LCS on signatures	$O(n_1 \cdot n_2)$
Column similarity matrix	Greedy assignment	$O(m_1 \cdot m_2 \cdot n_{\text{matched}})$
Render	Walk all cells	$O(n_{\text{out}} \cdot m_{\text{out}})$

Total per iteration: $O(n_1 \cdot n_2 + m_1 \cdot m_2 \cdot n_{\text{matched}})$. With refinement_iters = 2, the total cost is $O(n \cdot m \cdot n)$ where $n$ = rows, $m$ = columns. This is practical for tables up to a few thousand rows and a few hundred columns.

6. What Coordinate Descent Cannot Do

The algorithm inherits the limitations of its building blocks:

No column reordering detection (without refinement). Header LCS reports reordered columns as delete+add. Refinement fixes this if the data is consistent, but only after one full iteration.
No move detection. A row that moves from position 5 to position 50 is reported as deleted+inserted, not moved. (Move detection requires a second pass over the change list — see tate’s ALGORITHMS.md for the proposed approach.)
No guarantee of global optimality. The convergence proof guarantees a local optimum, but the gap to global can be unbounded on adversarial inputs.
Budget limits. Above lcs_row_budget rows, all alignments become positional. This is the practical manifestation of treewidth: when the constraint graph is too large, even 1D subproblems are abandoned.

What’s Next

Coordinate descent is what tate does today: navigate the NP-hard landscape heuristically, with provable convergence but no approximation guarantee. The final post in this series asks whether the hardness can be used constructively — not just worked around.

The treewidth structure that makes FPT algorithms possible also enables FPT-sized zero-knowledge proofs: proofs whose size depends on the treewidth, not the input size. This opens applications in privacy-preserving data comparison.