GenAI is a Spinning Jenny, and we need a Power Loom

By James Conner

Code Gets Written Faster Than It Gets Reviewed

GenAI tools can generate entire features in minutes. A developer can scaffold a REST API, write comprehensive tests, and implement error handling faster than most teams can schedule a design review. The code often follows patterns better than rushed human implementations: consistent naming, proper abstractions, comprehensive documentation.

The problem is all that generated code still goes through human review processes designed for incremental, peer-authored changes. A 2,000-line pull request generated in 30 minutes sits in review for days while engineers try to understand code they didn't write.

This creates a bottleneck that Amdahl's Law describes perfectly. When you accelerate one part of a system while leaving another unchanged, the unchanged part becomes the limiting factor. We've made code generation machine-fast, but verification remains human-slow.

For decades, the Software Development Life Cycle maintained rough equilibrium, with requirements, design, implementation, testing, and review all moving at human speed. GenAI has shattered that balance by turning the SDLC into a funnel where code pours in faster than it can be processed, while traditional pull requests assume human authorship and incremental changes that can't handle machine-speed generation.

The Industrial Revolution Had This Problem

In the 1760s, the Spinning Jenny let one worker produce eight times more yarn than before. Suddenly, textile manufacturers had more yarn than their weavers could handle. The bottleneck shifted from spinning to weaving.

The solution wasn't to hire more weavers or make them work faster. It was to mechanize weaving with power looms that could match the yarn production rate.

GenAI is our Spinning Jenny. It produces code faster than humans can review it. We need a power loom for code verification.

Why Human Review Can't Scale

If GenAI can write code so quickly, why not just auto-approve it? The answer lies in the specific ways current AI systems fail.

First, GenAI produces significantly more bugs than human-written code. CodeRabbit's analysis found defect rates averaging 170% higher in AI-generated code, including logic errors, edge case failures, and integration problems that compilers can't catch.

Second, AI tends toward unnecessary complexity. Ask it to implement a simple feature and you'll often get an over-engineered solution with multiple abstraction layers, design patterns applied dogmatically, and configuration options you didn't need. The code works, but it's harder to maintain than a straightforward implementation.

Third, AI systems "forget" context as they generate longer code blocks. They'll implement the same utility function twice in different files, create redundant data structures, or duplicate business logic because they lose track of what they've already written. This leads to maintenance nightmares and subtle inconsistencies.

Finally, AI struggles with integration points. It makes optimistic assumptions about how external systems behave, implements error handling that looks comprehensive but misses real failure modes, and creates interfaces that work in isolation but break under production conditions.

These problems stem from the probabilistic nature of language models. AI systems infer behavior rather than observe it directly. They generate code based on statistical patterns in training data, not from understanding system requirements or constraints. Edge cases get guessed at. Error paths are assumed to be similar to common patterns. Integration points are modeled on incomplete examples.

These problems explain why human review remains essential, but they also point toward a solution since the issues aren't random but systematic patterns that automated systems could potentially catch and address.

Code review isn't just bug hunting. Experienced engineers evaluate multiple dimensions simultaneously:

                        Mechanical Checks:
                        Syntax and type correctness
Security vulnerabilities
Performance implications
Test coverage

                    

                        Judgment Calls:
                        Architectural fit with existing systems
Maintainability over time
Appropriate use of abstractions
Code complexity vs. business complexity

                    

The mechanical checks are already better handled by automated tools like static analyzers for security issues, compilers for type errors, and performance profilers for bottlenecks.

The judgment calls are where humans add value, but they're also what makes review slow. These evaluations require context, experience, and subjective assessment that doesn't compress into simple rules.

Adversarial Verification

The solution borrows from machine learning: Generative Adversarial Networks. In a GAN, a generator creates outputs while a discriminator evaluates them. But this isn't a simple pass/fail review. The discriminator is structurally coupled to the generator, rejecting outputs repeatedly until they meet specific criteria. The discriminator's purpose isn't approval; it's refinement through resistance. Acceptance of code isn't granted by sign-off, but earned through survival.

Most AI code tools today use "judges" that evaluate finished code and render a verdict. Tools like Cursor's Bugbot exemplify this approach, providing helpful feedback on completed pull requests. But that's still human-paced thinking. We need discriminators that actively pressure the generation process.

Instead of one discriminator, use a collection. Each focuses on a specific concern:

                        Bug Prevention Discriminator: Catches logic errors, edge cases, and integration failures that current AI systems commonly produce

                        Complexity Discriminator: Rejects over-engineered solutions and unnecessary abstractions, pushing for simpler implementations

                        Consistency Discriminator: Prevents code duplication and ensures the generator maintains context across the entire codebase

                        Integration Discriminator: Validates assumptions about external systems and enforces realistic error handling

                        Security Discriminator: Rejects code with auth bypasses, injection vulnerabilities, or data leaks

                        Performance Discriminator: Rejects code with N+1 queries, memory leaks, or inefficient algorithms

The key difference from current approaches: these discriminators operate during generation, not after. Instead of generating flawed code and then trying to fix it, the system generates code under constant pressure to avoid known failure patterns.

Building these discriminators is feasible using Supervised Fine-Tuning of frontier models with concise, focused prompts. Each discriminator operates within narrow, bounded contexts: security vulnerabilities, performance anti-patterns, or architectural violations, making the training data requirements manageable and the evaluation criteria clear.

This adversarial approach should theoretically produce higher-quality code than current GenAI systems. The generator learns to avoid the systematic problems that make AI code unreliable, while the discriminators ensure that speed doesn't come at the cost of correctness.

Unlike traditional code review, this process can happen hundreds of times per minute through iterative refinement where discriminators provide specific feedback about failures rather than simple rejections. The generator doesn't move on until it produces code that satisfies all discriminators simultaneously. In this model, code isn't accepted because someone glanced at it. It's accepted because it withstood sustained, multi-dimensional pressure from a system designed to find its weaknesses.

The Economics of Adversarial Verification

Running multiple discriminators will definitely cost more than current AI code generation. However, a full cost analysis hasn't been done, but the economics will likely work in favor of this approach for several reasons.

First, discriminator specialization should enable cost optimization by selectively tuning for the correct model size and family. Mechanical evaluations like syntax checking or security pattern detection could potentially use nano-sized models or even Small Language Models running locally. Only the judgment-heavy discriminators, like architectural fit and maintainability assessment, would need expensive frontier models with advanced reasoning capabilities.

Second, the upfront generation cost would likely be offset by dramatically faster development throughput. When verification is no longer human-paced, features could reach market weeks or months earlier. The business value of accelerated delivery would typically dwarf the computational costs.

Third, higher-quality generated code should reduce downstream costs. Less time spent debugging means more time building features. Fewer production incidents mean lower operational overhead. The cost of generating better code upfront is usually less than the cost of fixing bad code later.

Where Humans Fit

Humans don't disappear, they "shift left" and move upstream. Instead of reviewing implementations, they define intent and manage the verification system.

Intent Definition: Humans specify what the system should do, under what constraints, and within which boundaries. These become first-class inputs to the generation process, not comments buried in tickets.

System Stewardship: Humans tune discriminator sensitivity, monitor false positives, and adjust pressure based on context. Some projects need strict security discriminators. Others need permissive performance discriminators. Knowing when to apply which pressure requires human judgment.

Trust shifts from "I've read every line" to "I've defined the intent clearly and the system enforces it reliably." That's uncomfortable for engineers used to line-by-line control, but it's similar to how we already trust automated testing frameworks to validate behavior without manually checking every execution path.

The Power Loom for Software

Pull requests assume human authorship and human review. They're artifacts of a development process where both generation and verification happened at human speed. When generation accelerates to machine speed, verification must follow.

Discriminator-driven verification is our power loom, matching the speed of AI code generation while maintaining the rigor of human review. Instead of relying on cursory human inspection, code earns approval by proving itself against rigorous automated challenges designed to expose every potential flaw.

This isn't about replacing human judgment. It's about encoding that judgment into systems that can operate at machine scale. The alternative is watching GenAI's potential get bottlenecked by processes designed for a slower world.

We've mechanized code generation. Now we need to mechanize code verification. The textile industry solved this problem 250 years ago with power looms. Software development needs the same breakthrough.