The autonomous engineering pipeline built on Karpathy's context engineering, Claude Code, NASA-grade quality gates, and cost-optimized model routing. Every script copy-paste ready. Every claim sourced.
AI agents produce at incredible speed. Knowing whether output is correct is the hard part. That's where the 100x engineer lives.
Each works independently. Together they create a fundamentally different operating level. This is what separates vibe coding from agentic engineering.
Agent = Model + Harness. Changing only the wrapper around a fixed model improves performance by 6-10x on the same benchmark. Most agent failures are configuration problems, not model limitations. Stripe ships 1,300 AI-generated PRs/week with a heavily modified agent harness — zero model changes.
Route simple queries to cheap models, complex to expensive. RouteLLM (ICLR 2025): only 26% of calls need the expensive model. 95% quality retained.
Cache reads cost 10% of standard input on Anthropic. DeepSeek cache hits are 50x cheaper than misses. Hit rates go from 7% naive to 84% optimized.
NASA Power of 10 rules enforced by ESLint. 7-stage fail-fast gate: Prettier, ESLint, tsc, Vitest, Semgrep, Gitleaks, npm audit. Claude Code hooks auto-format every edit and block writes to protected files. Quality becomes automatic.
90% of agent projects fail within 30 days — runaway costs are #1. Real incidents: $16-50K in 5 hours (recursion loop), $47K in 11 days (LangChain agent). The 5-layer budget system enforces limits at the gateway level. If the gateway enforces the budget before forwarding, the agent literally cannot make a violating call.
Karpathy's LLM Wiki: conversations flow into daily logs, compile into a wiki, inject into the next session. Creates a compounding brain, not a forgetful retriever. Every session makes the next one smarter.
The strongest pattern from production: the agent that writes code never reviews its own work. A separate reviewer catches blind spots. LLM-as-judge matches human agreement at 80% — and costs 500-5000x less.
Start with 20-50 tasks from real production failures. Use pass@k for capability, pass^k for reliability. Graduate capability evals into regression suites. If pass@1 < pass@3, add retries — the agent is capable but inconsistent.
Every task flows through 5 stages. Budget tracked at every step. Circuit breakers halt runaway agents. Quality gates block bad code automatically.
Understand codebase, identify files, create plan
Execute plan, write code, handle dependencies
Run tests, fix failures, add coverage
Security scan, quality check, LLM-as-judge
Commit, create PR, deploy
ORCHESTRATOR Shell scripts · Cron · GitHub Actions · Routines ───────────────────────────────────────────────────────────────────── AGENT LAYER PLAN (Opus) · CODE (Sonnet) · REVIEW (Opus) · TEST (Haiku) ───────────────────────────────────────────────────────────────────── MCP SERVERS Memory · GitHub · Search · Browser · Context7 ───────────────────────────────────────────────────────────────────── QUALITY GATES Prettier → ESLint → tsc → Vitest → Semgrep → Gitleaks ───────────────────────────────────────────────────────────────────── COST CONTROL Model routing · Prompt caching · Token budgets ───────────────────────────────────────────────────────────────────── MEMORY Conversations → Daily Logs → Compiled Wiki → Next Session
From the 126K+ starred CLAUDE.md that started the agentic engineering movement. The operating system for how agents should write code.
Don't assume. Don't hide confusion. Surface tradeoffs. If multiple interpretations exist, present them — don't pick silently.
If something is unclear, stop. Name what's confusing. Ask.
Minimum code that solves the problem. Nothing speculative. No abstractions for single-use code. If 200 lines could be 50, rewrite it.
Would a senior engineer say this is overcomplicated? If yes, simplify.
Touch only what you must. Clean up only your own mess. Every changed line should trace directly to the user's request.
Don't "improve" adjacent code. Match existing style, even if you'd do it differently.
Define success criteria. Loop until verified. Transform tasks into verifiable goals, then loop independently until they pass.
Strong success criteria let you loop independently. Weak criteria require constant clarification.
People who are very good at this can peak much higher than 10x. Vibe coding raises the floor. Agentic engineering extrapolates the ceiling.— Andrej Karpathy, Sequoia AI Ascent 2026
Not every task needs a frontier model. Route by complexity. DeepSeek V4 Flash trails Opus by 1.8 SWE-bench points but costs 35-100x less.
| Model | Input / 1M | Output / 1M | SWE-bench | vs Frontier |
|---|---|---|---|---|
| Claude Opus 4 | $15.00 | $75.00 | 72.5% | Baseline |
| Claude Sonnet 4 | $3.00 | $15.00 | 72.7% | 5x cheaper |
| DeepSeek V4 Pro 75% OFF | $0.44 | $0.87 | 80.6% | ~7x cheaper |
| DeepSeek V4 Flash | $0.14 | $0.28 | ~79% | 35-100x cheaper |
| DS V4 Pro (cache hit) | $0.004 | — | — | ~1,400x cheaper |
Each multiplier compounds. Miss one, you're still 10x. Stack all seven, you operate at a fundamentally different level.
Incremental adoption. Each step works independently. No all-or-nothing commitment. Add a layer when you're ready.
All configs, scripts, and research in one repo.
Adapt the rules to your codebase. Keep under 150 lines. Add project-specific patterns.
Expand Claude's capabilities: memory, search, browser, GitHub, docs.
Auto-format on every edit. Block edits to .env files. Restore context after compaction.
4-stage pipeline: Analyze, Implement, Test, Review. Budget-tracked and JSON-logged.
150+ sources researched. All claims sourced. Unverified claims explicitly flagged.
Deep dives, references, and cheatsheets from the ml0x research pipeline.
Algorithms, evaluation metrics, sklearn code snippets, hyperparameter guides, and common pitfalls. One page, everything you need.
The philosophy, architecture, and research behind the agentic engineering pipeline. Open source, sourced from 150+ references.
Interactive neural network visualizer. Build, train, and experiment with architectures in your browser.
Calculate precision, recall, F1 score, and accuracy from your confusion matrix. Instant visual feedback.
Find the optimal learning rate for your model. Interactive sweep visualization with schedule recommendations.
Visual guide to gradient descent optimization. SGD, Adam, learning rate schedules, and convergence diagnostics.
Diagnose and fix bias-variance tradeoff problems. Visual examples, detection strategies, and practical solutions.
Side-by-side comparison of ML algorithms. When to use each, strengths, weaknesses, and decision criteria.
Clear breakdown of supervised and unsupervised learning paradigms. Use cases, algorithms, and how to choose.
Step-by-step checklist for feature engineering. Encoding, scaling, selection, and transformation techniques.
When and how to use transfer learning. Pre-trained models, fine-tuning strategies, and domain adaptation.