[P] Run Karpathy's Autoresearch for $0.44 instead of $24 — Open-source parallel evolution pipeline on SageMaker Spot
TL;DR: I built an open-source pipeline that runs Karpathy's autoresearch on SageMaker Spot instances — 25 autonomous ML experiments for $0.44 total (vs ~$24 on an H100). 4x parallel execution, 2.3x faster, 18x cheaper. Includes an 8-chapter vibe coding tutorial. GitHub
The Problem
Karpathy's autoresearch is brilliant — an AI agent modifies training code, runs 5-minute experiments, keeps improvements, and repeats overnight. But it assumes you have an H100 sitting around for 8 hours. Most of us don't.
I wanted to know: can you get the same results on cheap cloud GPUs, paying only pennies per experiment?
What I Built
A parallel evolution pipeline on SageMaker Managed Spot Training:
- Each generation: N candidates generated → N SageMaker Spot jobs run simultaneously → best val_bpb selected → next generation
- HUGI pattern (Hurry Up and Get Idle): GPUs spin up for 5 minutes, terminate immediately. Zero idle cost.
- Works with any GPU: H100, L40S, A10G — auto-detects and falls back gracefully
Architecture: diagram
Results
| Original (H100, sequential) | This project (L40S Spot, parallel) | |
|---|---|---|
| Cost for 83 experiments | ~$24 (on-demand) / ~$7 (spot) | ~$1.33 |
| Wall clock | ~8 hours | ~3.5 hours |
| GPU idle cost | ~50% wasted | $0 |
| Experiments in parallel | 1 | 4 |
My actual run: 25 experiments across 5 generations for $0.44 on L40S (ml.g7e.2xlarge Spot in us-east-1).
The pipeline autonomously discovered that EMBEDDING_LR is the most sensitive parameter, improving val_bpb from 1.0656 → 1.0643 through conservative LR evolution. Architecture changes (deeper models, bigger batches) all failed in the 5-minute budget.
Surprises Along the Way
Some things I learned the hard way:
Spot capacity varies 1-9 by region. Same instance type: score 1 in us-west-2 (stuck for 30+ min), score 9 in us-east-1 (allocated in 2 min). Always run
aws ec2 get-spot-placement-scoresbefore choosing a region.Flash Attention 3 doesn't work on L40S. Pre-compiled FA3 kernels only support Hopper (sm_90) and Ampere (sm_80/86). Ada Lovelace (sm_89) crashes at runtime. Had to add a PyTorch SDPA fallback — which halved MFU (20% vs 40%).
DEVICE_BATCH_SIZE ≠ throughput. Doubled batch size from 64→128, used 2x VRAM... and val_bpb got WORSE. Turns out with fixed TOTAL_BATCH_SIZE, larger micro-batches just reduce gradient accumulation steps without processing more tokens. The real lever is TOTAL_BATCH_SIZE.
Larger Spot instances can be cheaper. g7e.8xlarge ($0.93/hr) was cheaper than g7e.2xlarge ($1.82/hr) because of lower demand. Check price history for all sizes.
Cheap GPU experiments transfer to expensive GPUs. Research confirms that architecture/optimizer rankings found on L40S ($0.04/experiment) transfer to H100 for production training. Absolute LR values need re-tuning, but "A beats B" conclusions are portable.
The Vibe Coding Angle
The entire project was built through conversational AI coding (Claude Code) in a single ~13-hour session. I documented the full journey as an 8-chapter vibe coding tutorial — from initial idea through infrastructure debugging to autonomous evolution results. Every chapter includes the actual prompts used, the failures encountered, and the cost at each step.
Try It
```bash git clone https://github.com/roboco-io/serverless-autoresearch cd serverless-autoresearch cp config.yaml.example config.yaml
Edit with your AWS credentials
make setup # IAM role make prepare # Data → S3 make dry-run # Verify (free) make run # 10 gen × 4 pop = 40 experiments (~$0.70) ```
Links
- GitHub: https://github.com/roboco-io/serverless-autoresearch
- Tutorial: 8-chapter vibe coding tutorial
- Comparison Report: Original vs Serverless
- Spot Capacity Guide: How to find available Spot GPUs
- Key Insights: 12 battle-tested lessons
What's your cheapest setup for running ML experiments? Anyone tried autoresearch on other cloud providers?
Update: I wrote a full step-by-step tutorial documenting how this was built.
If you want to learn by doing (not just read the code), I turned the entire
build process into an 8-chapter hands-on tutorial:
| Ch | What You'll Learn |
|----|------------------|
| 1 | How a single prompt + deep interview became the architecture |
| 2 | 23 files generated in one session with parallel AI agents |
| 3 | The region saga — Spot scores, quota wars, 3 region migrations |
| 4 | First experiment: FA3 CUDA crash → SDPA fallback → $0.02 success |
| 5 | The Batch Size Trap — why doubling BS made results WORSE |
| 6 | 5 generations of autonomous evolution (what worked vs what failed) |
| 7 | Turning lessons into a reusable Claude Code skill |
| 8 | Final scorecard: 18x cheaper, 2.3x faster |
Every chapter includes the actual prompt I used, what went wrong,
and exact commands to reproduce it. Total cost to follow along: ~$0.70.
The most educational part is probably Chapter 5 (The Batch Size Trap) —
I learned that DEVICE_BATCH_SIZE ≠ throughput the hard way ($0.07 lesson).
Start here: Chapter 1: The Idea
[link] [comments]
Want to read more?
Check out the full article on the original site