TL;DR: I built an open-source pipeline that runs Karpathy's autoresearch on SageMaker Spot instances — 25 autonomous ML experiments for $0.44 total (vs ~$24 on an H100). 4x parallel execution, 2.3x faster, 18x cheaper. Includes an 8-chapter vibe coding tutorial. GitHub

The Problem

Karpathy's autoresearch is brilliant — an AI agent modifies training code, runs 5-minute experiments, keeps improvements, and repeats overnight. But it assumes you have an H100 sitting around for 8 hours. Most of us don't.

I wanted to know: can you get the same results on cheap cloud GPUs, paying only pennies per experiment?

What I Built

A parallel evolution pipeline on SageMaker Managed Spot Training:

Each generation: N candidates generated → N SageMaker Spot jobs run simultaneously → best val_bpb selected → next generation
HUGI pattern (Hurry Up and Get Idle): GPUs spin up for 5 minutes, terminate immediately. Zero idle cost.
Works with any GPU: H100, L40S, A10G — auto-detects and falls back gracefully

Architecture: diagram

Results

	Original (H100, sequential)	This project (L40S Spot, parallel)
Cost for 83 experiments	~$24 (on-demand) / ~$7 (spot)	~$1.33
Wall clock	~8 hours	~3.5 hours
GPU idle cost	~50% wasted	$0
Experiments in parallel	1	4

My actual run: 25 experiments across 5 generations for $0.44 on L40S (ml.g7e.2xlarge Spot in us-east-1).

The pipeline autonomously discovered that EMBEDDING_LR is the most sensitive parameter, improving val_bpb from 1.0656 → 1.0643 through conservative LR evolution. Architecture changes (deeper models, bigger batches) all failed in the 5-minute budget.

Surprises Along the Way

Some things I learned the hard way:

Spot capacity varies 1-9 by region. Same instance type: score 1 in us-west-2 (stuck for 30+ min), score 9 in us-east-1 (allocated in 2 min). Always run aws ec2 get-spot-placement-scores before choosing a region.
Flash Attention 3 doesn't work on L40S. Pre-compiled FA3 kernels only support Hopper (sm_90) and Ampere (sm_80/86). Ada Lovelace (sm_89) crashes at runtime. Had to add a PyTorch SDPA fallback — which halved MFU (20% vs 40%).
DEVICE_BATCH_SIZE ≠ throughput. Doubled batch size from 64→128, used 2x VRAM... and val_bpb got WORSE. Turns out with fixed TOTAL_BATCH_SIZE, larger micro-batches just reduce gradient accumulation steps without processing more tokens. The real lever is TOTAL_BATCH_SIZE.
Larger Spot instances can be cheaper. g7e.8xlarge ($0.93/hr) was cheaper than g7e.2xlarge ($1.82/hr) because of lower demand. Check price history for all sizes.
Cheap GPU experiments transfer to expensive GPUs. Research confirms that architecture/optimizer rankings found on L40S ($0.04/experiment) transfer to H100 for production training. Absolute LR values need re-tuning, but "A beats B" conclusions are portable.

The Vibe Coding Angle

The entire project was built through conversational AI coding (Claude Code) in a single ~13-hour session. I documented the full journey as an 8-chapter vibe coding tutorial — from initial idea through infrastructure debugging to autonomous evolution results. Every chapter includes the actual prompts used, the failures encountered, and the cost at each step.

Try It

```bash git clone https://github.com/roboco-io/serverless-autoresearch cd serverless-autoresearch cp config.yaml.example config.yaml

Edit with your AWS credentials

make setup # IAM role make prepare # Data → S3 make dry-run # Verify (free) make run # 10 gen × 4 pop = 40 experiments (~$0.70) ```

Links

GitHub: https://github.com/roboco-io/serverless-autoresearch
Tutorial: 8-chapter vibe coding tutorial
Comparison Report: Original vs Serverless
Spot Capacity Guide: How to find available Spot GPUs
Key Insights: 12 battle-tested lessons

What's your cheapest setup for running ML experiments? Anyone tried autoresearch on other cloud providers?

Update: I wrote a full step-by-step tutorial documenting how this was built.

If you want to learn by doing (not just read the code), I turned the entire
build process into an 8-chapter hands-on tutorial:

| Ch | What You'll Learn |
|----|------------------|
| 1 | How a single prompt + deep interview became the architecture |
| 2 | 23 files generated in one session with parallel AI agents |
| 3 | The region saga — Spot scores, quota wars, 3 region migrations |
| 4 | First experiment: FA3 CUDA crash → SDPA fallback → $0.02 success |
| 5 | The Batch Size Trap — why doubling BS made results WORSE |
| 6 | 5 generations of autonomous evolution (what worked vs what failed) |
| 7 | Turning lessons into a reusable Claude Code skill |
| 8 | Final scorecard: 18x cheaper, 2.3x faster |

Every chapter includes the actual prompt I used, what went wrong,
and exact commands to reproduce it. Total cost to follow along: ~$0.70.

The most educational part is probably Chapter 5 (The Batch Size Trap) —
I learned that DEVICE_BATCH_SIZE ≠ throughput the hard way ($0.07 lesson).

Start here: Chapter 1: The Idea

submitted by /u/Consistent-Milk-6643
[link] [comments]

[P] Run Karpathy's Autoresearch for $0.44 instead of $24 — Open-source parallel evolution pipeline on SageMaker Spot