[R] GPT-5.4-mini regressed 22pp on vanilla prompting vs GPT-5-mini. Nobody noticed because benchmarks don't test this. Recursive Language Models solved it.

GPT-5.4-mini produces shorter, terser outputs by default. Vanilla accuracy dropped from 69.5% to 47.2% across 12 tasks (1,800 evals). The official RLM implementation dropped too (69.7% to 50.2%). Our implementation - where the model writes Python to query data instead of attending to all of it with task pattern matching and entropy - went from 72.7% to 69.5%. The architecture absorbed what the model couldn't.

Also: AIME 2025 is 80% vs 0% vanilla. Same pattern as GPT-5.2. The model outputs a bare guess with no reasoning; the REPL forces it to compute via code. Reducing latency while increasing accuracy.

5.1x fewer tokens than official RLM, while 3.2x cheaper. It works with every model.

https://github.com/avilum/minrlm

submitted by /u/cov_id19
[link] [comments]

[R] GPT-5.4-mini regressed 22pp on vanilla prompting vs GPT-5-mini. Nobody noticed because benchmarks don't test this. Recursive Language Models solved it.

Want to read more?

Tagged with