Going from 3B/7B dense to Nemotron 3 Nano (hybrid Mamba-MoE) for multi-task reasoning — what changes in the fine-tuning playbook? [D]
Following up on something I posted a few days back about fine-tuning for multi-task reasoning. Read a lot since then, and I've moved past the dense 3B vs 7B question — landing on Nemotron 3 Nano (the 30B-A3B hybrid Mamba-Attention-MoE NVIDIA released recently) instead. Architecture maps to the multi-task structure I'm trying to train better than a dense base. Problem is I've only ever read about dense transformer fine-tuning, so I don't know what the hybrid Mamba+MoE arch actually breaks in the standard LoRA recipe.
Still self-taught, no formal ML background, been working with LLMs via API for about a year. First time actually fine-tuning anything end-to-end.
Why Nemotron 3 Nano specifically (in case the choice itself is the mistake):
- 23 Mamba-2 + 23 sparse MoE + 6 GQA attention layers, 128 experts per MoE layer with top-6 routing
- 30B total / ~3.6B active — capacity without per-token compute blowup
- Mamba-2 layers seemed like the right structural fit for state-aware reasoning across longer context
- Open weights under NVIDIA Open Model License, clean for what I want to do
What I'm trying to fine-tune for (LoRA, distilling reasoning traces from a stronger teacher):
- Reading what's structurally happening in a situation vs. what's being stated on the surface
- Holding multiple legitimate perspectives without collapsing to one too early
- Surfacing the load-bearing thread when input has multiple tangled problems
- Conditioning output on a small set of numeric input features describing context state
40-80k examples planned, generated by Sonnet 4.6 with selective Opus 4.7 on the hardest 20%. ORCA-style explanation tuning, not just I/O pairs.
Hardware: dropping the M4 Mac plan from my last post — Nemotron 3 Nano needs more memory than 24gb unified can hold even just for weights. Renting H100 80GB on RunPod for training. ~$120 budget across 5-6 iterations.
What I'm specifically worried about (because the hybrid arch isn't covered in any standard fine-tuning tutorial I've found):
- Router under LoRA. Can you LoRA the MoE router weights safely, or do you freeze the router and only LoRA the expert FFNs + attention? If you freeze, does multi-task specialization still emerge or does everything pile into the same experts?
- Mamba-2 layers under low-rank adaptation. Standard LoRA tutorials assume pure attention. Mamba-2 has selective SSM state and different projection structure — does standard LoRA on the input/output projections work cleanly, or are there gotchas (state init, recurrence stability under low-rank perturbation) that vanilla guides don't cover?
- Load-balancing loss + multi-task imbalance. If my 4 capabilities have different example counts, does the auxiliary load-balancing loss fight task-specific gradients? Known failure modes here?
- Catastrophic forgetting on a 30B sparse base. With LoRA adapters on the experts, does base reasoning degrade the way it does for dense fine-tunes, or does sparse routing structurally protect more of it?
- Eval granularity under expert specialization. A single capability could quietly degrade while aggregate metrics look fine if different experts handle different tasks. What's the right held-out eval design for sparse MoE under multi-task?
Stack: planning to use Unsloth (their Nemotron 3 Nano support shipped recently), per-capability held-out eval sets built and frozen before Batch 1, batch API + prompt caching on the teacher side to keep dataset cost in check.
Not looking for:
- "just try it and see" — first run is already going to be wrong, want to know which dimensions are most likely to surprise me
- "use a smaller dense model first" — already weighed; the hybrid arch is specifically why I want this one
- Generic LoRA tutorials — comfortable with the dense-transformer LoRA literature, the gap is Mamba+MoE specifics
Looking for:
- War stories from anyone who's actually fine-tuned Mamba+MoE hybrids (Nemotron, Jamba, Mixtral if relevant) and can tell me where it went sideways
- Papers I might be missing on multi-task LoRA on sparse MoE specifically — most of the multi-task literature I've found assumes dense
- Pitfalls around router gradients under low-rank adaptation
- Whether the standard LoRA rank sweet spots (8-32) still hold, or if MoE+Mamba shifts what works
Happy to write up what I find — first-time projects produce useful negative results even when they fail, and there's basically no public writeup yet on solo-developer-scale Nemotron 3 fine-tuning.
[link] [comments]
Want to read more?
Check out the full article on the original site