Review of gemma-3-270M

#31
by Clemylia-LLMs - opened

Hello, this is a very good model for its size and performance, it is certainly very complete and incredible on benchmarks.

However, I am used to training my own models from scratch, which are about the same size (270m), without any problems.
When I try to fine-tune your gemma, it crashes my session as if it were much bigger.

So it is unusable in the context of fine-tuning and open source, sorry
For original creations, it's best to start from scratch.

Hi @Clemylia-LLMs
Thanks for the feedback and for trying Gemma .
Just to clarify from your message it sounds like you usually train models from scratch (~270M parameters) and when attempting to fine-tune Gemma, your session crashes due to memory usage.
Even with similar parameter counts, different architectures can have very different training-time memory footprints. Factors such as attention implementation optimizer states, sequence length, and framework defaults can significantly affect memory usage.
To help understand what’s happening in your case, could you share a bit more detail about your setup?

  1. Which Gemma checkpoint are you using?
  2. What hardware / GPU and VRAM are you running on?
  3. Are you attempting full fine-tuning or PEFT?
  4. What sequence length and batch size are you using?
  5. Are you using the Transformers Trainer, Accelerate, or a custom script?

Thanks

Hello!
I used the gemma shown here (gemma-3-270m.)
I tried fine-tuning it on a GPU T4 In full-fine-tuning ,
I don't remember the more information remains.
I'm used to training much larger models from scratch, so I was surprised by the crash.

Sign up or log in to comment