Inference seems to be very slow on A100 even when flash_attn is enabled

by boydcheung - opened Dec 13, 2025

Dec 13, 2025

•

Could you help testing the latency/inference speed of this 2B model?

Any suggestions what might be the cause of the problem? I've used the same version of transformers as in model card for inference.

7 days ago

On GitHub there was an issue saying it's about torch vision. I downgraded the torch vision but it's still not very fast, though much better

2 days ago

I enabled flash_attention2 and it is getting better, you could try

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment