Inference seems to be very slow on A100 even when flash_attn is enabled

#7
by boydcheung - opened

Could you help testing the latency/inference speed of this 2B model?

Any suggestions what might be the cause of the problem? I've used the same version of transformers as in model card for inference.

On GitHub there was an issue saying it's about torch vision. I downgraded the torch vision but it's still not very fast, though much better

I enabled flash_attention2 and it is getting better, you could try

Sign up or log in to comment