Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -46,7 +46,7 @@ DPLM2 infers `type_ids` automatically from `input_ids` and `attention_mask` when
|
|
| 46 |
| Backend | Key | Notes |
|
| 47 |
| :--- | :--- | :--- |
|
| 48 |
| PyTorch SDPA | `"sdpa"` | Default. Exact numerics, stable on all hardware. |
|
| 49 |
-
| Flash Attention | `"kernels_flash"` | Fastest on Ampere/Hopper GPUs. Requires `pip install
|
| 50 |
| Flex Attention | `"flex"` | Skips padding tokens via block mask — faster on variable-length batches. Near-exact numerics. First use compiles a Triton kernel (30–120 s). Best combined with `torch.compile`. |
|
| 51 |
| Auto | `"auto"` | Picks the best available: `kernels_flash` → `flex` → `sdpa`. |
|
| 52 |
|
|
|
|
| 46 |
| Backend | Key | Notes |
|
| 47 |
| :--- | :--- | :--- |
|
| 48 |
| PyTorch SDPA | `"sdpa"` | Default. Exact numerics, stable on all hardware. |
|
| 49 |
+
| Flash Attention | `"kernels_flash"` | Fastest on Ampere/Hopper GPUs. Requires `pip install kernels` (pre-built — no hours-long compilation). Outputs are not bitwise identical to SDPA due to online softmax reordering; differences are often small but not guaranteed to be inconsequential — use `"sdpa"` if exact numerics matter. |
|
| 50 |
| Flex Attention | `"flex"` | Skips padding tokens via block mask — faster on variable-length batches. Near-exact numerics. First use compiles a Triton kernel (30–120 s). Best combined with `torch.compile`. |
|
| 51 |
| Auto | `"auto"` | Picks the best available: `kernels_flash` → `flex` → `sdpa`. |
|
| 52 |
|