Sharing: LTX2.3 Subtitles and Removal
Thanks for releasing LTX-2-3, but the generated videos often include subtitles, some of which have spelling errors or garbled Chinese characters.
Therefore, is there an option in LTX-2-3 to completely disable subtitle output?
I'm hoping for a way to completely prevent subtitle generation.
Thanks!
Never had any subtitles myself, but seen a few others mention same. I guess its something rare?
But try negative prompt and add "subtitles" there.
(if you use low step distilled, you might also want to add the NAG LTX node from KJnodes then so that the negative prompt has effect. But you would need to update KJNodes since that was added yesterday i think)
@RuneXX Based on my testing, the NAG LTX node does not work on LTX2.3 and encounters an error when saving the video.
I havent tested yet, but I saw Kijai made some updates to the Nag node (a few hours ago).
So do make sure you have very latest KJNodes at least ;-) (do a git pull in the KJnodes folder)
But curious why some get subtitles though. I never had. Wonder if it might also help with prompting differently. But thats just all speculations ;-)
At least make sure you put text to say in "..." and perhaps explicity say : and then the person says "....".
At least thats how i prompt, and never had any subtitles. But just thinking out loud, i have no idea really why some have ;-)
Had some issues with NAG that should be fixed now.
For this particular case it can work:
With NAG prompt cartoon, still image, bad quality, subtitles, text, watermark, overlay effects:
Spaghetti for the win ;-)
Yeah that seemed to remove the subtitles, nice ;-)
Thanks KJ , RuneXX , this worked for me! , nice ;-)
Thank you for NAG node. Subtitles are gone...... awesome!
Thanks @Kijai for the NAG implementation. Great fix. Wanted to confirm that NAG works outside ComfyUI too, for anyone running the official LTX-2 Python pipeline directly.
I'm using the ltx2.3-fp8 model and monkey-patched attn2.forward (cross-attention) on all 48 transformer blocks in Stage 1. But stage 2 doesn't need it (distilled refinement).
For others just in case useful, here's the algorithm per cross-attention call:
- Compute attention with positive context →
x_pos - Compute attention with NAG negative context →
x_neg - Extrapolate:
guidance = x_pos * scale - x_neg * (scale - 1) - L1-normalize (clamp by
tau) - Alpha-blend:
result = guidance * alpha + x_pos * (1 - alpha)
Using your defaults: nag_scale=11.0, nag_alpha=0.25, nag_tau=2.5, prompt: cartoon, still image, bad quality, subtitles, text, watermark, overlay effects (altough still getting some cartoons)
Important: NAG alone isn't enough. With cfg_scale=1.0 (NAG only, no CFG), heavy dialogue prompts still leaked text. NAG + CFG together (cfg_scale=3.0) produced no text overlays across diverse prompts, including ones that previously always produced garbled overlay text.
Performance: Zero measurable overhead. H100 SXM 80GB, FP8, 30 steps, 1088×1920: 120s avg. If using torch.compile, apply NAG patches before compilation.