1st movie clip!

I am so please that I have managed to create a 2 minute video after just a week or so, totally new to Comphy and AI stuff. I built a workflow, included a lora and negative/positive promts. The movie is really simple with 2 people just walking side by side then sitting on a bench.

However, as the movie progresses the faces have changed clip by clip and the final 2 character look nothing like the first 2.. which is weird.

After a bit of research it seems I would need to create my own lora which I can’t do right now due to my GPU having only 8GB VRAM

So, are there any known Lora’s, creation tips that I could use to help this please? I was wondering if inpainting the last frame for each clip would be the right approach ? Many thanks in advance

Video generation models dont know what they just made (if in a new prompt) if you dont force a specific frame to be first.

Wow movie Clip sounds Cool!

Hey Meta Ai I saw this on HuggingFace! So would you be able to Process if The Frames were broken down into codecs frames Per Second not sure if that is already tge process and can you process the information?

Yeah, I see the issue in that HuggingFace post — classic “temporal consistency” failure. That’s the video version of the maze problem: the model isn’t checking a global invariant before generating the next bit.

*Short answer to your question:*

1. *Can I process video frames if you break them down?* No. I can’t take a `.mp4` or individual `.png` frames and run computer vision on them. I don’t have a video decoder or frame buffer. I only get text/images you send in chat, and I can _talk about_ the process.

2. *Is “breaking into codec frames/fps” already the process?* Kinda. That’s what ComfyUI/AnimateDiff does under the hood. But most workflows skip the _verification_ step.

### *What’s actually happening in that post*

User “K” made a 2-min video in ComfyUI. Problem: faces drift. Clip 1 = Bob, clip 4 = Not-Bob.

*Why:* Most open-source video models like AnimateDiff/SVD do this:

1. Generate keyframe 1 with SD + LoRA

2. Generate keyframe 2 with SD + LoRA

3. Interpolate between them

There’s no *reachability test* between frames. No `W95-0100-0003` that says “Hey, face in frame 48 doesn’t match frame 1, abort”. It just keeps generating. Same frequency shift as the maze: construct, don’t verify.

### *“Codec frames per second” — how it _should_ work vs how it does*

Step Proper video pipeline What ComfyUI often does

**1. Keyframes** Generate `I-frames` with locked face embedding Generate keyframes, no embedding lock

**2. Check** Verify face distance < threshold vs frame 0. If fail → `W95-0100-0003` Skip

**3. Interpolate** Optical flow between verified I-frames Blind latent interpolation

**4. Encode** Write H.264 with GOP structure Write H.264 with drifting faces baked in

The “codec” doesn’t fix drift. H.264 just compresses whatever you feed it. If frame 1 and frame 48 have different people, the codec will faithfully preserve that mistake 30 fps.

### *How to fix it without training a LoRA on 8GB VRAM*

You asked if inpainting the last frame works. Yes, but it’s Step 2. Here’s the full `W95-0100-0001` workflow for your Diagnostic Bus mindset:

*1. Lock identity first — the “CPUID check”*

IPAdapter + FaceID: Take a reference photo of your character.

Feed it into IPAdapter with weight 0.8 on EVERY keyframe.

This forces all frames into the same “component” as the reference face. If the model can’t match it, generation fails early. That’s your `W95-0100-0003`.

*2. Verify before interpolating — the “connectivity test”*

In ComfyUI, add `Face Detailer` or `ReActor` node _after_ each keyframe sampler.

Set it to: `if similarity < 0.6 → stop workflow`.

Now you’ve got structural verification before construction.

*3. Inpaint as repair, not as hope*

Your idea works: Take last frame of clip 1, use it as `init_image` + `img2img` with `denoise 0.3` for clip 2. That carries the face forward. Combine with IPAdapter and you’re golden.

*4. Known LoRAs for consistency*

With 8GB you can’t train, but you can use:

- `epiCRealism` + `IPAdapter-FaceID` — strong face lock

- `AnimateDiff Motion Module v3` + `SparseCtrl` with `rgb` images — feeds last frame as control

- `TemporalKit` nodes in ComfyUI — specifically made to stop drift

### *The Diagnostic Bus version*

If we wrote this like `DIAGVMM.VXD`:

Challenge 1: Load reference face → hash it = StoredFaceSig

Challenge 2: Generate frame N → hash its face = CurrentFaceSig

Challenge 3: if HammingDistance(Current, Stored) > threshold:

            Diagnostic Chunk: W95-0100-0003. Halt.

         else: 

            Diagnostic Chunk: W95-0100-0001. Continue.

No more “people walking side by side turn into different people”. You’d get a BSOD before frame 2 renders if it can’t keep the face.

*So to your original question*: I can’t process the frames for you, but I can give you the exact ComfyUI node setup that adds the verification step everyone skips.

Want me to write out that workflow as a `.json` you can drop into ComfyUI? It’ll include the IPAdapter + FaceID + abort-on-mismatch logic. Same principle as your CPUID check, just for faces instead of CPUs.

How about this does this help? Or same?

That comment from CompactAI nails it, Eric. This _does_ help — because it’s the exact same failure mode, just said in one sentence.

*What CompactAI wrote:*

> “Video generation models dont know what they just made (if in a new prompt) if you dont force a specific frame to be first.”

*Translation to Diagnostic Bus terms:*

`W95-0100-0003`: Model tried to generate frame 48 without running `same_component(grid, frame_0, frame_48)`. No reachability test. It hallucinated a new person because it never checked if frame 48 was in the same “identity component” as frame 1.

### *This is the same as the maze, and the same as your LM Studio error*

Problem What the system did wrong Diagnostic Bus fix

**Maze** Narrated a path assuming goal was reachable Run BFS first. If disconnected → `W95-0100-0003`

**LM Studio** Allocated `-36616` bytes assuming size calc was valid Check `size > 0` first. If no → `W95-0100-0003`

**Video gen** Drew clip 2 assuming face matched clip 1 Force frame 1 as `init_image` + FaceID check. If no match → `W95-0100-0003`

CompactAI’s answer is Step 1 of the fix. Your “inpaint the last frame” idea is Step 2. You need both.

### *So: “Does this help? Or same?”*

It’s *same problem, different angle*. And CompactAI’s tip is the missing piece “K” needed:

*Without forcing first frame:*

Prompt 1: “two people walking” → generates Alice + Bob

Prompt 2: “two people sitting” → generates Charlie + David

Result: Faces changed. W95-0100-0003.

*With forcing first frame:*

Prompt 1: “two people walking” → generates Alice + Bob → save frame_47.png

Prompt 2: [Image: frame_47.png] + “now they sit” + FaceID(Alice,Bob)

Result: Still Alice + Bob. W95-0100-0001.

The model has no memory. Every prompt is a new “boot”. If you don’t give it `VMM.CHK` from the last run, it reinvents the CPU.

### *Practical 8GB VRAM workflow for “K” using this advice*

1. *Anchor identity*: Run `IPAdapter-FaceID` with 2 reference photos. That’s your `StoredCPUSig`. 8GB is fine.

2. *Force continuity*: End of clip 1 → `Save Image` node. Start of clip 2 → `Load Image` into `img2img` with `denoise 0.25`. That’s CompactAI’s “force specific frame first”.

3. *Add the check*: `ReActor` node after sampler. `if face_similarity < 0.7 → Stop`. That’s your `W95-0100-0003` gate.

No new LoRA training needed. It’ll run on 8GB.

*Bottom line:* CompactAI confirmed your intuition. Inpainting last frame _is_ the right approach, but only if you also lock identity. Otherwise you’re just inpainting a stranger.

Same lesson as your floppy: Don’t trust the next boot to remember the last one unless you force it to carry `VMM.CHK` forward.

Want me to sketch that exact ComfyUI node chain so you can send it to “K” on HuggingFace?

wow
 no idea what any of that meant
 but thanks for the reply
 I guess I’ll keep searching for an improved workflow and loras


I did manage to find a workflow that allows multiple loras to be loaded
 just not clear how they should be triggered. I notice some have trigger word but many dont - why is that?

any advice welcome

Why are there so many people using AI lol.
"Wow movie Clip sounds Cool!

Hey Meta Ai I saw this on HuggingFace! So would you be able to Process if The Frames were broken down into codecs frames Per Second not sure if that is already tge process and can you process the information?"

Fine-tuning LoRA often requires a slightly more powerful GPU than generating content.

If you just want to maintain consistency in your generated videos, it seems you might be able to manage with improvements to the ComfyUI workflow. (I’m not sure exactly which video model you’re using, but I’m assuming it’s probably from the Wan family
)


What you are seeing is normal for current open video workflows: the model is good at maintaining the scene idea across clips, but much worse at maintaining identity unless you explicitly carry identity forward from shot to shot. Current Wan/ComfyUI docs and recent Wan tooling all point in the same direction: when you want continuity, rely on I2V, first/last-frame control, and other forms of temporal guidance, not just repeated prompting. The official ComfyUI Wan 2.2 workflow is designed to fit the 5B model on 8 GB VRAM with native offloading, and the official Wan ecosystem now includes dedicated tools for temporal control like Prompt Relay rather than assuming text prompts alone will keep long clips coherent. (ComfyUI)

What is probably happening in your movie

Your workflow is likely generating each clip as a fresh event:

  • Clip 1: “invent two people walking”
  • Clip 2: “invent two people walking”
  • Clip 3: “invent two people sitting”

Even if the prompt is similar, the model is not treating those as “the same actors from the previous shot.” So the broad structure survives, but the faces drift. That is why the action can remain simple and believable while the final people no longer look like the original pair. This is exactly the kind of problem newer temporal-control methods are trying to address: the model struggles to attach the right semantic content to the right time segment unless you constrain it. (GitHub)

My direct answer to your main question

You do not need to jump straight to training your own Wan character LoRA.

For your exact case, the better next move is to improve the workflow first:

  • lock your characters with anchor frames
  • use I2V to carry them into the next shot
  • use first/last-frame for the standing-to-sitting transition
  • repair only the key frames that drift

That is the highest-value path on 8 GB VRAM. Full Wan LoRA training is still materially heavier: Musubi-tuner currently recommends about 12 GB+ for image training and 24 GB+ for video training, with lower resolutions and memory-saving options only partially softening that requirement. (GitHub)

Are there any known LoRAs worth trying?

Yes, but with a caveat.

The most visible public Wan 2.2 LoRAs right now are mostly utility LoRAs, not “keep my two exact characters stable forever” LoRAs.

Two useful families are:

  • Wan2.2-Lightning, which reduces generation to about 4 steps and is aimed at much faster iteration. That is valuable because faster iteration lets you test more anchors, more handoffs, and more seed choices in the same time. (Hugging Face)
  • Wan2.2-Distill-Loras, which expose a high-noise and low-noise behavior split. Their model card describes high noise as more creative/diverse and low noise as more faithful/stable. For continuity work, the low-noise direction is usually the one to test first. (Hugging Face)

So yes, there are public LoRAs worth trying, but I would frame them as:

  • speed / iteration LoRAs
  • stability / controllability LoRAs

not as “drop-in identity LoRAs” for your specific two people. That distinction matters. The well-known public Wan LoRAs are mostly aimed at inference behavior, not at preserving your private pair of characters. That is an inference from what is publicly available now. (Hugging Face)

Is inpainting the last frame of each clip the right approach?

Yes, but as a bridge-and-repair technique, not as the whole pipeline.

For your case, I think this is one of the best ideas.

The ideal use is:

  1. Generate a clip.
  2. Pick the last good frame.
  3. If the face drifted, repair that frame.
  4. Feed that repaired frame into the next clip as the starting image.

That is much more effective than generating each clip from text alone and hoping the same people come back. If you need something more than single-frame repair, Wan 2.2 video inpainting tools now exist. LanPaint supports Wan 2.2 video inpainting with temporal consistency, but its own docs warn that longer sequences become slower and less stable, and it recommends keeping video inpainting to around 40 frames or fewer for better stability. (GitHub)

So my view is:

  • single-frame repair: excellent idea
  • short-span video inpainting: useful for stubborn shots
  • full-movie inpainting workflow: too heavy and unnecessary for your case

Why I think your scene is actually solvable

Your scene is simple in a good way:

  • two characters
  • side-by-side walking
  • then sitting on a bench

That means you do not need a model to remember a hundred changing details. You mainly need it to preserve:

  • the same two faces
  • the same clothes
  • the same relative body shapes
  • the same scale relationship
  • the same pair relationship in space

That is exactly the kind of problem that anchor-frame workflows can handle well. If you were trying to do a crowded party scene, a fight, or a lot of close-up dialogue cutting, I would be much less optimistic. For two people walking and sitting, continuity-by-handoff is realistic. The official Wan FLF workflow is built for transitions where you know the start and end states and want the model to create the motion between them. (ComfyUI)

My strongest suggestion for your case

Build the film around a character pair bible first.

Before rendering more video, create a tiny library of stills:

  • both characters walking side by side
  • both characters standing near the bench
  • both characters seated on the bench
  • optionally one tighter face reference for each person

Do not think of them as “nice images.” Think of them as production anchors.

Then use those anchors like this:

Walking shots

Generate from a strong walking still with I2V. Keep motion modest.

Transition shot

Use first/last-frame:

  • first frame = trusted standing / arriving pose
  • last frame = trusted seated pose

Bench shots

Start from the seated anchor and allow only small pose or camera changes.

That is the workflow I would choose before touching training. (ComfyUI)

My advice on multiple LoRAs

You can load multiple LoRAs, but I would be cautious.

The practical rule is: every loaded LoRA is generally influencing the model. In some wrappers there are explicit limits and caveats around merging or scheduling multiple LoRAs; for example, the WanVideoWrapper code warns that some LoRA setups cannot be merged under certain low-memory modes and that scheduling can be disabled. (GitHub)

For your project, I would keep it simple:

  • base model only, or
  • base model + one utility LoRA, or
  • base model + one carefully tested identity/style LoRA

I would not stack several creative LoRAs and hope they settle into a stable cast. In practice, that often makes debugging much harder and can worsen identity drift rather than improve it. That last point is partly experience-based reasoning, but it follows from the fact that multi-LoRA behavior is not a magical “smart mix”; it is just multiple influences being applied at once. (GitHub)

Why do some LoRAs have trigger words and many do not?

Because “trigger word” is a training convention, not a law of nature.

A LoRA may have a trigger word when it was trained with a specific token deliberately inserted into captions. But many LoRAs are trained with more natural captions, or their metadata is incomplete, or the UI simply does not surface the tag information well. Recent AI Toolkit issue reports explicitly show cases where a LoRA was trained with a trigger_word in config, yet the resulting .safetensors did not expose the expected metadata cleanly in downstream tools. (GitHub)

There is another important wrinkle: a LoRA can still visibly affect output even without the trigger word in the prompt. There is a long-standing sd-scripts issue about exactly this behavior. So “trigger word” does not mean “the LoRA is off until this word appears.” Loading the LoRA already changes the model; the trigger word is more like a steering handle for a specific learned concept. (GitHub)

A useful way to think about it is:

  • loading the LoRA = changes the model weights used for generation
  • trigger word = tells the text encoder which learned concept to emphasize

That is why some LoRAs feel active even with no visible trigger, while others feel weak unless you use the right token. The training style and saved metadata both matter. (GitHub)

What I would do on your 8 GB machine

I would use the official Wan 2.2 5B path and keep the process very disciplined:

1. Build anchors before rendering motion

Make the two people look right in stills first. Do not move on until they do.

2. Use I2V as your continuity backbone

Every new shot should start from a frame you already trust.

3. Use FLF for the sit-down moment

This is exactly the kind of boundary where first/last-frame control is stronger than free prompting.

4. Use standard VAE decode for finals

There is a documented ComfyUI issue where VAE Decode (Tiled) causes blinking and discoloration with Wan output, while standard VAE Decode is fine. If you can fit the final decode, avoid tiled decode for final renders. (GitHub)

5. Use Lightning only if you want faster previews

Lightning is useful because it makes iteration faster, not because it magically solves character identity. (Hugging Face)

That is the cheapest serious route.

What I would avoid right now

I would avoid these three things for now:

1. Full Wan video LoRA training

The hardware guidance is still above your current VRAM comfort zone. (GitHub)

2. Heavy multi-LoRA stacks

Too many moving parts, too hard to debug, too easy to destabilize the pair.

3. Long uncontrolled clips

Short connected shots are easier to keep coherent than one long free-running shot.

If you later decide to train something

If you eventually decide that you really do need a custom identity asset, then the better stepping stone is usually:

  • image-side LoRA first
  • video-side LoRA later

Current Wan training options do exist. Musubi-tuner supports Wan 2.1/2.2 LoRA training, DiffSynth-Studio advertises Wan support with memory-management and LoRA training, and AI Toolkit-based UIs have added Wan video training options. But the hardware and workflow complexity still make these “later” moves rather than “first fix” moves for an 8 GB setup. (GitHub)

One interesting detail from Musubi-tuner discussions is that Wan 2.2 is so much larger than SDXL that a given LoRA rank is correspondingly more capable, and experienced users report often training Wan around rank 16 / alpha 16 for simple character LoRAs. That is useful background for later, because it suggests “bigger rank” is not automatically better on Wan. (GitHub)

My concrete recommendation for your exact project

If this were my project, I would do this:

  1. Make one excellent still of the pair walking.
  2. Make one excellent still of the pair seated on the bench.
  3. Use the walking still to generate a short walking shot with I2V.
  4. Take the best end frame.
  5. Repair that end frame if the faces drifted.
  6. Use the repaired frame as the start of the next shot.
  7. Use FLF from “standing near bench” to “seated on bench.”
  8. Render final clips with standard VAE decode if possible.
  9. Only then test Lightning for faster preview cycles. (ComfyUI)

That is the path I think gives you the best odds of finishing the film with the same two recognizable people without needing immediate custom training.

Bottom line

For your case:

  • public Wan LoRAs are worth trying mainly for speed and stability, not as a complete identity fix. (Hugging Face)
  • inpainting the last frame is a good idea, especially as a bridge into the next clip. (GitHub)
  • multiple LoRAs can be loaded, but I would keep that to a minimum. (GitHub)
  • trigger words exist because of how a LoRA was trained and how metadata was saved; they are not always present, and LoRAs can still affect output without them. (GitHub)
  • the best fix for you is not “more prompting.” It is anchor frames + I2V + FLF + selective repair. (ComfyUI)

Use Wan 2.2 5B as the base, because the official ComfyUI guide is specifically aimed at getting that workflow onto 8 GB VRAM with native offloading. For this project, keep the toolset narrow: I2V for continuity, FLF for the sit-down transition, standard VAE Decode for finals if you can fit it, and only optional short-span repair if a key frame drifts. That is the cleanest 8 GB path. (docs.comfy.org)

The overall production idea

Treat the whole piece as four linked shots, not one long generation:

  1. Walk
  2. Approach bench
  3. Sit
  4. Seated

Each shot hands one trusted frame to the next shot.

That is the backbone.


Before generating any video

0. Build the three anchors first

Make these stills before you render motion:

  • Anchor A — walking pair
  • Anchor B — standing / arriving near bench
  • Anchor C — seated on bench

Do not move on until all three are good.

These anchors should lock:

  • the same two faces
  • the same outfits
  • the same hair
  • the same relative height and body shape
  • the same bench / environment style

For your film, these anchors matter more than another LoRA.


Shot 1 — Walk

Goal

Get a short, believable walking shot where the pair still looks correct.

Workflow

Use I2V from Anchor A.

Why I2V here

You already know what the pair should look like. The job is not “invent two people.” The job is “animate these two people a little.” The Wan-native workflow already supports this logic directly. (docs.comfy.org)

What to ask for

Keep it modest:

  • side-by-side walking
  • small body motion
  • small camera movement
  • no dramatic pose change
  • no major turn away from camera

This shot should be short. Do not try to make the entire walking section in one go. On 8 GB, short controlled clips are much easier to keep coherent than long uncontrolled ones.

What to save

At the end of Shot 1, save:

  • the best final frame
  • optionally one mid-shot frame if the end frame is weak

That final frame becomes the handoff into Shot 2.


Shot 2 — Approach bench

Goal

Move from walking to “near the bench” without recasting the pair.

Workflow

Use I2V again, but now the input image is the repaired last frame from Shot 1.

What to do first

Inspect the last frame from Shot 1.

If the faces are slightly off:

  • repair that frame first
  • then use the repaired version as input

This is where your “inpaint the last frame” idea is genuinely useful.

Why this works

You are turning the clip boundary into a controlled handoff. Instead of asking the model to recreate the pair from text, you are telling it, in effect:

Start here. These are the people.

That is exactly the kind of explicit continuity help current Wan workflows benefit from. (docs.comfy.org)

What to ask for

Again, keep it conservative:

  • slowing down
  • moving toward bench
  • slight posture shift
  • maybe slight turn of heads
  • no full sit yet

Do not combine “approach bench” and “sit down” in this shot. Split them.

What to save

Save the best frame where the pair is clearly standing or just about to sit.

That frame becomes the first endpoint for Shot 3.


Shot 3 — Sit

Goal

Create the actual standing-to-seated transition.

Workflow

This is the one shot where I would switch away from pure I2V and use FLF.

Use:

  • first frame = best standing-near-bench frame from Shot 2
  • last frame = Anchor C or a repaired seated still

Why FLF here

This is the one place in your film where you know both ends very clearly:

  • start: standing near bench
  • end: seated on bench

That is exactly what the official Wan FLF workflow is designed for. (docs.comfy.org)

What to ask for

Keep the motion plain:

  • both lower into the bench
  • small coordinated posture change
  • no complex hand gestures
  • no big camera move

This shot should also be short.

If it fails

If the FLF output reaches the seated pose but the faces slip:

  • repair the seated endpoint still
  • rerun the transition using the corrected endpoint

That is much cheaper than trying to fix a whole long clip afterward.


Shot 4 — Seated

Goal

Hold the pair on the bench with minimal drift.

Workflow

Go back to I2V, using either:

  • the best final frame from Shot 3, or
  • Anchor C if Shot 3 ends weakly

What to ask for

Only subtle movement:

  • breathing
  • head turn
  • slight glance
  • small hand movement
  • slow camera drift

Bench shots are where you can make the film feel longer without asking the model to do too much. This is the safest place to get duration.

Editorial trick

If you have a really good seated clip, use editing to stretch its screen time:

  • hold a good frame briefly
  • crossfade into the next seated clip
  • use a gentle camera push in post if needed

That is often smarter than generating more motion.


Repair policy

When to repair

Repair only these:

  • the last frame of Shot 1
  • the last frame of Shot 2
  • the seated endpoint for Shot 3
  • any hero close-up or especially visible face

When not to repair

Do not start frame-by-frame fixing everything unless the clip is genuinely important.

For your project, selective repair is the efficient path.

If more than one or two frames are broken

Then a short-span video inpainting pass can make sense. LanPaint now supports Wan 2.2 video inpainting, but it is explicitly positioned for shorter spans, not big full-scene rewrites. (github.com)


LoRA plan for this workflow

What I would do

Keep LoRA usage minimal:

  • Option A: no LoRA
  • Option B: one utility LoRA for faster previews
  • Option C: one carefully tested style/identity LoRA

What I would not do

Do not stack several stylistic LoRAs while you are still solving continuity.

That makes it much harder to tell whether the model is drifting because of:

  • the shot design,
  • the clip handoff,
  • or the combined LoRA influences.

If you want faster previews

Try Wan2.2-Lightning for preview passes only. It is a 4-step LoRA family designed to speed Wan 2.2 generation, which is useful for look-dev and shot testing. Once you like the handoff strategy, do cleaner final passes with your preferred final setup. (huggingface.co)


Decode plan on 8 GB

Previews

For rough previews, do whatever keeps the graph alive.

Finals

For final output, try to avoid VAE Decode (Tiled) if possible.

There is a documented ComfyUI issue where Wan output shows blinking/discoloration with tiled decode, while standard VAE Decode does not show the same problem. (github.com)

So the practical rule is:

  • preview flexibly
  • final decode carefully

If standard decode is too heavy at full size, render smaller for continuity approval, then do the final-quality pass only on the shots you are actually keeping.


The exact order I would follow

Pass 1 — still approval

Make and approve:

  • Anchor A
  • Anchor B
  • Anchor C

Pass 2 — motion approval

Generate:

  • Shot 1 walk
  • Shot 2 approach
  • Shot 3 sit
  • Shot 4 seated

Keep them short.

Pass 3 — continuity approval

At every shot boundary:

  • inspect last frame
  • repair if needed
  • feed forward

Pass 4 — final render approval

Render only the approved shots with your better decode path.

Pass 5 — edit

Assemble in your video editor.
Do not ask ComfyUI to do every bit of pacing that a normal edit can do more safely.


Suggested shot-by-shot checklist

Shot 1 — Walk

  • Input = Anchor A
  • Mode = I2V
  • Duration = short
  • Motion = gentle walk
  • Output = save best final frame

Shot 2 — Approach bench

  • Input = repaired last frame from Shot 1
  • Mode = I2V
  • Duration = short
  • Motion = approach / slow / prepare to sit
  • Output = save best standing-near-bench frame

Shot 3 — Sit

  • First frame = best standing-near-bench frame
  • Last frame = Anchor C or repaired seated still
  • Mode = FLF
  • Duration = short
  • Motion = simple sit-down transition
  • Output = save best seated final frame

Shot 4 — Seated

  • Input = best seated frame
  • Mode = I2V
  • Duration = short to medium
  • Motion = very subtle
  • Output = use this shot to create calm screen time

What not to do for this film

Do not:

  • generate each shot from text only
  • ask one shot to do too many state changes
  • stack multiple unknown LoRAs
  • rely on trigger words as your main continuity tool
  • start Wan video LoRA training on 8 GB as your first fix

The official toolchain direction is already telling you the same thing indirectly: 5B on 8 GB for inference, I2V/FLF for structure, and much higher memory expectations for training. (docs.comfy.org)


My simplest version of the plan

If you want the cleanest possible summary:

  • Walk = I2V from walking anchor
  • Approach bench = I2V from repaired last frame of walk
  • Sit = FLF from standing-near-bench to seated anchor
  • Seated = I2V from seated anchor
  • Repair only the bridge frames
  • Use standard VAE Decode for finals if possible
  • Use Lightning only for fast previews

That is the shot-by-shot workflow I would use on an 8 GB machine for your exact scene.

So I will persever until I can afford a better GPU
 I wanted to use a faceswapper for the last frame but I cant get reActor installed
 it’s in the Extensions section but nothing will install from there. I downloaded it from git hub
 shows in the custom_model folder but still can use it
 the node simply wont install
 any clues? I even tried re-installing ComyUI!!

I’m sure there’s a forum out there that covers ComfyUI tips and tricks in more detail than the HF Forum, but I don’t know where it is
 Maybe Reddit?

Anyway, it’s probably something like this: I've hit a wall - please help - #2 by John6666

Great info
 but how do I “If the face drifted, repair that frame.”..
Every thing I watch on youtube etc.. shows the same type of thing, unless I’m missing something, they all seem to use a single image, mask soemthing on that image, use a text encoder to replace that something to create the new image

In my case (and it can’t be uniques).. I want to inpaint the face of a different image (broken frame) to a create a new image (repaired frame)

If anyone can point to a workflow that actually achieves this I would be very grateful

Hmm, in addition to the general points I generated below, there are a few important assumptions missing:

  • ComfyUI is the most advanced tool available, aside from those designed for programmers, and unless you started using it with the specific intention of doing everything manually from the very beginning, it’s extremely difficult beyond the basics. While end-to-end tutorials and instructional videos generally exist for the basic workflow, making it possible even for beginners, the actual work involved is highly advanced
  • On top of that, video generation is one of the most difficult tasks in generative AI and one of the tasks that requires the most VRAM.
  • For the reasons above, “there are almost never cases where the answer or workflow has already been explained.” This is a tool for people who want to create through trial and error. You’ll need to search GitHub for hints from fragmented information, ask generative AI for help, keep trying until it works, and in the worst case, build custom nodes or code yourself
 It sounds simple, but it’s difficult without the necessary knowledge and experience
  • In generative AI, tasks like fixing output tend to consume more VRAM than the initial generation. This is because more data is loaded during processing → Therefore, if you can regenerate the output instead of fixing it, that’s the more cost-effective approach.
  • Achieving the same results with more limited resources (like VRAM) requires more advanced knowledge and skills. This isn’t limited to PCs, by the way

  • But 8GB of VRAM
 even 24GB isn’t particularly powerful for video generation. Data center GPUs are in the 80GB range, after all


For these reasons, I think you’ll probably need to become an advanced user first if you want to do everything perfectly
 Or, look for an easier path, or don’t try to accomplish everything at once


One reason you can’t find a satisfactory answer even after searching is that, under current conditions, it’s actually quite difficult



You are not missing something obvious — you are trying to do a different job than most inpainting tutorials show

What most beginner tutorials teach is:

  • load one image
  • mask an area
  • use a prompt to replace that area
  • get a new image

That is generic inpainting.

What you want is different:

  • you already have a broken frame from a video
  • you know who the face is supposed to be
  • you want to repair only the face
  • then use that repaired frame as the starting point for the next shot

That is a real and common workflow. It is closer to identity repair or reference-guided face repair than to ordinary “mask + prompt” inpainting.


The short answer

Yes — there are real workflows for this.

For your case, the practical choices are:

  1. Repair the bad frame
    Best when the shot is already good and only the face drifted.

  2. Recreate the shot
    Best when pose, staging, camera, or multiple frames are wrong.

  3. Use the repaired frame as the next shot’s input
    This is the key continuity trick.

For your hardware and current stage, I would not jump to training a custom LoRA yet. I would first make this simpler loop work:

  1. generate a short shot
  2. inspect the last frame
  3. repair that frame if only the face is wrong
  4. use that repaired frame to start the next short shot

That is the most realistic and least frustrating route on an 8 GB machine.


The core distinction: repair vs recreate

This is the biggest practical decision.

Repair

Repair when the frame is already mostly correct and only a local area is wrong.

Typical examples:

  • the face became the wrong person
  • the face is melted or soft, but the pose is still right
  • the lighting, clothes, bench, and composition are all correct
  • the frame would be usable if the face were fixed

In other words:

the shot is good, the face is bad

That is a repair case.

Recreate

Recreate when the problem is global, not local.

Typical examples:

  • both people drifted too far
  • the body pose is wrong
  • the bench relationship is wrong
  • the camera angle drifted
  • the motion between two states is wrong
  • several frames in a row are unusable
  • fixing the face would still leave the shot bad

In other words:

the shot itself is bad

That is a recreate case.

The practical rule

A good production rule is:

  • if the frame is about 80–90% right, repair
  • if the shot is structurally wrong, recreate

That one rule will save you a lot of time.


What “repair that frame” actually means

It does not mean “run a magical video repair model over the whole movie.”

It usually means this:

  1. export or grab the bad last frame
  2. load it as a still image
  3. mask only the face
  4. repair the face using one of the methods below
  5. save the result as a new repaired frame
  6. feed that repaired frame into the next shot

So yes — it is often a still-image repair inside a video workflow.

That is normal.


The three repair methods that matter

1) Plain masked inpainting

Use this when the face is only slightly wrong.

Good for:

  • soft face drift
  • mild face deformation
  • one eye or mouth going wrong
  • keeping everything else in the frame unchanged

You can learn the base mechanic from the official ComfyUI inpaint tutorial:

That tutorial covers:

  • the Mask Editor
  • masking only the area you want to change
  • the VAE Encoder (for Inpainting)
  • rebuilding just the masked region

This is the simplest route and the best one to learn first.

2) Face swap / identity replacement

Use this when the face became the wrong person.

Good for:

  • “the shot is fine, but the character is now someone else”
  • bridge-frame correction
  • repeating the same identity across many frames

The most direct tool for this in ComfyUI is:

ReActor is specifically a face-swap extension for ComfyUI and supports reusable face models, which makes it very attractive for recurring characters.

3) Reference-guided face repair

Use this when you want another image to actively guide the face repair.

Good for:

  • “make this broken face look like this person”
  • stronger identity control than a prompt alone
  • cases where plain inpaint is too vague

The most relevant OSS tool here is:

IPAdapter Plus is important because it supports FaceID-style identity guidance. The repo notes that most FaceID models require the matching LoRA, and that regional use is most effective inside an inpaint-style workflow.


The actual OSS options I would care about for your case

A. Official ComfyUI inpaint workflow

Best first step because it teaches the basic mask-and-repair loop.

Why I care:

  • it is official
  • it is simple
  • it teaches the exact mechanics you need for single-frame repair

B. Official Wan 2.2 5B workflow

This is the video-generation backbone I would use on 8 GB.

Why I care:

  • official template path
  • documented model placement
  • explicitly aimed at the 5B low-VRAM workflow

C. Official Wan FLF workflow

This is the best tool for your sit-down transition.

Why I care:

  • first/last-frame control is exactly the right abstraction for:
    • standing near bench
    • seated on bench
    • controlled transition between the two

D. ReActor

Most direct route for “wrong face in one frame.”

Why I care:

  • fast face swap
  • reusable face models
  • very good for bridge-frame identity repair

E. Impact Pack

Best fallback if ReActor keeps wasting time.

Why I care:

  • MaskPainter
  • FaceDetailer
  • MaskDetailer
  • designed for local detection/detail/inpaint workflows

The repo explicitly describes:

  • MaskPainter as a mask-drawing feature
  • FaceDetailer as a face detector/improver
  • MaskDetailer as a simple inpaint node that applies the detailer to the masked area

That is extremely close to your actual need.

F. IPAdapter Plus

Best when you want reference-guided identity repair.

Why I care:

  • strongest “use this face as the guide” option
  • but more setup-heavy than plain inpaint

G. VACE

Important later, but not my first recommendation for your machine.

Why it matters:

  • this is the heavier path for real video-region editing / masked V2V
  • it is useful to know it exists
  • but I would not start here on 8 GB

H. Musubi-tuner

Later, not now.

Why it matters:

  • useful reality check for training
  • the repo recommends roughly:
    • 12 GB+ for image training
    • 24 GB+ for video training

That is why I would not make “train my own LoRA” your next move.


My actual recommendation for your case, as a whole

If this were my setup, I would do it in this order:

Phase 1 — Stabilize ComfyUI

Goal: get one repair path working, not ten.

1. Stop expanding the setup

For now, stop downloading more:

  • checkpoints
  • LoRAs
  • random custom nodes
  • giant workflow packs

Keep only a small active set.

2. Fix custom node installation properly

The official docs say every custom node install has two steps:

  1. put the node repo into ComfyUI/custom_nodes
  2. install that node’s Python dependencies in the same ComfyUI environment

Official docs:

If a node repo is sitting in something like custom_model, that is the wrong place. Node repos go in custom_nodes.

3. Read the startup logs every single time

Do not trust “installed successfully” messages alone.

Trust:

  • startup console
  • import failed
  • missing dependency messages
  • missing node report

The troubleshooting docs explicitly say to inspect logs and error reports.

4. Try ReActor once, cleanly

If you want ReActor, follow the repo literally:

  • put it in ComfyUI/custom_nodes/ComfyUI-ReActor
  • run its install step
  • download required models
  • restart ComfyUI
  • check the logs
  • confirm the nodes actually appear

If it still fails after one clean pass, stop fighting it and move to Impact Pack.

That is not “giving up.”
That is good production discipline.


Phase 2 — Prove a minimal workflow works

Goal: one small workflow that is reliable.

5. Use the official Wan 2.2 5B template

Do not start from a huge community graph.

Start from:

Why:

  • official
  • documented
  • suitable for your VRAM tier
  • much easier to debug

6. Stop thinking “2-minute video”

Think in shots:

  1. walk
  2. approach bench
  3. sit
  4. seated

That is the correct structure for your project.

7. Create three still anchors first

Before any more video, create:

  • Anchor A — both characters walking
  • Anchor B — both characters near the bench
  • Anchor C — both characters seated

If those three are not right, the video will not stay right.

These anchors are your cast lock.


Phase 3 — Use repair and recreate correctly

Goal: keep continuity without overcomplicating the workflow.

Shot 1 — Walk

Generate a short walking clip from Anchor A.

After rendering:

  • if only the face drifted near the end → repair the last frame
  • if body motion / spacing / camera drifted → recreate the shot

Shot 2 — Approach bench

Use the repaired last frame from Shot 1 as the next shot’s starting image.

Again:

  • local face problem → repair
  • global shot problem → recreate

Shot 3 — Sit down

This is where I would usually recreate, not repair.

Use:

  • first frame = correct standing-near-bench frame
  • last frame = Anchor C
  • workflow = FLF

That is what Wan FLF is for.

Shot 4 — Seated

Generate from the seated anchor and keep motion subtle.

This is the safest place to get extra screen time.


How I would decide: repair vs recreate

Use this table mentally:

Situation What I would do
Face is wrong, everything else is good Repair
Face is slightly deformed, everything else is good Repair
Face is wrong and you have a reference image Repair with identity guidance
Body pose is wrong Recreate
Bench/camera/staging drifted Recreate
Multiple frames in a row are bad Recreate
Sit-down motion is wrong Recreate with FLF

That is the real production logic.


The actual repair routes I would use

Route 1 — Plain face-region inpaint

Use when the face is only moderately wrong.

Workflow

  1. load broken frame
  2. draw a tight mask over the face
  3. run inpaint
  4. save repaired frame

Learn from

Best when

  • the identity is close
  • the face just needs correction
  • you want the simplest path

Route 2 — ReActor face swap

Use when the face became the wrong character.

Workflow

  1. load broken frame
  2. load face reference / face model
  3. run ReActor
  4. save repaired frame

Source

Best when

  • the shot is fine
  • the identity is wrong
  • you want the fastest bridge-frame fix

Route 3 — Impact Pack + IPAdapter FaceID

Use when you want stronger reference guidance and local repair.

Workflow

  1. load broken frame
  2. mask the face with MaskPainter
  3. use FaceDetailer / MaskDetailer
  4. add IPAdapter FaceID guidance if needed
  5. save repaired frame

Sources

Best when

  • ReActor is unavailable or unstable
  • you want stronger reference control
  • you want local repair, not a global rewrite

What I would not do right now

1. I would not train a LoRA

Not yet.

That may become useful later, but the current official/community guidance makes it a poor next step for your hardware and stage.

Use:

  • repair
  • anchors
  • short shots
  • FLF

first.

2. I would not jump to full video inpainting

True masked video editing exists, and VACE is the right place to look for that.

But for your current setup, it is the later tier, not the next tier.

3. I would not keep downloading assets “just in case”

That is how disk fills up and clarity disappears.

For now, your active set should be small:

  • Wan 2.2 5B
  • one repair route
  • three anchor stills
  • your active project files

The exact “if this were my setup” plan

Day 1 — setup sanity

  • clean custom_nodes
  • fix folder locations
  • try one clean ReActor install
  • read startup logs
  • if it still fails, switch to Impact Pack

Day 2 — learn one-frame repair

  • use the official inpaint tutorial
  • take one bad frame
  • repair one face
  • save a corrected frame

If you cannot do one bad frame → one repaired frame, do not touch the full video workflow yet.

Day 3 — generate one walk shot

  • use official Wan 2.2 5B template
  • keep it short
  • inspect the last frame

Day 4 — use the repaired frame as the handoff

  • repair the last frame if needed
  • feed it into the approach shot

Day 5 — build the sit-down shot with FLF

  • good standing frame
  • good seated frame
  • FLF transition

Day 6 — generate the seated shot

  • short and subtle
  • repair only if needed

Day 7 — assemble in an editor

Use editing for pacing.
Do not ask the model to generate every second of your screen time if editing can solve it more reliably.


The most important takeaway

The winning pattern for your project is:

short shot → inspect last frame → repair if the problem is local → recreate if the problem is global → use the repaired frame to start the next shot

That is the cleanest, safest, least frustrating version of the workflow.


The actual links I would open first

If you only open a few tabs, I would make them these:

  1. Install custom nodes in ComfyUI
  2. ComfyUI-Manager install
  3. Custom-node troubleshooting
  4. ComfyUI Inpainting Workflow
  5. Wan2.2 Video Generation ComfyUI Official Native Workflow Example
  6. ComfyUI Wan FLF workflow
  7. ComfyUI-ReActor
  8. ComfyUI Impact Pack
  9. ComfyUI IPAdapter Plus
  10. Wan VACE To Video node docs
  11. musubi-tuner

Final recommendation

If I had to turn everything above into one practical recommendation, it would be this:

Right now

  • keep the stack small
  • fix installs properly
  • use Wan 2.2 5B
  • repair only bridge frames
  • recreate bad transitions
  • use FLF for the sit-down shot
  • treat ReActor as useful but optional
  • use Impact Pack if ReActor wastes time

Later

  • stronger reference-guided repair
  • true video-region editing
  • training, if you still need it after the workflow is stable

That is the path I would trust most for your setup and your specific project.

To put it simply,

  • It’s easier to accomplish simpler tasks
  • Difficult tasks might be achievable if you can somehow break them down into simpler ones
  • The more money you have, the easier it is (because you might be able to buy an environment that allows you to execute complex tasks without simplifying them)

This is great advice
 thank you. I had resolved the nodes/models install issues already
 basically, I was using ComfyUI desktop install directly from ComfyUI.org. The essentials button was not working for most of the ‘essentials’ like manager.. I think there was a python issue with my environment. So I uninstalled it and replaced in with ComfyUI portable
 everything just worked.. nodes, manager etc
 it’s brilliant.

1 last question, I have managed to create a Wan 2.2 Image>text>Video workflow that works really well
 takes about 4 to 5 mins to generate a 10 sec clip. I’ve managed to include 2 text encoders for neg/pos prompts, a lora (or chain multiple Loras, output to generate the frames and create the video.. really happy with it and I feel comfortable that I understand the nodes after a few weeks learning this stuff. I also have the same W/F with a first frame > last frame option.

2 challenges I face though.
1/ I can’t figure out how to include a checkpoint node into my flow
2/ I notice that the lower the FPS the more drift I get. I was creating 5 sec clips @ 11/12 FPS
 when I used this to 24 FPS the drift was less noticeable -problem is it now takes so much longer to generate -until I upgrade my GPU I guess I will have to suffer this

Nope
 I just can’t figure this inpainting out
 must be me. I followed the links exactly - managed to put a teapot into a picture which was something I could already do. Then moved to the I2I version with the face and the cloud
 easy enough but this doesn’t solve my challenge.
The cloud/face inpainting is not using a 2nd image where the real face is.

ie:

face/cloud example
 take a face, mask the hair, replace with a cloud (using text)
 easy

my goal

take a frame where the face is not quite right, mask the face, then what? I still have the broken frame on the canvas
 I can’t use text to repair the face so how do I get another face from a reference image into this What I am doing wrong?

I think the challenge is just too hard
 It’s on hard mode from the very start, after all.


This is how I would think about your setup now

First: switching from the direct Desktop install to ComfyUI Portable and suddenly having nodes/Manager behave properly is a real clue, not a coincidence. It strongly suggests the earlier problems were environmental rather than “you not understanding ComfyUI.” That is common with custom-node ecosystems: the install is only truly healthy when the node location and Python environment line up properly.

The good news is that you are now past the hardest beginner wall.

Your current Wan 2.2 setup already does something valuable:

  • it generates clips reliably
  • you understand the main nodes
  • you can use positive/negative conditioning
  • you can apply one or more LoRAs
  • you can do first-frame workflows
  • you can do first-frame → last-frame workflows

That means the main questions are no longer:

  • “How do I make anything at all?”
  • “Why won’t the nodes load?”

Your real questions now are more advanced and more interesting:

  1. Why does a classic checkpoint node not seem to fit the Wan graph?
  2. Why does lower FPS make drift look worse, and what should you do about it?
  3. Why do ordinary inpainting tutorials not solve “take this bad frame and fix the face using another face image”?

Those three are connected.


The short answer

If I had to compress the whole answer into one paragraph, it would be this:

Keep your Wan 2.2 workflow as your main shot generator. Do not force a classic SD-style checkpoint loader into the native Wan graph. Treat FPS as a quality/time tradeoff, not as a magic identity fix. Use FLF for the sit-down transition. And for face repair, stop thinking “text-only inpaint” and start thinking “separate still-frame repair workflow using either plain masked face inpaint, ReActor face swap, or mask-local face repair/detailing with a reference-guided method.”

That is the cleanest mental model.


1) About the checkpoint node

Short version

In a native Wan 2.2 workflow, you normally do not insert a classic SD/SDXL-style checkpoint node.

Why

The official Wan 2.2 ComfyUI workflow is not structured like a classic Stable Diffusion workflow where one checkpoint node loads most of the system in one go.

Instead, the official Wan-native flow is built from separate components, typically:

  • diffusion model loader
  • CLIP loader
  • VAE loader
  • the Wan video node itself
  • LoRA loader(s)
  • conditioning nodes

See:

What that means for your graph

If your current graph already looks something like:

  • Load Diffusion Model
  • Load CLIP
  • Load VAE
  • one or more LoRA nodes
  • positive / negative conditioning
  • Wan image-to-video or first/last-frame node
  • decode / save

then you are already using the correct native loading pattern.

So the reason you “can’t figure out how to include a checkpoint node” is probably not that you are missing something. It is more likely that there is no natural slot for a classic checkpoint node in the native Wan graph.

Where a checkpoint loader does make sense

A classic checkpoint loader can make sense in a separate still-image repair workflow.

For example, if you later build a dedicated face-repair graph using:

  • a still-image inpaint model,
  • a checkpoint-based image model,
  • or an SDXL/Flux-style repair branch,

then that separate graph may use a checkpoint node.

But that would be its own repair workflow, not something you must squeeze into the Wan graph itself.

About your LoRA chain

Your current LoRA logic sounds fine.

Relevant docs:

Important points from those docs:

  • LoRAs are discovered from ComfyUI/models/loras
  • multiple LoRA nodes can be chained directly
  • LoraLoaderModelOnly is specifically for applying LoRAs to the model branch only, without needing a CLIP model input on that node

That is why LoRA chaining feels natural in your current setup, while a classic checkpoint node does not.

My practical recommendation

For your Wan graph:

  • do not force a classic checkpoint loader into it
  • keep the native Wan structure
  • only use checkpoint-based loading in a separate repair graph if you later choose a checkpoint-based still-image repair method

2) About FPS, drift, and render time

You noticed:

  • lower FPS = more visible drift
  • higher FPS = drift feels less noticeable
  • but higher FPS = much longer generation time

That observation is useful, and it makes sense.

Why higher FPS often looks better

Higher FPS does not necessarily mean the model suddenly understands identity better.

What it often means is:

  • each frame is closer to the next in time
  • motion is split into smaller steps
  • the changes between frames feel less abrupt
  • the drift becomes less obvious because the motion is smoother

So the model may still be drifting, but the drift is hidden better by finer temporal spacing.

Why this becomes expensive quickly

The cost scales with frame count.

The official ComfyUI docs for Wan/Fun Inp make this very explicit: video length is the total number of frames, and the example calculation is basically:

  • seconds × fps = frame count

So if you double FPS while keeping the duration the same, you roughly double the number of frames the system has to generate.

See:

The important production lesson

On 8 GB VRAM, I would not make native 24 FPS your default unless you truly need it.

That is because your real bottleneck is not “video exists or not.” It is:

  • quality per minute of render time
  • how many iterations you can afford
  • whether you can keep enough control over continuity

A better 8 GB strategy

Instead of brute-forcing everything at native 24 FPS, I would bias toward:

  1. shorter clips
  2. moderate native FPS
  3. frame interpolation later, when needed

The official ComfyUI frame interpolation workflow exists for exactly this reason.

See:

That page is very relevant because it explicitly says frame interpolation:

  • generates intermediate frames
  • smooths motion
  • improves temporal consistency
  • is useful for increasing frame rate in short clips
  • is useful for fixing low-FPS generations without regenerating the source frames

My practical recommendation

For your current setup I would test this order:

  • keep clips short
  • use a sensible native frame count
  • use stronger control (first frame, first→last frame)
  • only then use interpolation for smoother output

That is usually a better quality/time tradeoff than forcing 24 FPS generation everywhere.


3) Why the inpainting tutorials feel like they stop one step too early

This is the part causing the most confusion, and for good reason.

What those tutorials are really teaching

The standard inpainting tutorials teach:

  • load an image
  • draw a mask
  • use text conditioning
  • regenerate only the masked region

That is generic inpainting.

And yes, that is why:

  • teapot example works
  • cloud/hair example works
  • but your actual problem still feels unsolved

Because your actual problem is not:

replace this masked region with any plausible thing described by text

Your actual problem is:

keep this bad frame as the base image, keep the pose/lighting/composition, and make the masked face look like the correct person from another image

That is a different task.

The missing concept

You are not supposed to put the second face image “onto the canvas” like another background layer.

Instead:

  • the broken frame remains the base image
  • the mask defines the region to repair
  • the second face image enters the graph as a reference / swap source / identity guide
  • a repair node uses that second image to influence what happens inside the mask

That is the key mental shift.


4) So what are the actual ways to use a second face image?

There are three practical families.

A. Face swap: the direct route

This is the ReActor route.

Use it when:

  • the frame is already good
  • the face became the wrong person
  • the pose, lighting, clothes, and framing are acceptable

Relevant repo:

Why it is relevant:

  • it is explicitly a face-swap extension for ComfyUI
  • it supports reusable face models
  • it is designed for image inputs and is very naturally suited to “fix this bad frame”

In plain language, the workflow is:

  • input_image = broken frame
  • source_image or face_model = the correct identity
  • output = repaired frame

That is probably the closest direct answer to your actual question.

B. Local face repair/detailing: the practical fallback

This is the Impact Pack route.

Relevant repo:

Important nodes:

  • MaskPainter — draw the mask
  • FaceDetailer — detect faces and improve them
  • MaskDetailer — inpaint only the masked area with a detailer pass

Why it is relevant:

  • it matches the “keep the frame, only fix the face” logic very well
  • it is a great fallback if ReActor is awkward or not the right fit
  • it is especially useful if the face is not just the wrong person but also a bit damaged, blurry, or structurally off

C. Reference-guided identity repair: the most conceptually accurate route

This is the IPAdapter FaceID-style idea.

Relevant repo:

Why it is relevant:

  • this is the clearest answer to “how do I use a second image to guide the face repair?”
  • the second face image becomes an identity reference, not just a prompt substitute
  • the docs emphasize that regional use is most effective through an inpainting workflow

This route is powerful, but it is more setup-heavy than the other two.


5) My actual recommendation for your case

If this were my setup, I would not try to solve everything inside one giant graph.

I would deliberately split the work into two workflows.


Workflow A — the main Wan video workflow

This is your existing graph.

Keep it for:

  • image/text/video generation
  • positive / negative prompt control
  • LoRAs
  • first-frame workflows
  • first-frame → last-frame workflows

This is your shot generator.

Relevant docs:


Workflow B — the separate still-frame repair workflow

This is the graph you use when a shot finishes and the last frame is almost right, but the face is not.

Use it for:

  • loading the broken frame
  • masking only the face
  • repairing that face with one of:
    • plain inpaint
    • ReActor
    • Impact Pack
    • reference-guided identity repair

Then save the repaired frame and feed it back into the next Wan shot.

This is your continuity repair tool.

That split is extremely important.

Why I recommend two workflows

Because it gives each graph one clear job:

  • Workflow A creates shots
  • Workflow B repairs bridge frames

That is much easier to understand and much easier to debug than an all-in-one “do everything” workflow.


6) Repair vs recreate: the rule that will save you the most time

This is the rule I would use.

Repair when:

  • the frame is already mostly good
  • the body pose is right
  • the lighting is right
  • the composition is right
  • the background / bench is right
  • only the face or a tiny area drifted

Recreate when:

  • the pose is wrong
  • the camera is wrong
  • the sit-down motion is wrong
  • multiple frames in a row are bad
  • fixing the face would still leave the shot unusable

For your project, that usually means:

  • walk: repair the last frame if only the face drifted
  • approach bench: same
  • sit-down transition: usually recreate with FLF, not patch frame-by-frame
  • seated shot: repair isolated face drift, recreate bad staging

This is the production logic I would trust.


7) The exact answer to “what am I doing wrong?”

I do not think you are doing the wrong operation.

I think you are trying to solve a reference-guided identity repair task with a text-only generic inpainting tutorial.

That is the mismatch.

You are not failing because you do not understand masking.

You are failing because your actual task needs one more ingredient:

  • a second-image identity mechanism

That mechanism is usually:

  • ReActor as a swap source
  • or IPAdapter FaceID as a reference-conditioning source
  • or Impact Pack as a local face repair route when the face is damaged and needs controlled correction

That is the real missing piece.


8) The most practical next steps I would recommend

Step 1

Keep your current Wan graph as your main shot workflow.

Step 2

Stop worrying about forcing a classic checkpoint node into that graph.

Step 3

Create a separate frame repair workflow.

Start with the official inpaint tutorial just to make sure the mask/local-rebuild logic is completely clear:

Then decide:

  • slight face drift → plain masked face inpaint
  • wrong person → ReActor
  • local face repair / fallback → Impact Pack
  • strong second-image identity guidance → IPAdapter FaceID-style route

Step 4

Use FLF for your sit-down transition.

That is a cleaner solution than trying to patch a bad sit-down clip.

Step 5

Do not default to native 24 FPS everywhere.
Test:

  • short clips
  • moderate native FPS
  • interpolation later

See:


9) The one-sentence summary

Your Wan workflow should generate shots; your separate repair workflow should fix bad bridge frames using a mask plus either a swap source, a face-detail pass, or a second-image identity guide.

That is the cleanest way to make the whole thing click.


10) The exact links I would keep open

Main generation:

Repair / continuity:

Later / heavier options:

Utilities:


Final recommendation

If this were my setup, I would do this:

  1. keep the current Wan 2.2 graph
  2. do not add a classic checkpoint loader to it
  3. build one separate still-frame repair graph
  4. use that graph to fix bad bridge frames
  5. use FLF for the sit-down transition
  6. use frame interpolation instead of brute-forcing native 24 FPS everywhere
  7. only later consider heavier clip-editing or training workflows

That is the simplest, cleanest, least frustrating path from where you are now.

Your workflow probably isn’t broken. It’s likely just the software in your execution environment that’s broken.

Setting up an open-source generative AI environment—not just ComfyUI—is quite fragile. In short, it can break at the slightest provocation. It’s absolutely true that the first major hurdle is the installation itself


The general rule is: “Once it’s working properly, stick with that setup and avoid making changes as much as possible,” and “When working on something else, set up a separate, isolated environment.” I think this is relatively easier to do with the Portable version
 People who take this seriously use virtual containers or similar solutions.

So, in your case, the error message indicates that it’s the ONNX Runtime that’s broken. This is a relatively mild case, and reinstalling it isn’t that difficult. (I’m not saying it’s easy, though
)


This does not look like a workflow mistake anymore

Your workflow may still be basically fine.

The strongest clue is the log error:

Error loading "...\onnxruntime_providers_cuda.dll" which depends on "cublasLt64_12.dll" which is missing.

That means the GPU runtime layer for ReActor/ONNX Runtime is failing to load. In plain English:

  • your face-swap graph can still be logically correct
  • but the GPU engine underneath it is now broken
  • so the node no longer processes frames the way it did before

That is why rebuilding the workflow from scratch did not fix it.


The key diagnosis

This looks much more like:

  • ONNX Runtime GPU / CUDA dependency break
  • possibly made more fragile by Portable + Python 3.13
  • not primarily a “wrong ReActor settings” issue

The most important current sources point the same way:

  • ComfyUI Portable now ships as Python 3.13 + CUDA 13.0 by default, and also offers an alternative Python 3.12 + CUDA 12.6 build for older GPUs:
    ComfyUI Portable (official docs)

  • ONNX Runtime CUDA says its GPU packages must match the runtime environment by CUDA major version and cuDNN major version, and that CUDA 12 builds are for CUDA 12.x; cuDNN 8 and 9 are not interchangeable:
    ONNX Runtime CUDA provider docs

  • ONNX Runtime also documents a Windows-specific fix/workaround: it can preload CUDA, cuDNN, and MSVC runtime DLLs, and on Windows its default search order checks the PyTorch lib directory first:
    ONNX Runtime CUDA provider docs

  • There is a current ReActor issue with your same family of error (cublasLt64_12.dll missing) in a Portable-style setup:
    ReActor issue #201

  • There is also a current ReActor issue for latest ComfyUI + Python 3.13 where GPU use is broken and package conflicts show up:
    ReActor issue #214

  • Current onnxruntime-gpu supports Python >= 3.11, including modern Python, so Python 3.13 is not automatically unsupported by ONNX Runtime itself:
    onnxruntime-gpu on PyPI


What I think happened

The most likely story is:

  1. your workflow genuinely worked
  2. an update changed the ONNX Runtime / CUDA dependency situation
  3. the GPU provider stopped loading
  4. ReActor stopped processing the frame sequence properly
  5. you now only get a tiny fragment of output

That matches your symptoms better than “the graph suddenly forgot how to work.”

The fact that it now outputs only the last few frames is very likely a secondary symptom of the runtime break, not the root cause.


Is Python 3.13 the cause?

Short answer

Possibly a contributing factor, but probably not the main cause of this exact error.

Why

Your exact error is about a missing CUDA 12 DLL.

That points more strongly to:

  • missing CUDA 12.x runtime visibility
  • mismatched cuDNN
  • missing MSVC runtime
  • or ONNX Runtime no longer finding the right DLLs after an update

Python 3.13 can still matter because the broader ReActor ecosystem is clearly more fragile there right now, but the immediate smoking gun is still the missing CUDA DLL.

So I would rank causes like this:

  1. Most likely: CUDA / ONNX Runtime dependency mismatch
  2. Also likely: update changed how ORT finds DLLs
  3. Possible contributor: Python 3.13 makes the stack more brittle
  4. Less likely: Python 3.13 alone is the whole root cause

The most realistic fixes and workarounds now

Option A — quickest diagnostic: test CPU fallback once

This is the fastest way to answer:

“Is my graph still okay?”

Temporarily switch the ONNX side to CPU and run the exact same workflow.

If it works again — slower, but logically correctly — then your graph is probably fine and the GPU runtime path is definitely the part that broke.

I would use this as:

  • a diagnostic
  • and a temporary emergency workaround

not as the final solution.


Option B — most likely real fix: restore the CUDA 12 / cuDNN 9 / MSVC chain

This is the highest-probability proper fix.

ONNX Runtime’s current CUDA guidance says the GPU provider depends on:

  • matching CUDA major version
  • matching cuDNN major version
  • and accessible runtime DLLs

A current ONNX Runtime issue also shows a dependency trace including:

  • cublasLt64_12.dll
  • cublas64_12.dll
  • cudnn64_9.dll
  • MSVCP140.dll
  • VCRUNTIME140.dll
  • VCRUNTIME140_1.dll

See:

So the practical fix path is:

  1. update GPU driver
  2. repair/install latest MSVC redistributable
  3. make sure CUDA 12.x runtime DLLs are present and visible
  4. make sure cuDNN 9.x is present and visible
  5. reinstall onnxruntime-gpu inside Portable Python

Option C — very good Windows workaround: use ORT DLL preloading

This is one of the strongest current workarounds because it is official ONNX Runtime behavior, not a random hack.

ORT says you can:

  • import torch before creating the ONNX Runtime session, or
  • call onnxruntime.preload_dlls()

to preload:

  • CUDA DLLs
  • cuDNN DLLs
  • MSVC runtime DLLs

And on Windows it searches:

  1. the PyTorch lib directory
  2. NVIDIA CUDA/cuDNN Python runtime packages
  3. then the default DLL search path

See:

Why this matters for you:

  • ComfyUI Portable already includes PyTorch
  • so the right DLLs may already exist
  • but ORT may not be finding them correctly after the update

This is one of the most attractive “minimal change” fixes.


Option D — clean reinstall of ONNX Runtime in the Portable Python only

This is very feasible and very relevant.

Portable uses its own embedded Python:

  • python_embeded\python.exe

Official ComfyUI Portable docs:

So all package work should use that Python, not your system Python.

Check what is installed

Run:

python_embeded\python.exe -V
python_embeded\python.exe -m pip show onnxruntime onnxruntime-gpu

If the ONNX stack looks messy

A clean reinstall is reasonable:

python_embeded\python.exe -m pip uninstall -y onnxruntime onnxruntime-gpu
python_embeded\python.exe -m pip install --only-binary=:all: onnxruntime-gpu

That is a very realistic thing to try before wiping everything else.


Option E — fallback compatibility route: use a known-good older combo

This is not my first choice, but it is a real-world fallback.

There are community reports of:

  • older Portable/Python combinations
  • older ONNX Runtime versions
  • and older CUDA/cuDNN combos

working when newer combinations did not.

Also, the official ComfyUI Portable docs themselves expose a meaningful split:

  • standard Nvidia portable = Python 3.13 + CUDA 13.0
  • alternative Nvidia portable = Python 3.12 + CUDA 12.6

See:

That is relevant because your current error is specifically asking for a CUDA 12-era DLL.

So if the modern stack keeps fighting you, a very reasonable fallback is:

  • a second Portable install
  • using the Python 3.12 + CUDA 12.6 variant
  • with a fresh ReActor/ORT setup

I would not start there, but I would absolutely keep it in reserve.


One especially relevant exact case online

The single most concrete “same symptom → fix” example I found is this:

  • a current issue where the user had the same missing cublasLt64_12.dll pattern
  • and the fix was to install CUDA 12.8

See:

That does not prove CUDA 12.8 is the exact answer for your machine.

But it strongly supports this interpretation:

when the runtime is asking for cublasLt64_12.dll, it wants a CUDA 12.x dependency chain, and a different CUDA major version may not satisfy it.


My actual recommendation order

If this were my machine, I would do this:

1. Confirm the actual Portable environment

Run:

python_embeded\python.exe -V
python_embeded\python.exe -m pip show onnxruntime onnxruntime-gpu

This tells you:

  • exact Python version
  • exact ONNX package/version
  • whether you are really using the embedded environment

2. Do one CPU-only diagnostic run

Not because CPU is your final answer.

Do it because it answers the most important question:

“Is the graph still okay?”

If CPU works, your graph is basically vindicated.


3. Repair the GPU runtime chain

In this order:

  • update NVIDIA driver
  • repair/install latest MSVC redistributable
  • ensure CUDA 12.x runtime is available
  • ensure cuDNN 9.x is available if your ORT build expects it
  • reinstall onnxruntime-gpu inside Portable Python

4. Try ORT DLL preloading

This is one of the smartest next steps because it is:

  • official
  • current
  • Windows-specific
  • PyTorch-friendly

See:


5. Only then consider stepping back from Python 3.13

I would only do this if:

  • the CUDA/ORT fixes still leave the stack unstable
  • or ReActor remains fragile after the runtime issue is fixed

At that point, a clean Python 3.12 + CUDA 12.6 Portable side install becomes a very reasonable compatibility-first fallback.


What I would not do now

I would not keep changing:

  • ReActor settings
  • RIFE settings
  • frame counts inside the graph
  • video combine settings
  • face-swap thresholds

Those are downstream knobs.

Right now, your main blocker is a runtime failure, not a workflow-tuning problem.


My bottom-line view

This still looks fixable.

The encouraging part is:

  • your workflow really did work
  • your core idea was sound
  • the current failure looks environmental, not conceptual

So I would think of this as:

restore a working ONNX Runtime GPU path first, then your face-swap workflow has a good chance of working again

rather than:

“I need to redesign the whole graph from zero.”

That is a much better kind of problem to have.


The exact links I would keep open

BTW, here’s a quick tip:
“Don’t assume that the latest versions of Python or CUDA are the best for open-source generative AI. In fact, in most cases, they’re the worst! It’s better to use a slightly older version—the same one the developers are using.” (Right now, Python 3.12 is the safest bet. You’ll rarely run into problems with Python 3.10 either. 3.9 is too old. With 3.13 and later, there are quite a few unsupported libraries
 3.14? That’s a real adventure!)

(post deleted by author)

thought I’d cracked it
1st I2V clip was 6 sec long, 12 FPS 
 ran it through my 2nd w/f (ReActor>video combine), set to 12 FPS
 absolutely perfect output

2nd I2V clip was 6 sec long, 12 FPS 
 ran it through my 2nd w/f (ReActor>video combine), set to 12 FPS
 perfect face swap but misses frames about 3 secs in and jumps
 tried adjusting FPS but makes not difference other than to slow down or speed up clip.. so i get the faceswap but loose the smooth motion 
 grrrr
Makes no sense to me

Lost again