⚠️ Ethical Use & Disclaimer

This model is a technical tool designed for Digital Identity Research, Professional VFX Workflows, and Cinematic Prototyping.

By downloading or using this LoRA, you acknowledge and agree to the following:

Intended Use: Designed for filmmakers, VFX artists, and researchers exploring high-fidelity video identity transformation.
Consent & Rights: You must possess explicit legal consent and all necessary rights from any individual whose likeness is being processed.
Legal Compliance: You are fully responsible for complying with all local and international laws regarding synthetic media.
Liability Waiver: This model is provided "as is." As the creator (Alissonerdx), I assume no responsibility for misuse. Any legal, ethical, or social consequences are solely the responsibility of the end user.

📺 Video Examples

V1 Examples

Generated using the Frame 0 Anchoring Technique. All examples follow the guide video motion while preserving the identity provided in the first frame.

Example 1	Example 2

Example 3	Example 4

Example 5

V3 Examples

If you want to see the full setup in practice, watch here:

https://www.youtube.com/watch?v=HBp03iu7wLA

The following examples demonstrate the new persistent-template workflow used in V3:

Example 6	Example 7

Example 8

The image references for the versions are stored under:

ltx-2.3/...

🛠 Technical Background (V1)

To achieve this level of identity transfer, I heavily modified the official LTX-2 training scripts.

Key Improvements

Novel Conditioning Injection: Custom latent injection methods for reference identity stabilization.
Noise Distribution Overhaul: Implemented a custom High-Noise Power Law timestep distribution, forcing the model to prioritize target identity reconstruction over guide-video context.
Training Compute: 60+ hours of training on NVIDIA RTX PRO 6000 Blackwell GPUs, iterating through 300GB+ of experimental checkpoints.

📊 Dataset Specifications

V1 Dataset

300 high-quality head swap video pairs
Trained on 512x512 buckets
Primarily landscape format
Optimized for close-up framing

Wide shots may reduce identity fidelity.

💡 Inference Guide (V1)

🔴 CRITICAL — Frame 0 Requirement

This version was trained to use Frame 0 as the identity anchor.

You must prepare the first frame correctly.

Recommended Workflow

Perform a high-quality head swap on Frame 0.
Use that processed frame as conditioning input.
Run the full video generation.

For best results, prepare Frame 0 using my previous BFS Image Models.

Optimization

LoRA Strength

1.0 → Best motion fidelity
>1.0 → Stronger identity and hair capture, but may distort original motion

Multi-Pass Workflows

You can experiment with multiple passes using different strengths.

Prompting

Detailed prompts currently have no effect.

Trigger remains:

head swap

⚠️ Known Issues (V1 – Alpha)

Identity Leakage: Hair from the guide video may reappear.
Hard Cuts: Jump cuts can reset identity.
Portrait Format: Performance is significantly better in landscape.

🚀 Version 2 – Major Update

V2 introduces a complete redesign of conditioning strategy and masking logic, significantly improving identity robustness and reducing leakage.

🔹 Multiple Conditioning Modes (Using First Frame)

V2 supports multiple identity injection approaches:

1️⃣ Direct Photo Conditioning

Use a clean photo of the new face as reference input.

This method works and can produce strong results. However, because the model must internally reconcile lighting, perspective, depth, and occlusion differences, it may need to fight to correctly integrate the new identity into the guide video. In some cases, this can reduce stability or identity consistency.

2️⃣ First-Frame Head Swap (Recommended)

Applying a proper head swap on Frame 0 still produces extremely strong and reliable results.

Because the first frame is already structurally correct — pose, lighting, depth, and occlusions — the model has significantly less work to do. Instead of forcing alignment from a static photo, it simply propagates and stabilizes the identity through time.

This approach generally:

Produces higher identity fidelity
Reduces deformation
Minimizes integration artifacts
Improves overall temporal stability

3️⃣ Automatic Magazine-Style Overlay

The new face is automatically cut and positioned over the guide face using mask alignment. This simulates a magazine-cutout-style overlay, but performed automatically based on mask positioning.

4️⃣ Manual Overlay

Advanced users may manually composite the new face over Frame 0 before running inference.

🔹 Facial Motion Behavior (Important Change)

Unlike V1:

V2 does not follow the original guide face’s facial micro-movements.

The guide face is fully masked to prevent identity leakage.

This makes masking quality critical.

Mask Requirements

The guide face must be completely covered.
Mask color must be a magenta tone.
Any visible guide identity may leak into the final output.

🔹 Mask Types

Users may alternate between:

▪ Square Masks

More stable identity
Better consistency
Often produce stronger overall results
May generate slightly oversized heads due to spatial padding

In most scenarios, square masks tend to perform better because they provide additional spatial context for the model to reconstruct structure and hair.

▪ Tight / Adjusted Masks

More natural head proportions
May deform if guide head shape differs significantly
Sensitive to long-hair mismatches

If the original guide has long hair and the new identity does not, deformation risk increases.

🔹 Dataset & Training Improvements (V2)

800+ video pairs
Trained at 768 resolution
768 is the recommended inference resolution
Improved hair stability
Reduced identity leakage compared to V1
More robust identity transfer under motion

🔹 First Pass vs Second Pass

You may:

Run a single pass at 768 (recommended)
Or run a downscaled first pass plus a second upscale pass

⚠️ Important:

A second pass may alter identity from the first pass and reduce consistency in some cases.

🔹 Trigger

Trigger remains:

head swap

🚀 Version 3 – Persistent Template Workflow

V3 introduces a new persistent-template conditioning workflow.

Unlike previous versions, which relied primarily on the identity being established from Frame 0 only, V3 uses a custom guide-video construction step that keeps the new face visible throughout the entire guide sequence.

This results in a much stronger and more persistent identity signal during inference.

🙏 Acknowledgements

Special thanks to facy.ai for sponsoring the GPU used to train this model.

If you want to check their platform, you can use my referral link:

https://facy.ai/a/headswap

🔹 How V3 Works

V3 uses a custom node from ComfyUI-BFSNodes to prepare the guide video before inference.

Repository:

https://github.com/alisson-anjos/ComfyUI-BFSNodes

Workflow file:

workflows/workflow_ltx2_head_swap_drag_and_drop_v3.0

The guide-video preparation process works like this:

Start from the original guide video
Add a vertical green chroma-key strip on the side
Place the reference face image inside that strip
Apply this composition to every frame of the original video
Use this new composite video as the actual inference guide

This means the new identity remains fully visible during all frames of the guide video, instead of appearing only in Frame 0 like in previous versions.

That is the main reason V3 can achieve better consistency than earlier versions.

🔹 Why V3 Is Different

Because the identity reference stays visible during the full guide sequence, V3 gives the model a much more stable conditioning signal across time.

In practice, this can improve:

Identity consistency
Temporal stability
Resistance to identity drift
Facial motion continuity
Lip sync behavior
Expressive facial movement preservation

This version is especially useful for shots where the face remains visible for longer periods, or where dialogue, mouth movement, and facial acting matter more.

V3 is not just a refinement of the first-frame method. It changes the conditioning logic by giving the model access to a persistent identity template across the entire inference sequence.

🔹 Final Output Behavior

Even though the guide video used during inference contains the vertical chroma-key side strip, the final generated result does not include that strip.

The generated video is returned in the original resolution and framing of the source guide video.

So in practice:

The green side strip exists only in the internal guide/template video
It is used only to improve inference conditioning
It does not appear in the final output

🔹 Prompting for V3

For V3, users can also pass the composite guide video into a vision-capable model to extract a structured prompt.

This is useful because the composite video contains two different information sources:

the reference identity inside the side strip
the performance and scene information in the main video area

This helps keep identity and action description separated more cleanly.

Recommended Prompt Template

Analyze this composite video.

The video contains:
1. a side chroma-key panel with a reference face image
2. a main performance video showing the body, clothing, movement, hand actions, objects, framing, and environment

Your task is to extract:
- the target face identity from the side panel
- the performance/action from the main video

Critical rules:
- The side-panel face is the only valid source for identity traits and head-level accessories.
- Ignore the visible face and head appearance in the main video completely.
- Do not describe any face, hair, hairstyle, hair color, eye color, makeup, facial features, facial expression, attractiveness, headwear, hood, hat, or accessories from the main video.
- In the ACTION section, describe the performer only as "a person" and focus only on body movement, clothing, hand actions, objects, framing, and environment.
- Do not mention the chroma panel, green background, split layout, or editing structure.
- Be factual and non-creative.
- Do not guess uncertain details. If a detail is not clearly visible, omit it.

Return exactly in this format:
head_swap:

FACE:
A brief but detailed objective identity description from the side-panel face only. Include, when clearly visible: apparent gender, apparent ethnicity, skin tone or complexion, approximate age range, head shape, hair or baldness pattern, hair color, eye color, facial hair, visible skin details, headwear or head covering, visible facial accessories, and any especially distinctive facial trait. Prioritize the eyes when they are a strong defining feature.

ACTION:
A concise performance description from the main video. Include only: visible clothing, body position, movement, hand actions, objects being shown or handled, camera-facing behavior, framing, and environment. Do not include any face or head appearance from the main video.

Good example:
FACE:
Female, fair skin, approximately 20-30 years old, oval head shape, long wavy vivid blue-violet hair, bright golden-amber eyes with dark defined pupils, no facial hair, smooth skin, and pink flower hair accessories as a distinctive head adornment.

ACTION:
A person in a dark top faces the camera indoors, holds a package of false eyelashes close to the lens, peels one lash from the backing, brings it near the eye area, and examines it while making small hand movements.

Bad example:
ACTION:
A person with long curly blonde braids holds a pair of false eyelashes...

How to Use

Generate the V3 composite guide video using the node
Pass that composite video into a vision-capable model
Extract the structured FACE and ACTION prompt
Use that output as the base prompt for the V3 workflow

🔹 Captions / Descriptions for V3

If you want automatic captions or prompt extraction from video, you can also use my Ollama nodes.

Repository:

https://github.com/alisson-anjos/ComfyUI-Ollama-Describer

A useful node for this workflow is:

Ollama Video Describer

This can help generate structured descriptions from the composite guide video and make it easier to build the final prompt for V3.

🔹 V3 Trigger

Trigger remains:

head_swap:
FACE:
....

ACTION:
....

🔴 Critical Success Factor (V2 / V3)

Mask and preparation quality still matter enormously.

Even with improved conditioning, final quality depends on:

Proper face coverage
Clean compositing
Strong alignment
Good source and reference quality

If any portion of the original guide identity remains visible where it should not, the model may still reintroduce unwanted traits.

Take time to refine your inputs. Better preparation consistently produces better output than simply increasing LoRA strength.

🔧 Advanced Technique: Combine with LTX-2 Inpainting

Advanced users can experiment with combining this LoRA with the native LTX-2 inpainting workflow.

This can help:

Refine problematic areas
Correct small deformation zones
Improve edge blending
Recover detail in hair or jaw regions

When properly combined, inpainting can significantly enhance final output quality, especially in challenging frames.

🔹 Recommendation

I strongly recommend testing both LoRAs and comparing the final behavior.

Depending on the guide clip, framing, facial motion, and the kind of result you want, some users may prefer the look or motion style of one version over the other.

In general:

V2 may still be preferred for some first-frame-driven workflows
V3 is better when you want a stronger persistent identity signal, better consistency, and better facial/lip motion continuity

The best version will often depend on the shot and on personal preference.

💙 Support

Maintaining R&D and renting Blackwell GPUs is expensive.

If this project helps you, consider supporting the development of:

V3 improvements
Advanced conditioning pipelines
SAM 3 integration
Full reference-photo-only workflows

Support here:

https://buymeacoffee.com/nrdx

Downloads last month: -

Model tree for botp/BFS-Best-Face-Swap-Video

Base model

Lightricks/LTX-2

Adapter

(48)

this model