β οΈ Ethical Use & Disclaimer
This model is a technical tool designed for Digital Identity Research, Professional VFX Workflows, and Cinematic Prototyping.
By downloading or using this LoRA, you acknowledge and agree to the following:
- Intended Use: Designed for filmmakers, VFX artists, and researchers exploring high-fidelity video identity transformation.
- Consent & Rights: You must possess explicit legal consent and all necessary rights from any individual whose likeness is being processed.
- Legal Compliance: You are fully responsible for complying with all local and international laws regarding synthetic media.
- Liability Waiver: This model is provided "as is." As the creator (Alissonerdx), I assume no responsibility for misuse. Any legal, ethical, or social consequences are solely the responsibility of the end user.
πΊ Video Examples
V1 Examples
Generated using the Frame 0 Anchoring Technique. All examples follow the guide video motion while preserving the identity provided in the first frame.
| Example 1 | Example 2 |
|---|---|
| Example 3 | Example 4 |
|---|---|
| Example 5 |
|---|
V3 Examples
If you want to see the full setup in practice, watch here:
https://www.youtube.com/watch?v=HBp03iu7wLA
The following examples demonstrate the new persistent-template workflow used in V3:
| Example 6 | Example 7 |
|---|---|
| Example 8 |
|---|
The image references for the versions are stored under:
ltx-2.3/...
π Technical Background (V1)
To achieve this level of identity transfer, I heavily modified the official LTX-2 training scripts.
Key Improvements
- Novel Conditioning Injection: Custom latent injection methods for reference identity stabilization.
- Noise Distribution Overhaul: Implemented a custom High-Noise Power Law timestep distribution, forcing the model to prioritize target identity reconstruction over guide-video context.
- Training Compute: 60+ hours of training on NVIDIA RTX PRO 6000 Blackwell GPUs, iterating through 300GB+ of experimental checkpoints.
π Dataset Specifications
V1 Dataset
- 300 high-quality head swap video pairs
- Trained on 512x512 buckets
- Primarily landscape format
- Optimized for close-up framing
Wide shots may reduce identity fidelity.
π‘ Inference Guide (V1)
π΄ CRITICAL β Frame 0 Requirement
This version was trained to use Frame 0 as the identity anchor.
You must prepare the first frame correctly.
Recommended Workflow
- Perform a high-quality head swap on Frame 0.
- Use that processed frame as conditioning input.
- Run the full video generation.
For best results, prepare Frame 0 using my previous BFS Image Models.
Optimization
LoRA Strength
- 1.0 β Best motion fidelity
- >1.0 β Stronger identity and hair capture, but may distort original motion
Multi-Pass Workflows
You can experiment with multiple passes using different strengths.
Prompting
Detailed prompts currently have no effect.
Trigger remains:
head swap
β οΈ Known Issues (V1 β Alpha)
- Identity Leakage: Hair from the guide video may reappear.
- Hard Cuts: Jump cuts can reset identity.
- Portrait Format: Performance is significantly better in landscape.
π Version 2 β Major Update
V2 introduces a complete redesign of conditioning strategy and masking logic, significantly improving identity robustness and reducing leakage.
πΉ Multiple Conditioning Modes (Using First Frame)
V2 supports multiple identity injection approaches:
1οΈβ£ Direct Photo Conditioning
Use a clean photo of the new face as reference input.
This method works and can produce strong results. However, because the model must internally reconcile lighting, perspective, depth, and occlusion differences, it may need to fight to correctly integrate the new identity into the guide video. In some cases, this can reduce stability or identity consistency.
2οΈβ£ First-Frame Head Swap (Recommended)
Applying a proper head swap on Frame 0 still produces extremely strong and reliable results.
Because the first frame is already structurally correct β pose, lighting, depth, and occlusions β the model has significantly less work to do. Instead of forcing alignment from a static photo, it simply propagates and stabilizes the identity through time.
This approach generally:
- Produces higher identity fidelity
- Reduces deformation
- Minimizes integration artifacts
- Improves overall temporal stability
3οΈβ£ Automatic Magazine-Style Overlay
The new face is automatically cut and positioned over the guide face using mask alignment. This simulates a magazine-cutout-style overlay, but performed automatically based on mask positioning.
4οΈβ£ Manual Overlay
Advanced users may manually composite the new face over Frame 0 before running inference.
πΉ Facial Motion Behavior (Important Change)
Unlike V1:
V2 does not follow the original guide faceβs facial micro-movements.
The guide face is fully masked to prevent identity leakage.
This makes masking quality critical.
Mask Requirements
- The guide face must be completely covered.
- Mask color must be a magenta tone.
- Any visible guide identity may leak into the final output.
πΉ Mask Types
Users may alternate between:
βͺ Square Masks
- More stable identity
- Better consistency
- Often produce stronger overall results
- May generate slightly oversized heads due to spatial padding
In most scenarios, square masks tend to perform better because they provide additional spatial context for the model to reconstruct structure and hair.
βͺ Tight / Adjusted Masks
- More natural head proportions
- May deform if guide head shape differs significantly
- Sensitive to long-hair mismatches
If the original guide has long hair and the new identity does not, deformation risk increases.
πΉ Dataset & Training Improvements (V2)
- 800+ video pairs
- Trained at 768 resolution
- 768 is the recommended inference resolution
- Improved hair stability
- Reduced identity leakage compared to V1
- More robust identity transfer under motion
πΉ First Pass vs Second Pass
You may:
- Run a single pass at 768 (recommended)
- Or run a downscaled first pass plus a second upscale pass
β οΈ Important:
A second pass may alter identity from the first pass and reduce consistency in some cases.
πΉ Trigger
Trigger remains:
head swap
π Version 3 β Persistent Template Workflow
V3 introduces a new persistent-template conditioning workflow.
Unlike previous versions, which relied primarily on the identity being established from Frame 0 only, V3 uses a custom guide-video construction step that keeps the new face visible throughout the entire guide sequence.
This results in a much stronger and more persistent identity signal during inference.
π Acknowledgements
Special thanks to facy.ai for sponsoring the GPU used to train this model.
If you want to check their platform, you can use my referral link:
πΉ How V3 Works
V3 uses a custom node from ComfyUI-BFSNodes to prepare the guide video before inference.
Repository:
https://github.com/alisson-anjos/ComfyUI-BFSNodes
Workflow file:
workflows/workflow_ltx2_head_swap_drag_and_drop_v3.0
The guide-video preparation process works like this:
- Start from the original guide video
- Add a vertical green chroma-key strip on the side
- Place the reference face image inside that strip
- Apply this composition to every frame of the original video
- Use this new composite video as the actual inference guide
This means the new identity remains fully visible during all frames of the guide video, instead of appearing only in Frame 0 like in previous versions.
That is the main reason V3 can achieve better consistency than earlier versions.
πΉ Why V3 Is Different
Because the identity reference stays visible during the full guide sequence, V3 gives the model a much more stable conditioning signal across time.
In practice, this can improve:
- Identity consistency
- Temporal stability
- Resistance to identity drift
- Facial motion continuity
- Lip sync behavior
- Expressive facial movement preservation
This version is especially useful for shots where the face remains visible for longer periods, or where dialogue, mouth movement, and facial acting matter more.
V3 is not just a refinement of the first-frame method. It changes the conditioning logic by giving the model access to a persistent identity template across the entire inference sequence.
πΉ Final Output Behavior
Even though the guide video used during inference contains the vertical chroma-key side strip, the final generated result does not include that strip.
The generated video is returned in the original resolution and framing of the source guide video.
So in practice:
- The green side strip exists only in the internal guide/template video
- It is used only to improve inference conditioning
- It does not appear in the final output
πΉ Prompting for V3
For V3, users can also pass the composite guide video into a vision-capable model to extract a structured prompt.
This is useful because the composite video contains two different information sources:
- the reference identity inside the side strip
- the performance and scene information in the main video area
This helps keep identity and action description separated more cleanly.
Recommended Prompt Template
Analyze this composite video.
The video contains:
1. a side chroma-key panel with a reference face image
2. a main performance video showing the body, clothing, movement, hand actions, objects, framing, and environment
Your task is to extract:
- the target face identity from the side panel
- the performance/action from the main video
Critical rules:
- The side-panel face is the only valid source for identity traits and head-level accessories.
- Ignore the visible face and head appearance in the main video completely.
- Do not describe any face, hair, hairstyle, hair color, eye color, makeup, facial features, facial expression, attractiveness, headwear, hood, hat, or accessories from the main video.
- In the ACTION section, describe the performer only as "a person" and focus only on body movement, clothing, hand actions, objects, framing, and environment.
- Do not mention the chroma panel, green background, split layout, or editing structure.
- Be factual and non-creative.
- Do not guess uncertain details. If a detail is not clearly visible, omit it.
Return exactly in this format:
head_swap:
FACE:
A brief but detailed objective identity description from the side-panel face only. Include, when clearly visible: apparent gender, apparent ethnicity, skin tone or complexion, approximate age range, head shape, hair or baldness pattern, hair color, eye color, facial hair, visible skin details, headwear or head covering, visible facial accessories, and any especially distinctive facial trait. Prioritize the eyes when they are a strong defining feature.
ACTION:
A concise performance description from the main video. Include only: visible clothing, body position, movement, hand actions, objects being shown or handled, camera-facing behavior, framing, and environment. Do not include any face or head appearance from the main video.
Good example:
FACE:
Female, fair skin, approximately 20-30 years old, oval head shape, long wavy vivid blue-violet hair, bright golden-amber eyes with dark defined pupils, no facial hair, smooth skin, and pink flower hair accessories as a distinctive head adornment.
ACTION:
A person in a dark top faces the camera indoors, holds a package of false eyelashes close to the lens, peels one lash from the backing, brings it near the eye area, and examines it while making small hand movements.
Bad example:
ACTION:
A person with long curly blonde braids holds a pair of false eyelashes...
How to Use
- Generate the V3 composite guide video using the node
- Pass that composite video into a vision-capable model
- Extract the structured FACE and ACTION prompt
- Use that output as the base prompt for the V3 workflow
πΉ Captions / Descriptions for V3
If you want automatic captions or prompt extraction from video, you can also use my Ollama nodes.
Repository:
https://github.com/alisson-anjos/ComfyUI-Ollama-Describer
A useful node for this workflow is:
Ollama Video Describer
This can help generate structured descriptions from the composite guide video and make it easier to build the final prompt for V3.
πΉ V3 Trigger
Trigger remains:
head_swap:
FACE:
....
ACTION:
....
π΄ Critical Success Factor (V2 / V3)
Mask and preparation quality still matter enormously.
Even with improved conditioning, final quality depends on:
- Proper face coverage
- Clean compositing
- Strong alignment
- Good source and reference quality
If any portion of the original guide identity remains visible where it should not, the model may still reintroduce unwanted traits.
Take time to refine your inputs. Better preparation consistently produces better output than simply increasing LoRA strength.
π§ Advanced Technique: Combine with LTX-2 Inpainting
Advanced users can experiment with combining this LoRA with the native LTX-2 inpainting workflow.
This can help:
- Refine problematic areas
- Correct small deformation zones
- Improve edge blending
- Recover detail in hair or jaw regions
When properly combined, inpainting can significantly enhance final output quality, especially in challenging frames.
πΉ Recommendation
I strongly recommend testing both LoRAs and comparing the final behavior.
Depending on the guide clip, framing, facial motion, and the kind of result you want, some users may prefer the look or motion style of one version over the other.
In general:
- V2 may still be preferred for some first-frame-driven workflows
- V3 is better when you want a stronger persistent identity signal, better consistency, and better facial/lip motion continuity
The best version will often depend on the shot and on personal preference.
π Support
Maintaining R&D and renting Blackwell GPUs is expensive.
If this project helps you, consider supporting the development of:
- V3 improvements
- Advanced conditioning pipelines
- SAM 3 integration
- Full reference-photo-only workflows
Support here:
- Downloads last month
- -
Model tree for botp/BFS-Best-Face-Swap-Video
Base model
Lightricks/LTX-2