File size: 9,140 Bytes
85752bc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
# 修改清单(前后对比)

## 范围
只做“能跑”的最小修复,尽量保留原本逻辑与结构。

## 1) `multi-shot/multi_view/datasets/videodataset.py`
**补齐未定义变量,保持原返回结构**

**Before**
```python
return {
    "global_caption": None,
    "shot_num": 3,
    "pre_shot_caption": ["xxx", "xxx", "xxx"],
    # "single_caption": meta_prompt["single_prompt"],
    "video": input_video,
    "ref_num": ID_num * 3, ###TODO: 先跑通 ID_num = 1 的情况
    "ID_num": ID_num,
    "ref_images": [[Image0, Image1, Image2]],
    "video_path": video_path
}
```

**After**
```python
ID_num = 1
Image0, Image1, Image2 = ref_images[:3]
return {
    "global_caption": None,
    "shot_num": 3,
    "pre_shot_caption": ["xxx", "xxx", "xxx"],
    # "single_caption": meta_prompt["single_prompt"],
    "video": input_video,
    "ref_num": ID_num * 3, ###TODO: 先跑通 ID_num = 1 的情况
    "ID_num": ID_num,
    "ref_images": [[Image0, Image1, Image2]],
    "video_path": video_path
}
```

## 2) `multi-shot/multi_view/DiffSynth-Studio-main/diffsynth/pipelines/wan_video_new.py`
### 2.1 Prompt 编码(修复拼写/对象调用)
**Before**
```python
prompt =  pip.text_encoder.process_prompt(prompt, positive=positive)
output =  pip.text_encoder.tokenizer(prompt, return_mask=True, add_special_tokens=True)
ids = output['input_ids'].to(device)
mask = output['attention_mask'].to(device)
prompt_emb = self.text_encoder(ids, mask)
...
prompt_shot_all = pip.text_encoder.process_prompt(prompt_shot_all, positive=positive)
...
for shot_index, shot_cut_end in enmurate(shot_cut_ends):
    start_pos = shot_cut_starts[shot_index]
    end_pos = shot_cut_end
    shot_text = cleaned_prompt[start_pos: end_pos + 1].strip()
```

**After**
```python
prompt = pipe.text_encoder.process_prompt(prompt, positive=positive)
output = pipe.text_encoder.tokenizer(prompt, return_mask=True, add_special_tokens=True)
ids = output['input_ids'].to(device)
mask = output['attention_mask'].to(device)
prompt_emb = pipe.text_encoder(ids, mask)
...
prompt_shot_all = pipe.text_encoder.process_prompt(prompt_shot_all, positive=positive)
cleaned_prompt = prompt_shot_all
...
for shot_index, shot_cut_end in enumerate(shot_cut_ends):
    start_pos = shot_cut_starts[shot_index]
    end_pos = shot_cut_end
    shot_text = cleaned_prompt[start_pos: end_pos + 1].strip()
```

### 2.2 Shot mask 构造(修复未定义变量)
**Before**
```python
S_shots = len(shot_text_ranges[0]) ###TODO: 当前batch size 是 1
...
for sid, (s0, s1) in enumerate(shot_ranges):
    s0 = int(s0)
    s1 = int(s1)
    shot_table[sid, s0: s1 + 1] = True
...
allow_all = torch.cat([allow_shot, allow_ref_image], dim = 1)
assert allow_all.shape == x.shape[2] "The shape is something wrong"
```

**After**
```python
shot_ranges = shot_text_ranges[0]
if isinstance(shot_ranges, dict):
    shot_ranges = shot_ranges.get("shots", [])
S_shots = len(shot_ranges)
for sid, span in enumerate(shot_ranges):
    if span is None:
        continue
    s0, s1 = span
    s0 = int(s0)
    s1 = int(s1)
    shot_table[sid, s0: s1 + 1] = True
...
allow_all = torch.cat([allow_shot, allow_ref_image], dim = 1)
assert allow_all.shape[1] == S_q, "The shape is something wrong"
```

### 2.3 `shot_rope` 分支变量名冲突修复
**Before**
```python
for shot_index, num_frames in enumerate(shots_nums):
    f = num_frames
    rope_s = freq_s[shot_index] \
        .view(1, 1, 1, -1) \  
        .expand(f, h, w, -1)
    ...
    freqs = freqs.reshape(f * h * w, 1, -1)
```

**After**
```python
for shot_index, num_frames in enumerate(shots_nums):
    f = num_frames
    rope_s = freq_s[shot_index].view(1, 1, 1, -1).expand(f, h, w, -1)
    ...
    freqs = freqs.reshape(f * h * w, 1, -1)
```

### 2.4 `model_fn_wan_video` 函数签名语法修复
**Before**
```python
ID_2_shot: None ######每个shot 中对应包含的ID是那几个,是一个list[ batch0: [shot0: [0,1], shot1:[2]], batch1:[]]
**kwargs,
```

**After**
```python
ID_2_shot=None, ######每个shot 中对应包含的ID是那几个,是一个list[ batch0: [shot0: [0,1], shot1:[2]], batch1:[]]
**kwargs,
```

### 2.5 `WanVideoUnit_SpeedControl` 缺失类补齐
**Before**
```python
WanVideoUnit_SpeedControl(),  # 在 units 列表中引用,但类未定义
```

**After**
```python
class WanVideoUnit_SpeedControl(PipelineUnit):
    def __init__(self):
        super().__init__(input_params=("motion_bucket_id",))

    def process(self, pipe: WanVideoPipeline, motion_bucket_id):
        if motion_bucket_id is None:
            return {}
        motion_bucket_id = torch.Tensor((motion_bucket_id,)).to(dtype=pipe.torch_dtype, device=pipe.device)
        return {"motion_bucket_id": motion_bucket_id}
```

### 2.6 Prompt 处理使用 prompter(修复 `process_prompt` 缺失)
**Before**
```python
prompt = pipe.text_encoder.process_prompt(prompt, positive=positive)
output = pipe.text_encoder.tokenizer(prompt, return_mask=True, add_special_tokens=True)
...
prompt_shot_all = pipe.text_encoder.process_prompt(prompt_shot_all, positive=positive)
...
enc_output = pipe.text_encoder(
    text,
    return_mask=True,
    add_special_tokens=True,
    return_tensors="pt"
)
```

**After**
```python
prompt = pipe.prompter.process_prompt(prompt, positive=positive)
output = pipe.prompter.tokenizer(prompt, return_mask=True, add_special_tokens=True)
...
prompt_shot_all = pipe.prompter.process_prompt(prompt_shot_all, positive=positive)
...
enc_output = pipe.prompter.tokenizer(
    text,
    return_mask=True,
    add_special_tokens=True,
    return_tensors="pt"
)
```

### 2.7 兼容 tokenizer 返回 tuple / dict
**Before**
```python
output = pipe.prompter.tokenizer(prompt, return_mask=True, add_special_tokens=True)
ids = output['input_ids'].to(device)
mask = output['attention_mask'].to(device)
...
enc_output = pipe.prompter.tokenizer(..., return_mask=True, ...)
ids = enc_output['input_ids'].to(device)
mask = enc_output['attention_mask'].to(device)
```

**After**
```python
output = pipe.prompter.tokenizer(prompt, return_mask=True, add_special_tokens=True)
if isinstance(output, tuple):
    ids, mask = output
else:
    ids = output['input_ids']
    mask = output['attention_mask']
ids = ids.to(device)
mask = mask.to(device)
...
enc_output = pipe.prompter.tokenizer(..., return_mask=True, ...)
if isinstance(enc_output, tuple):
    ids, mask = enc_output
else:
    ids = enc_output['input_ids']
    mask = enc_output['attention_mask']
ids = ids.to(device)
mask = mask.to(device)
```

### 2.8 使用 prompter 的 `text_len`(修复属性缺失)
**Before**
```python
pad_len = pipe.text_encoder.text_len - total_len
```

**After**
```python
pad_len = pipe.prompter.text_len - total_len
```

## 3) `multi-shot/multi_view/DiffSynth-Studio-main/diffsynth/models/wan_video_dit.py`
### 3.1 `attention_per_batch_with_shots` 中 ID token slice 修复
**Before**
```python
ID_token_start = shot_token_all_num + id_idx * pre_ID_token_num
ID_token_end   = start + pre_ID_token_num
assert end <= k.shape[2], (
    f"ID token slice out of range: start={start}, end={end}, "
    f"K_len={k.shape[2]}"
)
id_token_k = k[bi, :, start:end, :] 
id_token_v = v[bi, :, start:end, :]
```

**After**
```python
start = shot_token_all_num + id_idx * pre_id_token_num
if start >= k.shape[2]:
    continue
end = min(start + pre_id_token_num, k.shape[2])
id_token_k = k[bi, :, start:end, :]
id_token_v = v[bi, :, start:end, :]
```

### 3.2 `CrossAttention.forward` 增加 `attn_mask`
**Before**
```python
def forward(self, x: torch.Tensor, y: torch.Tensor):
    ...
    x = self.attn(q, k, v)
```

**After**
```python
def forward(self, x: torch.Tensor, y: torch.Tensor, attn_mask=None):
    ...
    x = self.attn(q, k, v, attn_mask=attn_mask)
```

## 4) `multi-shot/multi_view/DiffSynth-Studio-main/diffsynth/trainers/utils.py`
**新增参数以匹配 pipeline**

**Before**
```python
# (no --shot_rope argument)
```

**After**
```python
parser.add_argument("--shot_rope", type=bool, default=False, help="Whether apply shot rope for multi-shot video")
```

## 5) 新增文件
**`multi-shot/MULTI_SHOT_CORE_SUMMARY.md`**
- Before: 文件不存在
- After: 新增总结文档

**`multi-shot/MODIFICATION_LOG.md`**
- Before: 文件不存在

## 6) `multi-shot/dry_run_train.py`
**强制将模型移动到 CUDA 以匹配输入设备**

**Before**
```python
device = "cuda" if torch.cuda.is_available() else "cpu"
model.pipe.device = device
model.pipe.torch_dtype = torch.bfloat16
```

**After**
```python
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
model.pipe.device = device
model.pipe.torch_dtype = torch.bfloat16
```
- After: 新增修改清单(本文件)

## 验证
```bash
python -m py_compile multi-shot/multi_view/datasets/videodataset.py
python -m py_compile multi-shot/multi_view/train.py
python -m py_compile multi-shot/multi_view/DiffSynth-Studio-main/diffsynth/pipelines/wan_video_new.py
python -m py_compile multi-shot/multi_view/DiffSynth-Studio-main/diffsynth/models/wan_video_dit.py
python -m py_compile multi-shot/multi_view/DiffSynth-Studio-main/diffsynth/trainers/utils.py
```