multishot / MODIFICATION_LOG.md

Upload folder using huggingface_hub

85752bc verified 2 months ago

9.14 kB

	# 修改清单（前后对比）

	## 范围
	只做“能跑”的最小修复，尽量保留原本逻辑与结构。

	## 1) `multi-shot/multi_view/datasets/videodataset.py`
	补齐未定义变量，保持原返回结构

	Before
	```python
	return {
	"global_caption": None,
	"shot_num": 3,
	"pre_shot_caption": ["xxx", "xxx", "xxx"],
	# "single_caption": meta_prompt["single_prompt"],
	"video": input_video,
	"ref_num": ID_num * 3, ###TODO: 先跑通 ID_num = 1 的情况
	"ID_num": ID_num,
	"ref_images": [[Image0, Image1, Image2]],
	"video_path": video_path
	}
	```

	After
	```python
	ID_num = 1
	Image0, Image1, Image2 = ref_images[:3]
	return {
	"global_caption": None,
	"shot_num": 3,
	"pre_shot_caption": ["xxx", "xxx", "xxx"],
	# "single_caption": meta_prompt["single_prompt"],
	"video": input_video,
	"ref_num": ID_num * 3, ###TODO: 先跑通 ID_num = 1 的情况
	"ID_num": ID_num,
	"ref_images": [[Image0, Image1, Image2]],
	"video_path": video_path
	}
	```

	## 2) `multi-shot/multi_view/DiffSynth-Studio-main/diffsynth/pipelines/wan_video_new.py`
	### 2.1 Prompt 编码（修复拼写/对象调用）
	Before
	```python
	prompt = pip.text_encoder.process_prompt(prompt, positive=positive)
	output = pip.text_encoder.tokenizer(prompt, return_mask=True, add_special_tokens=True)
	ids = output['input_ids'].to(device)
	mask = output['attention_mask'].to(device)
	prompt_emb = self.text_encoder(ids, mask)
	...
	prompt_shot_all = pip.text_encoder.process_prompt(prompt_shot_all, positive=positive)
	...
	for shot_index, shot_cut_end in enmurate(shot_cut_ends):
	start_pos = shot_cut_starts[shot_index]
	end_pos = shot_cut_end
	shot_text = cleaned_prompt[start_pos: end_pos + 1].strip()
	```

	After
	```python
	prompt = pipe.text_encoder.process_prompt(prompt, positive=positive)
	output = pipe.text_encoder.tokenizer(prompt, return_mask=True, add_special_tokens=True)
	ids = output['input_ids'].to(device)
	mask = output['attention_mask'].to(device)
	prompt_emb = pipe.text_encoder(ids, mask)
	...
	prompt_shot_all = pipe.text_encoder.process_prompt(prompt_shot_all, positive=positive)
	cleaned_prompt = prompt_shot_all
	...
	for shot_index, shot_cut_end in enumerate(shot_cut_ends):
	start_pos = shot_cut_starts[shot_index]
	end_pos = shot_cut_end
	shot_text = cleaned_prompt[start_pos: end_pos + 1].strip()
	```

	### 2.2 Shot mask 构造（修复未定义变量）
	Before
	```python
	S_shots = len(shot_text_ranges[0]) ###TODO: 当前batch size 是 1
	...
	for sid, (s0, s1) in enumerate(shot_ranges):
	s0 = int(s0)
	s1 = int(s1)
	shot_table[sid, s0: s1 + 1] = True
	...
	allow_all = torch.cat([allow_shot, allow_ref_image], dim = 1)
	assert allow_all.shape == x.shape[2] "The shape is something wrong"
	```

	After
	```python
	shot_ranges = shot_text_ranges[0]
	if isinstance(shot_ranges, dict):
	shot_ranges = shot_ranges.get("shots", [])
	S_shots = len(shot_ranges)
	for sid, span in enumerate(shot_ranges):
	if span is None:
	continue
	s0, s1 = span
	s0 = int(s0)
	s1 = int(s1)
	shot_table[sid, s0: s1 + 1] = True
	...
	allow_all = torch.cat([allow_shot, allow_ref_image], dim = 1)
	assert allow_all.shape[1] == S_q, "The shape is something wrong"
	```

	### 2.3 `shot_rope` 分支变量名冲突修复
	Before
	```python
	for shot_index, num_frames in enumerate(shots_nums):
	f = num_frames
	rope_s = freq_s[shot_index] \
	.view(1, 1, 1, -1) \
	.expand(f, h, w, -1)
	...
	freqs = freqs.reshape(f * h * w, 1, -1)
	```

	After
	```python
	for shot_index, num_frames in enumerate(shots_nums):
	f = num_frames
	rope_s = freq_s[shot_index].view(1, 1, 1, -1).expand(f, h, w, -1)
	...
	freqs = freqs.reshape(f * h * w, 1, -1)
	```

	### 2.4 `model_fn_wan_video` 函数签名语法修复
	Before
	```python
	ID_2_shot: None ######每个shot 中对应包含的ID是那几个，是一个list[ batch0: [shot0: [0,1], shot1:[2]], batch1:[]]
	**kwargs,
	```

	After
	```python
	ID_2_shot=None, ######每个shot 中对应包含的ID是那几个，是一个list[ batch0: [shot0: [0,1], shot1:[2]], batch1:[]]
	**kwargs,
	```

	### 2.5 `WanVideoUnit_SpeedControl` 缺失类补齐
	Before
	```python
	WanVideoUnit_SpeedControl(), # 在 units 列表中引用，但类未定义
	```

	After
	```python
	class WanVideoUnit_SpeedControl(PipelineUnit):
	def __init__(self):
	super().__init__(input_params=("motion_bucket_id",))

	def process(self, pipe: WanVideoPipeline, motion_bucket_id):
	if motion_bucket_id is None:
	return {}
	motion_bucket_id = torch.Tensor((motion_bucket_id,)).to(dtype=pipe.torch_dtype, device=pipe.device)
	return {"motion_bucket_id": motion_bucket_id}
	```

	### 2.6 Prompt 处理使用 prompter（修复 `process_prompt` 缺失）
	Before
	```python
	prompt = pipe.text_encoder.process_prompt(prompt, positive=positive)
	output = pipe.text_encoder.tokenizer(prompt, return_mask=True, add_special_tokens=True)
	...
	prompt_shot_all = pipe.text_encoder.process_prompt(prompt_shot_all, positive=positive)
	...
	enc_output = pipe.text_encoder(
	text,
	return_mask=True,
	add_special_tokens=True,
	return_tensors="pt"
	)
	```

	After
	```python
	prompt = pipe.prompter.process_prompt(prompt, positive=positive)
	output = pipe.prompter.tokenizer(prompt, return_mask=True, add_special_tokens=True)
	...
	prompt_shot_all = pipe.prompter.process_prompt(prompt_shot_all, positive=positive)
	...
	enc_output = pipe.prompter.tokenizer(
	text,
	return_mask=True,
	add_special_tokens=True,
	return_tensors="pt"
	)
	```

	### 2.7 兼容 tokenizer 返回 tuple / dict
	Before
	```python
	output = pipe.prompter.tokenizer(prompt, return_mask=True, add_special_tokens=True)
	ids = output['input_ids'].to(device)
	mask = output['attention_mask'].to(device)
	...
	enc_output = pipe.prompter.tokenizer(..., return_mask=True, ...)
	ids = enc_output['input_ids'].to(device)
	mask = enc_output['attention_mask'].to(device)
	```

	After
	```python
	output = pipe.prompter.tokenizer(prompt, return_mask=True, add_special_tokens=True)
	if isinstance(output, tuple):
	ids, mask = output
	else:
	ids = output['input_ids']
	mask = output['attention_mask']
	ids = ids.to(device)
	mask = mask.to(device)
	...
	enc_output = pipe.prompter.tokenizer(..., return_mask=True, ...)
	if isinstance(enc_output, tuple):
	ids, mask = enc_output
	else:
	ids = enc_output['input_ids']
	mask = enc_output['attention_mask']
	ids = ids.to(device)
	mask = mask.to(device)
	```

	### 2.8 使用 prompter 的 `text_len`（修复属性缺失）
	Before
	```python
	pad_len = pipe.text_encoder.text_len - total_len
	```

	After
	```python
	pad_len = pipe.prompter.text_len - total_len
	```

	## 3) `multi-shot/multi_view/DiffSynth-Studio-main/diffsynth/models/wan_video_dit.py`
	### 3.1 `attention_per_batch_with_shots` 中 ID token slice 修复
	Before
	```python
	ID_token_start = shot_token_all_num + id_idx * pre_ID_token_num
	ID_token_end = start + pre_ID_token_num
	assert end <= k.shape[2], (
	f"ID token slice out of range: start={start}, end={end}, "
	f"K_len={k.shape[2]}"
	)
	id_token_k = k[bi, :, start:end, :]
	id_token_v = v[bi, :, start:end, :]
	```

	After
	```python
	start = shot_token_all_num + id_idx * pre_id_token_num
	if start >= k.shape[2]:
	continue
	end = min(start + pre_id_token_num, k.shape[2])
	id_token_k = k[bi, :, start:end, :]
	id_token_v = v[bi, :, start:end, :]
	```

	### 3.2 `CrossAttention.forward` 增加 `attn_mask`
	Before
	```python
	def forward(self, x: torch.Tensor, y: torch.Tensor):
	...
	x = self.attn(q, k, v)
	```

	After
	```python
	def forward(self, x: torch.Tensor, y: torch.Tensor, attn_mask=None):
	...
	x = self.attn(q, k, v, attn_mask=attn_mask)
	```

	## 4) `multi-shot/multi_view/DiffSynth-Studio-main/diffsynth/trainers/utils.py`
	新增参数以匹配 pipeline

	Before
	```python
	# (no --shot_rope argument)
	```

	After
	```python
	parser.add_argument("--shot_rope", type=bool, default=False, help="Whether apply shot rope for multi-shot video")
	```

	## 5) 新增文件
	`multi-shot/MULTI_SHOT_CORE_SUMMARY.md`
	- Before: 文件不存在
	- After: 新增总结文档

	`multi-shot/MODIFICATION_LOG.md`
	- Before: 文件不存在

	## 6) `multi-shot/dry_run_train.py`
	强制将模型移动到 CUDA 以匹配输入设备

	Before
	```python
	device = "cuda" if torch.cuda.is_available() else "cpu"
	model.pipe.device = device
	model.pipe.torch_dtype = torch.bfloat16
	```

	After
	```python
	device = "cuda" if torch.cuda.is_available() else "cpu"
	model.to(device)
	model.pipe.device = device
	model.pipe.torch_dtype = torch.bfloat16
	```
	- After: 新增修改清单（本文件）

	## 验证
	```bash
	python -m py_compile multi-shot/multi_view/datasets/videodataset.py
	python -m py_compile multi-shot/multi_view/train.py
	python -m py_compile multi-shot/multi_view/DiffSynth-Studio-main/diffsynth/pipelines/wan_video_new.py
	python -m py_compile multi-shot/multi_view/DiffSynth-Studio-main/diffsynth/models/wan_video_dit.py
	python -m py_compile multi-shot/multi_view/DiffSynth-Studio-main/diffsynth/trainers/utils.py
	```