Spaces:

JackIsNotInTheBox
/

Generate_Audio_for_Video

Sleeping

BoxOfColors Claude Sonnet 4.6 commited on 20 days ago

Commit

2e7f9a4

1 Parent(s): 894b188

Fix xregen truncating total audio duration when clip < segment window

When an xregen model generates a shorter clip than the original segment
window (e.g. MMAudio 8s on a HunyuanFoley 15s segment), _stitch_wavs
trims the last segment's wav expecting it to cover the full window.
A short wav gets min-clipped, making the final stitched audio shorter
than total_dur_s.

Fix: in _xregen_splice, after prepending leading silence to align to
seg_start, append trailing silence to pad the wav to the full original
segment duration (seg_end - seg_start). _stitch_wavs then trims it
correctly and the output is always exactly total_dur_s long.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Files changed (1) hide show

app.py +25 -11

app.py CHANGED Viewed

@@ -1785,20 +1785,34 @@ def _xregen_splice(new_wav_raw: np.ndarray, src_sr: int,
     slot_wavs = _load_seg_wavs(meta["wav_paths"])
     new_wav   = _resample_to_slot_sr(new_wav_raw, src_sr, slot_sr, slot_wavs[0])
-    # Align new_wav so sample index 0 corresponds to seg_start in video time.
-    # _stitch_wavs trims using seg_start as the time origin, so if the clip
-    # started AFTER seg_start (clip_start_s > seg_start), we prepend silence
-    # equal to (clip_start_s - seg_start) to shift the audio back to seg_start.
     if clip_start_s is not None:
-        seg_start = meta["segments"][seg_idx][0]
-        offset_s  = seg_start - clip_start_s   # negative when clip starts after seg_start
         if offset_s < 0:
             pad_samples = int(round(abs(offset_s) * slot_sr))
-            silence = np.zeros(
-                (new_wav.shape[0], pad_samples) if new_wav.ndim == 2 else pad_samples,
-                dtype=new_wav.dtype,
-            )
-            new_wav = np.concatenate([silence, new_wav], axis=1 if new_wav.ndim == 2 else 0)
     video_path, audio_path, updated_meta, waveform_html = _splice_and_save(
         new_wav, seg_idx, meta, slot_id

     slot_wavs = _load_seg_wavs(meta["wav_paths"])
     new_wav   = _resample_to_slot_sr(new_wav_raw, src_sr, slot_sr, slot_wavs[0])
+    # Align new_wav so sample index 0 corresponds to seg_start in video time,
+    # and the wav is long enough to cover the full original segment window.
+    #
+    # _stitch_wavs trims each wav relative to its seg_start, expecting the wav
+    # to cover the full segment window (seg_end - seg_start).  xregen models
+    # may generate a shorter clip (e.g. MMAudio 8 s on a 15 s segment), which
+    # causes _stitch_wavs to trim short and produce truncated output.
+    #
+    # Steps:
+    #   1. Prepend silence if the clip started after seg_start.
+    #   2. Append silence if the wav is still shorter than the full segment window.
     if clip_start_s is not None:
+        seg_start, seg_end = meta["segments"][seg_idx]
+        full_seg_samples = int(round((seg_end - seg_start) * slot_sr))
+        # Step 1: prepend silence to align to seg_start
+        offset_s = seg_start - clip_start_s   # negative when clip starts after seg_start
         if offset_s < 0:
             pad_samples = int(round(abs(offset_s) * slot_sr))
+            silence = np.zeros((new_wav.shape[0], pad_samples), dtype=new_wav.dtype)
+            new_wav = np.concatenate([silence, new_wav], axis=1)
+        # Step 2: append silence to fill the full segment window
+        current_samples = new_wav.shape[1]
+        if current_samples < full_seg_samples:
+            tail = np.zeros((new_wav.shape[0], full_seg_samples - current_samples),
+                            dtype=new_wav.dtype)
+            new_wav = np.concatenate([new_wav, tail], axis=1)
     video_path, audio_path, updated_meta, waveform_html = _splice_and_save(
         new_wav, seg_idx, meta, slot_id