Preprint · 2026

ϕ-Noise:
Training-Free Temporal Video Conditioning
via Phase-Based Noise Manipulation

Ofir Abramovich* Nadav Z. Cohen* Adi Rosenthal* Ariel Shamir
Canvas-Lab  ·  Department of Computer Science, Reichman University
* Equal Contribution
arXiv logo arXiv PDF logo PDF Hugging Face logo Hugging Face GitHub logo Code

Simple, zero-training video motion control through
frequency-domain phase injection into diffusion noise latents.

Applications

I2V Motion Transfer

Motion Transfer (+ Structure Conditioning)

Cut & Drag

Abstract

Latent video diffusion models generate videos by progressively transforming Gaussian noise into realistic samples conditioned on text or visual inputs. However, existing conditioning methods often require additional training and computational overhead. Motivated by recent findings on the importance of frequency components in generative models, we propose a simple, training-free approach for motion-conditioned video generation by injecting low-frequency phase information from a reference video directly into the diffusion noise latents.

Our method transfers motion cues without modifying the model architecture or inference pipeline. Using several applications, we demonstrate effective control over both appearance and dynamics in generated videos, while achieving competitive or superior results compared to more complex conditioning approaches.

How Does It Work?

ϕ-Noise operates by decomposing both the reference video latent and the Gaussian noise latent into the frequency domain via the Discrete Fourier Transform (DFT). The low-frequency phase components of the noise — which encode coarse spatial structure and dominant temporal dynamics — are replaced with those from the reference. A Spectral-Temporal Energy Balancing Mask Φ is then applied to preserve total signal energy and ensure stable denoising.

teaser

ϕ-Noise injects low-frequency phase information from a reference video into Gaussian noise latents, enabling training-free motion and structural conditioning across diverse video generation tasks — with no changes to the diffusion model.

Analysis

Phase Substitution & Energy Balancing

ϕ-Noise method overview figure

Figure 2. Phase and Energy Analysis — phase distributions, latent energy evolution across denoising steps, and qualitative comparison with/without energy balancing.

Naïve phase injection disrupts the expected energy profile of the noise, causing saturation artifacts and out-of-distribution denoising. ϕ-Noise applies a frequency-dependent scaling mask Φ that scales down injected low-frequency magnitudes by 1/γ while compensating by scaling high frequencies by β, guaranteeing exact energy preservation: E(z̃k ⊙ Φ) = E(z̃).

Because the DFT and its inverse are computationally negligible relative to a single diffusion step, ϕ-Noise introduces no additional runtime or memory overhead and requires no changes to the model architecture or inference pipeline.

Comparisons

ϕ-Noise supports three distinct video generation tasks within a single unified framework, all using WAN (Wan2.2-14B) as the base model with no additional training.

Choose a task from the tabs below.

Given a reference video and a text prompt, generate a video matching the prompt while preserving input motion dynamics. Spatial ϕ-Noise achieves strong alignment with both textual content and motion patterns.

Reference
Ours
DMT
DiTFlow

Prompt: "A monkey is climbing a climbing wall."

Reference
Ours
DMT
DiTFlow

Prompt: "A motorcycle is driving on the road."

Reference
Ours
DMT
DiTFlow

Prompt: "A dolphin is jumping in the air into the water."

Align with both a text prompt and a first-frame image condition while following the motion of the reference video. ϕ-Noise successfully transfers motion across varying subjects and handles complex dynamics while preserving identity coherence.

Reference
Ours
MotionClone
Wan

Prompt: "The person is performing a backflip."

Reference
Ours
MotionClone
Wan

Prompt: "A motorcycle is driving on the road."

Reference
Ours
MotionClone
Wan

Prompt: "The person is sitting, and then suddenly flies magically."

Reference
Ours
MotionClone
Wan

Prompt: "The shark is swimming in the ocean."

Users cut object patches from an image or add sprites, then animate them with rigid drag paths. ϕ-Noise generates coherent videos following prescribed motion while producing plausible non-rigid dynamics (e.g., fire breath, tentacle movement).

Input
Ours
Time-to-Move
Go-with-the-Flow
Input
Ours
Time-to-Move
Go-with-the-Flow
Input
Ours
Time-to-Move
Go-with-the-Flow

Quantitative Results

We evaluate on a diverse benchmark of 60 high-quality videos: 20 from the TTM dataset, 30 from LOVEU-TGVE-2023, and 10 in-the-wild videos for real-world generalization. ϕ-Noise achieves competitive or state-of-the-art results at negligible additional cost.

Model CLIP-T ↑ Aes ↑ Img ↑ LPIPS-T ↓ Flow-E ↓ Subj-C ↑ Smooth ↑ Dyn-D ↑
Wan-I2V 0.3080.6520.644 0.116181.100.9420.9780.647
GWTF 0.3140.6200.637 0.097152.810.9420.9810.647
TTM 0.3110.6470.653 0.110102.390.9480.9780.705
Ours (ϕ-Noise) 0.3130.6370.627 0.171 101.49 0.9180.964 0.764

ϕ-Noise achieves the best Flow-Error (motion fidelity) and Dynamics-Degree scores — the two metrics most directly measuring motion transfer quality — while requiring no additional training.

BibTeX

If you find this work useful, please cite:

@article{abramovich2025phinoise,
  title   = {ϕ-Noise: Training-Free Temporal Video Conditioning
            via Phase-Based Noise Manipulation},
  author  = {Abramovich, Ofir and Cohen, Nadav Z. and
            Rosenthal, Adi and Shamir, Ariel},
  journal = {arXiv preprint},
  year    = {2025},
}