Positional encodings are essential to transformer-based generative models, yet their behavior in multimodal and attention-sharing settings is not fully understood. In this work, we present a principled analysis of Rotary Positional Embeddings (RoPE), showing that RoPE naturally decomposes into frequency components with distinct positional sensitivities.
We demonstrate that this frequency structure explains why shared-attention mechanisms, where a target image is generated while attending to tokens from a reference image, can lead to reference copying, in which the model reproduces content from the reference instead of extracting only its stylistic cues. Our analysis reveals that the high-frequency components of RoPE dominate the attention computation, forcing queries to attend mainly to spatially aligned reference tokens and thereby inducing this unintended copying behavior.
Building on these insights, we introduce a method for selectively modulating RoPE's frequency bands so that attention reflects semantic similarity rather than strict positional alignment. Applied to modern transformer-based diffusion architectures, where all tokens share attention, this modulation restores stable and meaningful shared attention. As a result, it enables effective control over the degree of style transfer versus content copying, yielding a proper style-aligned generation process in which stylistic attributes are transferred without duplicating reference content.
Transformers are permutation-equivariant, so position must be injected explicitly. Modern diffusion transformers such as Flux do this with Rotary Positional Embeddings (RoPE): each query and key vector is split into 2-dimensional chunks, and each chunk is rotated by an angle proportional to the token's position. The angular frequency \(\theta_d\) varies geometrically across chunks, from very high to very low.
A key property of RoPE is that the inner product between a rotated query and a rotated key depends only on the relative displacement between their positions. This is what gives RoPE its locality: two tokens at very different positions tend to have a smaller inner product, and therefore lower attention. But, as we show next, this locality is not uniform across the embedding — some chunks of RoPE care a lot about position, and others barely at all.
One of the most useful tricks in diffusion models is shared attention: tokens of a target image are allowed to attend not only to themselves, but also to tokens of a reference image. This unlocks training-free image manipulation tasks — appearance transfer, style transfer, reference-based editing — by implicitly aligning semantic regions between the two images. A representative use case is style-aligned generation (StyleAligned): generate a set of images that share a consistent visual style while depicting different content.
While working on UNet based diffusion models, translating this idea to RoPE-based diffusion transformers is surprisingly fragile. Because target and reference tokens are placed on the same positional grid, RoPE's locality bias kicks in: each target query is pulled toward the reference token sitting at the same spatial position. Instead of borrowing style, the model ends up copying content.
RoPE's per-chunk angular frequency \(\theta_d\) follows a geometric series ranging from \(1\) down to \(1/10000\). Plotting the mean attention similarity between two identical vectors as a function of their positional shift \(\Delta\), split by frequency band, reveals a striking picture: high-frequency components drop sharply with even small displacements (strong locality), while low-frequency components are nearly insensitive to position (global, semantic interactions).
We see the same effect inside Flux. Splitting RoPE into high- and low-frequency subsets and zeroing the rest, we visualize the attention from selected query points (red dots). With only high-frequency components, attention concentrates on positionally aligned tokens. With only low-frequency components, attention becomes global and follows semantic structure. The same trend holds for cross-image attention: scaling down the high-frequency components of the reference keys shifts the model from positional copying toward semantically aligned attention.
Our fix is to continuously modulate RoPE's frequency bands on the reference keys: attenuate the high frequencies (\(s_{hf}<1\)) to break the positional dominance that causes copying, and amplify the low frequencies (\(s_{lf}>1\)) to encourage global, semantic cross-image attention.
Concretely, for each RoPE chunk \(d\) we set \[ s_d = s_{hf} + (s_{lf} - s_{hf})\,\tilde{d}^{\,\beta}, \qquad \tilde{d} = \frac{d}{D/2 - 1}, \] This polynomial schedule transitions smoothly across the frequency spectrum, keeping attention stable. The two scalars \(s_{hf}\) and \(s_{lf}\) are the only knobs of the method — together they give a continuous handle on the tradeoff between structural fidelity and free-form style transfer.
Early denoising steps fix global structure; later steps add texture and style. We linearly increase both \(s_{hf}\) and \(s_{lf}\) over time: a fixed low scale fails to transfer fine stylistic detail, while a fixed high scale leaks too much structure. Our scheduled modulation lets the model first establish the correct global layout, then sharpen attention onto the reference for fine-grained style.
Our frequency-aware modulation gives a continuous handle on how strongly the reference style is transferred. Increasing \(s_{lf}\) amplifies the reference's low-frequency components in shared attention, progressively pulling more of its visual style into the generated image while preserving the target prompt. Drag the slider below to explore this effect on a single reference and target prompt.
""
For each reference image we generate three target prompts and compare Ours against AlignedGen, StyleAligned, IPAdapter, and B-LoRA. Use the arrows above to browse different references. Our approach achieves faithful style transfer while maintaining structural coherence, avoiding (irrelevant to the target prompt) content leakage from the reference into the generated image.