Untwisting RoPE: Frequency Control for Shared Attention in DiTs

Anonymous Authors

TL;DR: Shared attention in RoPE-based diffusion transformers often collapses into reference copying, reproducing the reference image instead of transferring its style. We trace this back to RoPE's high-frequency components and introduce a simple, training-free frequency-aware modulation that restores meaningful shared attention and enables controllable style-aligned generation.

Reference

The Cathedral of Santa Maria del Fiore

The Statue of Liberty

The Colosseum

Reference

Two knights playing chess

A man eating pizza

A clown dancing

Reference

A koala bear sitting on a tree

A wolf howling at the moon

A monkey swinging from a branch

Reference

A woman playing guitar

The Seattle Space Needle

A cat playing with a ball of yarn

Reference

A majestic lion

A beautiful butterfly

A scenic waterfall

Reference

The New York skyline

An ice skater performing a jump

A woman reading a book in a park

Reference

A marathon runner

A fruit basket on a table

Two kittens playing

Reference

A bicycle

A seashell

A swan

Reference

A fireman

A wooden house

A gold chest

Reference

A pine tree

A pineapple

An eagle in flight

Reference

A dog catching a frisbee

A bear eating honey

A man riding a bicycle

Reference

A firefighter figurine

A doctor figurine

A police woman figurine

Abstract

Positional encodings are essential to transformer-based generative models, yet their behavior in multimodal and attention-sharing settings is not fully understood. In this work, we present a principled analysis of Rotary Positional Embeddings (RoPE), showing that RoPE naturally decomposes into frequency components with distinct positional sensitivities.

We demonstrate that this frequency structure explains why shared-attention mechanisms, where a target image is generated while attending to tokens from a reference image, can lead to reference copying, in which the model reproduces content from the reference instead of extracting only its stylistic cues. Our analysis reveals that the high-frequency components of RoPE dominate the attention computation, forcing queries to attend mainly to spatially aligned reference tokens and thereby inducing this unintended copying behavior.

Building on these insights, we introduce a method for selectively modulating RoPE's frequency bands so that attention reflects semantic similarity rather than strict positional alignment. Applied to modern transformer-based diffusion architectures, where all tokens share attention, this modulation restores stable and meaningful shared attention. As a result, it enables effective control over the degree of style transfer versus content copying, yielding a proper style-aligned generation process in which stylistic attributes are transferred without duplicating reference content.

A Quick Look at RoPE

Transformers are permutation-equivariant, so position must be injected explicitly. Modern diffusion transformers such as Flux do this with Rotary Positional Embeddings (RoPE): each query and key vector is split into 2-dimensional chunks, and each chunk is rotated by an angle proportional to the token's position. The angular frequency \(\theta_d\) varies geometrically across chunks, from very high to very low.

A key property of RoPE is that the inner product between a rotated query and a rotated key depends only on the relative displacement between their positions. This is what gives RoPE its locality: two tokens at very different positions tend to have a smaller inner product, and therefore lower attention. But, as we show next, this locality is not uniform across the embedding — some chunks of RoPE care a lot about position, and others barely at all.

Shared Attention & the Reference Copying Problem

One of the most useful tricks in diffusion models is shared attention: tokens of a target image are allowed to attend not only to themselves, but also to tokens of a reference image. This unlocks training-free image manipulation tasks — appearance transfer, style transfer, reference-based editing — by implicitly aligning semantic regions between the two images. A representative use case is style-aligned generation (StyleAligned): generate a set of images that share a consistent visual style while depicting different content.

While working on UNet based diffusion models, translating this idea to RoPE-based diffusion transformers is surprisingly fragile. Because target and reference tokens are placed on the same positional grid, RoPE's locality bias kicks in: each target query is pulled toward the reference token sitting at the same spatial position. Instead of borrowing style, the model ends up copying content.

Reference: a wooden bull figurine — Reference A wooden bull figurine

Target generated without shared attention — Reference A wooden bull figurine

So the question is: why does RoPE pull attention toward spatially aligned tokens so aggressively, and can we surgically tame it without throwing positional information away? For that, we need to look at RoPE one frequency band at a time.

Frequency Bands of RoPE

Mean attention similarity vs. positional shift across RoPE frequency bands

RoPE's per-chunk angular frequency \(\theta_d\) follows a geometric series ranging from \(1\) down to \(1/10000\). Plotting the mean attention similarity between two identical vectors as a function of their positional shift \(\Delta\), split by frequency band, reveals a striking picture: high-frequency components drop sharply with even small displacements (strong locality), while low-frequency components are nearly insensitive to position (global, semantic interactions).

Query

All frequencies

High frequencies only

Low frequencies only

Attention visualization with high vs. low frequency RoPE components

We see the same effect inside Flux. Splitting RoPE into high- and low-frequency subsets and zeroing the rest, we visualize the attention from selected query points (red dots). With only high-frequency components, attention concentrates on positionally aligned tokens. With only low-frequency components, attention becomes global and follows semantic structure. The same trend holds for cross-image attention: scaling down the high-frequency components of the reference keys shifts the model from positional copying toward semantically aligned attention.

Frequency-Aware Modulation

Our fix is to continuously modulate RoPE's frequency bands on the reference keys: attenuate the high frequencies (\(s_{hf}<1\)) to break the positional dominance that causes copying, and amplify the low frequencies (\(s_{lf}>1\)) to encourage global, semantic cross-image attention.

Concretely, for each RoPE chunk \(d\) we set \[ s_d = s_{hf} + (s_{lf} - s_{hf})\,\tilde{d}^{\,\beta}, \qquad \tilde{d} = \frac{d}{D/2 - 1}, \] This polynomial schedule transitions smoothly across the frequency spectrum, keeping attention stable. The two scalars \(s_{hf}\) and \(s_{lf}\) are the only knobs of the method — together they give a continuous handle on the tradeoff between structural fidelity and free-form style transfer.

Timestep Scheduling

Early denoising steps fix global structure; later steps add texture and style. We linearly increase both \(s_{hf}\) and \(s_{lf}\) over time: a fixed low scale fails to transfer fine stylistic detail, while a fixed high scale leaks too much structure. Our scheduled modulation lets the model first establish the correct global layout, then sharpen attention onto the reference for fine-grained style.

Controlling the Style–Content Tradeoff

Our frequency-aware modulation gives a continuous handle on how strongly the reference style is transferred. Increasing \(s_{lf}\) amplifies the reference's low-frequency components in shared attention, progressively pulling more of its visual style into the generated image while preserving the target prompt. Drag the slider below to explore this effect on a single reference and target prompt.

Click a reference image to switch examples.

Generated output as a function of s_lf — Generated output

\(s_{lf}\) = 0.60

weaker style transfer stronger style transfer

Comparisons

Reference

Ours

AlignedGen

StyleAligned

IPAdapter

B-LoRA

A lighthouse on a snowy shore

AlignedGen result 1 for Crossing the Delaware

StyleAligned result 1 for Crossing the Delaware

IPAdapter result 1 for Crossing the Delaware

B-LoRA result 1 for Crossing the Delaware

A top down view of a pond with two ducks swimming

AlignedGen result 2 for Crossing the Delaware

StyleAligned result 2 for Crossing the Delaware

IPAdapter result 2 for Crossing the Delaware

B-LoRA result 2 for Crossing the Delaware

A wooden cottage in a forest clearing

AlignedGen result 3 for Crossing the Delaware

StyleAligned result 3 for Crossing the Delaware

IPAdapter result 3 for Crossing the Delaware

B-LoRA result 3 for Crossing the Delaware

Reference

Ours

AlignedGen

StyleAligned

IPAdapter

B-LoRA

A telescope

An airplane

A steam boat

Reference

Ours

AlignedGen

StyleAligned

IPAdapter

B-LoRA

The taj mahal

AlignedGen result 1 for Persian miniature

StyleAligned result 1 for Persian miniature

IPAdapter result 1 for Persian miniature

A busy pizza restaurant

AlignedGen result 2 for Persian miniature

StyleAligned result 2 for Persian miniature

IPAdapter result 2 for Persian miniature

A majestic lion resting on a rock

AlignedGen result 3 for Persian miniature

StyleAligned result 3 for Persian miniature

IPAdapter result 3 for Persian miniature

Reference

Ours

AlignedGen

StyleAligned

IPAdapter

B-LoRA

A sailboat on the ocean

AlignedGen result 1 for Modernist village

StyleAligned result 1 for Modernist village

IPAdapter result 1 for Modernist village

A monkey swinging from a tree branch in the jungle

AlignedGen result 2 for Modernist village

StyleAligned result 2 for Modernist village

IPAdapter result 2 for Modernist village

A dolphin jumping out of the water

AlignedGen result 3 for Modernist village

StyleAligned result 3 for Modernist village

IPAdapter result 3 for Modernist village

Reference

Ours

AlignedGen

StyleAligned

IPAdapter

B-LoRA

New york skyline

AlignedGen result 1 for Palette-knife woman

StyleAligned result 1 for Palette-knife woman

IPAdapter result 1 for Palette-knife woman

A ice skater performing a jump

AlignedGen result 2 for Palette-knife woman

StyleAligned result 2 for Palette-knife woman

IPAdapter result 2 for Palette-knife woman

A woman reading a book in a park

AlignedGen result 3 for Palette-knife woman

StyleAligned result 3 for Palette-knife woman

IPAdapter result 3 for Palette-knife woman

Reference

Ours

AlignedGen

StyleAligned

IPAdapter

B-LoRA

The eiffel tower

AlignedGen result 1 for Impressionist bazaar

StyleAligned result 1 for Impressionist bazaar

IPAdapter result 1 for Impressionist bazaar

B-LoRA result 1 for Impressionist bazaar

A man fishing by a river

AlignedGen result 2 for Impressionist bazaar

StyleAligned result 2 for Impressionist bazaar

IPAdapter result 2 for Impressionist bazaar

B-LoRA result 2 for Impressionist bazaar

A majestic eagle soaring in the sky

AlignedGen result 3 for Impressionist bazaar

StyleAligned result 3 for Impressionist bazaar

IPAdapter result 3 for Impressionist bazaar

B-LoRA result 3 for Impressionist bazaar

Reference

Ours

AlignedGen

StyleAligned

IPAdapter

B-LoRA

Two roman soldiers fighting

AlignedGen result 1 for Watercolor beach

StyleAligned result 1 for Watercolor beach

The tower of pisa

AlignedGen result 2 for Watercolor beach

StyleAligned result 2 for Watercolor beach

The statue of liberty

AlignedGen result 3 for Watercolor beach

StyleAligned result 3 for Watercolor beach

Reference

Ours

AlignedGen

StyleAligned

IPAdapter

B-LoRA

A roman emperor in a park

Ours result 1 for Ink-and-watercolor park

AlignedGen result 1 for Ink-and-watercolor park

StyleAligned result 1 for Ink-and-watercolor park

IPAdapter result 1 for Ink-and-watercolor park

B-LoRA result 1 for Ink-and-watercolor park

A traffic jam on a highway

Ours result 2 for Ink-and-watercolor park

AlignedGen result 2 for Ink-and-watercolor park

StyleAligned result 2 for Ink-and-watercolor park

IPAdapter result 2 for Ink-and-watercolor park

B-LoRA result 2 for Ink-and-watercolor park

A pirate with a parrot on his shoulder at the helm of his ship

Ours result 3 for Ink-and-watercolor park

AlignedGen result 3 for Ink-and-watercolor park

StyleAligned result 3 for Ink-and-watercolor park

IPAdapter result 3 for Ink-and-watercolor park

B-LoRA result 3 for Ink-and-watercolor park

Reference

Ours

AlignedGen

StyleAligned

IPAdapter

B-LoRA

A jazz musician playing saxophone on stage

AlignedGen result 1 for Minimalist ballerina

StyleAligned result 1 for Minimalist ballerina

IPAdapter result 1 for Minimalist ballerina

B-LoRA result 1 for Minimalist ballerina

A soccer player kicking a ball

AlignedGen result 2 for Minimalist ballerina

StyleAligned result 2 for Minimalist ballerina

IPAdapter result 2 for Minimalist ballerina

B-LoRA result 2 for Minimalist ballerina

A hot air balloon flying over mountains

AlignedGen result 3 for Minimalist ballerina

StyleAligned result 3 for Minimalist ballerina

IPAdapter result 3 for Minimalist ballerina

B-LoRA result 3 for Minimalist ballerina

Reference

Ours

AlignedGen

StyleAligned

IPAdapter

B-LoRA

A portrait of a cat

AlignedGen result 1 for Pen-and-ink portrait

StyleAligned result 1 for Pen-and-ink portrait

IPAdapter result 1 for Pen-and-ink portrait

B-LoRA result 1 for Pen-and-ink portrait

A palm tree

AlignedGen result 2 for Pen-and-ink portrait

StyleAligned result 2 for Pen-and-ink portrait

IPAdapter result 2 for Pen-and-ink portrait

B-LoRA result 2 for Pen-and-ink portrait

A steam train

AlignedGen result 3 for Pen-and-ink portrait

StyleAligned result 3 for Pen-and-ink portrait

IPAdapter result 3 for Pen-and-ink portrait

B-LoRA result 3 for Pen-and-ink portrait

Reference

Ours

AlignedGen

StyleAligned

IPAdapter

B-LoRA

A marathon runner

A fruit basket on a table

Two kittens playing

Reference

Ours

AlignedGen

StyleAligned

IPAdapter

B-LoRA

A cute puppy

AlignedGen result 1 for Pencil-sketch bear

StyleAligned result 1 for Pencil-sketch bear

IPAdapter result 1 for Pencil-sketch bear

A football helmet sketch

AlignedGen result 2 for Pencil-sketch bear

StyleAligned result 2 for Pencil-sketch bear

IPAdapter result 2 for Pencil-sketch bear

A wine glass sketch

AlignedGen result 3 for Pencil-sketch bear

StyleAligned result 3 for Pencil-sketch bear

IPAdapter result 3 for Pencil-sketch bear

Reference

Ours

AlignedGen

StyleAligned

IPAdapter

B-LoRA

A grand piano

AlignedGen result 1 for Plush-toy cottage

StyleAligned result 1 for Plush-toy cottage

IPAdapter result 1 for Plush-toy cottage

An old television

AlignedGen result 2 for Plush-toy cottage

StyleAligned result 2 for Plush-toy cottage

IPAdapter result 2 for Plush-toy cottage

A snowman

AlignedGen result 3 for Plush-toy cottage

StyleAligned result 3 for Plush-toy cottage

IPAdapter result 3 for Plush-toy cottage

Reference

Ours

AlignedGen

StyleAligned

IPAdapter

B-LoRA

A man eating pizza

Two knights playing chess

A clown dancing

Reference

Ours

AlignedGen

StyleAligned

IPAdapter

B-LoRA

A woman playing guitar

AlignedGen result 1 for Color-pencil kiss

StyleAligned result 1 for Color-pencil kiss

IPAdapter result 1 for Color-pencil kiss

The seattle needle

AlignedGen result 2 for Color-pencil kiss

StyleAligned result 2 for Color-pencil kiss

IPAdapter result 2 for Color-pencil kiss

A cat playing with a ball of yarn

AlignedGen result 3 for Color-pencil kiss

StyleAligned result 3 for Color-pencil kiss

IPAdapter result 3 for Color-pencil kiss

For each reference image we generate three target prompts and compare Ours against AlignedGen, StyleAligned, IPAdapter, and B-LoRA. Use the arrows above to browse different references. Our approach achieves faithful style transfer while maintaining structural coherence, avoiding (irrelevant to the target prompt) content leakage from the reference into the generated image.