Recently, Microsoft (Tianzhu Ye et. al.) released a paper called differential attention which is supposed to cancel noise in attention module. The concept is similar to how noise cancelling headphones work.
In the attention block, and are projected to , , , . The idea here is we can use and to be coming from “one microphone” and and to be coming from “other microphone”. The attention is modified as follows, Here, is learnable scalar and authors have defined it as , , , are learned through backprop. Whereas, is defined as .
My thoughts
- Based on the equation, this does seem promising in minimzing noise in attention blocks. I’m eager to try this out in encoder-decoder model and visualize attention maps.
- Not sure how authors have reached to initializtion of , but they claim it is robust to any initializations.