Attention Is All You Need😀

A. Self Attention:

A1. Vectorization:

One Hot Encoding: 1 where word is present.
Bag of Words (BoW): count on words.

s1. ram sita ram
s2. sita ram hari

BoW:
		ram sita hari
s1 [ 2   1    0  ]
s2 [ 1   1    1  ]

OHE:
		  ram sita hari
ram  [ 1   0    0  ]
sita [ 0   1    0  ]
hari [ 0   0    1  ]

Word Embeddings : capture semantic (average) meaning, similar words have close embedding values.

Apple as fruit: 9000 sentences
Apple as company: 1000 sentences 

Final SAMPLE embedding of word "Apple":

		       taste technology ->(Assumed only two embedding components)
	Apple:  [ 0.9      0.1  ] -> (static) embedding titlted towards fruit
	
	So, there is **need of smart contextual embeeding**. Achieved through self attention.

A2.

Contextual Embedding:

Example sentence:

s1. money bank grows :

eₘₒₙₑᵧ(new) = 0.7 eₘₒₙₑᵧ + 0.2 eᵦₐₙₖ + 0.1 e𝓰ᵣₒ𝓌ₛ

eᵦₐₙₖ(new) = 0.25 eₘₒₙₑᵧ + 0.7 eᵦₐₙₖ + 0.05 e𝓰ᵣₒ𝓌ₛ

s2. river bank flows:

eᵦₐₙₖ(new) = 0.3 eᵣᵢᵥₑᵣ + 0.6 eᵦₐₙₖ + 0.1 e𝒻ₗₒ𝓌ₛ

→ bank refers differently in both sentences.

Here embeddings are combined as per the sentence to capture the context as well. Now, word “bank” differs as per sentence if it comes with river or money. Similar for every words in sentence.

→ We can say the multiplication factor is based upon the similarity between words, obtained after dot product of old static embeddings. Eg:

eᵦₐₙₖ(new) = [eᵦₐₙₖ.eᵀₘₒₙₑᵧ] eₘₒₙₑᵧ + [eᵦₐₙₖ.eᵀᵦₐₙₖ] eᵦₐₙₖ + [eᵦₐₙₖ.eᵀ𝓰ᵣₒ𝓌ₛ] e𝓰ᵣₒ𝓌ₛ ——————(eq1)

Assume [eᵦₐₙₖ.eᵀₘₒₙₑᵧ] = s₂₁, [eᵦₐₙₖ.eᵀᵦₐₙₖ] = s₂₂, [eᵦₐₙₖ.eᵀ𝓰ᵣₒ𝓌ₛ] = s₂₃ ———————(focus1)

then,

→ softmax operation is performed on them to get W₂₁, W₂₂, W₂₃ respectively.

So,

Yᵦₐₙₖ(say) = eᵦₐₙₖ(new) = W₂₁eₘₒₙₑᵧ + W₂₂eᵦₐₙₖ + W₂₃*e𝓰ᵣₒ𝓌ₛ = General Contextual Word Embeddings ——————(eq2)

→ NOTE that contextual embeddings of every word can be done in parallel. But positional or sequential information is lost, meaning that it does not capture which word comes before and after.

→ General Contextual Word Embeddings:
- No learnable parameters.
1. piece of cake : maybe referring to either easy task or actual portion of cake.
2. So, Need of Task Specific Contextual Embeddings : introduce learning parameters.

→ focus1 can be termed as querying the key to get similarity, where,

query (q) is the word whose contextual embedding is being calculated,
key (k) is embedding of every word in the sentence, and

→ old embeddings play three roles of all query, key and value in general contextual embedding.

value (v) is the embedding which is multiplied with W₂₁, W₂₂, W₂₃.
Remember that q, k, v are all full old embeddings without any context information. Embedding vectors are not actually broken into q, k, v in this case of general contextual embeddings, instead they are full embeddings itself.

Task Specific Contextual Embeddings:
- separation of concerns to query, key and value which are generated for every embeddings.
- To generate q, k, v for every words, random weights matrix (Wq, Wₖ, Wᵥ) are multiplied to original embedding of respective words and adjusted during training.
→ Now, remaining all operations are same to that of general contextual embeddings. For our previous example, eq1 becomes:

Yᵦₐₙₖ = [qᵦₐₙₖ.kᵀₘₒₙₑᵧ] vₘₒₙₑᵧ + [qᵦₐₙₖ.kᵀᵦₐₙₖ] vᵦₐₙₖ + [qᵦₐₙₖ.kᵀ𝓰ᵣₒ𝓌ₛ] v𝓰ᵣₒ𝓌ₛ

Hence, Operations till now can be summarized in compact form as:

Attention (Q, K, V) = Softmax( Q Kᵀ ) V
Scaled Dot Product Attention:

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V $$

where, dₖ is dimensionality of key (k) vector.

→ Nature of dot product:
- dot product of low dimensional vectors = low variance
- dot product of high dimensional vectors = high variance
Extra Note:
- X → Var(X) then,
- cX→ c²Var(X)

B. Multi-head Attention

Multi-head Attention:

Assume the sentence “The man saw the astronomer with a telescope”.

What does it mean?
- Man used telescope and saw the astronomer.
- Man saw a astronomer who has a telescope.
→ Self-attention can only capture only one meaning/perspective. So, need multi-head attention.

→ Multi-head attention is simply just multiple self attentions, meaning, multiple different Wq, Wₖ, Wᵥ matrices to capture different meanings of sentence. This results in multiple q, k, v vectors for each words. Then, follow similar steps as in self attention for every respective q, k, v vectors to get multiples context embeddings. Example: multiples eᵦₐₙₖ(new) or Yᵦₐₙₖ for word “bank”: say Y¹ᵦₐₙₖ and Y²ᵦₐₙₖ for two self attentions. Finally, they are concatenated together.

C. Positional Encoding

To maintain positional information,

First Approach:
1. simply concatenate position count at the end of embeddings of every word.
2. Disadvantage: Different words occur at different positions in different sentences, no fixed sentence length.
Second Approach:
1. Using Sine function and concatenate value at end of embedding.
2. Disadvantage: periodic function, so different words may get same value.
Third Approach:
1. Using multiple combinations of sine and cosine functions at different frequencies.
→ In paper, dimension of positional encodings are equal to dimension of embeddings and they are numerically added instead of concatenation to reduce dimensions which impact number of parameters, resources for calculations, and training time.