Attention Is All You NeedπŸ˜€

A. Self Attention:

A1. Vectorization:

  1. One Hot Encoding: 1 where word is present.
  2. Bag of Words (BoW): count on words.
s1. ram sita ram
s2. sita ram hari

BoW:
		ram sita hari
s1 [ 2   1    0  ]
s2 [ 1   1    1  ]

OHE:
		  ram sita hari
ram  [ 1   0    0  ]
sita [ 0   1    0  ]
hari [ 0   0    1  ]
  1. Word Embeddings : capture semantic (average) meaning, similar words have close embedding values.
Apple as fruit: 9000 sentences
Apple as company: 1000 sentences 

Final SAMPLE embedding of word "Apple":

		       taste technology ->(Assumed only two embedding components)
	Apple:  [ 0.9      0.1  ] -> (static) embedding titlted towards fruit
	
	So, there is **need of smart contextual embeeding**. Achieved through self attention.

A2.

  1. Contextual Embedding:

    Example sentence:

    s1. money bank grows :

    eβ‚˜β‚’β‚™β‚‘α΅§(new) = 0.7 eβ‚˜β‚’β‚™β‚‘α΅§ + 0.2 eᡦₐₙₖ + 0.1 eπ“°α΅£β‚’π“Œβ‚›

    eᡦₐₙₖ(new) = 0.25 eβ‚˜β‚’β‚™β‚‘α΅§ + 0.7 eᡦₐₙₖ + 0.05 eπ“°α΅£β‚’π“Œβ‚›

    s2. river bank flows:

    eᡦₐₙₖ(new) = 0.3 eα΅£α΅’α΅₯β‚‘α΅£ + 0.6 eᡦₐₙₖ + 0.1 eπ’»β‚—β‚’π“Œβ‚›

    β†’ bank refers differently in both sentences.

    Here embeddings are combined as per the sentence to capture the context as well. Now, word β€œbank” differs as per sentence if it comes with river or money. Similar for every words in sentence.

    β†’ We can say the multiplication factor is based upon the similarity between words, obtained after dot product of old static embeddings. Eg:

    eᡦₐₙₖ(new) = [eᡦₐₙₖ.eα΅€β‚˜β‚’β‚™β‚‘α΅§] eβ‚˜β‚’β‚™β‚‘α΅§ + [eᡦₐₙₖ.eᡀᡦₐₙₖ] eᡦₐₙₖ + [eᡦₐₙₖ.eα΅€π“°α΅£β‚’π“Œβ‚›] eπ“°α΅£β‚’π“Œβ‚› β€”β€”β€”β€”β€”β€”(eq1)

    Assume [eᡦₐₙₖ.eα΅€β‚˜β‚’β‚™β‚‘α΅§] = s₂₁, [eᡦₐₙₖ.eᡀᡦₐₙₖ] = sβ‚‚β‚‚, [eᡦₐₙₖ.eα΅€π“°α΅£β‚’π“Œβ‚›] = s₂₃ β€”β€”β€”β€”β€”β€”β€”(focus1)

    then,

    β†’ softmax operation is performed on them to get W₂₁, Wβ‚‚β‚‚, W₂₃ respectively.

    So,

    Yᡦₐₙₖ(say) = eᡦₐₙₖ(new) = W₂₁eβ‚˜β‚’β‚™β‚‘α΅§ + Wβ‚‚β‚‚eᡦₐₙₖ + W₂₃*eπ“°α΅£β‚’π“Œβ‚› = General Contextual Word Embeddings β€”β€”β€”β€”β€”β€”(eq2)

    β†’ NOTE that contextual embeddings of every word can be done in parallel. But positional or sequential information is lost, meaning that it does not capture which word comes before and after.

    β†’ General Contextual Word Embeddings:

    1. piece of cake : maybe referring to either easy task or actual portion of cake.
    2. So, Need of Task Specific Contextual Embeddings : introduce learning parameters.

β†’ focus1 can be termed as querying the key to get similarity, where,

β†’ old embeddings play three roles of all query, key and value in general contextual embedding.

  1. Task Specific Contextual Embeddings:

    β†’ Now, remaining all operations are same to that of general contextual embeddings. For our previous example, eq1 becomes:

    Yᡦₐₙₖ = [qᡦₐₙₖ.kα΅€β‚˜β‚’β‚™β‚‘α΅§] vβ‚˜β‚’β‚™β‚‘α΅§ + [qᡦₐₙₖ.kᡀᡦₐₙₖ] vᡦₐₙₖ + [qᡦₐₙₖ.kα΅€π“°α΅£β‚’π“Œβ‚›] vπ“°α΅£β‚’π“Œβ‚›

    Hence, Operations till now can be summarized in compact form as:

    Attention (Q, K, V) = Softmax( Q Kα΅€ ) V

  2. Scaled Dot Product Attention:

    $$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V $$

    where, dβ‚– is dimensionality of key (k) vector.

    β†’ Nature of dot product:

    Extra Note:

B. Multi-head Attention

  1. Multi-head Attention:

    Assume the sentence β€œThe man saw the astronomer with a telescope”.

    What does it mean?

    β†’ Self-attention can only capture only one meaning/perspective. So, need multi-head attention.

    β†’ Multi-head attention is simply just multiple self attentions, meaning, multiple different Wq, Wβ‚–, Wα΅₯ matrices to capture different meanings of sentence. This results in multiple q, k, v vectors for each words. Then, follow similar steps as in self attention for every respective q, k, v vectors to get multiples context embeddings. Example: multiples eᡦₐₙₖ(new) or Yᡦₐₙₖ for word β€œbank”: say Y¹ᡦₐₙₖ and Y²ᡦₐₙₖ for two self attentions. Finally, they are concatenated together.

C. Positional Encoding

To maintain positional information,

  1. First Approach:

    1. simply concatenate position count at the end of embeddings of every word.
    2. Disadvantage: Different words occur at different positions in different sentences, no fixed sentence length.
  2. Second Approach:

    1. Using Sine function and concatenate value at end of embedding.
    2. Disadvantage: periodic function, so different words may get same value.
  3. Third Approach:

    1. Using multiple combinations of sine and cosine functions at different frequencies.

    β†’ In paper, dimension of positional encodings are equal to dimension of embeddings and they are numerically added instead of concatenation to reduce dimensions which impact number of parameters, resources for calculations, and training time.