s1. ram sita ram
s2. sita ram hari
BoW:
ram sita hari
s1 [ 2 1 0 ]
s2 [ 1 1 1 ]
OHE:
ram sita hari
ram [ 1 0 0 ]
sita [ 0 1 0 ]
hari [ 0 0 1 ]
Apple as fruit: 9000 sentences
Apple as company: 1000 sentences
Final SAMPLE embedding of word "Apple":
taste technology ->(Assumed only two embedding components)
Apple: [ 0.9 0.1 ] -> (static) embedding titlted towards fruit
So, there is **need of smart contextual embeeding**. Achieved through self attention.
Contextual Embedding:
Example sentence:
s1. money bank grows :
eββββα΅§(new) = 0.7 eββββα΅§ + 0.2 eᡦβββ + 0.1 eπ°α΅£βπβ
eᡦβββ(new) = 0.25 eββββα΅§ + 0.7 eᡦβββ + 0.05 eπ°α΅£βπβ
s2. river bank flows:
eᡦβββ(new) = 0.3 eα΅£α΅’α΅₯βα΅£ + 0.6 eᡦβββ + 0.1 eπ»ββπβ
β bank refers differently in both sentences.
Here embeddings are combined as per the sentence to capture the context as well. Now, word βbankβ differs as per sentence if it comes with river or money. Similar for every words in sentence.
β We can say the multiplication factor is based upon the similarity between words, obtained after dot product of old static embeddings. Eg:
eᡦβββ(new) = [eᡦβββ.eα΅ββββα΅§] eββββα΅§ + [eᡦβββ.eα΅α΅¦βββ] eᡦβββ + [eᡦβββ.eα΅π°α΅£βπβ] eπ°α΅£βπβ ββββββ(eq1)
Assume [eᡦβββ.eα΅ββββα΅§] = sββ, [eᡦβββ.eα΅α΅¦βββ] = sββ, [eᡦβββ.eα΅π°α΅£βπβ] = sββ βββββββ(focus1)
then,
β softmax operation is performed on them to get Wββ, Wββ, Wββ respectively.
So,
Yᡦβββ(say) = eᡦβββ(new) = Wββeββββα΅§ + Wββeᡦβββ + Wββ*eπ°α΅£βπβ = General Contextual Word Embeddings ββββββ(eq2)
β NOTE that contextual embeddings of every word can be done in parallel. But positional or sequential information is lost, meaning that it does not capture which word comes before and after.
β General Contextual Word Embeddings:
β focus1 can be termed as querying the key to get similarity, where,
β old embeddings play three roles of all query, key and value in general contextual embedding.
Task Specific Contextual Embeddings:
β Now, remaining all operations are same to that of general contextual embeddings. For our previous example, eq1 becomes:
Yᡦβββ = [qᡦβββ.kα΅ββββα΅§] vββββα΅§ + [qᡦβββ.kα΅α΅¦βββ] vᡦβββ + [qᡦβββ.kα΅π°α΅£βπβ] vπ°α΅£βπβ
Hence, Operations till now can be summarized in compact form as:
Attention (Q, K, V) = Softmax( Q Kα΅ ) V
Scaled Dot Product Attention:
$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V $$
where, dβ is dimensionality of key (k) vector.
β Nature of dot product:
Extra Note:
Multi-head Attention:
Assume the sentence βThe man saw the astronomer with a telescopeβ.
What does it mean?
β Self-attention can only capture only one meaning/perspective. So, need multi-head attention.
β Multi-head attention is simply just multiple self attentions, meaning, multiple different Wq, Wβ, Wα΅₯ matrices to capture different meanings of sentence. This results in multiple q, k, v vectors for each words. Then, follow similar steps as in self attention for every respective q, k, v vectors to get multiples context embeddings. Example: multiples eᡦβββ(new) or Yᡦβββ for word βbankβ: say Y¹ᡦβββ and Y²ᡦβββ for two self attentions. Finally, they are concatenated together.
To maintain positional information,
First Approach:
Second Approach:
Third Approach:
β In paper, dimension of positional encodings are equal to dimension of embeddings and they are numerically added instead of concatenation to reduce dimensions which impact number of parameters, resources for calculations, and training time.