Decoding NLP Attention Mechanisms
In this blogpost series, we will walk you through the rise of the transformer architecture. Our first stop will be a focus on the attention mechanism, which is the key component of this architecture. We will then move on to the transformer itself in part II, and finally, we’ll introduce BERT in part III.
In order to understand the transformer and its motives, we will need to dive into its core idea: a novel paradigm called attention. This paradigm made its grand entrance into the NLP landscape (specifically in translation systems) in 2014, well before the Deep Learning hype, in an iconic paper by Bahdanau et. al “Neural Machine Translation by Jointly Learning to Align and Translate.”
Before going any further, let’s recall the basic architecture of a machine translation system.