The Dark Secrets Of BERT

The Dark Secrets Of BERT

  • May 7, 2020
Table of Contents

The Dark Secrets Of BERT

BERT stands for Bidirectional Encoder Representations from Transformers. This model is basically a multi-layer bidirectional Transformer encoder(Devlin, Chang, Lee, & Toutanova, 2019), and there are multiple excellent guides about how it works generally, includingthe Illustrated Transformer. What we focus on is one specific component of Transformer architecture known as self-attention.

In a nutshell, it is a way to weigh the components of the input and output sequences so as to model relations between them, even long-distance dependencies. As a brief example, let’s say we need to create a representation of the sentence “Tom is a black cat”. BERT may choose to pay more attention to “Tom” while encoding the word “cat”, and less attention to the words “is”, “a”, “black”.

This could be represented as a vector of weights (for each word in the sentence). Such vectors are computed when the model encodes each word in the sequence, yielding a square matrix which we refer to as the self-attention map.

Source: topbots.com

Tags :
Share :
comments powered by Disqus

Related Posts

A Hacker’s Guide to Efficiently Train Deep Learning Models

A Hacker’s Guide to Efficiently Train Deep Learning Models

Three months ago, I participated in a data science challenge that took place at my company. The goal was to help a marine researcher better identify whales based on the appearance of their flukes. More specifically, we were asked to predict for each image of a test set, the top 20 most similar images from the full database (train+test).

Read More