Blog posts

2021

What Attention Dose Not Look At Is All You Need

5 minute read

Published:

Motivation

  • This post proposes a hypothesis about how BERT learns syntactic representations, mainly based on an analysis of BERT’s attention (Clark et al., 2019).
  • As shown in Figure 1, attention heads often focus on special tokens: early heads attend to [CLS]; middle heads attend to [SEP]; deep heads attend to periods and commas.
  • Inspired by this, we suppose there are three phases of heads focus on different representations.