Blog posts


What Attention Dose Not Look At Is All You Need

5 minute read



  • This post proposes a hypothesis about how BERT learns syntactic representations, mainly based on an analysis of BERT’s attention (Clark et al., 2019).
  • As shown in Figure 1, attention heads often focus on special tokens: early heads attend to [CLS]; middle heads attend to [SEP]; deep heads attend to periods and commas.
  • Inspired by this, we suppose there are three phases of heads focus on different representations.