What Attention Dose Not Look At Is All You Need
Published:
Motivation
- This post proposes a hypothesis about how BERT learns syntactic representations, mainly based on an analysis of BERT’s attention (Clark et al., 2019).
- As shown in Figure 1, attention heads often focus on special tokens: early heads attend to [CLS]; middle heads attend to [SEP]; deep heads attend to periods and commas.
- Inspired by this, we suppose there are three phases of heads focus on different representations.