On Identifiability in Transformer Neural Networks

Transformer neural networks produce attention distributions in self-attention layers. These distributions are not always identifiable, meaning multiple different attention distributions can produce the same output. This is caused by the left null space of the learnt weights being non empty. Low-rank bottlenecks in the attention heads can also reduce model scalability. We verify the theory developed with NanoGPT.