Abstract
Recent studies (Blevins et al. 2018, Tenney et al. 2019, etc) have presented evidence that linguistic information, such as Part-of-Speech (PoS), is stored in the word representations (embeddings) learned by neural networks, with the neural networks being trained to perform next word prediction and other NLP tasks. In this work, we focus on so-called probing tasks or diagnostic classifiers that train linguistic feature classifiers on the activations of a trained neural model and interpret the accuracy of such classifiers on a held-out set as a measure of the amount of linguistic information captured by that model. In particular, we show that the overlap between training and test set vocabulary in such experiments can lead to over-optimistic results, as the effect of memorization on the linguistic classifier’s performance is overlooked.
We then present our technique to split the vocabulary across the linguistic classifier’s training and test sets, so that any given word type may only occur in either the training or the test set. This technique makes probing tasks more informative and consequently assess more accurately how much linguistic information is actually stored in the token representation.
To the best of our knowledge, only a few studies such as Bisazza and Tump (2018) have reported on the effect of vocabulary splitting in this context and we corroborate their findings.
From our experiments we found that incorporating such a technique for PoS classification, clearly shows the effect of memorization when the vocabulary is not split, especially at the word-type representation level (that is, the context-independent embeddings, or layer 0).
For our experiments, we trained a language model on next-word-prediction. We then extracted the word representations from the encoder, for all the layers. These representations are then taken as the input to a logistic regression model, that is trained on PoS classification. The model is run for the two different settings: with and without vocabulary splitting. Finally, the output is analysed and compared between the different split settings.
Across all layers, the full vocabulary setting gave high accuracy values (85-90%), compared to when the vocabulary split was enforced (35 – 50%). To further substantiate that this is due to memorization, we also compared the results to that from a LM with randomly initialized embeddings. The difference of around 70% further suggests that the model is memorizing words, but not truly learning syntax.
Our work provides evidence that the results of linguistic probing tasks only partially account for the linguistic information stored in neural word representations. Splitting the vocabulary provides a solution to this problem, but is not itself a trivial task and comes with its own set of issues, such as large deviations across random runs.
We conclude that more care must be taken when setting up probing task experiments and, even more, when interpreting them.
We then present our technique to split the vocabulary across the linguistic classifier’s training and test sets, so that any given word type may only occur in either the training or the test set. This technique makes probing tasks more informative and consequently assess more accurately how much linguistic information is actually stored in the token representation.
To the best of our knowledge, only a few studies such as Bisazza and Tump (2018) have reported on the effect of vocabulary splitting in this context and we corroborate their findings.
From our experiments we found that incorporating such a technique for PoS classification, clearly shows the effect of memorization when the vocabulary is not split, especially at the word-type representation level (that is, the context-independent embeddings, or layer 0).
For our experiments, we trained a language model on next-word-prediction. We then extracted the word representations from the encoder, for all the layers. These representations are then taken as the input to a logistic regression model, that is trained on PoS classification. The model is run for the two different settings: with and without vocabulary splitting. Finally, the output is analysed and compared between the different split settings.
Across all layers, the full vocabulary setting gave high accuracy values (85-90%), compared to when the vocabulary split was enforced (35 – 50%). To further substantiate that this is due to memorization, we also compared the results to that from a LM with randomly initialized embeddings. The difference of around 70% further suggests that the model is memorizing words, but not truly learning syntax.
Our work provides evidence that the results of linguistic probing tasks only partially account for the linguistic information stored in neural word representations. Splitting the vocabulary provides a solution to this problem, but is not itself a trivial task and comes with its own set of issues, such as large deviations across random runs.
We conclude that more care must be taken when setting up probing task experiments and, even more, when interpreting them.
Original language | English |
---|---|
Publication status | Published - Jan-2020 |
Event | 30th Meeting of the Computational Linguistics in the Netherlands: CLIN30 - Drift 21, Utrecht, Netherlands Duration: 30-Jan-2020 → 30-Jan-2020 https://clin30.sites.uu.nl/ |
Conference
Conference | 30th Meeting of the Computational Linguistics in the Netherlands |
---|---|
Abbreviated title | clin30 |
Country/Territory | Netherlands |
City | Utrecht |
Period | 30/01/2020 → 30/01/2020 |
Internet address |