Analyzing, Designing, and Evaluating Linguistic Probes.
This article is authored by Keyur Faldu and Dr. Amit Sheth. This article elaborates on a niche aspect of the broader cover story on “Rise of Modern NLP and the Need of Interpretability!” At Embibe, we focus on developing interpretable and explainable Deep Learning systems, and we survey the current state of the art techniques to answer some open questions on linguistic wisdom acquired by NLP models.
This article is in continuation of the previous article (Discovering the Encoded Linguistic Knowledge in NLP models) to understand what linguistic knowledge is encoded in NLP models. The previous article covers what is probing, how it is different from multi-task learning, and two types of probes — representation based probes and attention weights based probes. It also shed light on how a probe task (or auxiliary task) is used to assess the linguistic ability of NLP models trained on some other primary task(s).
Naturally, the prediction performance of probes on linguistic tasks, or supporting patterns to correlate or compare neural network mechanics with linguistic phenomenon gives insights on whats and hows of encoded linguistic knowledge. Prediction performance could be classification accuracy, correlation coefficients, or mean reciprocal rank of predicting the gold label. Note that the prediction performance of the model on the probe task can be compared with the state of the art performance of an explicitly trained model for the same task as the primary task to understand the extent of encoded linguistic knowledge. However, there are other aspects to dive deeper to analyze such probes, including the following.
- Bigger, the better? How linguistic knowledge captured by a model varies with respect to its complexity? i.e., dimension sizes, number of parameters, etc. Probing classifier performance on an auxiliary task with models of different complexity would explain this question.
- Generalization ability over complex test data should be assessed to attribute the success of probes for encoded linguistics knowledge. For example, what if training data generally contains ‘main auxiliary’ as the first verb, but generalized data deliberately contain distractors, and ‘main auxiliary’ is no longer the first verb. In such cases, if probes can detect ‘main auxiliary’ verbs, it can be attributed to linguistic features like parse trees, rather than sequential positional features.
- Ability to decode linguistic knowledge: Classification tasks are of relatively lesser complexity when compared with tasks pertaining to decode or construct linguistics knowledge, i.e., can we build a complete dependency parse tree using Internal representations? It is intriguing to discover approaches to recover latent linguistic knowledge.
- Limitations and source of the linguistic knowledge: When Probes are able to perform well on auxiliary linguistic tasks, but is it because of some correlation, or is there a causation? Because a deep and complex probe model can also memorize, it can overfit sought after linguistic knowledge. So, how can we prove the source of the linguistic knowledge expressed by the probes? When designing a ‘control task’, where the predictive performance of probes can be compared with the performance of control tasks, it could reveal the effectiveness of the probes and the source of the linguistic knowledge.
- Infuse linguistic knowledge: If neural network models are learning linguistic knowledge in the process of training for an end-to-end task, would it be possible to infuse linguistic knowledge, i.e., syntax parse trees, to boost the performance?
- Does encoded linguistic knowledge capture meaning? Linguistic knowledge like POS tagging, dependency trees, etc is syntactical in nature. Real-world applications would expect NLP models to understand semantic meanings. So, it’s of utmost importance to assess the capability of a model to encode semantic meaning.
The above considerations help us elaborate more to understand probes better. We can also draw meaningful conclusions on encoded linguistic knowledge in NLP models. Let us dive deeper into examples and surveys of research papers on these topics.
One of the early research to formally investigate the problem of probing encoded linguistic knowledge is “FINE-GRAINED ANALYSIS OF SENTENCE EMBEDDINGS USING AUXILIARY PREDICTION TASKS”, where, Adi et al.  aims for a better understanding of encoded sentence representations.
Three auxiliary tasks related to sentence structures were considered:
- Sentence length: Does sentence embedding encode information on sentence length?
- Word-content: Is it possible to predict if a word is contained in the sentence on the basis of sentence embeddings.
- Word-order. Given sentence embeddings, and two words, can the order of two words be determined?
These probes are based on the sentence embedding which is computed as the average of final representations produced by the encoder-decoder model and the CBOW (continuous bag of words) model. Key findings in the paper to understand “are bigger models better at encoding above linguistic knowledge” as follow:
Increasing the number of dimensions benefits some tasks more than others. As shown in figure 2, the (a) length and © order tests get the benefit of bigger representation dimensions, whereas the content test peaks at representation with 750 dimensions.
- On the other hand, CBOW models, which have much fewer parameters than encoder-decoder models, with lesser dimensions are also able to perform well for tasks ‘word-content’.
Models can be tested on generalization data to verify the extent of model learning. And, deliberately designed complex generalization data can test the limit of linguistic wisdom learned by NLP models. Generalization over such complex data shows the real linguistic ability as opposed to memorizing surface-level patterns.
Figure 3. The examples of training and development data, which are simpler in nature. Generalization data are more complex with the presence of distractors. (i) Main auxiliary task: “Will” is the target word, and “can” is a distractor added in the generalization data (ii) Subject noun task: “bee” is a target word, and “queen” is a distractor added in the generalization data. (Lin et al., ACL 2019)
Lin et al.  carried out such experiments in the paper, “Open Sesame: Getting Inside BERT’s Linguistic Knowledge”. Figure 3 shows how generalized data can contain deliberate distractors to the stress-test model’s encoded linguistic knowledge.
- ‘Main Auxiliary Task’ is to identify the main auxiliary verb (helping verb) in a sentence. Training and development data contain ‘main auxiliary verb’ as the first verb in sentences, however, the generalization dataset contains it deeper in the sentence.
- Similarly, ‘Subject Noun Task’ is to identify the noun acting as a subject, which is the first noun in training and development data, but it is modified in the generalization set.
The takeaways are:
- The main auxiliary verb in the training sentence “the cat will sleep” is “will”. Whereas, “The can that can meow will sleep” is a complex generalization sentence. Hence, the prediction of the main auxiliary verb “will” is difficult because of the presence of a distractor “can”. Probe performance in the above figure 4 (left) shows BERT layers encode linguistic information to detect the ‘main auxiliary verb’ really well on generalization data as well.
- Similarly, generalization on a progressive dataset for ‘subject noun’ tasks is a relatively difficult task. However, an increase in encoded linguistic information can be noticed as probing classifier performance increases in successive layers.
This paper further investigates the ‘attention mechanism’ of the model and how much it is sensitive to such distractions. It proposes ‘Confusion score’ which is the binary cross-entropy of attention of candidate tokens to the target token.
Figure 5: Confusion Score
We can see how confusion drops when the complexity of the distractor becomes lesser in the cases below.
- Confusion in A1 dropped from 0.97 to 0.93 in A1, because the distractor in A2 ‘the dogs’ is relatively easier to catch, as it does not match the singularity of the verb ‘does’. Similarly, confusion in A3 dropped from 0.85 to 0.81 in A4 for the same reason.
- Confusion of A1 dropped from 0.97 to 0.85 in A3 (and similarly for A2, A4 case) because of the presence of an additional relative clause, which would have possibly resulted in better identification of hierarchical syntactic structure.
As classifier probes are of comparatively lesser complexity, it is interesting to investigate if we can decode encoded linguistic knowledge in totality. Let’s say, can we build dependency parse trees altogether relying on encoded representations?
Hewitt and Manning  propose “Structural Probe” in the paper “A Structural Probe for Finding Syntax in Word Representations”, where it can be empirically concluded that it is possible to transform the space of internal representations to the space of linguistic knowledge. The probe identifies a linear transformation under which the squared L2 distance of transformed representations encodes the distance between words in the parse tree, and one in which the squared L2 norm of transformed representations encodes depth in the parse tree.
As can be seen, linguistic knowledge was learned by model layer after layer, and it fades in top layers because these layers are more tuned towards the primary objective function. It was also studied if increasing dimensions of transformed space help in expressing linguistic knowledge, where experiments convey that linguistic knowledge for a parse dependency tree can be expressed in about 32 or 64 dimensions, adding further dimensions does not add further value.
Probes, supervised models trained to predict linguistic properties have achieved high accuracy on a range of linguistic tasks. But does this mean that the representations encode linguistic structure or just that the probe has learned the linguistic task? Can we meaningfully compare the linguistic properties of layers of a model using linguistic task accuracy? A sufficiently deep probe model can memorize the linguistic information. So how can we address this limitation?
Hewitt and Liang propose “Selectivity” as a measure to show the effectiveness of probes in the paper “Designing and Interpreting Probes with Control Tasks”. Control tasks are designed to know how a probe can learn linguistic information independent of encoded representations. Selectivity is defined as the difference between linguistic task accuracy and control task accuracy.
As can be seen in the above figure 9, a control task for part of speech prediction would assign some word type (or identity) to a set of words independently, and a POS tag would be predicted based on word types (ignoring encoded representations altogether). So, if a deep probe is able to memorize it should be able to perform well for a control task as well. Probe model complexity and accuracy achieved for the auxiliary task of part-of-speech and its control task can be seen above in the right figure. It is of utmost importance to choose a probe with high selectivity and high accuracy to draw out conclusions.
Adi et al investigate the source of sentence structure knowledge in the paper “FINE-GRAINED ANALYSIS OF SENTENCE EMBEDDINGS USING AUXILIARY PREDICTION TASKS”. Inspite of the CBOW model being oblivious to the context around, Probe was able to give high accuracy on the auxiliary task to predict the sentence length. However, it was found that just the norm of sentence embedding was indicative of sentence length (figure 10 (right)), so the source of information was not from the encoded representations of a token. However, when these representations were aggregated, the norm tends to move towards 0, as established by the central limit theorem and Hoeffding‘s inequality. It can be noticed in figure 10 (left) that the length prediction accuracy for synthetic sentences (random words chose to form a synthetic sentence) was also close to legitimate sentences. So, the actual source of knowledge to determine the sentence length was just the statistical property to the aggregation of random variables.
Hence, it requires in-depth study and analysis to drive inference from outcomes of probes.
Now that we have surveyed techniques to analyze probes for encoded linguistic knowledge, a follow-up question is “can we infuse explicit linguistic knowledge for desired outcomes?”. There is an interesting study about paraphrase generation, “Syntax-guided Controlled Generation of Paraphrases”. Kumar et al [a] have shown that to paraphrase a source sentence, how can the syntax of an exemplar sentence be leveraged. A generated paraphrase should preserve the meaning of the source sentence but syntactic sentence structure should be similar to an exemplar sentence.
The above figure 11 shows generated paraphrases with guidance from the syntax of different exemplar sentences. We can observe how the model is able to get guidance from the syntax of exemplar sentences. Note that only the syntax of exemplar sentences is given as an input, actual individual tokens are not fed to the model. A syntax tree of an exemplar sentence can be extracted at different height H, and it can be fed as an input to the encoder-decoder model. Lesser height gives more flexibility of paraphrasing, while deeper height would try to explicitly control the syntax structure of paraphrase.
The encoded linguistic knowledge is essential to understanding the meaning of natural language. Most of the probes we have seen deals with syntactic linguistic knowledge. Semantic meaning captured in the text needs to be understood. We need to develop frameworks to assess the capabilities of NLP models like BERT for the same. Reading comprehension, text similarity, question answering, neural machine translation, etc are some of the examples where the true performance of the model would be based on its ability to encode semantic meaning.
Benchmarks like GLUE and SuperGLUE are developed to assess the abilities of fine-tuned NLP models to perform the tasks based on natural language understanding. Generally, the performance of NLP models is compared with validation accuracy. There are inherent limitations in using validation accuracy like overfitting, data distribution of validation set, etc. The paper “Beyond Accuracy: Behavioral Testing of NLP Models with CheckList”, presents a framework for assessing the model’s performance beyond validation accuracy.
“CHECKLIST” suggests three different test types, Minimum Functionality tests (MFT) where examples are generated with the expected gold labels., Invariance (INV) where, from given examples, it creates new examples where the gold labels get flipped., and Directional Expectation tests (DIR) changes the gold labels in a positive or negative direction. Examples of each are given below:
It was surprising to notice that while models like Roberta and BERT surpass human baselines (with the accuracies of 91.1% and 91.3%) are failing badly on simple rule-based generalizations of validation dataset. That said, there is a long road map ahead to achieve human-level natural language understanding.
We have gone through Probes to assess encoded linguistic knowledge in NLP models. We have found that
- NLP models do encode linguistic knowledge in order to solve some downstream NLP tasks.
- Bigger models or representations not necessarily encode better linguistic knowledge
- Linguistic knowledge encoded for syntactic tasks generalize over test data with complex sentence structure, attributes to model’s capacity to encode linguistic grammar.
- The deeper Probes can overfit and potentially memorize the auxiliary tasks, which leads to an overestimation of encoded linguistic knowledge. Hence, it is recommended to design control tasks for the Probes.
- When linguistic knowledge is supplied, models can do better on tasks seeking guidance from such knowledge.
- Syntactic linguistic knowledge is not enough to capture the meaning of natural language understanding. State of the art models is far from achieving the understanding needed for NLP tasks.
The encoded linguistic knowledge is primarily syntactic in nature, and as demonstrated by “CHECKLIST”, models fail on generalization which is semantic in nature. State of the art NLP models is primarily pre-trained in self-supervised fashion on unlabelled data, and fine-tuned on limited labeled data for the downstream tasks. It is certainly difficult to acquire semantic knowledge related to tasks or domains from unlabelled data or limited labeled data.
Infusing semantic knowledge and domain knowledge improves the ability of the NLP model to encode semantic and domain knowledge. As a result, it inherently develops the ability to reason and generate plausible and faithful explanations. Guar et al  describe how knowledge graphs can help make deep learning systems more interpretable and explainable.
- Belinkov, Y. and Glass, J., 2019. Analysis methods in neural language processing: A survey. Transactions of the Association for Computational Linguistics, 7, pp.49–72.
- Clark, K., Khandelwal, U., Levy, O. and Manning, C.D., 2019. What does bert look at? an analysis of bert’s attention. arXiv preprint arXiv:1906.04341
- Tenney, I., Das, D. and Pavlick, E., 2019. BERT rediscovers the classical NLP pipeline. arXiv preprint arXiv:1905.05950.
- Hewitt, J. and Liang, P., 2019. Designing and interpreting probes with control tasks. arXiv preprint arXiv:1909.03368.
- Hewitt, J. and Manning, C.D., 2019, June. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 4129–4138).Goldberg, “Assessing BERT’s Syntactic Abilities”, 2019
- Goldberg, Y., 2019. Assessing BERT’s syntactic abilities. arXiv preprint arXiv:1901.05287.
- Hofmann, V., Pierrehumbert, J.B. and Schütze, H., 2020. Generating Derivational Morphology with BERT. arXiv preprint arXiv:2005.00672.
- Coenen, A., Reif, E., Yuan, A., Kim, B., Pearce, A., Viégas, F. and Wattenberg, M., 2019. Visualizing and measuring the geometry of bert. arXiv preprint arXiv:1906.02715.
- Tenney, I., Xia, P., Chen, B., Wang, A., Poliak, A., McCoy, R.T., Kim, N., Van Durme, B., Bowman, S.R., Das, D. and Pavlick, E., 2019. What do you learn from context? probing for sentence structure in contextualized word representations. arXiv preprint arXiv:1905.06316.
- Peters, M.E., Neumann, M., Zettlemoyer, L. and Yih, W.T., 2018. Dissecting contextual word embeddings: Architecture and representation. arXiv preprint arXiv:1808.08949.
- Adi, Y., Kermany, E., Belinkov, Y., Lavi, O. and Goldberg, Y., 2016. Fine-grained analysis of sentence embeddings using auxiliary prediction tasks. arXiv preprint arXiv:1608.04207.
- Stickland, A.C. and Murray, I., 2019. Bert and pals: Projected attention layers for efficient adaptation in multi-task learning. arXiv preprint arXiv:1902.02671.
- Zhou, J., Zhang, Z., Zhao, H. and Zhang, S., 2019. LIMIT-BERT: Linguistic informed multi-task bert. arXiv preprint arXiv:1910.14296.
- Jawahar, G., Sagot, B. and Seddah, D., 2019, July. What does BERT learn about the structure of language?.
- Lin, Y., Tan, Y.C. and Frank, R., 2019. Open Sesame: Getting Inside BERT’s Linguistic Knowledge. arXiv preprint arXiv:1906.01698.
- Kumar, A., Ahuja, K., Vadapalli, R. and Talukdar, P., 2020. Syntax-guided Controlled Generation of Paraphrases. arXiv preprint arXiv:2005.08417.
- de Vries, W., van Cranenburgh, A. and Nissim, M., 2020. What’s so special about BERT’s layers? A closer look at the NLP pipeline in monolingual and multilingual models. arXiv preprint arXiv:2004.06499.
- Ribeiro, M.T., Wu, T., Guestrin, C. and Singh, S., 2020. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. arXiv preprint arXiv:2005.04118.
- Gaur, M., Faldu, K. and Sheth, A., 2020. Semantics of the Black-Box: Can knowledge graphs help make deep learning systems more interpretable and explainable?. arXiv preprint arXiv:2010.08660.