December 5, 2020

8 Links

8 Links of Separation

Linguistics Wisdom of NLP Models. Analyzing, Designing, and Evaluating… | by Keyur Faldu | Nov, 2020

Analyzing, Designing, and Evaluating Linguistic Probes.

This article is authored by Keyur Faldu and Dr. Amit Sheth. This article elaborates on a niche aspect of the broader cover story on “Rise of Modern NLP and the Need of Interpretability!” At Embibe, we focus on developing interpretable and explainable Deep Learning systems, and we survey the current state of the art techniques to answer some open questions on linguistic wisdom acquired by NLP models.

This article is in continuation of the previous article (Discovering the Encoded Linguistic Knowledge in NLP models) to understand what linguistic knowledge is encoded in NLP models. The previous article covers what is probing, how it is different from multi-task learning, and two types of probes — representation based probes and attention weights based probes. It also shed light on how a probe task (or auxiliary task) is used to assess the linguistic ability of NLP models trained on some other primary task(s).

Figure 1. The illustration diagram of Probes on the BERT model. It shows how input tokens are contextualized based on attention mechanisms in successive layers using attention mechanisms. Two types of Probes are shown, (1) representation based, and (2) attention-based. Note, the diagram is for broader illustration, so special tokens like CLS and SEP are not shown.

Naturally, the prediction performance of probes on linguistic tasks, or supporting patterns to correlate or compare neural network mechanics with linguistic phenomenon gives insights on whats and hows of encoded linguistic knowledge. Prediction performance could be classification accuracy, correlation coefficients, or mean reciprocal rank of predicting the gold label. Note that the prediction performance of the model on the probe task can be compared with the state of the art performance of an explicitly trained model for the same task as the primary task to understand the extent of encoded linguistic knowledge. However, there are other aspects to dive deeper to analyze such probes, including the following.

  • Bigger, the better? How linguistic knowledge captured by a model varies with respect to its complexity? i.e., dimension sizes, number of parameters, etc. Probing classifier performance on an auxiliary task with models of different complexity would explain this question.
  • Generalization ability over complex test data should be assessed to attribute the success of probes for encoded linguistics knowledge. For example, what if training data generally contains ‘main auxiliary’ as the first verb, but generalized data deliberately contain distractors, and ‘main auxiliary’ is no longer the first verb. In such cases, if probes can detect ‘main auxiliary’ verbs, it can be attributed to linguistic features like parse trees, rather than sequential positional features.
  • Ability to decode linguistic knowledge: Classification tasks are of relatively lesser complexity when compared with tasks pertaining to decode or construct linguistics knowledge, i.e., can we build a complete dependency parse tree using Internal representations? It is intriguing to discover approaches to recover latent linguistic knowledge.
  • Limitations and source of the linguistic knowledge: When Probes are able to perform well on auxiliary linguistic tasks, but is it because of some correlation, or is there a causation? Because a deep and complex probe model can also memorize, it can overfit sought after linguistic knowledge. So, how can we prove the source of the linguistic knowledge expressed by the probes? When designing a ‘control task’, where the predictive performance of probes can be compared with the performance of control tasks, it could reveal the effectiveness of the probes and the source of the linguistic knowledge.
  • Infuse linguistic knowledge: If neural network models are learning linguistic knowledge in the process of training for an end-to-end task, would it be possible to infuse linguistic knowledge, i.e., syntax parse trees, to boost the performance?
  • Does encoded linguistic knowledge capture meaning? Linguistic knowledge like POS tagging, dependency trees, etc is syntactical in nature. Real-world applications would expect NLP models to understand semantic meanings. So, it’s of utmost importance to assess the capability of a model to encode semantic meaning.

The above considerations help us elaborate more to understand probes better. We can also draw meaningful conclusions on encoded linguistic knowledge in NLP models. Let us dive deeper into examples and surveys of research papers on these topics.

Three auxiliary tasks related to sentence structures were considered:

  1. Sentence length: Does sentence embedding encode information on sentence length?
  2. Word-content: Is it possible to predict if a word is contained in the sentence on the basis of sentence embeddings.
  3. Word-order. Given sentence embeddings, and two words, can the order of two words be determined?

These probes are based on the sentence embedding which is computed as the average of final representations produced by the encoder-decoder model and the CBOW (continuous bag of words) model. Key findings in the paper to understand “are bigger models better at encoding above linguistic knowledge” as follow:

Figure 2: Accuracy vs Representation dimensions for auxiliary tasks (a) Length test, (b) Word content test, and © Word order test. (Adi et al. [11], ICLR 2017)

Increasing the number of dimensions benefits some tasks more than others. As shown in figure 2, the (a) length and © order tests get the benefit of bigger representation dimensions, whereas the content test peaks at representation with 750 dimensions.

  • On the other hand, CBOW models, which have much fewer parameters than encoder-decoder models, with lesser dimensions are also able to perform well for tasks ‘word-content’.

Figure 3. The examples of training and development data, which are simpler in nature. Generalization data are more complex with the presence of distractors. (i) Main auxiliary task: “Will” is the target word, and “can” is a distractor added in the generalization data (ii) Subject noun task: “bee” is a target word, and “queen” is a distractor added in the generalization data. (Lin et al.[15], ACL 2019)

Lin et al. [15] carried out such experiments in the paper, “Open Sesame: Getting Inside BERT’s Linguistic Knowledge”. Figure 3 shows how generalized data can contain deliberate distractors to the stress-test model’s encoded linguistic knowledge.

  • ‘Main Auxiliary Task’ is to identify the main auxiliary verb (helping verb) in a sentence. Training and development data contain ‘main auxiliary verb’ as the first verb in sentences, however, the generalization dataset contains it deeper in the sentence.
  • Similarly, ‘Subject Noun Task’ is to identify the noun acting as a subject, which is the first noun in training and development data, but it is modified in the generalization set.
Figure 4: The classification accuracy of probes based on the internal representations at different layers. ‘bbu’ stands for “BERT base uncased”, and “blu” stands for “BERT large uncased”. (Lin et al.[15], ACL 2019)

The takeaways are:

  • The main auxiliary verb in the training sentence “the cat will sleep” is “will”. Whereas, “The can that can meow will sleep” is a complex generalization sentence. Hence, the prediction of the main auxiliary verb “will” is difficult because of the presence of a distractor “can”. Probe performance in the above figure 4 (left) shows BERT layers encode linguistic information to detect the ‘main auxiliary verb’ really well on generalization data as well.
  • Similarly, generalization on a progressive dataset for ‘subject noun’ tasks is a relatively difficult task. However, an increase in encoded linguistic information can be noticed as probing classifier performance increases in successive layers.

This paper further investigates the ‘attention mechanism’ of the model and how much it is sensitive to such distractions. It proposes ‘Confusion score’ which is the binary cross-entropy of attention of candidate tokens to the target token.

Figure 5: Confusion Score

Figure 6. The examples for the ‘Subject-Verb Agreement’. In A1, the target token is the verb ‘does’, whereas candidate tokens are ‘the cat’ and ‘the dog’. Confusion score depends on the binary cross-entropy of agreement between ‘does’ and ‘the cat’, and ‘does’ and ‘the dog’. (Lin et al.[15], ACL 2019)

We can see how confusion drops when the complexity of the distractor becomes lesser in the cases below.

  • Confusion in A1 dropped from 0.97 to 0.93 in A1, because the distractor in A2 ‘the dogs’ is relatively easier to catch, as it does not match the singularity of the verb ‘does’. Similarly, confusion in A3 dropped from 0.85 to 0.81 in A4 for the same reason.
  • Confusion of A1 dropped from 0.97 to 0.85 in A3 (and similarly for A2, A4 case) because of the presence of an additional relative clause, which would have possibly resulted in better identification of hierarchical syntactic structure.

Hewitt and Manning [5] propose “Structural Probe” in the paper “A Structural Probe for Finding Syntax in Word Representations”, where it can be empirically concluded that it is possible to transform the space of internal representations to the space of linguistic knowledge. The probe identifies a linear transformation under which the squared L2 distance of transformed representations encodes the distance between words in the parse tree, and one in which the squared L2 norm of transformed representations encodes depth in the parse tree.

Figure 7. UUAS (Undirected Unlabeled Attachment Score) measures the performance of predicting the relationship between two tokens in the dependency tree. DSpr is a spearman coefficient to measure the distance between tokens in a dependency parse tree with gold data. The x-axis in the left figure denotes a hidden layer in the BERT-large model. In the right figure, x-axis represents the dimension of transformed space. (Hewitt et al. [5], NAACL 2019)

As can be seen, linguistic knowledge was learned by model layer after layer, and it fades in top layers because these layers are more tuned towards the primary objective function. It was also studied if increasing dimensions of transformed space help in expressing linguistic knowledge, where experiments convey that linguistic knowledge for a parse dependency tree can be expressed in about 32 or 64 dimensions, adding further dimensions does not add further value.

Figure 8. An example of a constructed dependency parse tree (Hewitt et al. [5], NAACL 2019)

Hewitt and Liang propose “Selectivity” as a measure to show the effectiveness of probes in the paper “Designing and Interpreting Probes with Control Tasks”. Control tasks are designed to know how a probe can learn linguistic information independent of encoded representations. Selectivity is defined as the difference between linguistic task accuracy and control task accuracy.

Figure 9: (left) The control task designed using random identifiers to the cluster of words, (right) The comparison of accuracy and complexity of the probe task with control task. (Hewitt et al. [4], EMNLP-2019)

As can be seen in the above figure 9, a control task for part of speech prediction would assign some word type (or identity) to a set of words independently, and a POS tag would be predicted based on word types (ignoring encoded representations altogether). So, if a deep probe is able to memorize it should be able to perform well for a control task as well. Probe model complexity and accuracy achieved for the auxiliary task of part-of-speech and its control task can be seen above in the right figure. It is of utmost importance to choose a probe with high selectivity and high accuracy to draw out conclusions.

Adi et al investigate the source of sentence structure knowledge in the paper “FINE-GRAINED ANALYSIS OF SENTENCE EMBEDDINGS USING AUXILIARY PREDICTION TASKS”. Inspite of the CBOW model being oblivious to the context around, Probe was able to give high accuracy on the auxiliary task to predict the sentence length. However, it was found that just the norm of sentence embedding was indicative of sentence length (figure 10 (right)), so the source of information was not from the encoded representations of a token. However, when these representations were aggregated, the norm tends to move towards 0, as established by the central limit theorem and Hoeffding‘s inequality. It can be noticed in figure 10 (left) that the length prediction accuracy for synthetic sentences (random words chose to form a synthetic sentence) was also close to legitimate sentences. So, the actual source of knowledge to determine the sentence length was just the statistical property to the aggregation of random variables.

Figure 10. (Left) Sentence length prediction accuracy vs representation dimensions. (Right) Norm vs sentence length. (Adi et al. [11], ICLR 2017)

Hence, it requires in-depth study and analysis to drive inference from outcomes of probes.

Figure 11: Examples of generated paraphrases for a source sentence using syntactic exemplars. (Kumar et al. [16], TACL 2020)

The above figure 11 shows generated paraphrases with guidance from the syntax of different exemplar sentences. We can observe how the model is able to get guidance from the syntax of exemplar sentences. Note that only the syntax of exemplar sentences is given as an input, actual individual tokens are not fed to the model. A syntax tree of an exemplar sentence can be extracted at different height H, and it can be fed as an input to the encoder-decoder model. Lesser height gives more flexibility of paraphrasing, while deeper height would try to explicitly control the syntax structure of paraphrase.

Figure 12. The left figure shows the syntax tree of a sentence. The right figure shows how different paraphrases are generated for the source sentence (S) when syntax trees of an exemplar sentence (E) at different heights (H=4 to 7) are given as input. (Kumar et al. [16], TACL 2020)

Benchmarks like GLUE and SuperGLUE are developed to assess the abilities of fine-tuned NLP models to perform the tasks based on natural language understanding. Generally, the performance of NLP models is compared with validation accuracy. There are inherent limitations in using validation accuracy like overfitting, data distribution of validation set, etc. The paper “Beyond Accuracy: Behavioral Testing of NLP Models with CheckList”, presents a framework for assessing the model’s performance beyond validation accuracy.

“CHECKLIST” suggests three different test types, Minimum Functionality tests (MFT) where examples are generated with the expected gold labels., Invariance (INV) where, from given examples, it creates new examples where the gold labels get flipped., and Directional Expectation tests (DIR) changes the gold labels in a positive or negative direction. Examples of each are given below:

Figure 13. Examples of test cases generated using MFT, INV, and DIR rules for the task of “Quora Question Pairs”, which is to detect whether two questions are duplicates or not. (Ribeiro et al. ACL 2020)

It was surprising to notice that while models like Roberta and BERT surpass human baselines (with the accuracies of 91.1% and 91.3%) are failing badly on simple rule-based generalizations of validation dataset. That said, there is a long road map ahead to achieve human-level natural language understanding.

We have gone through Probes to assess encoded linguistic knowledge in NLP models. We have found that

  • NLP models do encode linguistic knowledge in order to solve some downstream NLP tasks.
  • Bigger models or representations not necessarily encode better linguistic knowledge
  • Linguistic knowledge encoded for syntactic tasks generalize over test data with complex sentence structure, attributes to model’s capacity to encode linguistic grammar.
  • The deeper Probes can overfit and potentially memorize the auxiliary tasks, which leads to an overestimation of encoded linguistic knowledge. Hence, it is recommended to design control tasks for the Probes.
  • When linguistic knowledge is supplied, models can do better on tasks seeking guidance from such knowledge.
  • Syntactic linguistic knowledge is not enough to capture the meaning of natural language understanding. State of the art models is far from achieving the understanding needed for NLP tasks.

The encoded linguistic knowledge is primarily syntactic in nature, and as demonstrated by “CHECKLIST”, models fail on generalization which is semantic in nature. State of the art NLP models is primarily pre-trained in self-supervised fashion on unlabelled data, and fine-tuned on limited labeled data for the downstream tasks. It is certainly difficult to acquire semantic knowledge related to tasks or domains from unlabelled data or limited labeled data.


  1. Belinkov, Y. and Glass, J., 2019. Analysis methods in neural language processing: A survey. Transactions of the Association for Computational Linguistics, 7, pp.49–72.
  2. Clark, K., Khandelwal, U., Levy, O. and Manning, C.D., 2019. What does bert look at? an analysis of bert’s attention. arXiv preprint arXiv:1906.04341
  3. Tenney, I., Das, D. and Pavlick, E., 2019. BERT rediscovers the classical NLP pipeline. arXiv preprint arXiv:1905.05950.
  4. Hewitt, J. and Liang, P., 2019. Designing and interpreting probes with control tasks. arXiv preprint arXiv:1909.03368.
  5. Hewitt, J. and Manning, C.D., 2019, June. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 4129–4138).Goldberg, “Assessing BERT’s Syntactic Abilities”, 2019
  6. Goldberg, Y., 2019. Assessing BERT’s syntactic abilities. arXiv preprint arXiv:1901.05287.
  7. Hofmann, V., Pierrehumbert, J.B. and Schütze, H., 2020. Generating Derivational Morphology with BERT. arXiv preprint arXiv:2005.00672.
  8. Coenen, A., Reif, E., Yuan, A., Kim, B., Pearce, A., Viégas, F. and Wattenberg, M., 2019. Visualizing and measuring the geometry of bert. arXiv preprint arXiv:1906.02715.
  9. Tenney, I., Xia, P., Chen, B., Wang, A., Poliak, A., McCoy, R.T., Kim, N., Van Durme, B., Bowman, S.R., Das, D. and Pavlick, E., 2019. What do you learn from context? probing for sentence structure in contextualized word representations. arXiv preprint arXiv:1905.06316.
  10. Peters, M.E., Neumann, M., Zettlemoyer, L. and Yih, W.T., 2018. Dissecting contextual word embeddings: Architecture and representation. arXiv preprint arXiv:1808.08949.
  11. Adi, Y., Kermany, E., Belinkov, Y., Lavi, O. and Goldberg, Y., 2016. Fine-grained analysis of sentence embeddings using auxiliary prediction tasks. arXiv preprint arXiv:1608.04207.
  12. Stickland, A.C. and Murray, I., 2019. Bert and pals: Projected attention layers for efficient adaptation in multi-task learning. arXiv preprint arXiv:1902.02671.
  13. Zhou, J., Zhang, Z., Zhao, H. and Zhang, S., 2019. LIMIT-BERT: Linguistic informed multi-task bert. arXiv preprint arXiv:1910.14296.
  14. Jawahar, G., Sagot, B. and Seddah, D., 2019, July. What does BERT learn about the structure of language?.
  15. Lin, Y., Tan, Y.C. and Frank, R., 2019. Open Sesame: Getting Inside BERT’s Linguistic Knowledge. arXiv preprint arXiv:1906.01698.
  16. Kumar, A., Ahuja, K., Vadapalli, R. and Talukdar, P., 2020. Syntax-guided Controlled Generation of Paraphrases. arXiv preprint arXiv:2005.08417.
  17. de Vries, W., van Cranenburgh, A. and Nissim, M., 2020. What’s so special about BERT’s layers? A closer look at the NLP pipeline in monolingual and multilingual models. arXiv preprint arXiv:2004.06499.
  18. Ribeiro, M.T., Wu, T., Guestrin, C. and Singh, S., 2020. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. arXiv preprint arXiv:2005.04118.
  19. Gaur, M., Faldu, K. and Sheth, A., 2020. Semantics of the Black-Box: Can knowledge graphs help make deep learning systems more interpretable and explainable?. arXiv preprint arXiv:2010.08660.