Interpretability of Pre-trained Language Models: A Survey

HAO Yaru1, DONG Li1, XU Ke2*, LI Xianxian3   

  1. 1. Microsoft Research Asia, Beijing 100191, China;
    2. School of Computer Science and Engineering, Beihang University, Beijing 100083, China;
    3. Guangxi Key Laboratory of Multi-Source Information Mining and Security (Guangxi Normal University), Guilin Guangxi 541004, China
  • Received:2022-03-08 Revised:2022-05-05 Online:2022-09-25 Published:2022-10-18

Abstract: Large-scale pre-trained language models based on deep neural networks have achieved great success in various natural language processing tasks, such as text classification, reading comprehension, machine translation, etc., and have been widely used in the industry. However, the interpretability of these models is generally poor, that is, it is difficult for us to understand the reasons why different model structures and pre-training methods are effective, and to explain the internal mechanism of the models making predictions, which brings difficulties to the generalization of artificial intelligence models because of the uncertainty and the uncontrollability. Therefore, it is crucial to design reasonable methods to explain the model, which can not only effectively explain the behavior of the model, but also guide researchers to better improve the model. This paper introduces various research statuses of the interpretability of large-scale pre-trained language models in recent years, reviews related methods, and analyzes the shortcomings of the existing methods and possible future research directions.

Key words: language model, pre-training, interpretability, natural language processing, neural networks

