Between September 2019 and Juin 2021, the “Deep text” research project was carried out by the DHLab of the MSI and the BCL laboratory.

That close collaboration led to the development of new deep learning tools for text mining. In more detail, some original techniques were introduced to inspect the hidden layers of a (feed-forward, convolutional or recurrent) neural net, trained for classification.  Those techniques are referred to as Weighted Deconvolution Saliency (WTDS) and allow researchers and practitioners to detect in a paragraph of a corpus the key-patterns used by the artificial intelligence to classify the paragraph (e.g. author or domain attribution, sentiment analysis).

The main contributions of the proposed approach are:

  1. the extracted key-patterns account for the context. For instance, the same token at two different locations in a sentence can be considered as a key-token only once;
  2. the activation of a pattern toward all the classes is computed (see Figure 1);
  3. no randomization/sampling step is required (as it is the case for LIME and SHAP, for instance) thus reducing the computational burden of the patterns extraction.



Figure 1 : Key-patterns in Macron’s speech attributed to the classes “Trump” and “Obama”.
Figure 1 : Key-patterns in Macron’s speech attributed to the classes “Trump” and “Obama”. Figure 1 : Key-patterns in Macron’s speech attributed to the classes “Trump” and “Obama”.


Publications

The main findings emerging from “Deep text” were the subject of several works, including a published book chapter (1), a conference proceedings (2) and a recently submitted pre-print (3). Moreover, WTDS was implemented as part of the text data mining software Hyperbase and it is currently used at BCL, in support of several research projects and publications.

  1. Laurent Vanni, Marco Corneli, Dominique Longrée, Damon Mayaffre, Frédéric Precioso. Key Passages : From statistics to Deep Learning. Domenica Fioredistella Iezzi; Damon Mayaffre; Michelangelo Misuraca. Text Analytics. Advances and Challenges, Springer, pp.41-54, 2020, 978-3-030-52679-5. 10.1007/978-3-030-52680-1_4. hal-03099658
  2. Laurent Vanni, Marco Corneli, Dominique Longrée, Damon Mayaffre, Frédéric Precioso. Hyperdeep : deep learning descriptif pour l'analyse de données textuelles. JADT 2020 - 15èmes Journées Internationales d'Analyse statistique des Données Textuelles, Jun 2020, Toulouse, France. hal-02926880
  3. Laurent Vanni, Marco Corneli, Damon Mayaffre, Frédéric Precioso. From text saliency to linguistic objects: learning linguistic interpretable markers with a multi-channels convolutional architecture. 2021. hal-03142170
 

Project Members

Camille Bouzereau (Ph.D. student, Bases, Corpus and Language – BCL Laboratory), Magali Guaresi (Université Libre de Bruxelles Post-Doctoral fellow), Dominique Longrée (Professor, Université de Liège), Damon MAYAFFRE, (CNRS [National Center for Scientific Research] Researcher, Céline Poudat (Associate Professor, Bases, Corpus and Language – BCL Laboratory), Frédéric Précioso (Professor, I3S Laboratory), Laurent VANNI, (CNRS [National Center for Scientific Research] Research Engineer), Marco CORNELI, MSI Data Scientist.