Recognition of Printed and Handwritten Chinese Texts in Ancient Documents: The Numerica Sinologica Research Project
- Science outreach
- Research
- IDEX
on the July 21, 2023
Frédéric Constant, university professor in legal history at Université Côte d’Azur, presents the challenges addressed by the Numerica Sinologica research project and the role played by Azzurra’s computing resources and dedicated MSI staff.
In recent decades, numerous libraries and heritage institutions around the world have digitized and disseminated their collections, providing historians with vast amounts of information still to be exploited. Converting these images into text allows for the development of powerful analysis tools, especially crucial when dealing with significant amounts of data. While optical character recognition (OCR) techniques are relatively effective for modern prints, recognizing ancient prints or handwritten scripts (HTR - Handwritten Text Recognition) requires the development of specific models. Recently, the field has seen significant progress by leveraging deep learning and neural networks. The Numerica Sinologica project aims to develop OCR and HTR models tailored to Chinese historical sources, using the open-source software kraken combined with the eScriptorium platform, which has already proven its worth for Latin and Hebrew handwritten scripts.
Chinese writing has unique characteristics that required adapting existing tools. The vertical arrangement of writing lines poses specific challenges for line recognition (middle image, photo 1). Classical Chinese is composed of over 30,000 sinograms, some of which appear only in a few occurrences, compared to about a hundred for alphabetic languages.
We trained an initial kraken model using data from the Imperial Collection of the Four Repositories (photo 2), a vast corpus of works compiled by order of Emperor Qianlong in the 18th century, for which we have both transcriptions and images. The collection contains a total of 3,461 works, comprising over 2 million pages and 800 million characters.
Our first kraken model, trained on approximately two million lines, covers nearly all existing sinograms, i.e. all commonly used characters.
The next step in our work is to train new models on a wide diversity of writings, enabling them not only to recognize a broad range of characters but also a large number of writing styles, an essential condition for handwritten text recognition.
Model training requires considerable computing power, which only computing centers like Azzurra can provide to research teams. The generic models are trained on another server, that of IN2P3, affiliated with the National Institute of Nuclear and Particle Physics, which other project members use. Personally, I utilize Azzurra's resources to train specific models for particular corpora and conduct parameter and data testing.
Azzurra provides me with a powerful and accessible working environment. Coming from the humanities and social sciences without extensive computer skills, I have benefited from the regular assistance of Maeva Antoine (MSI Engineer), responsible for HPC computing resources, making it possible for research labs without dedicated internal resources to use the computing center.
The Numerica Sinologica project receives support from France's three main research units on imperial China: CCJ (UMR 8173 China, Korea, Japan), CRCAO (Research Center on Civilizations of East Asia, UMR 8155), and IAO (Institute of East Asian Studies, UMR 5062). The models are trained by Colin Brisson (EPHE, CRCAO UMR 8155), Marc Bui (Paris 8, EPHE, AOrOc UMR 8546), and Frédéric Constant (Université Côte d’Azur, Ermes UPR 1198, and IAO UMR 5062).
To contact the Azzurra Computing Center Team
- centre-calculs.ct@univ-cotedazur.fr (technical questions)
- centre-calculs.cs@univ-cotedazur.fr (scientific questions)