Translation with Large Language Models

TraLaLaM Project

Within the span of six short years (2017-2023), the field of Natural Language Processing (NLP) has been deeply transformed by the advances of general-purpose neural architectures, which are both used to learn deep representations for linguistic units and to generate high-quality textual content. These architectures are nowadays ubiquitous in NLP applications; trained at scale, these “large language models” (LLMs) offer multiple services (summarisation, writing aids, translation) in one model through human-like conversations and prompting techniques.

In this project, our aim is to analyse the new state of play from the perspective of machine translation (MT) and ask two main questions:

Prompting techniques make it straightforward to inject various types of contextual information that could help an MT system to take context into specific account such as to adapt to a domain, a genre, a style, to a client’s translation memory, to the readers’ language proficiency, etc. Is prompting equally effective for all these situations, assuming good prompts can be generated, or is it hopeless to expect improvements without (instruction) fine-tuning?
As LLMs can be trained without any parallel data, they open the perspective of improved MT for multiple language pairs, domains and styles for which such resources are scarce if they exist at all. Can this promise be held, especially for low-resource dialects and regional languages?

To address these two questions, project TraLaLaM will also :
Collect data for low-resource languages and use them to extend existing LLMs
Develop new testing corpora and associated evaluation strategies.

TraLaLaM Partners

The consortium is composed of two academic research teams: ISIR/MLIA (Sorbonne-Université and CNRS) and ALMAnaCH (Inria , Paris) and one SME (SYSTRAN).

ISIR is a joint laboratory of Sorbonne-Université and CNRS; within ISIR, the MLIA team conducts research in the field of Statistical Machine Learning (ML) with an emphasis on algorithmic aspects and on applications involving semantic data analysis and modelling complex physical systems.

Inria is the National Institute for Research in Digital Science and Technology. ALMAnaCH is Inria Paris' NLP research team, carrying out research in Natural Language Processing and Computational Humanities.

Since its creation in 1968, SYSTRAN has been a pioneer in MT technology. The company is an industry leader and has developed many innovations and innovative solutions commonly used by businesses and the general public over the years. Strongly focused on research and development, SYSTRAN has approximately 100 employees and a turnover of €20M.