Page content

Legal work often involves searching through large quantities of text for specific information, and many other disciplines in the humanities and social sciences also use texts as their primary data source. Artificial intelligence is a promising tool for conducting such large-scale searches, but many existing AI-based tools are highly specialised, and are only intended for use with particular types of content, such as contracts. Moreover, as documents change over time, a tool adapted for use with data from one point in time may be less useful for working with older or newer material. This project aims to find avenues for improvement by building on insights from the discipline of theoretical linguistics, which not only makes frequent use of large textual data sources but uses the data obtained to understand the mapping between form and meaning, and how such relationships change over time.

Building on previous research both in legal tech and in linguistics, we are developing a system for Retrieval Augmented Generation (RAG), a technique that allows AI models to extend beyond their initial traning in order to search and interpret new data. The aim is to develop a system that will be flexible enough to handle data from a variety of time periods and allow multiple aspects of the data to be queried using natural language. Such a system will build in new ways upon existing specialised tools, and integrate them so that users can make use of their multiple specialisms through a single, unified interface.

The projected system will be able to query data from multiple sources and in multiple formats. In this first phase of the project, our dataset is based on Old Bailey Online, which contains proceedings of trials from 1674 to 1913. This corpus spans more than two centuries, a period during which there were numerous changes both in the nature of legal proceedings and in the language used to represent them, and has previously been used to study topics ranging from grammatical change to the evolution of concepts and ideologies. In subsequent phases we plan to add additional datasets, potentially including material such as law reports and contractual data, so that users will be able to synthesise and compare data from multiple sources.

The ultimate aim of this project is a tool that will be of use both to practising lawyers, for research in areas such as case law, and for academic research in fields ranging from legal history to cultural history and linguistics. To this end, the development process will incorporate feedback from users in different disciplines and ensure that relevance and ease of use are prioritised throughout. Our goal is for the tool to evolve continually through a cyclical feedback process, so that it will be of greater benefit to an increasing number of users.

Project research team