As tech companies struggle to get their hands on large amounts of unstructured legal data, many law firms and legal services providers grapple with the concepts of how the algorithms “learn” and what happens to an algorithm once it’s trained.
A machine learning algorithm is basically a computer program that can take inputs, produce outputs, and then incorporate those outputs (in whole or in part) back into itself. You can think of spellcheck as a sort of machine learning algorithm. In many ways, it's already trained — it knows how to spell the words. It is able to look at a word that you’ve typed, see if it recognises it, and then, if it doesn't, give you some suggestions as to what it probably should be. And, in instances of an unfamiliar word, such as a surname or other proper noun, you can “label” a piece of data, such as that surname, as something it should know and tell it to “learn” that spelling.
Advances in machine learning algorithms and computer science have allowed us to create these systems that can perform certain types of tasks or analysis with minimal human intervention.
While several different approaches to machine learning exist, we'll focus on the “supervised” and “semi-supervised” machine learning models where people must intervene to “train” the algorithm by “labelling” data. For instance, I could take a handful of contracts and pull out the clauses that discuss which state or nation’s laws govern the contract. I’ll call this a “Governing Law” clause. I could call it a blueberry pancakes clause — the machine doesn’t care. So, we feed the algorithm 100 “Governing Law” clauses so that it can analyse them to find the common elements or whatever the algorithm does to “learn” what the clause is. This is part of training the algorithm.
The next step would be to show the algorithm a bunch of contracts it has never seen so it can try to identify the “governing law” clauses. It’s important that we know how many governing law clauses are in this body, this corpus of documents, so we can see how good it is at finding these clauses. At this stage, we’re measuring a couple things, like how often it misses a governing law clause or how often it identifies something as a governing law clause that isn’t in fact a governing law clause. Anyway, we’ll train it more, tweak it, massage it, and finally get it to a point where we’re fairly confident that it works to a certain, articulable standard. If you read into this further, you’ll see that accuracy, precision, recall, and F-1 score.
Instead of doing this for just one type of clause, we could do this for dozens or hundreds of types of clauses. However, in order to train an algorithm to recognise hundreds of types of clauses, we would need To get our hands on hundreds of examples of hundreds of different types of clauses! Access to large datasets, large corpora of legal documents, presents one of the most significant hurdles to training an algorithm for application in law and legal practice.
So, if a tech company trains up its algorithm on a law firm’s library of legal documents, this newer, smarter version of the algorithm is different than it was before. It contains within it more information than it did before. It, in a sense, knows more than it did before. So, who owns the intellectual property (IP) associated with the algorithm? Is it the company that built the algorithm, or is it the firm whose data trained it? Or is it a combination of the two?
From a legal perspective, and this is not legal advice, this would (or should) typically be governed by or covered in whatever contract or license agreement was hashed out between the parties.
Greater questions emerge when algorithms are trained on data in ways not permitted by the holder of the data. For instance, if a firm fed all of Ernest Hemingway’s writings into an algorithm so that the system would write like Ernest Hemingway, should the holders of Hemingway’s intellectual property rights have some sort of rights or ownership over that algorithm or, at the very least, have some say in how it's used?
NBC’s Olivia Solon broke a story in March 2019 with the gripping headline “Facial recognition’s ‘dirty little secret’: Millions of online photos scraped without consent[.]”. To Solon, it appeared scandalous that “[r]esearchers often just grab whatever images are available in the wild,” according to Prof. Jason Schultz of NYU School of Law. “That’s the dirty little secret of AI training sets[,]” he explained to Solon. The journalist seems so taken with this notion that she borrowed the term for the title of the piece.
However, MIT Technology review’s Karen Hao explains that “[r]eally, for industry insiders, IBM did nothing out of the ordinary” in a piece published a few days later. In fact, Hao had mentioned IBM’s research in a newsletter at the end of January 2019, but focused on concerns for mass surveillance rather than the consent of IP rights holders.
What I haven't really seen — and, admittedly, haven't really been looking for — is a robust discussion on the issues over the IP created. Does IBM rightly hold all the IP in its Watson systems since it has clearly trained and improved its algorithms on the backs of somebody else's IP? I think an argument can be made either way, and I think that more computer scientists and software engineers should talk to the lawyers before doing these sorts of things that could open up the companies to potential civil liability down the road.