This year, CIPA Congress is tackling all things AI. Together with Ben Hoyle (Hoyle IP Services Ltd), Coreena Brinck (Two IP) and Julio Fonseca (ASML), this Kat has the pleasure of speaking on a panel at Congress focused on the intersection between data, IP and AI, "Data is the new oil", chaired by Greg Corcoran (Greg Corcoran IP). For the avoidance of doubt, the following are this Kat's own views and do not represent the views of the rest of the panel.
The concept of data as oil has been around for a number of years, but does the analogy still hold? Coming from the perspective of the pharma and biotech industry, this Kat sees that there is now a shift away from thinking of data as a bulk commodity of raw material, towards the pursuit of high quality data that can improve the performance of AI models. This shift has important implications for IP strategy and licensing provisions relating to data.
Data was the new oil
Data is a very broad term. At a very basic level, data can be defined as a collection of facts, figures, and statistics. As the Google Books Ngram Viewer reveals, the term "data" only really came to prominence in the 1990s, followed by a steep rise in usage up to the early years of the millennium. The increased use of the term data correlated with the vast increase in information that became available to society from the 90s onwards, brought about partly by the internet and partly by technological advances in computing and scientific discovery (such as genetic sequencing technologies).
The concept of data as "the new oil" was brought to popular attention by a headline in The Economist back in 2017 that proclaimed that The world’s most valuable resource is no longer oil, but data. Interestingly, since then, much of the discussion about data and its use in the economy has often focused on the negative connotations of that phrase, particularly the exploitation of personal information by big tech. However, for the science and technology sector that the IP industry serves, the analogy of data as oil holds a different meaning. The explosion in the scale and complexity of information that arose at the start gave rise to its own field of big data (which became a buzzword around 2007) and associated specialisms devoted to analysing and processing this data (anyone remember systems biology?).
Importantly, this big data provided an essential bedrock for the AI systems which were to follow. The modern field of machine learning is fundamentally dependent on vast amounts of data to learn, improve, and make accurate predictions. There would be no Nobel-prize winning AlphaFold without the public data bank of protein sequences (IPKat), and there would be no Large Language Models (LLMs) without the vast quantity of language data available on the internet. These huge repositories of information were essential to train the models that are the foundation of modern AI.
![]() |
From oil to gemstones |
The different types of data
The data that has fuelled the field of AI in the form of training data takes multiple forms, spanning numerous formats and scientific domains, depending on the type of AI. For early visual AI models like Convolutional Neural Networks (CNNs), the primary data consists of images. This could include, for example, medical imagery such as breast scans and MRI scans used for detecting cancer, as well as visual, infrared, and radar images collected from aerial surveillance for applications like disaster management. Importantly, the generation of AI models used in scientific discovery relies on vast datasets such as genomic and proteomic information from gene sequencing, data from the Protein Data Bank containing protein structures, and simulation results from semiconductor chip design. For generative AI like LLMs, the training data is predominantly text and code, often scraped from the public internet.
Beyond the initial training data, another crucial category of data in machine learning revolves around the performance and refinement of the AI models themselves. This includes model optimization data, such as the human feedback gathered during techniques like Reinforcement Learning from Human Feedback (RLHF). In this process, human reviewers rank AI-generated responses or identify errors, and this feedback is then used as data to update and fine-tune the model. The output of the models also constitutes a key data type, which can range from a ranked list of documents generated by a prior art search tool to the predicted 3D structures of proteins produced by AlphaFold. Finally, statistics about a model's effectiveness are a critical form of data used to evaluate performance. For instance, research comparing the accuracy of machine learning systems against human doctors provides quantitative data on the model's utility.
Searching for the data gemstones
We can all accept that big data has been the oil that has fuelled the engine of AI. In the same way an engine cannot run without fuel, AI models are powerless without vast quantities of information to train on. However, the analogy begins to break down when we consider quality. Unlike crude oil, which is a relatively uniform commodity, there is a vast and often unappreciated difference in the type and quality of data available.
However, the persisting popularity of the "data as oil" analogy leads to a persistent misconception that all data is valuable. It is true that, in the early days of big data, the focus was predominantly on volume. However, we now have so much data that a lot of it is not only worthless but can actively harm an AI model by introducing and exacerbating irrelevant noise and biases. Just one example is the patent data used to train LLMs (IPKat). Google Patents contains a huge number of badly written patent applications (and granted patents). As a result, training an LLM on more of this data is unlikely to improve its performance for drafting patent applications.
Consequently, AI software developers are refocusing their efforts on mining data of quality. Cleaner and more relevant datasets are becoming increasingly valuable, and there is consequently an increasing focus on how these can be obtained. In other words, we are now looking for the data gemstones. A gemstone is rare, precisely formed, and valuable. For an AI model for predicting tumours, a data gemstone might take the form of a curated medical dataset where thousands of MRI scans have been meticulously annotated by multiple expert radiologists to identify tumours. For companies developing self-driving cars, valuable data is no longer another million miles of uneventful motorway driving. Instead, it is the rare video footage of a near-miss accident, a complex intersection in heavy rain, or a child running into the road. These high-quality, often rare, data points are the key to improving the performance of AI models so that they can perform more expert tasks, with greater accuracy.
The shift in the focus from data as oil to searching for the data gemstones is just as applicable to data relating to model optimization and performance. There is a similar shift in focus towards identifying the rare, high-value pieces of performance data that can improve the quality of a model's output. In the context of LLMs, for example, hallucinations in highly technical areas are unlikely to be identified and ironed out with a sledgehammer approach using more of the same generalist data similar to that which the model was trained on. Instead, the focus is now on expert annotation and optimization using Reinforcement Learning from Human Feedback (RLHF) (IPKat), and in developing methods whereby errors and hallucinations generated by the models can be identified.
Implications for IP strategy
The shift from data as a bulk commodity to curated gemstones of data has important implications for IP strategy. The value of data currently lies not in the raw information itself, but in the intellectual effort, investment, and expertise applied to curate, annotate, and structure the data. This is also where protectable IP can and does reside.
The most relevant IP for data include database rights, copyright, and trade secrets. In the UK and EU, sui generis database rights protect the substantial investment made in obtaining, verifying, or presenting the contents of a database (IPKat). Database rights can protect the investment put into creating valuable data, for example, by funding an extensive clinical trial and analysing the results with bioinformatic techniques, or by employing experts to annotate thousands of images.
Additionally, in many cases, the most powerful protection lies in treating these high-value datasets as trade secrets. A curated dataset derives its economic value from not being generally known and can be subject to reasonable steps to keep it secret. This is not an entirely new concept. The pharmaceutical industry, for example, has long treated its high-quality clinical trial data, compound libraries, and proprietary assay results as fiercely protected crown jewels.
However, a critical point for businesses to grasp is that the act of identifying, cleaning, annotating, and structuring data for AI model training is a significant value-add activity in and of itself. Many non-tech companies, including pharma companies, may be sitting on vast repositories of raw data but lack the internal expertise to refine the data so as to make it useful for training an AI model. This disconnect could create a significant risk that a company may undervalue their assets in collaborations with tech partners or, on the flip side, why an early partnership with an AI company may be necessary to extract the true value from the data.
Implications for licensing
From an IP licensing perspective, it is important to capture the nuance between raw low-quality and processed, annotated data in licence and collaboration agreements. A standard data licence that grants broad rights to a dataset for a nominal fee is no longer fit for purpose. Agreements must now be far more sophisticated and use precise definitions to clearly delineate what constitutes the licensed curated data as distinct from any raw inputs. Furthermore, contractual terms must address the outputs and applications of the data. This includes asserting ownership or at least rights of access to any improvements, models, or insights derived from the licensed data. Concurrently, strict field of use restrictions are essential to limit the licensee's application of the data to a specific purpose, preventing them from using the asset to train other models that could compete with the licensor's core business. Finally, the valuation of the licence must be recalibrated. The licence fees, royalties, or equity stake should reflect the true value of the curated data as a critical enabling asset, not merely as a cost-of-goods-sold commodity.
Final thoughts
As we will discuss on the Congress panel, the phrase "data is the new oil" remains relevant to our understanding of how today's modern AI models were developed. However, in a supply-led economy, the sheer amount of data that we have now, together with the fact that most of it is of low quality, has decreased the value of generic new data as a bulk commodity that can be mined from the masses. IP strategy needs to recognise this shift and move the focus to protecting the data that possesses the true value, the rubies, emeralds and sapphires that represent the new data gemstones.
Further reading