Many in the IP profession remain considerably sceptical of AI. AI may be useful for checking for typos and simple calculations of deadlines, but it cannot replace in-depth human reasoning about complex scientific and legal issues. However, the data suggests something different.
How do we know LLMs are any good?
Generative AI is an exceedingly competitive field. It is possibly one of the most competitive commercial arenas today. The foundational labs are in a race to produce the best models. This means it is very important to a lot of people, what it means to be "the best model". The competitive nature of the field, and the widespread adoption of AI means that there is thus considerable interest in evaluating and comparing models according to every possible metric. Consequently, there is a wealth of data at our disposal comparing the models.
The website Artificial Analysis is a good source of independent LLM evaluation data. As a casual perusal of Artificial Analysis shows, there are many possible ways to evaluate an LLM, including hallucination rate, speed, coding ability, ability to follow instructions, and mathematical reasoning. The MATH-500 Benchmark, for example, is a collection of 500 maths problems including algebra, geometry, intermediate algebra, number theory, precalculus, and probability, requiring step-by-step solutions and precise mathematical reasoning (GPT-5 is the current leader, scoring 99.4%). Given the focus on coding in the LLM world, it is unsurprising that many of the benchmarks relate to evaluating this type of ability. However, what we need the models to do in the IP profession, is very different from coding and mathematical solutions.
Evaluating AI for patent work
The LLM evaluation that is most relevant to the patent profession, in this Kat's opinion, is the assessment of the models' ability of models to perform long-context reasoning (LCR). If an LLM is going to be any use for the patent profession, it is going to need to understand and reason about long complex documents. As luck would have it, there is an evaluation designed to test this ability.
The current LCR evaluation for LLMs from Artificial Analysis consists of 100 questions relating to diverse document types, including academic papers, company financials, government consultations, legal documents, industry reports, and marketing materials. The documents average 100,000 tokens each (i.e. about 75,000 words, or 300 pages). The 23 legal documents in the test contributed the most tokens, with an average of 116,000 tokens. According to the Artificial Analysis summary, the LCR eval requires "genuine reasoning" rather than simple data extraction, comprising multi-step analysis to synthesize information from dispersed sections, the ability to understand complex domain-specific content, and clear and unambiguous answers free from errors and hallucinations.
According to Artificial Analysis, traditionally humans have dramatically outscored LLMs on LCR. Indeed, up until around 2024, even the best LLMs were bad at it. Even the best frontier models, such as ChatGPT, Claude and Gemini, achieved less than 50% accuracy in the LCR evaluation. Given this, it is entirely unsurprising that, if you tried to use an LLM for patent work in early 2024, it probably wasn't very good. In 2024, it was indeed a huge struggle to get LLMs to achieve good outcomes for complex tasks such as patent drafting, prosecution or prior art analysis without a highly sophisticated work flow, a great deal of separate prompting steps and complex coding loops, all of which needed a lot of underlying programming and software engineering (this Kat knows, she tried). In the world of 2024, AI-wrapper software made a lot of sense. It was also at this time that many of the AI wrapper companies for IP were founded. After all, in early 2024, we needed them.
The rate of change
According to the data, therefore, in 2024 LLMs were fairly bad at understanding and reasoning about long complicated documents. However, in AI, things change and they change fast, and the LCR evaluation data for the latest models tells us exactly how much things have changed. According to the independent assessment of Artificial Analysis, the best models currently available (ChatGPT 5, Claude Opus 4.6, Gemini Pro 3.1) currently score around 75% on the LCR test, a big jump up from 50%.
But 75% is still pretty far off 100%, I hear readers cry. However, the key piece of comparative information we need to know is how well humans perform in this test. According to Artificial Analysis, human domain experts also struggle with the test. Whilst human evaluation confirmed that it was possible to answer a question correctly, the average expert human score was typically 40-60% of questions on the first attempt. In other words, with average scores of 75% the best models are now better at long-context reading than the average human domain expert (in a fraction of the time).
The current ability of the frontier models on LCR tasks means that a lot of the programming and software solutions that the AI-wrapper companies were originally built on, are no longer needed. The frontier models now just do this, by default. Indeed, it is important to keep up to date with the capabilities of the underlying model, so as to avoid over-engineered solutions that actually prevent the models performing well (IPKat). The skill of prompt engineering has become a lot about what you don't need to say, as much as what you do need to say.
![]() |
| Beyond the wrapper |
Another key message from the LLM eval data on LCR is that, for patent work, we need to be using the best models. Interestingly, on the LCR benchmark, Grok 4.20 languishes down at 58%, whilst DeepSeek v3.2 is a respectable 65%. If you are using a free version of an LLM, or the "fast" non-reasoning/thinking model, it will be far worse, and probably no better than 50%. However, the better models are also far more expensive per token. If you are speaking to an AI software wrapper company, "what model are you using" is therefore one of the first questions you need to be asking.
The demise of the AI wrapper?
A previous IPKat post discussed the increasing redundancy of AI wrapper software for IP (IPKat: Is AI software for IP just expensive wrapping paper?). There is not much difference these days between the output of such tools, and the output of a foundational LLM such as ChatGPT, Gemini or Claude combined with prompt engineering by an experienced patent attorney. In many cases, the content of the output is worse, as it will lack specific technical field and jurisdictional expertise. However, a strong argument for using a wrapper was that they do generally provide a user-friendly interface with, for example, the ability to combine prompting and track changes.
That all changed last weekend, with Anthropic's release of the Claude plug-in for Word. The Claude plug-in (which, interestingly, appears to be marketed at lawyers specifically), allows Claude users to prompt within Word, incorporating tracked-changes functionality. Word is, by all accounts, a horrible piece of clunky software to deal with, and it is notable that Microsoft themselves haven't yet worked out how to combine CoPilot prompting and tracked changes in a useable way. Whilst tracked changes and version control combined with prompting has been available for ages for code, for text editing some people were even predicting a shift away from Word to a markdown editors or LaTeX editors. Claude has however, clearly recognised the importance of Word integration as a bottle-neck for AI adoption, and thrown everything at providing their own plug-in. As with all AI use, the Claude plug-in uses Claude itself, and therefore users need to have the appropriate confidential provisions in place (IPKat).
Final thoughts
It is clear from the benchmark data that, not only are LLMs now very good, but their abilities are also increasing very rapidly. In this Kat's view, the launch of the Claude Word plug-in removes one of the few arguments remaining for relying on AI wrapper software instead of upskilling ourselves as attorneys to be able to use AI effectively. Whilst the proud dinosaurs may be content to wait out the rise of AI until they can take early retirement, those of us who are enthusiastic about the future of the profession should view LLMs as an opportunity for learning something new. To do this, we need to be learning how to use AI to enhance our own expert capabilities, and not relying on someone else's.
Acknowledgements: Thanks, as always, to Mr PatKat (Laurence Aitchison, Head of Reasoning at Mistral) for his invaluable AI-industry insights.Further reading