35 Comments

This is a fine article, and almost convincing. But before delegating new decisions to LLMs, one should see if you can write briefs that force them to your desired conclusion. That is, the article is convincing that LLMs do well on historical data, but there is plenty of evidence that LLMs are somewhat fragile against hostile input. Litigants would be negligent (to be provocative) if they did not prepare a brief that would be 'convincing' to the LLM.

Expand full comment

I am wondering how an AI would have decided Plessy v. Ferguson, Dred Scott or Brown V. Board of Education based on the legal work available at the time.

Expand full comment

In this sense, AI may be more relevant as a tool for asking if we have written the law to say what we want it to say, as opposed to stopping at the question of what thing it says.

Expand full comment

After seeing these impressive results, I experimented with Claude before my most recent oral argument. Unfortunately, it (1) gave different results in different conversations and (2) made basic legal errors (e.g. inexplicably applying the wrong standard of review, when that question was extensively briefed). And that was with pretty good briefs by specialist appellate lawyers. To the best I could tell, it was just summarizing one party's or the other's brief, claiming it's correct, and assigning random probabilities of reversal. The arguments it thought were good and the odds it thought each side had of winning diverged hugely across conversations.

Expand full comment

As noted in my other comment, that’s not even close to the results I’ve been getting when using Opus. The level of accuracy in deciding Supreme Court cases has been remarkable. I’m curious which briefs / cases have yielded these bad outcomes and whether hand-holding helped.

Expand full comment

The case I was playing with was an appeal from a directed verdict in a breach of contract trial. It was a pretty straightforward case, albeit a record-heavy one given the procedural posture. It’s possible that the program would be better at more purely-legal cases like those SCOTUS takes. I’ll have to do some more experiments after your latest post.

Opus did seem to get better with handholding, though I’m no prompt engineer. For example, the specific bizarre procedural posture error it made was stating that you construe all factual disputes in the light most favorable to the directed verdict. When I asked “you sure about that one?,” it admitted the error and revised its probability estimate by 30% or so. But again, I was getting really different opinions in different conversations.

Expand full comment

A few things to consider:

1) LLM output is based on a random number generator. A random source may be unbiased in a statistical sense, but any single result can be an outlier. Single examples are interesting to show what an LLM *could* do, but useless to determine what it does *on average*. You need many samples to take an average. You need more to understand how much the results can vary. Don't base a study (or a judgement) on a single random draw!

Are judges prepared to do statistical studies to determine answers to questions? This might not save as much time as you thought.

2) When asking an LLM why it came to a previously written conclusion, it's always making up the answer. LLM's have no access to a history of their previous inner workings and don't remember anything about what they were "thinking" when outputting previous words (which may have happened on a different machine altogether). Their short-term memory is using the text alone. Their justifications are imitations of human justifications, role-playing how a human would justify themselves in writing based on the millions of examples they've seen. But they're not actually human, so these can't be the real reasons.

If you want to have a chance at making an LLM come to a conclusion based on reasoning, you need to make it write the reasons down first (called "thinking step by step"). If you do it the other way, it will pick an answer (often randomly) and reason backwards to invent justifications for it. They won't be very good justifications if the answer is wrong, but it won't point out a mistake unless you ask. Again, it's playing a role. As an experiment, programmers who like to tinker can force an LLM to output a certain answer and then ask it to invent reasons for it, and it will happily do so.

Consider how well it role-plays when we *know* it's making it up.

I wrote a blog post about this last year:

https://skybrian.substack.com/p/ai-chatbots-dont-know-why-they-did

What's a good use of AI chat? It's a hint generator. Ask it for ideas and then verify them yourself, though other means. If the hints don't work out, it's wasted a bit of time, but no harm done. (It's particularly useful for writing computer programs because we have relatively easy ways to test whatever we do.)

When using AI chat, verification is your job. "Distrust and verify" should be your motto. Think about how you would do it. If you have no way of verifying the results, you're in trouble.

Expand full comment

There is certainly a random factor to these outputs, but there is a random factor to the outputs of human judging, both in terms of judicial assignment and in terms of the outputs of a particular judge. To me there is an empirical question (in a double blind randomized study, would an expert human lawyer think Claude was as good as a human judge?) and a philosophical question (if so, should we speed things up with AI judges)? I think empirical testing would produce a “yes” answer to the first question, and while the philosophical issue is debatable, it’s not obvious that it’s inherently better to have humans do it.

Expand full comment

Sure, that's fair. I think I will wait for the actual study, though.

There are certainly economic reasons to speed things up, and that shades over into better justice from a speedier trials.

Expand full comment

Thank you, that was quite impressive. You mention AI replacing lawyering - maybe I need to try with Claude Opus, or maybe i need to try with US law, but I have tried with a (frontended) version of GPT-4turbo on Swiss law and got nothing very promising.

Expand full comment

To me, an advocate who tries to advance the law in a positive direction, this is a dystopian and frightening post. I’m imagining a world where the law is forever frozen, and where deep training data biases are forever replicated without even the ability to be examined or questioned. Control of these biases would be everything. Woe to criminal defendants, civil rights plaintiffs, and other low status litigants.

Furthermore, the fundamental premise, stated at the top, that a creative judge is necessarily a bad judge, is a false one. I think you underestimate the extent to which reasoning by analogy is a creative exercise. And most great judges, widely respected, were clearly creative minds (see, eg, Posner, Friendly, Hand).

These criticisms are not meant to undermine a very good post. You’re right that there is a lot of legal work that LLMs can and will pick up. you’re also probably right that LLMs will be doing the work of magistrate judges in mine run civil litigation pretty soon. And I’m reasonably certain that the empirical work you suggest, in cases with good briefing, will show pretty good performance by the LLMs.

But in the law, that unusual case where the outcome can be surprising is very important. Miranda, the “reasonable expectation of privacy” test, the shift to strict product liability, a million other examples come to mind. These changes often reflect a changing world, changes that might not be evident or adequately incorporated in training data.

I appreciated the post. I’m far from confident that I’m right. Just wanted to share some thoughts that came to mind reading it.

Expand full comment

You've convinced me, this sounds awesome.

We would only be able to change the law by changing the law. Or, we keep only the SC and be explicit that we value it's political function.

Expand full comment

Brilliant and provocative. I wonder how AI would have decided Bruen Dobbs or the Trump disqualification case using standard originalism. Critics say all would have come out the opposite way arguing this shows that originalism is camouflage for result oriented hypocrisy. An AI opinion could test these hypotheses. Maybe Adam will be tempted to do this. I would not know how.

Expand full comment

An LLM can only process text to the extent it's been programmed to by a human. This includes some self learning ("machine learning") but the depth of this self learning is still bounded by human understanding of applied math. For any given LLM, we don't know how well, or how deeply (or shallowly) the LLM can accurately and logically process inputs. I get that you want the Academy to do the work. But you're premising it on an assumption that the logic applied is the correct logic. I do not. I trust a judge and her clerks' logical decision making over an LLM.

Secondly, in the law, we have to cite our sources. All assertions must be backed up with citations. An AI does not cite its sources (or not all of them). We don't know what text inputs generating the reasoning leading to the output. For this reason alone, I look skeptically at LLM output.

Third, an LLM is biased. It's biased by the quality and genesis of its inputs. Train an LLM off Reddit or 4Chan or random_user_generated_content and who knows how good or bad the inputs are.

More broadly, maybe we should ask: do we want this? I admit I'm an idealist; I believe in judges, I believe in clerks, and I believe in the overall system of jurisprudence we've built. It's not perfect. But it's predictable, which is super important in things like business.

Expand full comment

Human brains are also neural networks that hear lots of and lots of human speech and then magically can learn to reason logically. Humans are also biased by the quality and genesis of our inputs. In my view this is a purely empirical issue. If the AI does as well as human judges, then we should treat it as magic just as we treat human brains as magic.

Expand full comment

There are definitely some interesting mysteries for future research (the field is called "mechanistic interpretability"), but we do know some things about how AI chatbots are built. The systems that surround them, which are used to make a chat interface out of an API, are built using ordinary programming. So there are some things we can usefully say about how they differ from people, and that can dispel the magic trick somewhat. See my other responses for more.

It's a bit off-topic, but if you're interested, Anthropic's recent interpretability research is wild:

https://www.anthropic.com/news/mapping-mind-language-model

Expand full comment

Am an incoming law student and was wondering if you'd had anyone take you up on your law review article challenge? Did some brief searching and social media scans but was unable to find any. Seems like a relatively simple (and fairly huge ROI) task. If you have, would love to know who it is so I can connect. If not, may have to pidgeonhole one of my professors and see if I can't make it happen. Love the article.

Cheers,

Sam

Expand full comment

Send me an email!

Expand full comment

Sent!

Expand full comment

Great article and I agree with everything you say except on one point.

AI is not necessarily unbiased. It would be more accurate to say that AI has a standardized perspective, one that proactively conforms well to the expectations of a judicial opinion, which in theory is supposed to be unbiased. But it will search for "cues" regarding "who usually loses cases" just as readily as it searches for legal precedents on how to justify an opinion.

Still I share your perspective that in practice it can likely be less biased that judges, and has the advantage that we can supplement it's opinion with rapidly generated introspection.

Expand full comment

To emphasize something said in a previous post: LLM are next-word-predictors with a little bit of randomness sprinkled atop. The amount of randomness depends on a parameter usually called "temperature", but that's a different story. Ignoring that little randomness, a LLM "just" predicts, based on all previous input and output in that session, what the next work would be. And then the next word, based on all previous input and output including the last word predicted. And so on and on and on.

The "prediction" tries to mimic the data used during the training, i.e. the LLM tries to predict a next word that most closely resembles an sort-of "average" next word in the training data.

As you observed, the training data is the internet, so it tries to predict what the internet would have answered. Of course, by giving the right prompt, you can, well, prompt it to focus on what a specific part of the internet would have answered. For example, the part of the internet containing all supreme court cases.

If you ask an LLM to explain something after-the-fact, it does not actually revisit it own inner state at that time and reasons based on that, but instead it tries to predict how "the internet" would answer to a request for after-the-fact explanation of reasoning, without actually understanding why the internet answered that way it did. So it basically generates a text that resembles an after-the-fact explanation of reasoning, but no actual after-the-fact reasoning took place. You can get basically the same answer by prompting the LLM with the previous text in a new session and then asking it to do an after-the-fact explanation, which is clearly bogus, because the first text wasn't an output of the LLM from that session.

Similarly, there is no database of text in the LLM, and neither does the LLM has any meta-knowledge about the data is has been trained on. Once training is completed, the LLM is totally separate from the training data. So, if you ask an LLM to think about (or compensate for) biases in the training data, it cannot do that, because it doesn't know anything about it's own training data. What it will do: Generate a text that mimics what someone on the internet might have written when being asked to considered internal biases.

Or in short: LLM do not have any "consciousness" or "memory" they could query, nor do they have any meta-knowledge about their own training data or internal functioning. They just generate text that convincingly seem like they do.

That said, though, sometimes fake-it-till-you-make-it is all you need. One should just avoid getting false impressions about what an AI can or cannot do. Therefore it is generally a good idea never to ask the LLM anything that refers to it's own internal state or to its training data or any other meta-knowledge about that LLM. You can happily ask the LLM about anything that was written previously in that session, but ask in a way and expect answers in a way as if asking about something someone else has written. Because for the LLM everything previously written in that session is just something it has digested before predicting the next word. It doesn't care much if it has digested it from your input or from it's own output. If just predicts what would be a probable next word to continue the conversation like a similar conversation would have been continued in the internet.

Expand full comment

I would find this article more convincing if I hadn't just yesterday asked ChatGPT a simple question about a water polo award (that is thoroughly documented online), which it answered completely incorrectly in a supremely confident tone.

Expand full comment

What’s the question? And which version of GPT?

Expand full comment

I was trying to figure out which percentage of the Cutino Awards have gone to goalkeepers (for an article I'm writing for a private athletic club magazine). This year's winner is a goalie, and I had been told in advance by someone from USA Water Polo that the first woman to win was a goalie, too, as well as a couple others. I confirmed online that he was right about the first winner, and was hoping to avoid having to check each name, so I asked ChatGPT, which told me that none of the female winners have been goalies. I don't know which version of GPT it was; it's the one that ran when I went to the OpenAI home page. It may be the one before 40 (which I guess would be 39, if that's how they do their naming, but I didn't see any version number at the time).

I went back to the OpenAI site just now (using a different computer, so it's possible the home page routed me to a different GPT version than I used before) and asked it the question again, as an experiment, and this time it was wrong in a different way:

Q: "How many winners of the Cutino Award have been goalkeepers?"

A: "As of my last update, which was in January 2022, there had been two winners of the Cutino Award who were goalkeepers: Genai Kerr from the University of California, Berkeley, who won the award in 1992, and Josh White from Pepperdine University, who won in 2007. There may have been more winners since then, but I don't have information beyond that date."

ChatGPT is not only incorrect, but is making up data: the award was first given in 1999, and so Kerr never won it. The 2007 winner was John Mann of UC Berkeley; no one from Pepperdine has ever won. I'm not sure "Josh White from Pepperdine" even exists--there's a Josh White who's a water polo coach, but he apparently went to a different school.

It's interesting in this second answer that ChatGPT doesn't mention women at all, as if they aren't eligible for the award. I may have asked it specifically about women last time; I don't remember the exact phrase.

As an experiment, I tried asking it about women, and this time used GPT-40:

Q: "How many women goalkeepers have won the Cutino Award?"

A: "As of my last update in January 2022, no female goalkeepers had won the Cutino Award. The Cutino Award is typically given to the best male and female water polo players in the United States, but it's more commonly associated with field players rather than goalkeepers. However, the landscape of sports can change, so it's always good to check for the most recent updates."

This is what I got yesterday, and it's wrong again: we know that at least the first one, Bernice Orwig, was a goalie. The award is also given specifically to NCAA players, so although it's suddenly acknowledging women here--only because the question refers to them, it seems--saying the award goes to "the best players in the United States" is not fully accurate. It's probably true that the award "is more commonly associated with field players rather than goalkeepers," since there are way more field players than goalies, but it's not much of an answer. I wondered if GPT-40 was more aware of male goalkeepers, and would put them in if I asked the general question again, so I tried it out:

Q: "How many goalkeepers have won the Cutino Award?"

A: "As of my last update in January 2022, no goalkeepers had won the Cutino Award. The award is primarily focused on field players in water polo rather than goalkeepers. However, the information might have changed since then, so it's worth verifying with the most recent sources."

Well, no made-up winners here, but once again, women are left out. I started to check whether any male winners were goalies, to see if it was at least right about that, and so far haven't found any, but I have to climb out of this rabbit hole. At any rate, I didn't ask about men only, so the answer is still false.

Expand full comment

Thanks. Yes, this is the type of question that AI performs very poorly at these days (including Claude). It's not surprising to me that it would hallucinate facts when confronted with a specific factual question of that nature. When asking it to conduct legal analysis without context, it similarly hallucinates case names and facts. Wikipedia, which is fantastically accurate, is far better at this sort of task. (https://en.wikipedia.org/wiki/Peter_J._Cutino_Award seems to match what you are saying.). AI's skill is logical reasoning based on facts it already has. I imagine subsequent generations will aggregate Wikipedia's factual accuracy with AI's logical reasoning.

Expand full comment

>It's not surprising to me that it would hallucinate facts when confronted with a specific factual question of that nature.

Well, this confuses me, as I thought the part where you wrote, "'...query an LLM concerning the ordinary meaning of some word (say, ‘landscaping’) and then mechanistically apply it...' I don’t see why this should be off the table" seems like you're suggesting it's a good idea to ask an LLM a factual question about a word's definition, and how it has been used in prior case law. It just seems like everything it spat out would have to be deeply fact-checked, and then where is the time savings?

Wikipedia itself, although largely correct, is full of hidden inaccurate facts, which is why it's great that it shows its sources, which ChatGPT does not.

Expand full comment

When discussing language interpretation in the context of judging, especially in regard to a widely-spoken language like English, it's important to consider which version(s) we're interpreting.

For example, Irish English v American English differ in some fairly legally-relevant ways. It took me a few years to recognise them (e.g., tabling a motion means to introduce a bill, not end it, bonnet is a car trunk, chemist = pharmacist or even more generally, the pharmacy, fanny...)

And while it's true that the starting point for LLMs is usually the Common Crawl, which skews heavily US, the fine-tuning process is done primarily by English speakers in the Global South:

https://www.theguardian.com/technology/2024/apr/16/techscape-ai-gadgest-humane-ai-pin-chatgpt

And Nigerian has it's own lexical differences compared to US English. https://www.farooqkperogi.com/2011/12/top-hilarious-differences-between.html

Expand full comment

I wonder, however, whether this itself could be accounted for via the right queries. Can’t you ask the AI to assume the question is posed by an Irish speaker? AI is amazingly good at, for example, translation.

Expand full comment

Perhaps, but one has to know (and remember) to ask. The marketing of ChatGPT and related has always been premised on the idea that we use natural language.

To an Irish solicitor, they aren't likely to think that they need to add further prompts to ensure ChatGPT is interpreting the Irish meaning of phrases (why would they? They certainly don't think or say such things naturally).

Most people really do not grok prompt engineering (even the basics), and asking an LLM to assume the question is posed by a native speaker is in essence, prompt engineering.

Expand full comment

A fascinating post! I'm guessing this could be very useful in practical, workmanlike contexts where, upon doing the work, there's not going to be any reasonable disagreement between jurists. Summary affirmances in appellate courts seems like the most obvious use case. And I can see the business case for non-complex arbitration too, even on questions that might divide decisionmakers.

I am quite skeptical of AI's ability to meaningfully supplant humans in the kinds of cases that divide mainstream jurists, particularly in ways that map onto jurisprudential differences between judges, because then the question will depend on what kind of jurist the AI is. I don't see any near-term world in which AI can either (1) settle questions about how jurists should approach law or (2) do better than humans at making persuasive cases for jurisprudential approach over another. And on the question of what kind of jurist Claude 3 Opus is, I asked it and it told me:

"When I've role-played as the Supreme Court and decided cases, my approach has been to carefully consider the text, structure, and purpose of the relevant laws, while also taking into account precedent, practical considerations, and the broader goals of the legal system. I strive to interpret the law in a way that is faithful to its meaning and intent, but also pragmatic and mindful of the real-world consequences of my decisions.

In terms of the text, I place significant weight on the plain meaning of the statutory language, but I also consider the context in which it appears and the overall statutory scheme. Where the text is ambiguous, I look to other tools of statutory interpretation, such as legislative history and canons of construction, to discern Congress's intent.

I also believe that precedent plays an important role in promoting stability, predictability, and fairness in the law. I am cautious about overturning settled precedent, but I am willing to do so if I conclude that a prior decision was clearly erroneous or that changed circumstances have undermined its rationale.

At the same time, I recognize that the law does not always provide clear answers, and that courts must sometimes exercise judgment and discretion in applying it to particular cases. In doing so, I believe it is important to consider the practical implications of different interpretations and to strive for outcomes that are just, equitable, and consistent with the fundamental values of our legal system.

In terms of jurists who approach cases similarly, I believe my approach shares some similarities with that of Justice Stephen Breyer, who is known for his pragmatic and purposive approach to statutory interpretation. Like Justice Breyer, I place significant weight on the text of the law, but also consider its purposes and practical consequences. I also believe my approach has some similarities to that of Justice Kagan, who is known for her careful attention to statutory text and structure, but also her willingness to consider practical realities and the broader purposes of the law.

That said, it's important to note that my approach is ultimately my own, and may not perfectly align with that of any particular Justice. The Supreme Court's decisions often involve complex issues and competing considerations, and reasonable jurists can disagree on the correct outcome in any given case."

Expand full comment

> "When I've role-played as the Supreme Court and decided cases, my approach has been to carefully consider the text, structure, and purpose of the relevant laws, while also taking into account precedent, practical considerations, and the broader goals of the legal system. I strive to interpret the law in a way that is faithful to its meaning and intent, but also pragmatic and mindful of the real-world consequences of my decisions."

From a basic understanding of how LLM's work, this has to be fiction. An LLM doesn't have a built-in way to remember anything it did. It doesn't know what it role-played in previous chats. Even its short-term memory is based on the chat transcript alone.

It's just more role-playing. Now it's role-playing what a human would say when asked to justify themselves. But it's not human and doesn't have human reasons for doing things, so all its justifications and predictions about how it might approach cases have to be fictional.

You cannot ask an LLM about itself and expect a factual answer. It's not trained on real introspection, it's trained on imitating texts. If it happens to tell the truth, it's because it was trained to say that. Otherwise it doesn't know.

Expand full comment

Pretty fascinating comment as well! Breyer and Kagan, eh? I wonder where that came from?

Expand full comment
Comment deleted
Jun 9
Comment deleted
Expand full comment

Not ironically, I agree with this assessment. It doesn’t undermine the argument at all.

Expand full comment

This is exactly what Claude Opus seemed to be doing when I experimented with it. It is very efficient at summarizing briefs, which is a useful tool, but did not otherwise give any useful input.

Expand full comment

Those are definitely not the results I am seeing when using Claude 3 Opus. When I input Supreme Court briefs it’s consistently giving me the right answer for the right reasons. A little bit of handholding also helps. Which case(s) gave you bad results in Claude 3 Opus? You can email me if you don’t want to disclose publicly.

Expand full comment