Playing Chess Against LLMs And The Mystery Of Instruct Models

November 16, 2024

At first glance, trying to play chess against a large language model (LLM) seems like a daft idea, as its weighted nodes have, at most, been trained on some chess-adjacent texts. It has no concept of board state, stratagems, or even whatever a ‘rook’ or ‘knight’ piece is. This daftness is indeed demonstrated by [Dynomight] in a recent blog post (Substack version), where the Stockfish chess AI is pitted against a range of LLMs, from a small Llama model to GPT-3.5. Although the outcomes (see featured image) are largely as you’d expect, there is one surprise: the gpt-3.5-turbo-instruct model, which seems quite capable of giving Stockfish a run for its money, albeit on Stockfish’s lower settings.

Each model was given the same query, telling it to be a chess grandmaster, to use standard notation, and to choose its next move. The stark difference between the instruct model and the others calls investigation. OpenAI describes the instruct model as an ‘InstructGPT 3.5 class model’, which leads us to this page on OpenAI’s site and an associated 2022 paper that describes how InstructGPT is effectively the standard GPT LLM model heavily fine-tuned using human feedback.

Ultimately, it seems that instruct models do better with instruction-based queries because they have been programmed that way using extensive tuning. A [Hacker News] thread from last year discusses the Turbo vs Instruct version of GPT 3.5. That thread also uses chess as a comparison point. Meanwhile, ChatGPT is a sibling of InstructGPT, per OpenAI, using Reinforcement Learning from Human Feedback (RLHF), with presumably ChatGPT users now mostly providing said feedback.

OpenAI notes repeatedly that InstructGPT nor ChatGPT provide correct responses all the time. However, within the limited problem space of chess, it would seem that it’s good enough not to bore a dedicated chess AI into digital oblivion.

If you want a digital chess partner, try your Postscript printer. Chess software doesn’t have to be as large as an AI model.

17 thoughts on “Playing Chess Against LLMs And The Mystery Of Instruct Models”

Tito Ferreira Figueiredo says:

November 16, 2024 at 10:15 am

Stockfish is not AI.

Report comment

Reply
1. Gravis says:
  
  November 16, 2024 at 5:50 pm
  
  It is AI. The problem you are having is that you have completely bought into the marketing department’s idea of AI when it’s actually a entirely subjective designation. Too bad for you.
  
  Report comment
  
  Reply
  1. S O says:
    
    November 16, 2024 at 8:02 pm
    
    Sorry, you are the one that bought into it. LLMs aren’t AI either.
    
    Report comment
    
    Reply
    1. Greg A says:
      
      November 17, 2024 at 7:11 am
      
      i wouldn’t mind hearing your definition
      
      fwiw i tend to retcon it…once we know that AI is nothing but a pattern matching system, i am inclined to say that natural intelligence is nothing but a pattern matching system. what do you say
      
      Report comment
      
      Reply
      1. Dave Bruce Rozee says:
        
        November 17, 2024 at 10:42 pm
        
        That may be. In humans the size of the weighted model is just significantly more vast than current llms.
        
        Report comment
Titus431 says:

November 16, 2024 at 10:28 am

Stockfish is definitely AI as defined by NIST and other widely accepted definitions of AI.

It is not “generative AI”.

For that matter, spellcheck, autocorrect and spam filters are all “AI”.

None are generative AI.

Software that can be considered AI has existed for at least 60 years.

Report comment

Reply
1. S O says:
  
  November 16, 2024 at 8:05 pm
  
  You mean there DOD definition? NIST doesn’t have one and the DOD version can be satisfied by a slide rule. None of these things are AI, that’s marketing. Applied models were called expert systems before the current boom.
  
  Report comment
  
  Reply
Anonymous says:

November 16, 2024 at 10:48 am

“For the open models I manually generated the set of legal moves and then used grammars to constrain the models, so they always generated legal moves. Since OpenAI is lame and doesn’t support full grammars, for the closed (OpenAI) models I tried generating up to 10 times and if it still couldn’t come up with a legal move, I just chose one randomly.”

In my limited experience of trying to get an LLM to play chess, the main problem is that they will very frequently give an illegal move, as they have no concept of the board state. Constraining the results to legal moves really doesn’t seem like a “fair” assessment of the LLM’s playing strength to me.

It also seems like a possible vector for corruption of the results, although I’m still trying to figure out how that happened in this case. For example, if the legal moves were generated by reading the output from Stockfish in multi-variation mode then they would be ordered from best to worst and so an LLM picking a move nearer to the start of the list would perform better.

“Update: OK, I actually think I’ve figured out what’s causing this. I’ll explain in a future post, but in the meantime, here’s a hint: I think NO ONE has hit on the correct explanation!”

It seems like the author has an idea what’s going on but isn’t saying… yet.

Report comment

Reply
1. arjundev says:
  
  November 17, 2024 at 7:55 pm
  
  Can you let me know about your post? Thanks!
  
  Report comment
  
  Reply
2. AySz88 says:
  
  November 18, 2024 at 8:50 am
  
  It doesn’t look like this kicked in yet, but I’d think eventually the instruction to be “a chess grandmaster” becomes more hindrance than good: GMs are weaker than Stockfish at any reasonable strength. The Stockfish settings used were very weak though.
  
  (There’s also the fact that the LLM is, in effect, a single node search depth – “policy only”, if I have the jargon right. But that’s not really distinguishing as far as I know.)
  
  Report comment
  
  Reply
rclark says:

November 16, 2024 at 10:50 am

Whatever you want to call all of this is just a list of instructions written by a programmer to do a task given a data set. No more no less. Handy for some jobs. AI is just a glowing marketing term to get the masses to spend money on subscriptions or whatever.

Report comment

Reply
1. Cad the Mad says:
  
  November 16, 2024 at 12:17 pm
  
  Golly I had no idea the self-hosted LLM RAG I have been using was costing me money. /s
  
  I love the comments because they quickly reveal who does and does not comprehend how machine learning works.
  
  Report comment
  
  Reply
2. Titus431 says:
  
  November 16, 2024 at 12:29 pm
  
  Again, AI is a very broadly defined term. Look at NIST or ISO or OWASP resources. Heck, read a Wikipedia article on Alan Turing and Marvin Minsky and perceptrons and the (re-)birth of deeply layered neural nets.
  
  AI, machine learning, deep learning, generative AI, RAG are not just “marketing terms” and in some cases have very very different meanings.
  
  Saying they’re just “instructions written by a programmer” is a very inaccurate way to describe LLM, RNN, CNN, etc. outputs.
  
  In fact that’s the whole point and has nothing to do with marketing or subscriptions.
  
  Report comment
  
  Reply
Jonathan Whitaker says:

November 16, 2024 at 11:47 am

The mystery is less of a mystery – that specific model was trained on millions of chess games! They specifically included all games above some ELO threshold, in PGN notation IIRC. I’m having trouble finding a public source, I’ll update if I find a link so you don’t have to trust me :)

Report comment

Reply
Dude says:

November 16, 2024 at 2:26 pm

Ultimately, it seems that instruct models do better with instruction-based queries because they have been programmed that way using extensive tuning.

Sounds to me like a roundabout way of creating expert system AIs. It’s the 1970’s all over again.

Report comment

Reply
1. physiii says:
  
  November 16, 2024 at 7:25 pm
  
  Expertise is what we learned not to use. For decades, every step of the way smart people people like yourself desired to program or guide the model based on expertise but turns out that was the bottle neck.
  
  But we are mostly blind to why we do things, even experts. If you pause and think sincerely to why you make decisions, you will start to see that most of what happens is automatic and emotional where conclusions come first. And you will start noticing this in others.
  
  Instead of asking an expert to explain their reasoning, all we needed was a way to compress information (attention) and to leverage compute and data.
  
  I think a lot about this too because I’ve been around a lot of experts. I have a masters in electromagnetics and photonics, went to medical school, worked on high level projects at national labs, etc.
  
  I often feel like an imposter even when I have solved problems that other obviously more expert than me were blocked, and continue to do so. It feels like there is something else besides the thing we call intelligence or expertise that is more critical. And I think this other thing is what we will start valuing more as opposed to expertise. I see it happening now in software and in medicine.
  
  Report comment
  
  Reply
  1. Daniel Larrosa says:
    
    November 17, 2024 at 7:12 am
    
    Very well said!
    
    It is that kind of “thing” that is / will be very difficult (imposible ?) to replicate/automate.
    
    We can call it creativity, “thinking out of the box”, serendipity, you-name-it…
    
    May be, some day, who knows… we must keep trying, and hoping, and (humbly) learning as we go…
    
    Best regards,
    Daniel F. Larrosa
    Montevideo – Uruguay
    
    Report comment
    
    Reply

Hackaday

Playing Chess Against LLMs And The Mystery Of Instruct Models

17 thoughts on “Playing Chess Against LLMs And The Mystery Of Instruct Models”

Leave a ReplyCancel reply

Search

Never miss a hack

If you missed it

Gentle Processing Makes Better Rubber That Cracks Less

How The Widget Revolutionized Canned Beer

Ore Formation: Introduction And Magmatic Processes

Remembering James Lovell: The Man Who Cheated Death In Space

Smartphone Hackability, Or, A Pocket Computer That Isn’t

Our Columns

Metric, Imperial, And Flexibility

Hackaday Podcast Episode 333: Nightmare Whiffletrees, 18650 Safety, And A Telephone Twofer

This Week In Security: The AI Hacker, FortMajeure, And Project Zero

For Americans Only: Estimating Celsius And Other Mental Metrics

FLOSS Weekly Episode 842: Will The Real JQ Please Stand Up

17 thoughts on “Playing Chess Against LLMs And The Mystery Of Instruct Models”

Leave a ReplyCancel reply

Search

Never miss a hack

Subscribe

If you missed it

Our Columns