At first glance, trying to play chess against a large language model (LLM) seems like a daft idea, as its weighted nodes have, at most, been trained on some chess-adjacent texts. It has no concept of board state, stratagems, or even whatever a ‘rook’ or ‘knight’ piece is. This daftness is indeed demonstrated by [Dynomight] in a recent blog post (Substack version), where the Stockfish chess AI is pitted against a range of LLMs, from a small Llama model to GPT-3.5. Although the the outcomes (see featured image) are largely as you’d expect, there is one surprise: the gpt-3.5-turbo-instruct
model, which seems quite capable of giving Stockfish a run for its money, albeit on Stockfish’s lower settings.
Each model was given the same query, telling it to be a chess grandmaster, to use standard notation, and to choose its next move. The stark difference between the instruct model and the others calls investigation. OpenAI describes the instruct model as an ‘InstructGPT 3.5 class model’, which leads us to this page on OpenAI’s site and an associated 2022 paper that describes how InstructGPT is effectively the standard GPT LLM model heavily fine-tuned using human feedback.
Ultimately, it seems that instruct models do better with instruction-based queries because they have been programmed that way using extensive tuning. A [Hacker News] thread from last year discusses the Turbo vs Instruct version of GPT 3.5. That thread also uses chess as a comparison point. Meanwhile, ChatGPT is a sibling of InstructGPT, per OpenAI, using Reinforcement Learning from Human Feedback (RLHF), with presumably ChatGPT users now mostly providing said feedback.
OpenAI notes repeatedly that InstructGPT nor ChatGPT provide correct responses all the time. However, within the limited problem space of chess, it would seem that it’s good enough not to bore a dedicated chess AI into digital oblivion.
If you want a digital chess partner, try your Postscript printer. Chess software doesn’t have to be as large as an AI model.
Stockfish is not AI.
Stockfish is definitely AI as defined by NIST and other widely accepted definitions of AI.
It is not “generative AI”.
For that matter, spellcheck, autocorrect and spam filters are all “AI”.
None are generative AI.
Software that can be considered AI has existed for at least 60 years.
In my limited experience of trying to get an LLM to play chess, the main problem is that they will very frequently give an illegal move, as they have no concept of the board state. Constraining the results to legal moves really doesn’t seem like a “fair” assessment of the LLM’s playing strength to me.
It also seems like a possible vector for corruption of the results, although I’m still trying to figure out how that happened in this case. For example, if the legal moves were generated by reading the output from Stockfish in multi-variation mode then they would be ordered from best to worst and so an LLM picking a move nearer to the start of the list would perform better.
It seems like the author has an idea what’s going on but isn’t saying… yet.
Whatever you want to call all of this is just a list of instructions written by a programmer to do a task given a data set. No more no less. Handy for some jobs. AI is just a glowing marketing term to get the masses to spend money on subscriptions or whatever.
The mystery is less of a mystery – that specific model was trained on millions of chess games! They specifically included all games above some ELO threshold, in PGN notation IIRC. I’m having trouble finding a public source, I’ll update if I find a link so you don’t have to trust me :)
Stop deleting my comment you shits