The Right Benchmark For GPT

Dan Maloney wanted to design a part for 3D printing. OpenSCAD is a coding language for generating 3D objects. ChatGPT can write code. What could possibly go wrong? You should go read his article because it’s enlightening and hilarious, but the punchline is that it ran afoul of syntax errors, but also gave him enough of a foothold that he could teach himself enough OpenSCAD to get the project done anyway. As with many people who have asked the AI to create some code, Dan finds that it’s not as good as asking someone who knows what they’re doing, but that it’s also better than nothing.

And this is where I start grumbling. When you type your desires into the word-follower machine, your alternative isn’t nothing. Your alternative is to fire up a search engine instead and type “openscad tutorial”. That, for nearly any human endeavor, will get you a few good guides, written by humans who are probably expert in the subject in question, and which are aimed at teaching you the thing that you want to learn. It doesn’t get better than that. You’ll be up and running with your design in no time.

Indeed, if you think about the relevant source material that the LLM was trained on, it’s exactly these tutorials. It can’t possibly do better than the best of them, although the resulting average tutorial might be better than the worst you’ll find. (Some have speculated on what happens when the entire Internet is filled with these generated texts – what will future AIs learn from?)

In Dan’s case, though, he didn’t necessarily want to learn OpenSCAD – he just wanted the latch designed. But in the end, he had to learn enough OpenSCAD to get the AI code compiling without error. He spent an hour learning OpenSCAD and now he’s good to go on his next project too.

So the next time you hear someone say that they got an answer back from a large language model that wasn’t perfect, but it was “better than nothing”, think critically if “nothing” is really the right benchmark.

Do you really want to learn nothing? Do you really have no resources to get started with? I would claim that we have the most amazing set of tutorial resources the world has ever known at our fingertips. Compared to the ability to teach millions of humans to achieve their own goals, that makes the LLM party tricks look kinda weak, in my opinion.

40 thoughts on “The Right Benchmark For GPT

  1. “It can’t possibly do better than the best of them”

    That’s not true in general. It’s possible for a LLM to read the sources, and use those to create interconnecting deep patterns that result in unique output not found anywhere in the source material, and possibly better quality. We’re not there yet, but that’s a matter of scale and technique.

    1. ‘We’re not there yet, but that’s a matter of scale and technique.’

      Para of what was said during the first era of AI enthusiasm. 1960ish.

      Of course it is true. Technique being a broad term. Technique = Solve strong AI problem.

    2. It’s possible in the same sense as the million monkeys on typewriters. Of course it may happen, because the LLM is a large Markov chain, but the chances of it coming up with something truly unique and original and CORRECT all at the same time is slim to none. The probability of it generating unique but nonsensical outputs is far greater.

      In the end, how would you even notice that it has? You have to be the judge, because the LLM isn’t self-critical. It doesn’t differentiate between information and nonsense.

      1. Can you tell me a source that does? Cause people don’t, you are far more likely to get bullshit answers going around asking random people than an AI, people who think they know, but are grossly mistaken, the internet sure doesn’t differentiate, so what does? It’s also not random chance if it generates good output in the sense the monkeys are, it’s solely based on our AI advancements, it will likely be there sooner than anyone thinks

        1. People can tell you that they’re not experts in the subject in question. In fact, if you go around asking people, most will say “I don’t know”, while the AI will confidently generate you bullshit and even insist it is correct.

          > It’s also not random chance if it generates good output in the sense the monkeys are

          Yes it is. Even when you ask the LLM the same question, it will generate different variations for an answer randomly. For many questions there is only one right answer, and given the multitude of possible answers that ChatGPT can give, it’s unlikely to pick the right answer.

          Just like with the monkeys, since they can’t read what they type, there’s no feedback to whether the answer is correct. In practice, it has been shown that monkeys prefer to type the letter S more than anything.

    3. Sorry mate, but that’s total rubbish. As someone else pointed out, it’s a markov chain. That means it’s a complex statistical system that works out what words go well together in what order given input weights. The cost of the models that are used in LLMs are eye-watering, but the early hype is just that, hype. They don’t score brilliantly on IQ tests (unless you turn off all the reasoning bits) and focus on language, they have zero comprehension, there is no mechanism in them to ensure that meaning/veracity/factuality is interpreded/conveyed/retained. LLMs are a dead end, that will someday be the language-based front-end to a real machine intelligence. None of the “emergent” abilities that were touted early on (like hinglish comprehension when not trained on hinglish) are actually emergent, they were measured using binary rather than graduated metrics which means they will ALWAYS appear to be emergent when you are increasing your model size by two orders of magnitude.

  2. GPT answers absolutely can be worse than nothing. Just recently on Space Exploration StackExchange we got “GPT says knots become useless without gravity, like I suspected.” If you don’t have a firm grasp on how things actually work, there’s no end to the lies you can be told by a confident liar.

      1. This is what agitates me about the current “AI” chatter frenzy. We /already know/ how the corpus of software gets polluted by people copy-pasting code without understanding it; and that’s when at least the original code is written by someone who knows what they’re doing and why they’re doing it that way.

        Github, MS et al are now actively pushing people to use code that has never been understood by /anyone/, and that’s going to pollute the whole ecosystem, and then future “AI” will be trained on the sewage, and the whole programming literature will be irreversibly degraded.

        It’s not going to affect the “elite” code that Google or Microsoft produce, because they wouldn’t touch these tools with a ten-foot pole. But for scientists, businesses and hobbyists, programming is just going to get less accessible as public resources dissolve into an ocean of spam. In 20 years, web development will be about as easy to get into as Verilog.

        1. Web development is already a Rube Goldberg nightmare. On all stacks.

          It’s not where anybody should start. And if they don’t start there, they won’t want to go there. Think of it as 21st century COBOL, somebodies got to do it…poor bastards. Better them than me.

          I guess my point is that nobody should be in that ghetto when they’re still copying and pasting random solutions from Stackoverflow, like some wetware version of ChatGPT. That’s child cruelty, like teaching a kid JS.

  3. I asked chatGPT to sketch for me the spectral response of a silicon detector in R. See what it came with:

    # Creating wavelength sequence
    wavelengths <- seq(300, 1100, by = 10)

    # Creating response sequence
    response <- ifelse(wavelengths <= 800, 0.00125*(wavelengths-300), -0.00167*(wavelengths-800) + 1)

    # Bounding response by 0 and 1
    response <- pmax(pmin(response, 1), 0)

    # Creating data frame
    silicon_response <- data.frame(
    Wavelength_nm = wavelengths,
    Response = response


  4. I mean, personally I think there is another way to look at all this– I mean I don’t use it extensively at this point, just now and then, but mostly for languages I *do* know– Or as a tool for translation to at least get my pre-written code ‘part of the way’ to some other language I am less familiar with.

    Is it/can it solve any of my ‘big existential coding problems’– No, and am not sure it ever will. But, thus far, as long as you are very careful about how you specify it, it seems to me pretty darn good at writing Regex without my brain getting totally tied in a knot; Or say I have to write a long set of case statements (or anything else similar) I find it saves me a lot of boring mindless typing.

    And for simple routines for languages I do know, yes, quite often it will err on this or that– But I can just read along until I reach the part I know is wrong, think ‘Silly ChatGPT’ and then easily fix it.

    I think there is another bigger question here though. Though the selected input sources are not entirely clear, we presume most of it is the ‘mixed bag of potatoes’ of the internet.

    However, the question relates to the *underlying technology*– I mean I wondered what you guys thought; If instead of ‘potatoes’ what if we trained such a model on ‘pure steak’ ? Would the same failings be there ?

    All the talk around bias and other things behind these models… I think a great deal of it is because they just vacuumed the Internet with this thing, with little regard to selectivity. I mean I hardly believe such a bot might start saying ‘Heil Hitler !’ if no one ever told him who he was– Which is to say I don’t think it would just *make him up*. But somewhere, on a webpage…


    1. Training some else’s neural net to do fun things is fun!

      Training an AI with an uncurated dataset is like running an internet poll with open options.

      You know you’re getting boatymcboatface, at best. More likely you’re getting ‘Mao did nothing wrong’ (or something equally inflammatory).
      Can’t blame anybody but yourself. btards/farkers etc are gonna btard. It’s just who they are.

      1. I wonder why ChatGPT didnt observe all of silicon valley hammering away for a few years to learn to code.

        its almost like its just supposed to be a chatbot,
        its almost like its just supposed to be good at things like customer service, spambotting, and stuff like that
        Its almost like it was trained on a bunch of banter.

        Its not general AI. Everyone needs to stop expecting it to be the singularity. ITS JUST A CHATBOT

      2. My take on all the LLMs is that they’re bloody good random generators. Do not expect them to produce anything “reasonable” and “meaningful”, but in my use cases it’s not needed and would have been rather detrimental.

        The real “intelligence”, the real discovery ability comes from a feedback loop:

        – generate rubbish
        – test this rubbish against multiple criteria in simulations
        – collect the feedback and apply to the initially generated rubbish
        – repeat

        And up until recent the “generate rubbish” was the weakest part in my experiments. Now, having LLMs with an ability to generate nearly correct and sometimes correct structures in a language you can easily teach them, is a ground-breaking feature.

        Forget about all the AI babble – this is the real power of LLMs. Embrace the fact that they’re just glorified Markov chains, don’t shy away from it.

  5. I just asked ChatGPT to write to very simple OpenSCAD code for a cube with a cylindrical hole running horizontally through the cube. After six corrections where I explained the errors it still didn’t get it right. I finally gave up and told it the answer.

    1. Well, openscad isn’t terribly popular compared to other languages, and the past two years of code examples wouldn’t be in its training data.

      That, and your description of what you want drawn is vague, and while most real draftsmen might guess what you meant, it’s still ambiguous.

      Running horizontally through the cube? Which axis is horizontal? Perpendicular one of its faces? Centered on that face?

  6. When I first saw the title of this article I thought it was going to be about the recent nerfing of chatGPT4’s brain power.

    A few months ago it scored well on a bunch of benchmarks, including one that asked if a given number was prime. A recent paper came out showing that chatGPT4 now scores abysmally on most of them, with no further comment from openAI. (it now scores around 2% on that prime answering question)

    The theory is that it’s openAI cheaping out on processing power, replacing chatGPT4 1.0 with smaller models behind the scenes and hoping nobody will notice. Another reason to develop/operate your model in-house.

    But then you get the question, is this a fair benchmark (math) to apply to an LLM text-transformer?

    And really, the answer to that question is “benchmark how you plan to use it.” If it turns out to be bad at math, you shouldn’t use it for that.

    1. The other theory is, when they increase the training set size and include more stuff in it, and don’t increase the model size to match, they get more and more “diluted” results with more generic answers that lose detail because it can’t retain as much information.

      1. The benefit, is that if you have a basic understanding, for instance I can code, and I could spend hours writing all the basic to-do of whatever I am doing, and then debugging my little mistakes, etc, or I can ask it to make the app, have the whole thing down in 5 minutes, and then spend an hour debugging it, and now I did in a little over an hour, what would have taken me a full day, or if you need to make something in a language yoy arent familiar with, it pre sets you up, and error reports and Google will help you find the errors.

        Even in the case you give, that person would have spent AT LEAST an hour doing said tutorial, and probably would have only had a very basic understanding of the tech, and would have had to fumble for many hours more making the part, instead he spent a singular hour getting it done, it was objectively much faster than going about it without ai, so it wasn’t just better than nothing, it was better than any other option he had for the specific goal he had.

        1. Your posted ratio of debug time vs. code writing time has never been observed in nature.

          Especially when your being asked to debug someone else’s code. That does not in fact work, never has. Approach broken, algorithm can’t complete before heat death of universe. Good luck with ‘debugging’.

          I herby sentence you to 5 years maintenance coding on steaming pile of JS mixed with Access VBA. I know, cruel and unusual.

        2. If you ask ChatGPT to generate an amount of code that would take you an hour to specify and write, it’s unlikely even be able to read the generated within an hour.

          You won’t be able to find its nontrivial bugs within an hour. If you don’t know what the code is or what it does, you can’t know what’s going wrong. You certainly won’t know the edge cases where nontrvial bugs happen.

          You probably won’t even know if the code does what you want within an hour. It’s not like you took the time to write a set of requirements or build a set of compliance tests to automate that problem. You’ll have to poke around randomly to see if it responds in a way you don’t want. Then you’ll have to spend time defining what you do want in that case.

    2. “A recent paper came out showing that chatGPT4 now scores abysmally on most of them, ”

      So, it is just as lazy as the rest of us.

    3. That “paper” is the stanford one, right?

      They included the markup language for code blocks in the code itself and said “ope, i guess the code doesn’t work lol” when leetcode failed to run it.

      They also *only* included prime numbers in their dataset when asking ChatGPT if the numbers were prime or not, resulting in both false positives and false negatives, as each version of ChatGPT would sometimes initially say ‘yes’ or ‘no’ regardless of whether or not the number was prime.

      That paper was absolutely awful, and was ripped apart by proper testing and methodology before it was ever officially peer reviewed.

  7. I’m an experienced and good developer. I use ChatGPT for like a dozen or more hours per week. It’s not for churning out code, it’s for uber-rubber-duck debugging along with increasing my capabilities. I’m just as confident in ChatGPT as I am in your random human-written tutorial you can find via search…which is to say not super-confident. However, I’m experienced enough that with either I have the nose to sniff out the problems. The benefit of ChatGPT is that I can ask it questions and get answers much faster than I could from a human tutorial writer.

    1. If you asked a person to pair-program along side you for those 12+ hours a week, you’d both benefit. As it stands, you’re getting a benefit, the other person gets nothing, while OpenAI gets a bunch of free training data from you.

      1. >If you asked a person to pair-program along side you for those 12+ hours a week, you’d both benefit.
        Mkay both benefit

        >As it stands, you’re getting a benefit, the other person gets nothing, while OpenAI gets a bunch of free training data from you.

        So you benefit, AND EVERY OTHER USER OF CHATGPT who ever uses any of the things OpenAI gleans from that “bunch of free training data from you” benefits

        Are you trying to say using ChatGPT is better in the long run?

      2. Who is this person pair programming with me? I’m at work, doing work, and I am able to do it better and more efficiently because AI handles the busy work. If there’s “the other person” somewhere being left in the dust, they’re welcome to hang out with all the people boomers keep complaining we’re not talking to anymore while we’re on our phones.

        Ultimately, this article was clearly biased and straw-manning to make a point no one was asking anyone to make. It’s EXACTLY like when Wikipedia came out all over again and everyone thought it was going to ruin education, haha.

  8. I have been using langchain to make a few LLM based apps with great success. Even hired an intern who is doing great too as we leverage chatgpt to fill in his gaps in programming knowledge.

    Ways of doing things get frozen in time and we repeat the same modes of operating. When you spark the imagination, people start thinking and things start happening.

    For example, have y’all seen the new mono-tile discovery that was a big deal but the ugly part was you had to use a chiral tile as well? Finding the achiral mono-tile was thought to be orders of mag. harder yet soon after the chiral version was found the achiral one was also found because it sparked imagination.

    Of course you have to understand your tool before you try to bang a nail with a screw driver. But a lot of comments showing a lack of imagination. Sure some of it is hype just like literally any technology ever.

    People have naive expectations then are sad when it doesn’t match those expectations and show negative sentiment towards this inanimate tool.

  9. There has been more than one occasion where someone had went: “how did you learn that?”

    The internet is underrated. Any topic you would like to learn, it’s there and available basically for free. The biggest caveat is that some topics get hidden by ads or being similar to other, “popular” topics, and there sometimes the correct terms even are stolen/misused. But there is usually an alternative term that will find you good results instead, and folks of similar interests will have picked up on this pattern so it will work out either way.

  10. In the original article, the user never realized his magical thinking though a few commentators did, sharing how they approach prompting. i don’t think anyone mentioned that one ask chatgpt for better prompts!

    try: give 20 example prompts better than “rewrite”.

    another thing is, though it may not have been trained on brainf*k, if you tell it the rules of the language while asking: is this complete? do you see loopholes? what questions do you have about the spec?

    then it will help you create a definition for the language/syntax that you can build on. open a new query, and iterate by refining the first entry. converse about that first entry until it starts loosing the spec, then refine from the top again.. open addtl queries to refine aspects or to work on your actual spec. use the best spec that you got from your spec queries by pasting/replacing it in the feature spec.

    that works well using openai/google.

    if you make an account at you can make 2 bots, one to define/refine the language spec with, another to define/refine your project with. making and editing bots is just like working with prompts, except that the instructions don’t get lost.

    what i’m trying to say is that just because you think something is magic, … no, i mean, the rule of GiGo garbage in, garbage out doesn’t go away just because you think something is magic. using the above you can teach chatgpt any esolang you want. realize that these things, like all genies, are extremely good at giving you what you want. just work on the question.. and please hack at it. learn to do it right!

  11. chatGPT is orders of magnitude better than what most google searches turn up for tutorials. Try asking a tutorial a question, rephrasing something, giving example code that isnt there, asking it translate from something you know to this new topic, reading your code and critiquing it….etc. An on demand interactive tutorial with a nearly endless supply of data to draw from…

    I am a pretty accomplished developer that picks up most languages quick and I get frustrated by the pacing of most tutorials (and dont get me started on a youtube video)…learning things from chatGPT has been life changing from a productivity stand point.

    Its code writing is about as good as an average coder would do at a whiteboard without a fancy IDE to back them up and you should have about the same expectations about that code working as written. But its still great for getting over the writer’s block I am sure most developers have run into.

  12. This is a bizarrely Luddite-ish, unproductive addition to the AI conversation. We had this conversation when Google first got big, then again when Wikipedia got big, then again when religious people thought the Mayan calendar knew when the world would end.

    Look. AI is just fine as a means to do work faster, or teach yourself to do things more efficiently. I would argue even the point about finding a tutorial written by an expert would be a bigger chore than just using AI to explain. That way, I don’t have to read an article swollen with ads only to find out the tutorial doesn’t exist, the expert isn’t an expert, or the expert has no concept of how to write a concise tutorial and now I’ve wasted time. With AI I can just ask the damn thing the specific question and I’ll very likely get an accurate answer provided I don’t ask it to code the entirety of the thing itself.

    So… In my personal experience nah, every time I think “let me just Google this question,” the answer is almost always “nah that’ll take forever and I might never find it, if I ask AI to explain X in a bullet list using concise language” then I’ll get exactly that.

    If ya don’t like it you’re fine not using it, but it’s perfectly fine at teaching a thing better than a random article will be, 100%. Although whatever OpenAI has done to it the past month or two has been pretty bizarre and funny to see in action.

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.