Revisiting Using AI Coding Assistants: You’re Holding It Wrong Edition

After scathing accusations of skimping on due diligence, as well as other feedback to my article on trying to use an ‘AI coding assistant’ for the first time, the only rational, academic response is to lick one’s wounds following a particularly bruising peer review and try to address the raised issues. Reality after all does not care about one’s feelings, and there may be more to this AI assistant technology that can be coaxed out with a more in-depth look.

To this end I’ll do my best to try and work through each raised point, criticism and accusation, to see what I – and perhaps others – can learn of this endeavor. Said points include the use of the wrong frontend – i.e. Copilot – and the wrong model – being Claude Haiku 4.5 – as well as the egregious flaw on my end of ‘prompting wrong’.

For the sake of due diligence the best frontend and models will be investigated for particular tasks, with finally the verbal minefield of ‘prompt engineering’ examined for industry-standard approaches.

Junior Developer

The exact way to refer to an LLM coding assistant is still in flux, with some comparing it to pair programming, while others see the assistant more as a glorified search engine that also has code-complete features as a kind of merger of a web search engine and IntelliSense in Visual Studio. This relationship and how to look at it is the cause of a lot of contention as a result.

Another perspective is that of these assistants being more like junior developers. After all, they can apparently do all the basic boilerplate stuff, write unit tests and perform a range of other basic tasks that are beneath more senior developers. The corollary here is then of course why companies would even want to hire another junior developer if the LLM can fill these jobs. Unsurprisingly, it is already being reported that this happening.

The million dollar question that remains is that if all of this is true whether a junior developer still has value. The answer appears to be ‘yes’, even if you ask Microsoft. The argument would appear to boil down to that these assistants supposedly automate away a lot of the tedium that used to get pushed onto junior developers, leaving them free to develop more advanced skills, naturally supported by the same coding assistants.

Fancier Automation

This gets us to the question of whether these assistants are really much better than the automation tools that have existed in IDEs for many decades now with arguable improvements over time. They certainly do seem to be more capable, but they’ll still never exceed their programming, and require a lot of finagling to make them do the right thing.

Returning to junior developers for a moment: bad apples aside, they will let you know if they didn’t understand something correctly, ask for clarification, admit that they don’t know something and offer to look something up in the documentation if they do not know. None of these are things that these glorified chatbots are capable of, which makes a comparison with IDE automation tools rather fair, especially since junior developers tend to get fired if they screw up as badly as the LLM tools seem to regularly do.

While it’s true that these newfangled coding assistants do have a context window in which they “remember” previous details, you’re still dealing with the limitations of the underlying model no matter how good your prompt engineering skills are. They will also regularly confabulate and you have to accept that they generate code and documentation that is just as likely to be correct as completely wrong, even if many users of these tools seem to believe that they are actually more performant.

Self-reported and observed AI coding assistant performance. (Credit: Joel Becker et al., METR, 2025)
Self-reported and observed AI coding assistant performance. (Credit: Joel Becker et al., METR, 2025)

Ergo you’ll be writing test cases for the test cases and generated code, while also pulling code review duty, as there is no possibility of ever establishing a level of trust. Especially not after it deletes your entire hard drive or the production database for the second time that week.

If that sounds like the kind of junior developer or automation tool you’d love to be paired up with, then you’re quite the adventurous spirit. Meanwhile I have had enough fun with even code completion tools like the aforementioned IntelliSense or its equivalents in the various other IDEs that I have used over the years to never use them again. It’s bad enough when a code completion tool gets it wrong, it’s worse when the human in the loop fails to catch the glaring mistake.

Model Frontends

Although we generally refer to ChatGPT, Claude or Copilot as an LLM, this is technically incorrect, as these are merely the chatbot frontends that are written to provide a natural language interface experience. The choice here is naturally quite dizzying, as you have a range of major players including the aforementioned, each of which offers a web interface as well as integration with various IDEs and use on the CLI for easy automation.

Hence the claim that one should never use the web frontend for coding, as it needs access to your code and local environment, which makes sense if you want more of the pair programming experience. Since my ‘IDE’ of choice are Notepad++ and Vim, my options here are of course rather limited. There is a third-party OpenAI integration plug for NP++ called NPPOpenAI, but that would seem to be it.

The cool kids are of course all using Visual Studio Code with direct integration of all the frontends, but that option seems about as appealing as ripping half the RAM out of my PC and smashing my fingers with a hammer. Even as a former avid Visual Studio Pro user I feel insulted on a fundamental level at the mere thought.

Maybe that one of the CLI tools like Copilot CLI are a better match for me here as suggested, but this would appear to be more of a way to automate various GitHub tasks. Despite searching around I could not find an objective comparison of the different frontends, just many strong opinions and various pricing plans for model access, so for all intents and purposes they’re being treated as the same.

The Model Catwalk

It was further suggested that I take a look at LiveBench.ai for a comparison of how models perform on various tasks. This does indeed appear to be a valuable resource, if only for providing what appears to be a fairly objective way to compare these individual models against each other.

When sorting by the heading Coding Average, it puts OpenAI’s GPT-5.2 Codex at the top, with Claude 4.7 Opus Thinking High Effort close behind at both a hair over 83%. The Haiku 4.5 model that I was using comes in at a mere 72.17%, which is still much better than the sub-60 percent models near the bottom. Of the free models Haiku 4.5 at least would seem to be not too terrible, with Anthropic marketing it in October of 2025 as equivalent to Sonnet 4 when it comes to coding performance:

Haiku 4.5 model comparison at launch. (Credit: Anthropic)
Haiku 4.5 model comparison at launch. (Credit: Anthropic)

Consequently, it would be expected to perform at least decently at given tasks, but we can take a look at what other models are available via GitHub Copilot, for instance.

If you’re not into shelling out any clams for a purported improved experience with at least the Pro+ – not Pro – subscription, you get access to quite a few models to pick from that are apparently not ‘premium’. Of note here is that new sign-ups are currently ‘paused’ as usage-based billing is being introduced.

Of these available free models the following would have theoretically performed better according to the aforementioned benchmarks:

  • GPT-5.4 Mini, at 74.70%.
  • GPT-5 Mini, at 76.07%.

Following the ‘fast and cost-efficient’ category things get a bit dicey to compare, due to Anthropic’s awesome naming scheme and apparently an additional mode you can use these models in, which may or may not apply here:

  • Claude Sonnet 4.6 (Thinking High Effort), 79.9%.
  • Claude Sonnet 4.5 (Thinking), 80.36%.

These are apparently ‘more versatile and highly intelligent’, which doesn’t seem to bump up their total score too much compared to the mini models. Following this we get the ‘most powerful at complex tasks’ models:

  • GPT-5.4 (Thinking High Effort), 78.18%.
  • GPT 5.3 Codex (High), 78.18%.

Taken at face value, the 72.17% for the Haiku 4.5 model is indeed somewhat worse than the other two mini models, yet as this points system relies on a specific methodology it’s important to consider what this means. From the underlying coding tests we can see that they are all Python-based programming examples, which is great if you’re testing Python coding assistants, but rather useless for my purposes as I program in just about any language except Python.

Perhaps more worrying here is the statistic that even in this scenario the best model (GPT-5.2 Codex) only managed to score a rather pitiful 83.62%, so your choice would seem to be roughly between ‘atrocious’ and ‘very bad’. Within the free model selection you’re choosing between roughly 28% and 22% of the answers being incorrect, or roughly a 3/4 chance of getting what you were asking for.

Statistically, this wouldn’t seem to make much of a difference when picking either model.

Prompt Engineering

On my last foray, I was also accused of “prompting the wrong way”, which brings us to the topic of prompt engineering, where you must learn to follow specific rules in order to “correctly” use one of these coding assistants. A crucial aspect that was not obvious to me is that you absolutely must use so-called ‘environmental prompting’, where you set the equivalent of global variables.

To this you then add , such as in the absolute gem that is used by the Livebench code test for an array test:

### Instructions: You are an expert Python programmer. You will be given a question (problem specification) and will generate a correct Python program that matches the specification and passes all tests. You will NOT return anything except for the program.
### Question:
You are given an integer array nums and an integer k.
The frequency of an element x is the number of times it occurs in an array.
An array is called good if the frequency of each element in this array is less than or equal to k.
Return the length of the longest good subarray of nums.
A subarray is a contiguous non-empty sequence of elements within an array.

Example 1:

Input: nums = [1,2,3,1,2,3,1,2], k = 2
Output: 6
Explanation: The longest possible good subarray is [1,2,3,1,2,3] since the values 1, 2, and 3 occur at most twice in this subarray. Note that the subarrays [2,3,1,2,3,1] and [3,1,2,3,1,2] are also good.
It can be shown that there are no good subarrays with length more than 6.

Example 2:

Input: nums = [1,2,1,2,1,2,1,2], k = 1
Output: 2
Explanation: The longest possible good subarray is [1,2] since the values 1 and 2 occur at most once in this subarray. Note that the subarray [2,1] is also good.
It can be shown that there are no good subarrays with length more than 2.

Example 3:

Input: nums = [5,5,5,5,5,5,5], k = 4
Output: 4
Explanation: The longest possible good subarray is [5,5,5,5] since the value 5 occurs 4 times in this subarray.
It can be shown that there are no good subarrays with length more than 4.


Constraints:

1 <= nums.length <= 10^5
1 <= nums[i] <= 10^9
1 <= k <= nums.length

### Format: You will use the following starter code to write the solution to the problem and enclose your code within delimiters.
```python
class Solution:
def maxSubarrayLength(self, nums: List[int], k: int) -> int:

```

### Answer: (use the provided format with backticks)

With this kind of preamble and explicit instructions to the ‘coding assistants’, you may as well just write the code yourself. Even if further brevity is usually ‘good enough’, the need to spend all that time and effort just to get the answer that you know you were looking for. Even a junior developer wouldn’t need this much hand holding.

In this regard, the other uses that people have mentioned, such as bouncing ideas off the chatbot — “rubber ducking” — has merit, but often even talking to potted plants or going on that walk around the block can do just about as much with getting one’s thoughts in order, along with random web searches.

Whether to use “micro prompts” or larger tasks, whether to use the chatbot as a search engine or not, and whether to correct answers to previous prompts are all details that seem to be highly divisive among users of these tools, as is the topic of vibe coding, which some seem to embrace, while others dismiss it as an insult for their artisanal craftwork.

Local Models

There are many more things to cover here, such the use of local models vs these hosted ones, with all the gotchas of subscriptions, private data harvesting and the like that this entails, but that will have to wait for another article. It’s also interesting how much the subscription and usage fees (and limitations) are currently going up across the various services, making the idea of local models seem more attractive, if they are even worth it with such limited inference capacity available.

Suffice it to say that I have learned some things along the way of writing this article, while not changing my overall premise and conclusion of the previous article. Although I could have certainly picked a theoretically better model, this is hard to to substantiate without pitting the models against each other in STM32 CMSIS and Ada coding challenges. Based on the results in Python it’s hard to make the claim that it would have made an amazing distinction, but maybe not using a ‘mini’ model makes all the difference here?

Hopefully the better models won’t be removed from free access before I can even give this idea a shot.

 

3 thoughts on “Revisiting Using AI Coding Assistants: You’re Holding It Wrong Edition

  1. Truth be told, most of the work in programming was already boiling over to simply writing lines and lines of slop. The only change that LLM makes, is that it’s now all automated. Anyway, the ship has long sailed and it’s probably joever for teens entering IT courses now. (Universities won’t mention that thou, it would be bad for their business.)

    I’m glad I got the so called “assistance package” during the pandemic (essentially a gov’t handout) and became an retro motocycle influencer. Work in IT is probably even more inhumane than working at Auchan. No I support myself just from making the videos instead of doing C#.

  2. yeah 2025 was bad. Opus 4.5 was the turning point for me – the point at which it was faster to use assistants. So don’t use old studies!

    The new hotness is: spawn 20 worktrees with 20 agents in parallel, working on 20 different things. But beware burnout. And I don’t care what the marketing says, I can leave Opus 4.6 or 4.8 alone for an hour and it will make good progress, esp with the 1M context window. So “work” these days is really wild, and burnout is a huge problem, since it becomes addictive to solve problems so fast. But you’re context-switching 100% of the time which is crazy tiring.

    1. if you want a real evaluation of what software development is going to look like, you’re gonna need the paid models. Full stop. I’m sure there’s somewhere on IRC you could get some API keys for…not much.

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.