We have all seen the advertisements and glossy flyers for coding assistants like GitHub Copilot, which promised to use ‘AI’ to make you write code and complete programming tasks faster than ever, yet how much of that has worked out since Copilot’s introduction in 2021? According to a recent report by code analysis firm Uplevel there are no significant benefits, while GitHub Copilot also introduced 41% more bugs. Commentary from development teams suggests that while the coding assistant makes for faster writing of code, debugging or maintaining the code is often not realistic.
None of this should be a surprise, of course, as this mirrors what we already found when covering this topic back in 2021. With GitHub Copilot and kin being effectively Large Language Models (LLMs) that are trained on codebases, they are best considered to be massive autocomplete systems targeting code. Much like with autocomplete on e.g. a smartphone, the experience is often jarring and full of errors. Perhaps the most fair assessment of GitHub Copilot is that it can be helpful when writing repetitive, braindead code that requires very little understanding of the code to get right, while it’s bound to helpfully carry in a bundle of sticks and a dead rodent like an overly enthusiastic dog when all you wanted was for it to grab that spanner.
Until Copilot and kin develop actual intelligence, it would seem that software developer jobs are still perfectly safe from being taken over by our robotic overlords.
Is this any different than when C++ was taking hold? Software got more bugs, things got vastly more bloated and slower, and development cycles increased in time. Yet computer scientist loved it.
Has it stopped taking hold? There are still plenty of use cases in the embedded world for regular C.
That is a false comparison. Say after me, correlation does not imply causation. “modern” languages are an attempt to fix the symptom and not the cause, bad software development practices. People need to learn what software engineering is and do it instead of pretending they did it. Coding is just brick laying.
“Say after me, correlation does not imply causation.”
this is an article about AI, they haven’t figured that out yet
Exact same comparison. Give people an easy convenient way to spew buggy crap code and they will use it.
Spot on
hahah my mind didn’t go to C++ but it went in the same direction. our jobs are not perfectly safe because our skills are not in demand. a lot of organizations are satisfied with extremely low quality code and a complete lack of troubleshooting. a lot of people manage to take unsupported prototypes to market.
Some middle managers mindlessly spew technical jargon to impress the C-suite while suppressing the input of working professional developers. Then, they act surprised when there are massive layoffs of middle managers, disguised as “organization restructuring”, following major security breaches and data center failures.
Yes
Another great one by Joe Kim. I’d love to see one of these creations being made in real time. Do any videos exist?
Imagine the pressure of trying to make a video of you drawing… or me replying to comments!
There is this talk he gave at Supercon: https://hackaday.com/2017/12/05/joe-kim-where-technology-and-art-collide/
I get it. But thanks!
All they need is the ability to run the code and check the error logs and output to see if it meets the specs.
sufficient test coverage for what you’re hoping, is impossible in every meaningful sense.
According to middle managers, testing is a waste of time. ;-)
yes, surely it’s easy to test all cases to determine if code will halt
When geeks make jokes….
There’s a paper for that. Check out “Voyager: An Open-Ended Embodied Agent with Large Language Models”. It’s an LLM that scripts itself to play Minecraft. It includes a loop of writing, debugging, and running.
I really enjoy having Copilot in my IDE, but four out of five suggestions are worthless – it’s that one-in-five that makes it useful enough to be worthwhile. The 41% error rate reported in the study suggests that the users they studied were far too trusting.
If you’re quick to recognize and discard the garbage, it’s net-positive. If you’re not, it’s not. It’s only net-positive by a small amount, so it doesn’t take much to go negative. But one-in-five of those one-in-five valid suggestions is so good that it puts a smile on my face, so I’m hooked.
I have the same experience. I’m actively comparing a couple local models hosted on my workstation with a Copilot subscription, and I get useful results often enough to keep going even with the overhead of having to read and evaluate the suggestions. Suggestions are often terrible, but hitting ESC once is a small cost to pay. When it does fill in boilerplate it can be much faster than typing it all out.
I think I get slightly better than 1/5 from Copilot, and about the same from llama 3.1 7b-instruct (q6) for code-complete, but I find the “chat and discuss how to diagnose and correct errors” to be… pretty much never worth the time with my local models. Copilot does a decent job, but by the time you have explained the problem you’ve probably also “rubber ducked” enough to get yourself the rest of the way to solving it anyway.
Hitting ESC is a small cost.
Evaluating randomly generated code is the cost.
Kernighan’s law:
https://www.laws-of-software.com/laws/kernighan/
Finding the bugs in a piece of code will take you at least twice as long as writing it yourself.
Generated code will have bugs. You will have the cost of finding them.
Which would you rather do?
Write a program
Debug a program
I’ve done enough to know that debugging somebody else’s code sucks.
You’re not wrong at all, but you’re overlooking the obvious solution: just don’t accept suggestions that you don’t understand. That’s where the “Copilot also introduced 41% more bugs” problem came from – people were accepting suggestions that they didn’t understand.
When I’m writing code, I know what I want to do. If Copilot’s suggestion does what I want, I accept it. If it doesn’t, I just keep typing. Most of the time is suggestions are so wrong that I don’t even break my stride. But like I said, the ones that work save me time, and sometimes I learn something new from it.
Don’t knock it ’til you’ve tried it. You’re just worrying about problems that don’t matter, and you’re not seeing the value that’s actually there.
It’s called job security…especially if you are a civil servant
So put another way, you have to read four lines of bad code for every passable line that the machine produces?
Read and understand 5 lines of code, reject 4 and keep 1 as possibly OK.
Sounds like a lot of work for little gain to me.
Yeah, but consider how many times you ignore the autocomplete suggestions when you’re typing on your phone. It’s kinda like that.
I can tell at a glance when Copilot is barking up the wrong tree, so I just keep typing. Once in a while it suggests what I was going to type anyway, so I accept the suggestion and move on. For repetitive stuff the probability of a usable suggestion goes way up – for example the other day I had to translate a half-dozen enum values from one API to another and it did most of the work for me.
And sometimes it suggests something better than what I had in mind, and those are the moments that are worth the subscription price.
It’s not game-changing, but it’s nice to have.
I find Copilot really useful for writing boilerplate code.
Say you need to open a file, collect some statistics, print said statistics to some format or another. Copilot can do this for you with one or two simple prompts, and it can do it for you in whatever language you want.
If you’re writing boilerplate code over and over…
write other code to make it not boilerplate
If you’re writing boilerplate over and over, you have other problems that this won’t fix
Yeah I find it useful for small scripts in languages/frameworks that I’m not familiar with. I write Windows batch scripts maybe once every 2 years … throw it in the LLM. I give it pseudo-code and it gives me DOS syntax.
Copilot just sucks, I haven’t seen any big improvement using it over just searching stuff.
BUT Cline exists now (https://github.com/cline/cline formerly claude-dev) and THAT really boosted my productivity. It can do complex tasks involving multiple files and gets a ton of context from your project, but targetted. When you ask it to perform a task, like add a field to a REST resource, it will build a strategy, figure out what files it needs to look in, then changes the files as needed and even checks its work after. It’s helped me a lot with common tasks that involve editing a lot of files, or adding new features that I don’t know how to do off the bat. It does get it wrong about as often as any AI, but often times I can just reprompt it on that specific file and it will correct itself.
It’s an amazing tool and just having it do SOMETHING whenever you don’t know how to start with a task has proven immensly helpful to keeping me going instead of getting stuck on something and googling for an hour.
this looks great for when my brain is fried and I don’t have motivation, thanks.
10usd/month is expensive for hobbiest
Lol wut? A single round of golf in a month costs $50. I go through $10 a month in road bike tires alone – $50 tire every 2000 miles at 400 miles a month.
Those numbers aren’t actually absurd – $10 a month is a more than reasonable amount for many hobbies. But did you notice that of all the examples you could use, you managed to pick ones that are perfectly stereotypical for the, shall we say, financially secure? Who probably aren’t the lowest common denominator for hobbyists. “A single month’s bus pass costs $50 and I go through $10 a month in prepaid minutes alone” would read differently, you know?
To make it clear to anyone reading with different cultural contexts: Golf, all over the U.S., has this whole context of mediating social networking between those of power, wealth, status, etc. And there’s the little fees to rent this or enter into that, the opportunity for investment in various things, the dedication of some of this money and time on an ongoing basis, the whole social circle thing… I’m not saying that anyone has to fit the stereotype, just that it’s an ironic example to use. And then the second one is buying (if I remember my prices well) fairly decent tires at a rapid pace because of how much time you spend on something that does not earn any money. It depends on where you live as to the meaning since perhaps that’s your only means of transportation. But in the sorts of places where golf has its status symbology, spending lots of time on a leisure endurance activity like running or cycling says “white collar” more than it says “I don’t have a car”.
We don’t need more AI and we don’t need more languages. We need better training of coders. Many people who call themselves software engineers don’t know the first thing about software engineering. Calling a coder a software engineer is like calling a brick layer an architect. Would you live in a house that some brick layer randomly throw together? Because that is the state of most software systems today. If you want less bugs, train more software engineers. It is that simple. A bit of planning and foresight go a long way to allowing to create bug free code that can be maintained and extended.
The ubiquity of python in lots of open source, and hugely distributed, projects, is the proof of your point.
I’d say the popularity of javascript proves the point, much like PHP in earlier years of the internet.
All large programs (> 100 lines) contain bugs. AI is a big help in finding those bugs and fixing them at the same time. Great times lie ahead, resistance is futile.
Not sure how you could misspell “writing those bugs and hiding them” as “finding those bugs and fixing them,” but otherwise, great point
“According to a recent report by code analysis firm Uplevel there are no significant benefits, while GitHub Copilot also introduced 41% more bugs.”
Headline: Hallucinating AI coding bots guarantee future human coder jobs.
So GitHub copilot isn’t using the latest shiniest models? If a good developer is working with the bot, a lot less of those minor mistakes get incorporated because the human is working in small chunks of functionality instead of trying to get a final complex result in a single prompt.
i have seen a ton of examples of software built rigidly to a plan — “brick laying” style development — and frankly it all sucks. maybe i’m just thinking of object oriented design patterns: deeply nested hierarchies for shallow interfaces, a triad of related classes as an alternative to a single use of ‘static’ keyword, massive boilerplate of all kinds, the product “Eclipse”. so i’m skeptical of your claim but i’ve seen a very biased set of examples!
what software engineering principle do you believe in?
i love analogies :)
on the one hand, mistakes in the foundation are very difficult to fix. just like in a real house. there’s definitely a software engineering equivalent to spending hours replacing drywall just to access a junction box for 5 minutes. to me, that labor is usually re-factoring. i’m trying to make a simple change but i can’t because the actual invariants are obscured by the ‘pile of bricks’ structure of the code. so i pick up related bricks and merge or stack them, and i discard red herring bricks or at least put them in a segregated pile.
but re-factoring is possible in the software in a way that isn’t possible with bricks. i can suspend reality. i can remove the foundation and leave the building floating on air without building temporary supports. if i can find it, i can replace the junction box without touching the drywall (i can make the one-line change in the middle of the rat’s nest). i can use test-driven-development, i can watch the building fall over without losing my progress.
i think the greatest effect of the unreal nature of programming is the diversity of requirements. in brick-laying, there are a zillion possibilities but almost everyone just wants a certain optimization of land area vs wall/roof surface area vs volume. so a few patterns get awful close to optimal in almost every situation. and trying something new in bricks is expensive — if the building falls over, the entire project is a loss! in software, the field is not so well-defined. and to try something new is often just the programmer’s labor — no one has to waste a million dollars worth of concrete just to find out the flaws in an architect’s drawing on the back of a napkin. diverse constraints and cheap experimentation make it much harder to generalize the technique, and less rewarding.
so i would say, my software engineering principle is maintenance. i’ve seen a lot of different approaches to try to formalize this — prototyping mock-ups, testing, user feedback cycles — but it comes down to, you get a good result iff the actual flaws are addressed. there’s no way to avoid the flaws so far as i can tell. and the best software i’ve worked with is software that has been subject to relentless and ongoing refactoring over decades.
The problem IS the ‘trained architect’ type halfwits that don’t code (often never have or have one language…and it’s the worst possible one…JS).
Systems do need good flexible design.
Theoreticians can’t do that.
Architects are often like database ‘experts’ that go on about 24th normal form. Useless, helpless clusters of cells.
It is a convenient trash filter…But that requires clueful HR.
A species not yet discovered.
Also: Apparently ‘Architect’ is now a job title. You’ll have projects with 8 architects and 3 senior architects.
Run away!
There are MBAs about.
Yep said it before and I’ll say it again. I would rather understand what I am trying to do when debugging as opposed to trying to figure out what the AI drivel is doing so I can then figure out where it screwed up.
AI can be your buddy. Just give it a code block and ask to explain it, for instance line by line. You can also ask for coding bugs, or security risk and you can ask what way to code could be optimized.
hahaha what if AI replaces the ubiquitous low-quality comment.
i += 4; /* add 3 to i */
If you don’t understand the code, how can you tell if the chatbot has properly explained it? You have to trust that it gives you the correct explanation – and chatbots are known to just make crap up. The words and sentences look reasonable, but the meaning is gone.
If you don’t understand the code, how will you know if the chatbot explained it properly or if it just pulled some metaphorical crap out of its metaphorical backside?
You can’t know, so just don’t do it.
If middle mangers applied that rule to their engineering staff, I’d never have gotten away with anything.
Boo.
‘If you can’t dazzle with brilliance, baffle with bullshit.’ anon.
The fact is I could have explained, but the manager got more out of the bullshit.
It’s all about housetraining management.
Never hit your manager.
You don’t want them to fear you hand.
Use rolled up newspaper, say ‘bad MBA’ and whack them gently across the bridge of their nose.
After a few repetitions, just reaching for the newspaper will stop unwanted behavior.
I asked GPT 4 to write some python code to create a simple 3D object (sphere I think) using SolveSpace (CAD). It spit out some lovely looking Python code including function calls for geometry creation and an extrusion or lathe. I think it used lathe on a rectangle (rather than on a semicircle) so theoretically it might have created a cylinder rather than a sphere? The geometric mistake is not the point though. The problem is that there is no API for doing this in Solvespace at all. All that code was a hallucination.
I was about to say “oh there’s a Python API for SolveSpace? That’s way more useful than some phony AI demo!”
I’ve had this too. The LLM is jumping the gun. I suggest asking if something is possible first. Give the LLM a chance to say “no” or “yes but”. Otherwise it’ll start writing code immediately, and then you’re stuck b/c all subsequent text is based on the previous text. Also hint at the technology that you want it to use – it’s more likely to bring up text from relevant docs than from irrelevant docs.
My example from today:
“I’m writing an Azure DevOps pipeline with YAML. The pipeline can pick from multiple repositories.
I want to make a pipeline parameter that is a complex object. It should include the repo name, default branch, and output directory. Is this possible?”
and it replies with a decent workaround
This helps to corral the LLM into the right mindset. It can still fail in a myriad of other ways though, like trying to use syntax from other languages, or features that don’t exist in a restricted subset (SOQL vs SQL).
I’m of the opinion that asking AI to help professional developers is like trying to help a race car driver drive to the grocery store.
However, for the occasional coder AI is great. It gets the gist of the thing pretty easily and tuning it to the exact requirement is much easier than writing from scratch.
So they didn’t really account for all use cases in their study.
I’m of the experience that it’s better than you think.
I’ve been in the industry for decades and I get paid absurd amounts for it. My employer offered Copilot for anybody who wanted to try it, so I tried it. After a couple/few weeks of learning to work with it, I liked it enough to buy a subscription to use at home.
I think code assistants are an XY situation (users asking for something, when the solution is something else). You can converse your way to a 1980s style video game with some LLMs now, having never written a single line of Java, Python, or C yourself. As the models get better at interpreting fine distinctions in how you converse with them, this phenomena of writing less code but getting a useful result will expand. Yes, there will probably always be better code written by humans, but the industry demonstrates over and over again that good code isn’t what pays the bills. Good enough code is the point.
So, an assistant that sits there trying to “help” you code is just not how software engineering will become a niche skill. The underlying belief seems to be that the assistants will make us so productive that there will be less demand for us overall. However, that’s not the dynamic at all, and so it’s irrelevant that they don’t work that well for that purpose. Rather, software engineering will possibly become a niche skill because the demand for high quality hand coding will go way down, and the software engineering demand will be mostly satisfied with conversational AI that writes its own bad code, but nobody will care.
It’s analogous to horses being replaced by cars. Horses are far more ecologically sound, and they don’t cause global warming…. but nobody cared. Cars took over transportation anyway. If it happens, this is how AI will take over software engineering. It won’t be better, it’ll be ubiquitous and adequate.
Imagine replacing all of the 97000 trucks that make deliveries into New York with horses, as was done in the old days. (Source for truck traffic: https://www.msllegal.com/blog/delivery-truck-traffic-in-nyc-heavy-and-getting-heavier/)
Just for numbers, say you replace one truck with one horse (not going to be enough for the transport needed, but it’ll do to illustrate the problems.)
30 pounds of manure for 97000 “trucks” is over 1400 tons of horse manure a day that has to be disposed of. (Source for pounds of horse crap per day: https://cavvysavvy.tsln.com/blog/the-good-old-days-in-new-york-city/)
Unfortnately, horse don’t drop their stuff some place convenient. They dump it where ever they are. You need an army of street sweepers to keep up. You need a way to transport the manure back out to farms where it can be used as fertilizer.
Horses are not “ecologically more sound.” They are a catastrophe in a city of any size.
Imagine every city buried under mountains of manure and flooded with rivers of urine. Is that “ecologically sound?”
(wandering off topic) Now do trucks ! ALL the raw materials and energy needed to make the trucks in the first place, to keep them running (fuel, maintenance, replacement parts), the ongoing impacts of noxious by-products from engine exhaust to disposal of used oil, used tires, etc… Now add in the same costs again for every truck that needs to be eventually replaced. Imagine the pyramid of dependencies required to make a truck – raw metals, rubber, plastics, glass, glues, etc… to say nothing of the tools required – casting and forming of metals, fine machining of high-tolerance parts, molding of plastics, rubber, glass and so on.
Just because all of these externalities required for a truck to exist and function are not immediately visible to you, doesn’t mean they don’t exist.
But yes, a very visible by-product of using horses is manure. That manure is a great fertilizer for crops. Can your truck do that ?
Another things horses do – they make new horses ! Without any help from humans. Can your truck do that ?
Picking one metric (manure) and using that to claim horses are an ecological “catastrophe” is not a very good argument.
Sorry, I’m just soooo tired of these arguments that amount to “Technology X (whose externalities I am ignoring) is superior to technology Y whose externalities I am expounding in great detail.”
Did AI hallucinate this response? Let’s back up to the core concept here. A house is not fit for purpose for delivery in the modern era. If you attempted to use horses to replace local delivery, your looking 5-1 to perhaps 10-1 horses per truck in town assuming horses never get tired.
Horses do not make horses without humans and still deliver goods for humans.
All of that manure is going to be going to fertilize crops used to feed horses and the horses that carry the feed into cities where those horses deliver and the take manure out. They aren’t doing this without the direction of humans, so no horses will survive, let alone reproduce, in a city delivering goods without humans.
A single horse in a measure living wild has a smaller impact than a working truck and is self sufficient in that environment. However, that house is doing 0 useful work.
i can’t tell whether you’re getting to the heart of jevons paradox or just skirting around the edge of it? anyways if you aren’t familiar you should look up jevons paradox…it’s on display everywhere
Sometimes it is better than search, can find you the APIs or system you are looking for. Terrible coder, could theoretically compile its own code and check the result against the pretty looking comments, but does not.
Gonna call BS on this one. No copilot experience yet but I have found GPT 3.5 to be an indispensable sidekick. Its initial attempts are pseudocode or buggy. But if you have it write the implementation and the unit tests, I’ve found it takes about 4 or 5 iterations until the tests pass. The quality of your interaction and knowledge of design patterns are critical here. Sometimes the LLM makes a weird design choice but all you have to do is ask and it will explore other options. I’ve never been as productive in so many languages. I come from Ruby but am successfully building Python and PHP tools that are a key part of my business. None of it would have been possible without ChatGPT. I did have issues early on because it kept forgetting stuff, bit then I learned you can turn on and manage memory. My strategy is to have multiple chats open so the little questions dont clutter up the big picture api or implementation. Sometimes i will have a fresh GPT review and optimize the code. Fwiw im a solo developer but it feels like I’m running a team effort. You could not pay me to go back to the old days of google and stack overflow. That ship has sailed. That’s been my experience at least
“the old days of Google and stack overflow”
you
you do realize those aren’t the old days, right?
like, we wrote software before Google
there was documentation
you read it
And now in the worst cases with documentation, you scrape all the hundreds of pages of docs into a custom bot and then it will create software specific solutions based on the documentation examples. Sometimes it will even help add in advanced functionality hidden deep within the docs.
I don’t agree that it doesn’t help productivity. I feel like it the past time was taken up with trying to figure out which library to use to do some simple UI things that you know exist because every other app out there does it. Now I am 4 weeks into creating a relatively complex app with all the features of a big name brand because basic features like logging in and API management shouldn’t take 40 each to do something that is standard and has specific security best practices that can be included even if you don’t know the exact term. When you have programmers using AI without having the faintest clue what they want the program to do, and no ability to follow the logic themselves is where bugs get introduced with copypasta code.
haha i don’t even disagree with your premise but citing specific security best practices as a place where AI saves you from learning or thinking too much. wow.
I feel like using Copilot is the wrong way to think about these LLMs for coding tasks. It’s like trying to make your horse cart go faster by getting a faster horse. Instead, I’ve been rethinking how I do development, and have included some LLMs for code generation, and it works wonders. Especially if I’m working on areas of code I’m not an expert at, like shaders.
I don’t use it for working on individual lines of code. I have it work with me one file/class at a time. I’ll also have an LLM generate code while I’m not in a place to write code. For example, use voice commands to have it generate a script while I’m working on level design. Instead of having to interrupt my level design, it’s doing something in tandem, and I can refine it later if I have to.