Measuring The Impact Of LLMs On Experienced Developer Productivity

Recently AI risk and benefit evaluation company METR ran a randomized control test (RCT) on a gaggle of experienced open source developers to gain objective data on how the use of LLMs affects their productivity. Their findings were that using LLM-based tools like Cursor Pro with Claude 3.5/3.7 Sonnet reduced productivity by about 19%, with the full study by [Joel Becker] et al. available as PDF.

This study was also intended to establish a methodology to assess the impact from introducing LLM-based tools in software development. In the RCT, 16 experienced open source software developers were given 246 tasks, after which their effective performance was evaluated.

A large focus of the methodology was on creating realistic scenarios instead of using canned benchmarks. This included adding features to code, bug fixes and refactoring, much as they would do in the work on their respective open source projects. The observed increase in the time it took to complete tasks with the LLM’s assistance was found to be likely due to a range of factors, including over-optimism about the LLM tool capabilities, LLMs interfering with existing knowledge on the codebase, poor LLM performance on large codebases, low reliability of the generated code and the LLM doing very poorly on using tacit knowledge and context.

Although METR suggests that this poor showing may improve over time, it seems fair to argue whether LLM coding tools are at all a useful coding partner.

AI Is Only Coming For Fun Jobs

In the past few years, what marketers and venture capital firms term “artificial intelligence” but is more often an advanced predictive text model of some sort has started taking people’s jobs and threatening others. But not tedious jobs that society might like to have automated away in the first place. These AI tools have generally been taking rewarding or enjoyable jobs like artist, author, filmmaker, programmer, and composer. This project from a research team might soon be able to add astronaut to that list.

The team was working within the confines of the Kerbal Space Program Differential Game Challenge, an open-source plugin from MIT that allows developers to test various algorithms and artificial intelligences in simulated spacecraft situations. Generally, purpose-built models are used here with many rounds of refinement and testing, but since this process can be time consuming and costly the researchers on this team decided to hand over control to ChatGPT with only limited instructions. A translation layer built by the researchers allows generated text to be converted to spacecraft controls.

We’ll note that, at least as of right now, large language models haven’t taken the jobs of any actual astronauts yet. The game challenge is generally meant for non-manned spacecraft like orbital satellites which often need to make their own decisions to maintain orbits and avoid obstacles. This specific model was able to place second in a recent competition as well, although we’ll keep rooting for humans in certain situations like these.