Don't Finish My Sandwiches

When they first came along, I was sceptical about the merits of using an AI (specifically, a Large Language Model) to help you code. However, the world doesn’t stand still, especially the world of technology, and it’s been a busy two and a half years. Coding assistants have become more sophisticated and widespread, and for a while I’ve been dipping my toe into the water and trying them out myself. As such, I thought it was time for an update.
My assistant of choice is Github Copilot, not through any sophisticated evaluation process, but purely through convenience. As a user of both Github and VS Code, things have reached the point where it’s more effort not to try it out in some capacity. I’ve been doing so on and off for a few months, on small tasks.
One thing that quickly struck me was poor performance in the functionality most people would first think of when thinking of coding assistants: autocomplete. As with many things LLM, this produces results that at first appear beguiling, verging on the miraculous, but fall apart on closer inspection.
To illustrate, here’s a recent example from my book log (which I happen to edit in VS Code). First, an example of it producing a result that genuinely impressed me (autocomplete suggestion in italics):
- title: "All Systems Red: The Murderbot Diaries" author: Martha Wells url: https://www.marthawells.com/murderbot1.htm start: 2025-01-28 end: 2025-01-30 - title: "Artificial Condition: The Murderbot Diaries"
Here, Copilot has suggested the title of the next book I’m reading based on the data in the previous entry. Not only has it formatted it correctly, it’s also factually accurate — “Artificial Condition” is indeed the next entry in Martha Well’s excellent MurderBot series. While it’s possible that someone else keeps a log of the books they’ve read in a YAML, using the same keys and structure as me, is also making their way reading through The MurderBot Diaries, and Copilot is just regurgitating an item from its training set, that seems unlikely. More likely, it’s actually working as advertised, and weaving together several different patterns (YAML syntax, properties of books, and the sequence of titles in that particular series) in a complex, multi-dimensional way. This is great demonstration of how LLMs go way beyond simpler statistical techniques.
It turned out that in this case the prediction, while entirely reasonable, was incorrect, as I wasn’t reading the next MurderBot book immediately (I like to spread these things out). No big deal — even if autocomplete doesn’t get it right every time, it’s still useful, and I can just ignore the suggestion and I’m no worse off. However, when I supplied the title of the next book I was reading, the suggestion was somewhat less helpful:
- title: "Monsters: What Do We Do with Great Art by Bad People?" author: "Claire Dederer" url: https://www.penguinrandomhouse.com/books/669579/monsters-by-claire-dederer/
If Copilot had done what it appeared to do — found a URL for a site about the book in question — it would have been at least as impressive as the previous example, if not more so. Fortunately, I had sufficient remaining scepticism to check the link, and confirmed that this was not the case. Instead, it produced something that looks like the right URL (this is the format of URLs used by Penguin), and it’s inserted the title and author in the way you’d expect them to appear. However, this is done without any reference to what’s on the other end of that link; the sequence of digits is basically random, and happens to resolve to “The Count of Monte Cristo”.
This is absolutely the way you’d expect an LLM to fail — in their basic form the way they work is to to produce a probable (and thus plausible) continuation of their input, with no reference to external reality. There’s some correlation between this probability and truth, but the correspondence is far from complete. More recent techniques ground the LLM with factual information in various ways, but as this example (by no means usual) shows the gap remains.
The above example is perhaps unfair, as it’s not really the kind of coding task that Copilot is targeted at. However, in my experience, using Copilot’s autocomplete while programming is worse. The kind of errors that turn up are the same, but have the potential to be both subtler and with more widespread impact. More damningly, LLM-based autocompletion compares very poorly with what it’s replacing.
I clearly remember code completion coming into the mainstream with the advent of Intellisense in the late 90s. Those of us who were already used to programming without it derided the idea as wasting more time than it saved, and we had a point. The predictions were often wrong, either in absolute terms or being non-idiomatic and overly verbose. However, Microsoft and many others didn’t stand still, and the quality of these systems steadily improved. Fast forward twenty-five years, and traditional (non-LLM) code completion is astounding; fast, reliable and idiomatic. Advances in areas like type systems and static analysis (even in dynamic languages like Python and TypeScript) mean that suggestions are not only correct but informative — while it doesn’t obviate the need for documentation or understanding, code completion can serve as a valuable aide-mémoire, avoiding the need to look up minor details and giving you the confidence that your recollection is correct.
Replacing the traditional, structured approach with an LLM completely undermines this benefit. If it offers a list of named parameters for a function I’m calling, I can no longer view that as a reference, as it might have made some of them up. It will, of course, seem plausible, but that makes it more dangerous, not less. After a few days of trying to do actual programming work with Copilot completions, I turned them off and went back to the old kind. I’ve been very happy with that decision.
Does that mean that, for me, Copilot is a bust? Not at all. There’s another mode of interaction that feels like a far better fit to the strengths and limitations of generative AI, and the more I use it the more I’m convinced that it could live up to the hype. I’ve become a big fan of Copilot Chat.
Here, instead of code completion, you interact with Copilot in a separate dialogue in a sidebar. “Chat” is absolutely the right name, as it looks and feels like a text conversation with a human being (and, absurd as it may sound, treating it as such may well yield better results). You can ask it how to perform a task or use an API, and it will produce a lucid, coherent answer with examples. Because it’s running in the context of the IDE, it has access to your own code as context, and this is evident both in the incidentals (variable names and so on) and in the fact that it incorporates relevant information not explicitly stated in the question, such as the libraries you’re already using.
Crucially, this isn’t a one-shot process. You can ask questions about the results, ask for changes or clarifications, and even correct mistakes. This last point is, I think, the one that makes it feel like a better interface to Copilot. Today’s LLM-based systems are prone to hallucinations, and it may well be that this is a fundamental property that will never go away. With autocomplete, these render individual suggestions useless and the system as a whole untrustworthy. In the context of a chat, you still need to be aware of them, but there’s a way forward with that awareness.
The first attempt to exploit a new technology is usually to use it as a drop-in replacement for an existing application. Ben Thompson frequently cites the example of advertising on the Internet, which initially aped print advertising in placing display ads alongside content. In most cases, it’s a poor substitute, and the new technology only really takes off with the advent of a different approach that plays to the strengths of the new medium (in that example, feed and search ads).
Of course, it’s not a given that such an approach will be found; most technological developments turn out to have little or no long term impact, whatever their boosters may think. For a while, I thought generative AI may fall into that camp, and my experience with code completion backed that theory. However, taking a step back, I’m now coming round to the idea that that is just the technology being misapplied. Chat interfaces, on the other hand, may or may not be turn out to be the best way to take advantage of the capabilities of LLMs, but they’re enough to suggest that there’s some there there.
The header image is a combination of photos by Sara Cervera (sandwich) and Matt Artz (spanner) from UnSplash.