I wrote ebook-notes.el, an Emacs Lisp package, to streamline the process of importing highlights and notes from an Amazon Kindle’s “My Clippings.txt” file directly into Org mode files. It automatically handles the association of notes with their corresponding highlights and prevents the import of duplicate entries.
To make life interesting, I decided to try using a LLM to “help”. I used Google’s Gemimi 2.5 Flash model. Don’t judge me. This was research!
Here are my insights - the Good, the Bad and the Ugly. See the conclusion in the Overall? section.
Background - if you don’t know about Emacs
Emacs is a highly extensible, customisable and powerful text editor.
What makes Emacs different from what you might think of as an editor (Microsoft Word, Notepad, VS Code) is that the core of Emacs is an interpreter for the Emacs Lisp programming language.
Everything in Emacs is, under-the-hood, programs. A program inserted the letter p
when I wrote that sentence; a different program runs when I paste text. You don’t need to know, or care to use Emacs. However, this means that Emacs is a massively powerful tool for doing whatever you want with text. What I wanted to do was take the clippings from my Kindle and make them useful.
The Good
Starting from an idea, some sample input data and a desired output I was able to generate a POC really quickly. I was pleasantly surprised at how well it worked. Emacs lisp is not a very common language. You will most likely have heard of Python. You might have heard of Java, even Rust. Emacs lisp is a nîche language, so I didn’t expect much.
Why did Gemini work better than expected?
I have two hypotheses why Gemini did as well as it did. Both are consistent with each other and also stand by themselves. Both or either may also be completely wrong.
Hypothesis 1: Well written examples give good results
This is actually a much deeper point. LLMs have no way of judging the quality of the work they ingest. With the proliferation of plagiarised copy-paste slop on sites like dev[DOT]to, geeksforgeeks, substack and LinkedIn there are a lot of low quality, incorrect examples for more popular languages.
Conversely the fewer examples of, say, Emacs Lisp, associated with the almost fanatical fan base means publishing poor quality or incorrect code will likely never get more copies made. Whereas high quality code is copied and discussed. (I hope this package falls closer to the former rather than the latter!)
Hypothesis 2: Languages with a standard are easier to learn
Languages with a standard - C, Fortran, Common Lisp - have an advantage that code written in them 30 years ago will still be able to run today. Languages which are rapidly evolving break frequently. There have been 15 main versions of Python since 1991 and 25 versions of Java since 1996.
The most recent C standard (C23) would still run programs written in 1990 and before. I’ve had python programs break between version 3.8 and 3.11.
What I suspect this means is that code and articles written for languages with a standard remain valid. Examples in evolving languages age poorly. This is not to say that learning from users and improving the programming language is a bad thing, it’s a bad thing for models like LLMs which consume everything without applying quality filters.
We should remember that having a standard doesn’t make the language less able than an evolving one; operating systems - how your computer actually works - are written in C. Languages with a standard have, at the very least, features that have been thought about more deeply. Common Lisp has the almost unique ability to use macros to extend with new features, like an object system, that are as efficient as the built-in ones.
Saying all this, Emacs Lisp doesn’t have a standard. It, however, changes slowly.
Writing tests and function documentation
Gemini did really quite well at producing tests and function documentation (docstrings). This is one area that many coders do really poorly at. I think going forward this will be a minimum expectation I will have for my teams if they want to use assisted coding.
The tests written weren’t comprehensive and did have some mistakes, but purely from a nothing vs. something perspective it was a win.
Similarly, pasting the code of a function and asking for documentation was surprisingly helpful in the most part. Consistent format, and pretty good understanding of the broad logic flows were achieved.
The Bad
Coding LLMs do poorly abstracting ideas
I wanted to flatten accented characters in the typesetting system LaTeX to their ASCII equivalent. That is, the way to flatten ë is to map \"e
to e
. The same for ä (\"a
), etc.
When I asked Gemini to write this code it went through and build a series of string replacement queries. Which would be slow. I decided to go with removing \"
.
This is a general issue that I’ve seen crop up. LLMs don’t generalise concepts well. This isn’t surprising. They are random number generators with a probabilistic overlay.
If you have someone writing code for you who:
- Doesn’t understand your problem well; or
- Doesn’t understand the coding language well; and
- Suffers from Dunning-Kruger (highly correlated with #1 and #2)
then you are in for a world of hurt: slow, bug riddled, unmaintainable, convoluted code.
Extending beyond the basics or common place
Once I had a working POC I decided to enhance it and resolve some of the edge cases. The LLM failed miserably at this. Asking for anything other than simple solutions to simple problems was a failure.
Why can some people create whole solutions easily? Because they’re not creating anything NEW.
“Build a login popup for Google/Apple accounts in javascript.” It’s been done numerous times before.
“Use a reactive framework to allow click-drag moving of boxes/tickets/tasks”? Same.
When asking for something somewhat novel, the code was significantly rewritten (rather than extended), what I intended was not achieved and it was a mess. This leads to the next point.
Use better prompts
Now, someone will undoubtedly think “you need better prompts”. The whole concept of a Prompt Engineer or training people to write “better prompts” is ridiculous.
Learning how to and writing “better” prompts is predicated on several flawed arguments.
Models change - What is better for Gemini is likely to not to be the same for ChatGPT, or Claude. These models also change and (arguably) improve. The prompt you learn now for one spicy autocomplete is going to be superseded by the next set model.
Writing a prompt is using natural language. Writing code uses a formal language. Where there is boilerplate, natural language prompts can help. When there is a complex concept to convey a domain specific (often) formal language is needed.
Writing clear concise commands to get the computer to do something is called programming. Learn to program.
I asked Gemini to contrast the two. It didn’t do too badly.
The comparison between human languages and formal languages like mathematics or programming is often made, but it’s a bit like comparing a hammer and a screwdriver. Both are tools, but they serve different purposes.
- Formal Languages
- These are characterized by their strict syntax and semantics. They are designed for precision and logical consistency. Examples include:
- Mathematics: Every symbol and operation has a single, well-defined meaning.
- Programming Languages: Code must follow rigid rules to be executed by a computer. Ambiguity leads to errors.
- Logic: Systems of logical notation are built to eliminate ambiguity and paradox.
- Natural Languages
- These are complex, evolving systems with rich, often ambiguous, syntax and semantics. They are characterized by:
- Redundancy: Spoken language is highly redundant, with extra sounds and words that help listeners understand the message even with background noise or a slight accent.
- Pragmatics: The meaning of an utterance often depends on the social context, the speaker’s intent, and the listener’s background knowledge.
- Metaphor and Analogy: Humans rely heavily on figurative language to convey complex ideas, something that is difficult to formalise.
The Ugly
Incorrect information
I don’t like the term hallucination for the errors that come from a LLM.
A hallucination is “an experience involving the apparent perception of something not present” according to the OED.
A LLM neither experiences or perceives anything. It’s lazy to anthropomorphise LLMs.
However, the presentation of incorrect information with the same confident language as factual information (often called hallucinations!) is a known issue with LLMs. I would argue it’s not actually an issue with LLMs, but with our interpretation of the outcome of these random number generators, but that’s a different discussion…
Incorrect information is, unsurprisingly, a problem with generated elisp code. Be it:
- missing brackets(!)
- incorrect use of built-in functions (like the argument order of
puthash
) - trouble with regular expressions (however, who doesn’t?)
- knowing when to use
cl-flet
andcl-letf
- errors where functions are made up, examples are given and advanced usage and common pitfalls are all confidently presented. When challenged the response was “I’ve hallucinated the name of a function that would be intuitive but is not actually implemented.”
However, most of this can be caught through tests.
Overall?
I’d probably use the LLM again to generate a quick starting point for some code and I’d definitely use it to generate the starting boilerplate for tests and function docstrings.
For generating small, self-contained functions, the output of a LLM doesn’t seem too bad.
However anything that is slightly complicated, novel or mission critical, that you need to ensure accuracy, safety or security there is no substitute for the craft of coding.