Introduction

First I’ll say that this entire blog post was 100% written by a human without AI assistance. The code we’re going to talk about however, is another story.

The recent explosion of quality open-source Large Language Models you can run at home got me interested in the prospect of leveraging them to write code. I’ve been writing code myself for over 30 years, so it seemed like a problem for which I might have some intuition. It’s common knowledge now that leveraging GPT-4 to write code is very effective, but can it be done at home entirely offline?

With heavy use of GPT-4, I set to work on the supercharger project, aiming at finding and implementing new techniques to make automatic code generation succeed more often. And I was both successful at automatically generating code, and also at finding some new ways improve the rate of success.

Preliminaries: What are LLMs? What is prompt engineering?

Large Language Models are just auto-complete engines. To use them to write code, you need to provide a starting point. The starting point is a template, which presents a conversation between two people. The first person is the user, and the second person is the computer. The user asks the computer to do something, and the computer responds with a piece of code that does the job.

Writing these prompts is called prompt engineering. An effective strategy is to start with a narrator describing the situation, and providing examples of back and forth conversation. The prompt ends with the LLM poised to write the first line of code.

Here’s an example narration line:

The following is a Python conversation between Human and Coder. Human and Coder take turns chatting. Coder always considers responses carefully and thinks step by step before answering. Coder always writes syntactically correct Python code.

This is followed by several back and forth examples such as:

[|User|]: Write a function that adds two numbers.
[|Coder|]: ```
def add_numbers(x, y):
    return x + y

It ends with:

[|Coder|]:

And then the LLM is prompted to auto-complete the line.

What kind of hardware do you need to run LLMs?

I have found there to be a very large difference between the quality of code written by 7B, 13B, and 30B models. The 30B models are the best, but they are also the slowest. To run them in real-time I’m using the Baize-30B model running with model parallelism across two RTX4090 GPUs in 8-bit mode. Since I’ve been doing a lot of ML lately, I already had the hardware. Here’s my GPU cluster, which is 4 machines each with 2x 4090 GPUs:

Cluster

I’ve also built 3x3090 GPU machines, and found that the training throughput of the 2x4090 machines is 2x higher, so I’d recommend anyone building systems for this to seriously consider the 4090 GPUs.

That being said, you can also run this same stuff on a single CPU - it will just take longer. The Jetson Orin 64 GB platform is also interesting because it has a lot of unified VRAM, so it can also run 30B models in a small form factor but much slower.

Some of the ideas in this blog post work fine without a lot of compute power, so don’t worry if you don’t have a lot of hardware. The supercharger project is currently designed for 2x3090 or 2x4090 GPUs per node, but it could be trivially adapted to run smaller models with just one GPU if there is more interest.

Code Generating Tricks and Techniques

I’ve identified 6 techniques that can be used to improve the quality of code generated by a language model:

  1. Improve the prompt used to generate the code.
  2. Parse the output of the LLM and clean it up.
  3. Ask the LLM to improve code that it previously generated.
  4. Generate unit tests for the code and run them to verify that the generated code works.
  5. Ask the LLM to score the quality of code + unit test, and do not stop until the quality hits a threshold.
  6. Check all pairs of codes and unit tests to improve the odds of success (birthday attack it!).

Technique 1: Improving the Prompt

For the supercharger project, I wrote and tested several different prompt templates to work towards auto-generating code.

  • Generate code that satisfies a given function prototype and description.
  • Generate a unit test that would test a given function prototype and description.
  • Improve code that was previously written.
  • Improve a unit test that was previously written.
  • Judge code quality on a scale from 0..1.
  • Judge code+test pair quality on a scale from 0..1.

These are available here: Supercharger Ask Templates

Some key prompt details that helped qualitatively:

  • Use GPT-4 interactions as examples.
  • Mention Python several times.
  • Try to surround example code in ``` quotes commonly used in markdown to surround code, to make LLM output easier to parse out.
  • Use [|Coder|] to refer to the LLM.
  • “Coder always writes syntatically correct Python”
  • “Think step by step”
  • “After careful consideration”
  • Give it about 2 examples.
  • Do not add a space after the colon in the final [|Coder|] prompt.

When running the LLM, you can specify a temperature for the LLM to use. A higher temperature means the LLM will be more creative, but also more likely to make mistakes. A lower temperature means the LLM will be more conservative, but also more likely to repeat itself. Repetitions are particularly bad because they are hard to detect and correct, so when generating code it’s a bad idea to use temperature = 0. I did not fine-tune the temperature and simply picked a default value of 1.0 since we are going to be generating a lot of candidates and can afford to make some mistakes. However, when judging code quality, I used a temperature of 0 because we want code judgements to be less “random” and we do not care if the LLM repeats itself for judging purposes since we only take the first few tokens.

Technique 2: Parsing the Output of the LLM

There are a few challenges with using the output of the LLM directly.

The output of an LLM tends to be conversational, mixing English sentences with code for example. This adds some additional challenge to parsing the output. By prompting the LLM to generate code surrounded with ``` quotes, it makes it easier to parse out the code from any text around it.

LLMs tend to make typos like forgetting colons, or forgetting to indent code.

I evaluated two approaches to solve this problem: First I tried using regex to parse the code generated by the LLM. Unfortunately while it is pleasantly first-principles, it was also error prone in practice because the Python syntax rules are more subtle than you might expect. For example, the following code:

    def bar():
    if a:
        if b:
            return 1
    else:
        return 0

The indent error on the first line is obvious because at least one statement must be in the def, but it may be equally valid to not ident the remaining lines. Just keeping track of “intentional” indents or de-indents doesn’t yield the right answer, and it becomes very complicated to think through all this.

So, I came up with a really nice hack using the ast module. This can parse Python code and throw a syntax error at the first issue it finds. So, my solution was to interpret the syntax errors ast throws and then brute force a bunch of variations of the code until that line parses correctly or at least with a new error. For example for an “unexpected indent” error, the fix_ast_error module in supercharger will try one-space indents up to 8 characters, and then it will try one-space de-indents up to 8 characters until the error goes away. Interestingly this also seems to work for missing colons, parantheses, and other syntax errors.

If nothing works, just deleting the error line is not a bad fallback since it might be something silly.

I added a few other code cleanup heuristics and wrote thorough unit tests to make sure the code cleanup works pretty robustly. Feel free to use this submodule or any others in your project if you’d find this useful!

The overall code cleanup workflow I ended up with is:

  1. Extract code from first markdown block.
  2. Fix errors using the ast walking approach mentioned above.
  3. Strip any globals and some imports, since we are only expecting to generate functions.
  4. Strip any leading comments since we want to use our own comments to prompt the LLM.
  5. Run the code through autoimport to add any missing imports.
  6. Format the code using Google yapf to make it look nice after all the wild indenting hacks etc.

Technique 3: Asking the LLM to Improve Code

This one I went through a few iterations on. Initially I asked the LLM to generate a list of improvements, and then asked the LLM to incorporate the improvements. This worked, but it was slow (required 2x calls instead of 1) and the LLM would often get stuck in a loop of generating the same improvement over and over again.

Instead, I found that it was much more effective to ask the LLM to generate a single improvement, and then ask the LLM to incorporate that improvement. This was much faster, and interestingly the LLM would fairly consistently generate a new improvement each time.

Interestingly, it’s very common that just asking for an improved version of the code succeeds.

For supercharger I alternate between generating novel code (relying on higher temperature to come up with creative new solutions) and improving existing code.

Technique 4: Generating Unit Tests

This one is a bit tricky because to use the unit tests, you have to actually run pytest to test the code on your computer, which opens up security problems where the script could try to erase your filesystem or something. So, running the code inside a VM is a good idea.

I came up with the idea of using docker for the VM aspect of the unit test execution, and it works great. The trick is to create a docker container based on the Python 3.10 image, which mounts a single local scratch folder for input/output, and runs bash -c 'while true; do sleep 1; done' in a loop to keep it running. Then to run tests, I just use container.exec_run to run pytest inside the container. This was pretty slick because it’s several times faster than restarting the container for each test.

The other issue is that the Python code might be requiring some packages from pip to execute. To solve that, I used pipreq to generate the requirements.txt file and install it prior to running a batch of tests.

Finally the code could run in an infinite loop, so setting a timeout is important. I found that 10 seconds was a good timeout for most code.

Technique 5: Asking the LLM to Score Code Quality

But even if the tests pass, maybe the tests were not adequate?

This is one of the ideas that should be explored more. I found that it was very effective to ask the LLM to score the quality of the code, and then to ask the LLM to improve the code until the quality score was above a threshold.

Also interestingly, it only takes about 2-3 seconds on my hardware to score a piece of code even if it’s fairly long. So, a key take-away from this blog post for the reader is that you can use even very large models to score code quality in real-time.

In fact an interesting project would be to just focus on this aspect of the LLMs to score the quality of functions in a code-base to try to identify functions that could be improved in isolation. I’ve decided that this should be my next project after supercharger is done. Imagine using such a program to score every function in the Linux kernel, and using it to zero in on functions that may be chained together to create security vulnerabilities.

Technique 6: Checking All Pairs of Code and Test

This one was a bit fun for me, because I was able to pull out my old friend: The Birthday Attack (aka Paradox)! This is fun to explain so indulge me for a moment.

Imagine that you’re in a room of 23 people and you want to find out if there are any two people in the room who share the same birthday. You can do this by asking each person in the room for their birthday, and then checking to see if anyone else in the room has the same birthday. If you do this, you will find that there is a 50% chance that you will find two people with the same birthday. But wait, how is that possible? There are only 365 days in a year, so the odds of two people sharing a birthday should be 1/365, right? Well, the trick is that you’re not just checking two people, you’re checking 23 * 22 / 2 pairs of people, which is increasing at the square of the number of people.

If you’re with me, then you can see how this can be applied to code generation. If you have a bunch of code and unit tests, the odds that any pair of them will succeed and are of good quality is low, but if you check all pairs, the odds of finding a good pair is much higher.

So, even if the code and unit tests do not pass at first, if we keep old attempts around and check them against new attempts, we can improve the odds of success.

Results

Well after all that… it works! Here’s an example of having it design a Python function based on a description and a prototype:

(supercharger) ➜  supercharger git:(main) ✗ python codegen/codegen.py
INFO:root:Input comments: # A function that calculates the factorial of a given non-negative integer
INFO:root:Function prototype: def factorial(n):
INFO:root:Function name: factorial
INFO:root:Setting up VM...
INFO:root:Starting LLM workers...
Deleted sources/factorial/test_factorial_0.py
INFO:root:Work queue empty and detected only 0/4 workers active.  Adding job...
INFO:root:Adding a job to write more tests (tests asked/completed=0/0, codes asked/completed=0/0)
INFO:root:Worker idle... (2 seconds)
INFO:root:Worker idle... (2 seconds)
INFO:root:Worker idle... (2 seconds)
INFO:root:Work queue empty and detected only 1/4 workers active.  Adding job...
INFO:root:Adding a job to write more code (tests asked/completed=0/0, codes asked/completed=1/0)
INFO:root:Work queue empty and detected only 2/4 workers active.  Adding job...
INFO:root:Adding a job to write more tests (tests asked/completed=1/0, codes asked/completed=1/0)
INFO:root:Worker idle... (2 seconds)
INFO:root:Work queue empty and detected only 3/4 workers active.  Adding job...
INFO:root:Adding a job to write more code (tests asked/completed=1/0, codes asked/completed=2/0)
INFO:root:Work queue depth = 0 active workers = 4/4
...
INFO:root:Work queue depth = 0 active workers = 4/4
INFO:root:Generated code len=187 in 22.965782165527344 seconds, with score 0.9 (scored in 1.946894884109497 seconds)
Task ID 1: Generated code (improved=False) with score 0.9 and len=187
INFO:root:Adding a job to improve the code with self-reflection
INFO:root:Work queue depth = 1 active workers = 3/4
...
INFO:root:Work queue depth = 0 active workers = 4/4
INFO:root:Generated test len=246 in 28.84612274169922 seconds
Task ID 2: Generated test (improved=False) len=246
INFO:root:Test passed: code 1 <-> test 2 - Asking judge if we are done
INFO:root:Adding a job to improve the test with self-reflection
INFO:root:Work queue depth = 2 active workers = 3/4
INFO:root:Generated test len=307 in 32.69654178619385 seconds
Task ID 0: Generated test (improved=False) len=307
INFO:root:Test passed: code 1 <-> test 0 - Asking judge if we are done
INFO:root:Adding a job to improve the test with self-reflection
INFO:root:Work queue depth = 2 active workers = 4/4
...
INFO:root:Work queue depth = 2 active workers = 4/4
INFO:root:Generated code len=168 in 68.91890406608582 seconds, with score 0.9 (scored in 1.9692518711090088 seconds)
Task ID 3: Generated code (improved=False) with score 0.9 and len=168
INFO:root:Test passed: code 3 <-> test 2 - Asking judge if we are done
INFO:root:Test passed: code 3 <-> test 0 - Asking judge if we are done
INFO:root:Adding a job to improve the code with self-reflection
INFO:root:Work queue depth = 4 active workers = 4/4
INFO:root:Judged code/test pair with score 1.0 in 42.62114644050598 seconds
Task 5 complete: Judged pair code=1 test=2 with score=1.0
INFO:root:Work queue depth = 3 active workers = 4/4
INFO:root:Found a good code/test pair: code=1 test=2 score=1.0
INFO:root:Wrote final code and test to disk. Exiting...

Note that this is doing one LLM query per node in the GPU cluster of 4 nodes, so it can be generating 2 candidate codes and 2 candidate tests simultaneously.

Generated Code

# A function that calculates the factorial of a given non-negative integer
def factorial(n: int) -> int:
    result = 1
    for i in range(1, n + 1):
        result *= i
    return result
import pytest
from factorial import factorial
def test_factorial():
    assert factorial(1) == 1
    assert factorial(2) == 2
    assert factorial(3) == 6
    assert factorial(4) == 24
    assert factorial(5) == 120
    assert factorial(6) == 720

When asked “What’s your opinion of this code and unit test?”, GPT-4 has this to say about the code:

The code implementation for the factorial function is good. It's an iterative approach, which can be more efficient in terms of memory usage compared to the recursive version. It's also more suitable for larger inputs as it does not have the risk of reaching the recursion limit.

Regarding the unit test, it is also good but has some room for improvement:

Add a test case for the base case (0), which is missing in the current test cases. The factorial of 0 is defined to be 1.
Add a test case for negative numbers to ensure the function behaves correctly with invalid input. The current implementation does not handle negative numbers, and ideally, it should raise an error in such cases.
Add more test cases for larger numbers to ensure the function works correctly for a wider range of input values.
With these improvements, the test cases would be more comprehensive and cover a wider range of scenarios.

Check out the code at https://github.com/catid/supercharger/ to see how it works in more detail.

Future work

I ran out of time to implement everything I had in mind, but here are some ideas for future work:

  • Add a planning module that breaks up a problem into several functions and generates code for each function.
  • Read the output of unit testing and use it to refine the code/tests.
  • Fine-tune the temperature, context-length, and max-tokens parameters to improve success rate.
  • Check if we can use smaller, faster models to improve code generation speed.
  • Use OpenAI API for some of the tasks in a hybrid of free + paid models.