April 5th, 2023 @ justine's web page

Edge AI Just Got Faster

When Meta released LLaMA back in February, many of us were excited to see a high-quality Large Language Model (LLM) become available for public access. Many of us who signed up however, had difficulties getting LLaMA to run on our edge and personal computer devices. One month ago, Georgi Gerganov started the llama.cpp project to provide a solution to this, and since then his project has been one of the hottest things on GitHub, having earned itself 19k stars. I spent the last few weeks volunteering for this project, and I've got some great news to share about its recent progress.

We modified llama.cpp to load weights using mmap() instead of C++ standard I/O. That enabled us to load LLaMA 100x faster using half as much memory. Our changes have just been made available in the latest release. The benefits are as follows:

More Processes
You can now run multiple LLaMA processes simultaneously on your computer. Here's a video of Georgi having a conversation with four chatbots powered by four independent llama.cpp processes running on the same Mac. So llama.cpp is not only going to be a better friend to you, it can also serve as your artificial circle of friends too. The trick that makes it possible is mmap() lets us map the read-only weights using MAP_SHARED, which is the same technique that's traditionally been used for loading executable software. So we figured, why aren't we using it to load neural network software too? Now we can.
Bigger Models
It's now safe to load models that are 2x larger without compromising system stability. Meta gave us the LLaMA models 7B, 13B, 30B, and 65B where bigger numbers usually means better artificial intelligence that's hungrier for RAM. If you needed 40GB of RAM before to safely load a 20GB model, then now you need 20GB (please note your computer still needs another 8GB or so on top of that for memory that isn't weights). The reason why our changes make an improvement is because mmap() avoids the need to copy pages. Copying pages is bad, because you don't want copied memory to compete with the kernel file cache. When too much copied memory gets created, the kernel reacts by evicting cache entries, which means LLaMA will load slowly from disk each time. Since reducing memory requirements, users have been telling wonderful stories, like running LLaMA-13B on an old Android phone. For PCs with 32GB of RAM, you should be able to comfortably run LLaMA-30B, since it's 20GB with 4-bit quantized weights.
Faster Loading
Remember that progress bar which made you wait for weights to load each time you ran the command? We got rid of that. Linux users should expect a 100x improvement in load time. Windows and MacOS users should expect a 10x improvement. What this means is that tokens will start being produced effectively instantaneously when you run LLaMA, almost providing a similar UX to ChatGPT on the shell. It's important to note these improvements are due to an amortized cost. The first time you load a model after rebooting your computer, it's still going to go slow, because it has to load the weights from disk. However each time it's loaded afterwards, it should be fast (at least until memory pressure causes your file cache to be evicted). This is great news for anyone wanting to use an LLM to generate text from a shell script, similar to the cat command. However, if your use case requires frequently restarting inference for reasons of context or quality, then you'll now have a quicker road to recovery. There is however a catch: after your weights file instantly loads, you still need to wait for your prompt to load. That's something you can expect to see addressed soon.

One of the reasons llama.cpp attracted so much attention is because it lowers the barriers of entry for running large language models. That's great for helping the benefits of these models be more widely accessible to the public. It's also helping businesses save on costs. Thanks to mmap() we're much closer to both these goals than we were before. Furthermore, the reduction of user-visible latency has made the tool more pleasant to use.

The new mmap() based loader is now available in the llama.cpp project, which is released under the MIT license on GitHub in both source and binary forms:


Existing users will need to convert their GGML weights to the new file format:

less migrate-ggml-2023-03-30-pr613.py            # view manual
python migrate-ggml-2023-03-30-pr613.py SRC DST  # run tool

New users should request access from Meta and read Simon Willison's blog post for an explanation of how to get started. Please note that, with our recent changes, some of the steps in his 13B tutorial relating to multiple .1, etc. files can now be skipped. That's because our conversion tools now turn multi-part weights into a single file.

How We Did It

When the llama.cpp project received feedback that we should be using mmap() the first idea that came to mind was to find a way to make it work within the confines of our C++ library abstractions. @apaz-cli was the person who got the ball rolling on this. The basic idea we tried was to see how much better mmap() could make the loading of weights, if we wrote a new implementation of std::ifstream. This meant that, rather than having the underlying I/O implementation call read(), it would instead use mmap() from the constructor, and then the our_ifstream::read() function would just do a memcpy() under the hood.

We determined that this would improve load latency by 18%. This was a big deal, since it's user-visible latency. However it turned out we were measuring the wrong thing. Please note that I say "wrong" in the best possible way; being wrong makes an important contribution to knowing what's right. I don't think I've ever seen a high-level library that's able to do what mmap() does, because it defies attempts at abstraction. After comparing our solution to dynamic linker implementations, it became obvious that the true value of mmap() was in not needing to copy the memory at all. The weights are just a bunch of floating point numbers on disk. At runtime, they're just a bunch of floats in memory. So what mmap() does is it simply makes the weights on disk available at whatever memory address we want. We simply must ensure that the layout on disk is the same as the layout in memory.


After going back to the drawing board, the tricky thing here was that the C++ loading process appeared to reshape the tensors after reading them. If we add printf statements to the old loading code, we'd get results like:

moving 0x640 bytes from offset 0x4a607 to offset 0 (n_dims=2 n_parts=2)
moving 0x640 bytes from offset 0x4ac47 to offset 0xc80 (n_dims=2 n_parts=2)
moving 0x640 bytes from offset 0x4b287 to offset 0x1900 (n_dims=2 n_parts=2)
moving 0x640 bytes from offset 0x4b8c7 to offset 0x2580 (n_dims=2 n_parts=2)
moving 0x640 bytes from offset 0x4bf07 to offset 0x3200 (n_dims=2 n_parts=2)
moving 0x640 bytes from offset 0x4c547 to offset 0x3e80 (n_dims=2 n_parts=2)
moving 0x640 bytes from offset 0x4cb87 to offset 0x4b00 (n_dims=2 n_parts=2)
moving 0x640 bytes from offset 0x4d1c7 to offset 0x5780 (n_dims=2 n_parts=2)
... and so forth, for another 200k+ lines

There were also a number of C++ STL containers that got populated with information during the loading process. It became clear that, in order to have a mappable file whose memory layout was the same as what evaluation wanted at runtime, we'd need to not only create a new file, but also serialize those STL data structures too. The only way around it would have been to redesign the file format, rewrite all our conversion tools, and ask our users to migrate their model files. We'd already earned an 18% gain, so why give that up to go so much further, when we didn't even know for certain the new file format would work?

I ended up writing a quick and dirty hack to show that it would work. I used a C library override trick where I started with code like this:

int main(int argc, char **argv) {
    gpt_vocab vocab;
    llama_model model;
    llama_model_load(model, vocab);
    for (;;) {
        llama_eval(model, vocab);

Then I modified the code above to avoid using the stack or static memory, and instead rely on the heap. On platforms like Linux, I was able to easily override the libc allocators by doing something like this:

struct magic *mag;

int main(int argc, char **argv) {
    gpt_vocab *vocab;
    llama_model *model;
    long len = 100l*1024*1024*1024
    int fd = open("magic.dat", O_RDWR|O_CREAT);
    ftruncate(fd, len);
    mag = mmap(0x330000000000, len,
               MAP_SHARED|MAP_FIXED, fd, 0);
    if (!mag->vocab) {
        vocab = new gpt_vocab;
        model = new llama_model;
        llama_model_load(*model, *vocab);
        msync(0x330000000000, len);
        mag->model = model;
        mag->vocab = vocab;
    } else {
        vocab = mag->vocab;
        model = mag->model;
    for (;;) {
        llama_eval(*model, *vocab);
void *memalign(size_t a, size_t n) {
    if (n < 1) n = 1;
    if (a < 16) a = 16;
    while (a & (a - 1)) ++a;
    // set p to next chunk in *mag on a
    ((size_t *)p)[-1] = n;
    return p;

void *malloc(size_t n) {
    return memalign(16, n);

void *calloc(size_t n, size_t z) {
    void *p;
    if ((p = malloc((n *= z))))
        memset(p, 0, n);
    return p;

void *realloc(void *p, size_t n) {
    void *q;
    if (!p) return malloc(n);
    if (!n) { free(p); return 0; }
    if ((q = malloc(n)))
        memcpy(q, p, ((size_t *)p)[-1]);
    return q;

void free(void *p) {}

Pseudo-C++ adapted from 5b8023d935401072b73b63ea995aaae040d57b87

The cool thing about the C library, is just about everything depends on it. If you override functions like malloc() on platforms like Linux, then all the languages and tools downstream of C (e.g. C++) will use it too. So the code above not only captures the GGML library use of malloc(), but also the STL vectors and maps that were being created too. The only thing I had to do, was make sure the stack-allocated memory got placed on the heap, which was basically just the model and vocab objects. The pointers to those of course needed to be stored in the magically mapped region, so that upon the process loading a second time, it'd have access to the root of the object graph.

This hack is how I made the case that loading could in fact be instantaneous. I didn't need to know much about the implementation details of the loader. I just redefined the heap so that it was a memory mapped file rather than the anonymous mapping it would use normally. Please note the above code does not follow any best practices. I think my code even deserves the honor of being called an abomination, which makes it the very best kind of experimental code. The correct and proper way of doing things is obviously to change the file format. But that would take 10x more effort. Now we knew for sure that it was worth doing. So the code you see above was eventually tossed away, so we could focus on the file format.

Mapping Memory

About a week later, the first code we ended up putting in the main branch that calls the mmap() function was Slaren's change. This might surprise some of the people who've been following my work. Managers and celebrities are usually the ones who get all the kudos. The tech industry isn't used to having its key collaborators on landmark technical achievements be anonymous people from 4chan, but that's exactly what happened here. While bringing the benefits of mmap() was a team effort, you could say that @Slaren was the person who added mmap() support. He did that by pointing out something very smart, which is that the 7B model only had 1-dimensional tensors, and as a result, didn't need to be unsharded, and therefore required no file format change. So he wrote the code and updated the project to map the file. Then he changed the loader so that it simply assigns a pointer to tensor->data instead of calling read(), whenever the tensor is 1-d. In doing this, Slaren showed us that it was possible to bring the benefits of instant load times to LLaMA 7B users immediately.

The hardest thing about introducing support for a function like mmap() though, is figuring out how to get it to work on Windows. I wouldn't be surprised if many of the people who had the same idea in the past, about using mmap() to load machine learning models, ended up not doing it because they were discouraged by Windows not having it. It turns out that Windows has a set of nearly, but not quite identical functions, called CreateFileMapping() and MapViewOfFile(). @oKatanaaa is the person most responsible for helping us figure out how to use them to create a wrapper function. Thanks to him, we were able to delete all of the old standard i/o loader code at the end of the project, because every platform in our support vector was able to be supported by mmap(). That meant we actually had a net negative impact on the number lines of C++ code! I think coordinated efforts like this are rare, yet really important for maintaining the attractiveness of a project like llama.cpp, which is surprisingly able to do LLM inference using only a few thousand lines of code and zero dependencies. We also had some help from @CoderRC who had previously designed his own set of POSIX functions for Mingw32 and knew the best technique for mmap feature detection.

Changing the File Format

So far, we've nailed down mmap() support for 7B. However we're still using the old C++ standard I/O code for the larger models. So the only thing left to do at this point was to change the file format, so that mmap() generalized to all the models we were using. That was the part I was responsible for doing.

In order to do inference, we need to load a few hundred tensors out of .pth files using torch, inside our conversion script. With the 7B model this was relatively simple. We only needed to iterate over the tensors in a single file, and produce a single file of output. The tensors in 7B were perfect already, and fully contiguous.

$ ls -hal models/7B/
-rw-r--r--   1 jart  staff   3.9G Mar 29 17:45 ggml-model-q4_0.bin

The issue was that, for models larger than 7B, the tensors were sharded into multiple files. Under our old way of doing things, we were simply doing a 1:1 copy when converting from .pth to GGML. As a result, the ugliness of loading from multiple files was preserved. Here's what it looked like on disk, for instance, with the LLaMA-65B model:

$ ls -hal models/65B/
-rw-r--r--   1 jart  staff   4.8G Mar 16 13:42 ggml-model-q4_0.bin
-rw-r--r--   1 jart  staff   4.8G Mar 16 13:43 ggml-model-q4_0.bin.1
-rw-r--r--   1 jart  staff   4.8G Mar 16 13:43 ggml-model-q4_0.bin.2
-rw-r--r--   1 jart  staff   4.8G Mar 16 13:44 ggml-model-q4_0.bin.3
-rw-r--r--   1 jart  staff   4.8G Mar 16 13:45 ggml-model-q4_0.bin.4
-rw-r--r--   1 jart  staff   4.8G Mar 16 13:45 ggml-model-q4_0.bin.5
-rw-r--r--   1 jart  staff   4.8G Mar 16 13:46 ggml-model-q4_0.bin.6
-rw-r--r--   1 jart  staff   4.8G Mar 16 13:46 ggml-model-q4_0.bin.7

Each file had the same structure, except the tensor data itself was like interlaced movie frames.


To make matters more challenging, different tensors are split apart in different ways, depending on the name. Some were split across columns, and some were split across rows. mmap() is a powerful system call, but it doesn't let you create overlapping mappings that interleave tensors appropriately. Even if we were willing to use hundreds of thousands of mmap() calls to reassemble the read/write operations in a copyless manner, mmap() has a 4096-byte alignment requirement that is too coarse for the tensors in this format. We had to rewrite the converter tool to put them back together by hand, into a much larger unified file, as an upfront one-time cost.

$ ls -hal models/65B/
-rw-r--r--   1 jart  staff    38G Mar 16 13:42 ggml-model-q4_0.bin2

The C++ loader was already doing the necessary conversion. All I had to do was simply move that code into the Python conversion script instead. That ensured the same commands people used before would automatically use the new format. Once I patched that, all which remained was writing a migration script. That was important since many people deleted Meta's original .pth files to save hard disk space, and they needed a tool to convert from the old format to the new format. This tool is the script that was recommended above, called migrate-ggml-2023-03-30-pr613.py. It was relatively straightforward to make, since it follows a similar logic as the conversion tool. Except in this case, I didn't need Torch, pickle, or anything like that. All that was needed, was plain old Numpy combined with seek, read, and write system calls. That's nice, since my favorite distro Alpine can't even run Torch!

The interesting thing about the seek() function is that operating systems let us seek past the end of a file. So it creates a convenient framework for unsharding tensors from multi-part files, since the i/o can be performed by writing tensors to disk, in such a way that the tensors have holes. We can then fill those in multiple passes once the remaining shards are processed. Doing that raises interesting questions of course, about how the file system might allocate blocks in the underlying physical medium. It's something that's not necessarily within our control, but I'd still love to learn more about it. For example, on some file systems I've noticed that, after converting a file, it might load from disk faster if cp is used afterwards to produce a copy.

There's one last important benefit to the new file format. It ensures tensors are aligned on a 32-byte boundary. The old file format didn't perform a roundup after writing the model vocabulary to disk. As a result, floats were being mmap()'d to odd addresses half the time, which would trigger UBSAN errors. It also potentially left some meat on the table when it comes to SIMD instructions. Alignment generally isn't a problem on modern microarchitectures of the two major architectures. In practice, the only time misalignment is completely forbidden is with semaphores on ARM. However just because it seems to work doesn't mean misalignment won't consume additional resources under the hood, or cause other problems in sneaky ways. One example would be x86, where misaligned semaphores will seem to work until you have the unlucky chance of your unsigned int overlapping a 64-byte cacheline boundary. For that reason, the new file format takes a more conservative approach, and it may potentially open some doors in the future for certain kinds of optimizations.

For further details, please see 78ca9838ee36660a776e97e3391b6fb5dcaacf7f and ee0c40dd6de8c3c658ae43199939ef40bb1cf408.


Many sources of information on the world wide web that explain how to use mmap() will also insist upon the use of madvise() as though its benefits were established fact. I couldn't measure any evidence that it'd be helpful in our case, since transformer models like LLaMA need to immediately fault every single memory page as soon as the weights are loaded. The madvise() system call is probably only helpful in situations where only a subset of pages are needed for a nontrivial amount of time, during which the disk would otherwise become underutilized.

posix_fadvise(POSIX_FADV_SEQUENTIAL) would be an example of a kind of advice that'd be potentially more helpful to users of LLMs. One of the downsides of the Linux cp command, is copying a file larger than RAM will destroy every existing entry in the file cache. Under normal circumstances this is a good thing, since a least recently used strategy usually works. However it can be problematic if you're just organizing your files on a production system where you don't want to disrupt performance. As far as I know, no standard command line utility offers a way to exploit this functionality. So we may provide an example of how to replace the cp command that could address this use case too. Another feature such a command could offer, would be the use of copy_file_range(), which enables files to be copied within the same partition 2x faster than the sendfile() technique that's used by standard utilities.