Dec 4th, 2023 @ justine's web page
I spent the last month working with Mozilla to launch a new project called llamafile which lets you distribute and run LLMs with a single file. We had a successful launch five days ago. The project earned 4.4k stars on GitHub, 1064 upvotes on Hacker News, and received press coverage from Hackaday.
Many people are excited about how the project puts control of chat bots into the hands of everyday people. However my favorite thing to focus on is how it can be genuinely useful in helping me get work done, as an old school UNIX hacker, who would rather write shell scripts than tame wild-eyed condas in containers.
I've never used things like the OpenAI API before. As much as I love browsing the web and using online services like Twitter, I simply wouldn't want to use a text generator that talks to the Internet, any more than I'd choose to use a terminal emulator that's built on Electron so it can integrate with Full Story.
I don't want my daily life working in the terminal to become a permanently recorded artifact in someone's cloud. Tools that don't talk to the network have no business being on the network. What I love about llamafile is that, since it's a locally-running dependency-free command line tool that's fast, I finally feel comfortable enough with LLMs to start learning how to use them.
That's because llamafile behaves just like classic commands,
e.g. cat
(which is basically what an LLM is if you think
about it). What I like about the old technology is that, if you're able
to run a command, you can automate it using shell scripts. That ability
to automate machines is the foundation of all power in tech, and
llamafile gives you that power over AI, in such a way that no one can
ever take it away from you. So even if you're fearful of AI, please give
this a try anyway, because if you do, then you're going to feel like all
those scary new powers you've read about in the news, will now be yours
to command, using the llamafile command.
Lastly, and even more importantly, you have the freedom to use these tools to enrich yourself, on your own terms (so long as you're not a patent troll), since both llamafile and the Mistral weights are licensed Apache 2.0. That means you can fork it, whitelabel it, or build on top of it; and the things you create can be as purely open or strictly confidential as you want them to be.
The easiest way to get a local LLM to be useful is to use it to summarize text. With llamafile, you can turn a web page with a long wall of text into a few short paragraphs. Consider the following URL:
Ed Post; USENET 1983
Real Programmers Don't Use Pascal
https://www.pbm.com/~lindahl/real.programmers.html
It's worth the ~20 minutes to read all 3,774 words, but if you need an executive summary, then your llamafile, powered by Mistral 7B, can read the text from the above URL to generate the following:
The article "Real Programmers Don't Use Pascal" is a humorous essay that argues that Real Programmers use Fortran, while Quiche Eaters use Pascal. The author claims that there are two types of programmers: Real Programmers and Quiche Eaters. Real Programmers are those who can understand and write programs in Fortran, while Quiche Eaters are those who cannot.
The article goes on to describe the characteristics of Real Programmers and Quiche Eaters. Real Programmers are said to be able to write programs quickly and efficiently, while Quiche Eaters are said to be slow and inefficient. Real Programmers are also said to be able to understand complex algorithms and data structures, while Quiche Eaters are not.
The article concludes by arguing that the future of programming is bright for Real Programmers, as Fortran remains a popular language and structured coding constructs have failed to replace it. The author also argues that Unix, despite its popularity among hackers, is not a serious operating system for Real Programmers.
Here we see Mistral provides a faithful overview of Ed Post's essay. Mistral also managed to identify the one point Ed Post made that was both offensive and wrong, which is his belief that UNIX is a toy.
Mistral is probably smart enough to read an article if you
just curl
the raw HTML into your prompt. What
makes links
useful for LLMs is it removes all the
unrelated <html>
tags which take up space in memory.
The PASCAL essay is 3000+ words. Each word is normally broken up into
multiple tokens. Mistral only has an 8,000 token context window. So if
we include the HTML, we might get an error that we've run out of space.
links -codepage utf-8 -force-html -dump -width 500 \ https://www.pbm.com/~lindahl/real.programmers.html
One of the links
flags that's particularly helpful
is -width 500
which turns off line wrapping. With the way
LLMs tokenize text, spaces are usually free, since LLM tokens are often
chopped up to include spaces, depending on the arbitrary alphabet each
model defines. On the other hand, line breaks and long strings of
repeating spaces will always be less efficient, and if they're being
used purely to reflow paragraphs for human readability, then those added
tokens provide no additional value from the LLM's perspective.
The links
command is a pretty common. For example, most
MacOS users just need to run:
brew install links
If you don't have a package manager, then here's a prebuilt APE binary of links v2.29:
links (7.7mb)
wget https://cosmo.zip/pub/cosmos/bin/links
For AMD64+ARM64 on Linux+Mac+Windows+FreeBSD+NetBSD+OpenBSD
I built it myself to test summarization with my Nvidia graphics card on Windows. Plus it gives me the ability to browse the web using PowerShell, which is a nice change of pace from Chrome and Firefox.
curl -o links.exe https://cosmo.zip/pub/cosmos/bin/links .\links.exe https://justine.lol
Regardless of your OS, if you have issues running my links binary, then see the gotchas section below.
./mistral-7b-instruct-v0.1-Q4_K_M-main.llamafile
So without further ado, here's how you'd do the above in one piece.
(echo "[INST]Summarize the following article:"; links -codepage utf-8 -force-html -width 500 -dump https://www.pbm.com/~lindahl/real.programmers.html; echo "[/INST]") | ./mistral-7b-instruct-v0.1-Q4_K_M-main.llamafile --temp 0 -c 7000 -n 1000 -f /dev/stdin --silent-prompt 2>/dev/null
On macOS with Apple Silicon you need to have Xcode installed for llamafile to be able to bootstrap itself.
If you use zsh and have trouble running llamafile, try saying sh -c
./llamafile
. This is due to a bug that was fixed in zsh 5.9+. The same
is the case for Python subprocess
, old versions of Fish, etc.
On some Linux systems, you might get errors relating to run-detectors
or WINE. This is due to binfmt_misc
registrations. You can fix that by
adding an additional registration for the APE file format llamafile
uses:
sudo wget -O /usr/bin/ape https://cosmo.zip/pub/cosmos/bin/ape-$(uname -m).elf sudo chmod +x /usr/bin/ape sudo sh -c "echo ':APE:M::MZqFpD::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register" sudo sh -c "echo ':APE-jart:M::jartsr::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"
As mentioned above, on Windows you may need to rename your llamafile by
adding .exe
to the filename.
Also as mentioned above, Windows also has a maximum file size limit of 4GB for executables. The LLaVA server executable above is just 30MB shy of that limit, so it'll work on Windows, but with larger models like WizardCoder 13B, you need to store the weights in a separate file. An example is provided above; see "Using llamafile with external weights."
On WSL, it's recommended that the WIN32 interop feature be disabled:
sudo sh -c "echo -1 >/proc/sys/fs/binfmt_misc/WSLInterop"On any platform, if your llamafile process is immediately killed, check if you have CrowdStrike and then ask to be whitelisted.