Oct 31st, 2024 @ justine's web page
I just learned 42 programming languages this month to build a new syntax highlighter for llamafile. I feel like I'm up to my eyeballs in programming languages right now. Now that it's halloween, I thought I'd share some of the spookiest most surprising syntax I've seen.
The languages I decided to support are Ada, Assembly, BASIC, C, C#, C++, COBOL, CSS, D, FORTH, FORTRAN, Go, Haskell, HTML, Java, JavaScript, Julia, JSON, Kotlin, ld, LISP, Lua, m4, Make, Markdown, MATLAB, Pascal, Perl, PHP, Python, R, Ruby, Rust, Scala, Shell, SQL, Swift, Tcl, TeX, TXT, TypeScript, and Zig. That crosses off pretty much everything on the TIOBE Index except Scratch, which can't be highlighted, since it uses blocks instead of text.
It's really not difficult to implement a syntax highlighter. You could probably write one over the course of a job interview. My favorite tools for doing this have been C++ and GNU gperf. The hardest problem here is avoiding the need to do a bunch of string comparisons to determine if something is a keyword or not. Most developers would just use a hash table, but gperf lets you create a perfect hash table. For example:
%{ #include <string.h> %} %pic %compare-strncmp %language=ANSI-C %readonly-tables %define lookup-function-name is_keyword_java_constant %% true false null
gperf was originally invented for gcc and it's a great way to squeeze
out every last drop of performance. If you run the gperf
command on the above code above, it'll
generate this .c file. You'll
notice its hash function only needs to consider a single character in in
a string. That's what makes it perfect, and perfect means better
performance. I'm not sure who wants to be able to syntax highlight C at
35 MB per second, but I am now able to do so, even though I've defined
about
4,000 keywords for the language. Thanks to gperf, those keywords
don't slow things down.
The rest just boils down to finite state machines. You don't really need
flex, bison, or ragel to build a basic syntax highlighter. You simply
need a for
loop and a switch
statement. At
least for my use case, where I've really only been focusing on strings,
comments, and keywords. If I wanted to highlight things like C function
names, well, then I'd probably need to do actual parsing. But focusing
on the essentials, we're only really doing lexing at most. See
highlight_ada.cpp
as an example.
All the research you're about to read about on this page, went into making one thing, which is llamafile's new syntax highlighter. This is probably the strongest advantage that llamafile has over ollama these days, since ollama doesn't do syntax highlighting at all. Here's a demo of it running on Windows 10, using the Meta LLaMA 3.2 3B Instruct model. Please note, these llamafiles will run on MacOS, Linux, FreeBSD, and NetBSD too.
The new highlighter and chatbot interface has made llamafile so pleasant for me to use, combined with the fact that open weights models like gemma 27b it have gotten so good, that it's become increasingly rare that I'll feel tempted to use Claude these days.
So while writing this highlighter, let's talk about the kinds of lexical syntax that surprised me.
The C programming language, despite claiming to be simple, actually has
some of the weirdest lexical elements of any language. For starters, we
have trigraphs, which were probably invented to help Europeans use C
when using keyboards that didn't
include #
, [
, \
, ^
, {
, |
, }
,
and ~
. You can replace those characters
with ??=
, ??(
, ??/
, ??)
, ??'
, ??<
, ??!
, ??>
,
and ??-
. Intuitive, right? That means, for example, the
following is perfectly valid C code.
int main(int argc, char* argv??(??)) ??< printf("hello world\n"); ??>
That is, at least until trigraphs were removed in the C23 standard. However compilers will be supporting this syntax forever for legacy software, so a good syntax highlighter ought to too. But just because trigraphs are officially dead, doesn't mean the standards committees haven't thought up other weird syntax to replace it. Consider universal characters:
int \uFEB2 = 1;
This feature is useful for anyone who wants, for example, variable names with arabic characters while still keeping the source code pure ASCII. I'm not sure why anyone would use it. I was hoping I could abuse this to say:
int main(int argc, char* argv\u005b\u005d) \u007b printf("hello world\n"); \u007d
But alas, GCC raises an error if universal characters aren't used on the specific UNICODE planes that've been blessed by the standards committee.
This next one is one of my favorites. Did you know that a single line comment in C can span multiple lines if you use backslash at the end of the line?
//hi\
there
Most other languages don't support this. Even languages that allow backslash escapes in their source code (e.g. Perl, Ruby, and Shell) don't have this particular feature from C. The ones that do support this too, as far as I can tell, are Tcl and GNU Make. Tools for syntax highlighting oftentimes get this wrong, like Emacs and Pygments. Although Vim seems to always be right about backslash.
Every C programmers knows you can't embed a multi-line comment in a multi-line comment. For example:
/*
hello
/* again */
nope nope nope
*/
However with Haskell, you can. They finally fixed the bug. Although they did adopt a different syntax.
-- Test nested comments within code blocks let result3 = {- This comment contains {- a nested comment -} -} 10 - 5
The thing that surprised me most about Tcl, is that identifiers can have
quotes in them. For example, this program will print a"b
:
puts a"b
You can even have quote in your variable names, however you'll only be
able to reference it if you use the ${a"b}
notation, rather
than $a"b
.
set a"b doge puts ${a"b}
JavaScript has a builtin lexical syntax for regular expressions. However it's easy to lex it wrong if you aren't paying attention. Consider the following:
var foo = /[/]/g;
When I first wrote my lexer, I would simply scan for the closing slash, and assume that any slashes inside the regex would be escaped. That turned out to be wrong when I highlighted some minified code. If a slash is inside the square quotes for a character set, then that slash doesn't need to be escaped!
Now onto the even weirder.
There's some invisible UNICODE characters called the LINE SEPARATOR
(u2028) and PARAGRAPH SEPARATOR (u2029). I don't know what the use case
is for these codepoints, but the ECMAScript
standard defines
them as line terminators, which effectively makes them the same
thing as \n
. Since these
are Trojan Source characters,
I configure my Emacs to render them as ↵ and ¶. However most software
hasn't been written to be aware of these characters, and will oftentimes
render them as question marks. Also as far as I know, no other language
does this. I was able to use that to my advantage for SectorLISP, since
it let me create C + JavaScript polyglots.
javascript syntax highlighting//¶` ... C only code goes here ... //`
That's how I'd insert C code into JavaScript files.
c syntax highlighting//¶` #if 0 //` ... JavaScript only code goes here ... //¶` #endif //`
And that's how I'd insert JavaScript into my C source code. An example of a piece of production code where I did this is lisp.js which is what powers my SectorLISP blog post. It both runs in the browser, and you can compile it with GCC and run it locally too. llamafile is able to correctly syntax highlight this stuff, but I've yet to find another syntax highlighter that does too. Not that it matters, since I doubt an LLM would ever print this. But it sure is fun to think about these corner cases.
We're all familiar with the heredoc syntax of shell scripts, e.g.
cat <<EOF
this is kind of
a multi-line
string
EOF
The above syntax allows you to put $foo
in your heredoc
string, although there's a quoted syntax which disables variable
substitution.
cat <<'END' this won't print the contents of $var END
If you ever want to confuse your coworkers, then one great way to abuse this syntax is by replacing the heredoc marker with an empty string, in which case the heredoc will end on the next empty line. For example, this program will print "hello" and "world" on two lines:
cat <<'' hello echo world
It's also possible in languages that support heredocs (Shell, Ruby, and Perl) to have multiple heredocs on the same line.
cat /dev/fd/3 3<< E1 /dev/fd/4 4<< E2 foo E1 bar E2
Another thing to look out for with shell, is it's like Tcl in the sense that special characters like #, which you might think would always begin a comment, can actually be valid code depending on the context. For example, inside a variable reference, # can be used to strip a prefix. The following program will print "there".
x=hi-there echo ${x#hi-}
Did you know that, from a syntax highlighting standpoint, a Kotlin string can begin with " but end with the { character? That's the way it's string interpolation syntax works. Many languages let you embed variable name references in strings, but TypeScript, Swift, Kotlin, and Scala take string interpolation to the furthest extreme of encouraging actual code being embedded inside strings.
val s2 = "${s1.replace("is", "was")}, but now is $a"
So to highlight a string with Kotlin, Scala, and TypeScript, one must
count curly brackets and maintain a stack of parser states. With
TypeScript, this is relatively trivial, and only requires a couple
states to be added to your finite state machine. However with Kotlin and
Scala it gets real hairy, since they support both double quote and
triple quote syntax, and either one of them can have interpolated
values. So that ended up being about 13 independent states the FSM needs
for string lexing alone. Swift also supports triple quotes for its
"\(var)"
interpolated syntax, however that only needed 10
states to support.
Swift has its own unique approach to the problem of embedding strings inside a string. It allows "double quote", """triple quote""", and /regex/ strings to all be surrounded with an arbitrary number of #hash# marks, which must be mirrored on each side. This makes it possible to write code like the following:
let threeMoreDoubleQuotationMarks = #""" Here are three more double quotes: """ """# let threeMoreDoubleQuotationMarks = ##""" Here are three more double quotes: #"""# """##
C# supports Python's triple quote multi-line string syntax, but with an interesting twist that's unique to this language. The way C# solves the "embed a string inside a string" problem, is they let you do quadruple quoted strings, or even quintuple quoted strings if you want. However many quotes you put on the lefthand side, that's what'll be used to terminate the string at the other end.
Console.WriteLine(""); Console.WriteLine("\""); Console.WriteLine(""""""); Console.WriteLine(""""""); Console.WriteLine(""" yo "" hi """); Console.WriteLine("""" yo """ hi """"); Console.WriteLine(""""First """100 Prime""" Numbers: """");
This is the way if you ask me, because it's actually simpler for a finite state machine to decode. With classic Python triple quoted strings, you need extra rules, to ensure it's either one double-quote character, or exactly three. By letting it be an arbitrary number, there's fewer rules to validate. So you end up with a more powerful expressive language that's simpler to implement. This is the kind of genius we've come to expect from Microsoft.
What will they think of next?
Normally when code is simpler for a computer to decode, it's more difficult for a human to understand, and FORTH is proof of that. FORTH is probably the simplest language there is, because it tokenizes everything on whitespace boundaries. Even the syntax for starting a string is a token. For example:
c" hello world"
Would mean the same thing as saying "hello world"
in every
other language.
One of the use cases I envision for llamafile is that it can help the banking system not collapse once all the FORTRAN and COBOL programmers retire. Let's say you've just been hired to maintain a secretive mainframe full of confidential information written in the COmmon Business-Oriented Language. Thanks to llamafile, you can ask an air-gapped AI you control, like Gemma 27b, to write your COBOL and FORTRAN code for you. It can't print punch cards, but it can highlight punch card syntax. Here's what FORTRAN code looks like, properly syntax highlighted:
* * Quick return if possible. * IF ((M.EQ.0) .OR. (N.EQ.0) .OR. + (((ALPHA.EQ.ZERO).OR. (K.EQ.0)).AND. (BETA.EQ.ONE))) RETURN * * And if alpha.eq.zero. * IF (ALPHA.EQ.ZERO) THEN IF (BETA.EQ.ZERO) THEN DO 20 J = 1,N DO 10 I = 1,M C(I,J) = ZERO 10 CONTINUE 20 CONTINUE ELSE DO 40 J = 1,N DO 30 I = 1,M C(I,J) = BETA*C(I,J) 30 CONTINUE 40 CONTINUE END IF RETURN END IF
FORTRAN has the following fixed column rules.
Now here's some properly syntax highlighted COBOL code.
000100*Hello World in COBOL 000200 IDENTIFICATION DIVISION. 000300 PROGRAM-ID. HELLO-WORLD. 000400 000500 PROCEDURE DIVISION. 000600 DISPLAY 'Hello, world!'. 000700 STOP RUN.
With COBOL, the rules are:
Zig has its own unique solution for multi-line strings, which are prefixed with two backslashes.
const copyright = \\ Copyright (c) 2024, Zig Incorporated \\ All rights reserved. ;
What I like about this syntax, is it eliminates that need we've always
had for calling textwrap.dedent()
with Python's triple
quoted strings. The tradeoff is that the semicolon is ugly. This is a
string syntax that really ought to be considered by one of the languages
that don't need semicolons, e.g. Go, Scala, Python, etc.
Lua has a very unique multi-line string syntax, and it uses an approach similar to C# and Swift when it comes to solving the "embed a string inside a string" problem. It works by using double square brackets, and it lets you put an arbitrary number of equal signs inbetween them.
-- this is a comment [[hi [=[]=] ]] there [[hi [=[]=] ]] there [==[hi [=[]=] ]==] hello [==[hi ]=]==] hello [==[hi ]===]==] hello [====[hi ]===]====] hello
What's really interesting is that it lets you do this with comments too.
--[[ comment #1 ]] print("hello") --[==[ comment [[#2]] ]==] print("world")
One of the most challenging languages to syntax highlight is assembly, due to the fragmentation of all its various dialects. I've sought to build something with llamafile that does a reasonably good job with AT&T, nasm, etc. syntax. Here's nasm syntax:
section .data message db 'Hello, world!', 0xa ; The message string, ending with a newline section .text global _start _start: ; Write the message to stdout mov rax, 1 ; System call number for write mov rdi, 1 ; File descriptor for stdout mov rsi, message ; Address of the message string mov rdx, 13 ; Length of the message syscall ; Exit the program mov rax, 60 ; System call number for exit xor rdi, rdi ; Exit code 0 syscall
And here's AT&T syntax:
/ syscall .globl _syscall,csv,cret,cerror _syscall: jsr r5,csv mov r5,r2 add $04,r2 mov $9f,r3 mov (r2)+,r0 bic $!0377,r0 bis $sys,r0 mov r0,(r3)+ mov (r2)+,r0 mov (r2)+,r1 mov (r2)+,(r3)+ mov (r2)+,(r3)+ mov (r2)+,(r3)+ mov (r2)+,(r3)+ mov (r2)+,(r3)+ sys 0; 9f bec 1f jmp cerror 1: jmp cret .data 9: .=.+12.
And here's GNU syntax:
/ setjmp() for x86-64 // this is a comment too ; so is this # this too! ! hello sparc setjmp: lea 8(%rsp),%rax mov %rax,(%rdi) mov %rbx,8(%rdi) mov %rbp,16(%rdi) mov %r12,24(%rdi) mov %r13,32(%rdi) mov %r14,40(%rdi) mov %r15,48(%rdi) mov (%rsp),%rax mov %rax,56(%rdi) xor %eax,%eax ret
With keywords I've found the simplest thing is to just treat the first identifier on the line (that isn't followed by a colon) as a keyword. That tends to make most of the assembly I've tried look pretty reasonable.
The comment syntax is real hairy. I really like the original UNIX
comments which only needed a single slash. GNU as still supports those
to this date, but only if they're at the beginning of the line (UNIX
could originally put them anywhere, since as
didn't have
the ability to do arithmetic back then). Clang doesn't support fixed
comments at all, so they're sadly not practical anymore to use in open
source code.
But this story gets even better. Another weird thing about the original
UNIX assembler is that it didn't use a closing quote on character
literals. So where we'd say 'x'
to get 0x78 for x, in the
original UNIX source code, you'd say 'x
. This is another
thing GNU as continues to support, but sadly not LLVM. In any case,
since a lot of code exists that uses this syntax, any good syntax
highlighter needs to support it.
The GNU assembler allows identifiers to be quoted, so you can put pretty much any character in a symbol.
Finally, it's not enough to just highlight assembly when highlighting
assembly. The assembler is usually used in conjunction with either the C
preprocessor, or m4. Trust me, lots of open source code does this.
Therefore lines starting with dnl
, m4_dnl
,
or C
should be taken as comments too.
Ada is a remarkably simple language to lex, but there's one thing I
haven't quite wrapped my head around yet, which is its use of the single
quotation mark. Ada can have character literals like C,
e.g. 'x'
. But single quote can also be used to reference
attributes, e.g. Foo'Size
. Single quote even lets you embed
expressions and call functions. For example, the program:
with Ada.Text_IO; procedure main is S : String := Character'(')')'Image; begin Ada.Text_IO.Put_Line("The value of S is: " & S); end main;
Will print out:
The value of S is: ')'
Because we're declaring a character, giving it a value, and then sending
it through the Image
function, which converts it to
a String
representation.
Let's talk about the Beginner's All-purpose Symbolic Instruction Code. While digging through the repos I've git cloned, I came across this old Commodore BASIC program that broke many of my assumptions about syntax highlighting.
10 rem cbm basic v2 example 20 rem comment with keywords: for, data 30 dim a$(20) 35 rem the typical space efficient form of leaving spaces out: 40 fort=0to15:poke646,t:print"{revers on} ";:next 50 geta$:ifa$=chr$(0):goto40 55 rem it is legal to omit the closing " on line end 60 print"{white}":print"bye... 70 end
We'll notice that this particular BASIC implementation didn't require a
closing quote on strings, variable names have these weird sigils, and
keywords like goto
are lexed eagerly out of identifiers.
Visual BASIC also has this weird date literal syntax:
Dim v As Variant ' Declare a Variant v = #1/1/2024# ' Hold a date
That's tricky to lex, because VB even has preprocessor directives.
#If DEBUG Then <WebMethod()> Public Function SomeFunction() As String #Else <WebMethod(CacheDuration:=86400)> Public Function SomeFunction() As String #End If
One of the trickier languages to highlight is Perl. It's exists in the spiritual gulf between shells and programming languages, and inherits the complexity of both. Perl isn't as popular today as it once was, but its influence continues to be prolific. Perl made regular expressions a first class citizen of the language, and the way regex works in Perl has since been adopted by many other programming languages, such as Python. However the regex lexical syntax itself continues to be somewhat unique.
For example, in Perl, you can replace text similar to sed as follows:
my $string = "HELLO, World!"; $string =~ s/hello/Perl/i; print $string; # Output: Perl, World!
Like sed, Perl also allows you to replace the slashes with an arbitrary punctuation character, since that makes it easier for you to put slashes inside your regex.
$string =~ s!hello!Perl!i;
What you might not have known, is that it's possible to do this with mirrored characters as well, in which case you need to insert an additional character:
$string =~ s{hello}{Perl}i;
However s///
isn't the only weird thing that needs to be
highlighted like a string. Perl has a wide variety of other magic
prefixes.
/case sensitive match/ /case insensitive match/i y/abc/xyz/e s!hi!there! m!hi!i m;hi;i qr!hi!u qw!hi!h qq!hi!h qx!hi!h m-hi- s-hi-there-g s"hi"there"g s@hi@there@ yo s{hi}{there}g
One thing that makes this tricky to highlight, is you need to take
context into consideration, so you don't accidentally think
that y/x/y/
is a division formula. Thankfully, Perl makes
this relatively easy, because variables can always be counted upon to
have sigils, which are usually $
for
scalars, @
for arrays, and %
for hashes.
my $greeting = "Hello, world!"; # Array: A list of names my @names = ("Alice", "Bob", "Charlie"); # Hash: A dictionary of ages my %ages = ("Alice" => 30, "Bob" => 25, "Charlie" => 35); # Print the greeting print "$greeting\n"; # Print each name from the array foreach my $name (@names) { print "$name\n"; }
This helps us avoid the need for parsing the language grammar.
Perl also has this goofy convention for writing man pages in your source
code. Basically, any =word at the start of the line will get it going,
and =cut
will finish it.
#!/usr/bin/perl =pod =head1 NAME my_silly_script - A Perl script demonstrating =cut syntax =head1 SYNOPSIS my_silly_script [OPTIONS] =head1 DESCRIPTION This script does absolutely nothing useful, but it showcases the quirky =cut syntax for POD documentation in Perl. =head1 OPTIONS There are no options. =head1 AUTHOR Your Name <your.email@example.com> =head1 COPYRIGHT Copyright (c) 2023 Your Name. All rights reserved. =cut print "Hello, world!\n";
Of all the languages, I've saved the best for last, which is Ruby. Now here's a language whose syntax evades all attempts at understanding. Ruby is the union of all earlier languages, and it's not even formally documented. Their manual has a section on Ruby syntax, but it's very light on details. Whenever I try to test my syntax highlighting, by concatenating all the .rb files on my hard drive, there's always another file that finds some way to break it.
def `(command) return "just testing a backquote override" end
Since ruby supports backquote syntax like var
= `echo hello`
, I'm not exactly sure how
to tell that the backquote above isn't meant to be highlighted as a
string. Another example is this:
when /\.*\.h/ options[:includes] <<arg; true when /--(\w+)=\"?(.*)\"?/ options[$1.to_sym] = $2; true
Ruby has a <<
operator, and it also supports heredocs
(just like Perl and Shell). So I'm not exactly sure how to tell that the
code above isn't a heredoc. Yes that code actually exists in the wild.
Even Emacs gets this wrong. Out of all 42 languages I've evaluated,
that's probably the biggest shocker so far. It might be the case that
Ruby isn't possible to lex without parsing. Even with parsing, I'm still
not sure how it's possible to make sense of that.
If I were to rank the complexity of programming languages by how many lines of code each one takes to syntax highlight, then FORTH would be the simplest language, and Ruby would be the most complicated.
125 highlight_forth.cpp 266 highlight_lua.cpp 132 highlight_m4.cpp 282 highlight_csharp.cpp 149 highlight_ada.cpp 282 highlight_rust.cpp 160 highlight_lisp.cpp 297 highlight_python.cpp 163 highlight_test.cpp 300 highlight_java.cpp 166 highlight_matlab.cpp 321 highlight_haskell.cpp 186 highlight_cobol.cpp 335 highlight_markdown.cpp 199 highlight_basic.cpp 337 highlight_js.cpp 200 highlight_fortran.cpp 340 highlight_html.cpp 211 highlight_sql.cpp 371 highlight_typescript.cpp 216 highlight_tcl.cpp 387 highlight_kotlin.cpp 218 highlight_tex.cpp 387 highlight_scala.cpp 219 highlight.cpp 447 highlight_asm.cpp 220 highlight_go.cpp 449 highlight_c.cpp 225 highlight_css.cpp 455 highlight_swift.cpp 225 highlight_pascal.cpp 560 highlight_shell.cpp 230 highlight_zig.cpp 563 highlight_perl.cpp 235 highlight_make.cpp 624 highlight_ruby.cpp 239 highlight_ld.cpp 263 highlight_r.cpp
llamafile is a Mozilla project who sponsors me to work on it. My work on open source is also made possible by my GitHub sponsors and Patreon subscribers. Thank you for giving me the opportunity to serve you all these last four years. Since you've read this far, I'd like to invite you to join both the Mozilla AI Discord and the Redbean Discord servers where you can chat with me and other people who love these projects.