This is the second post about my C compiler, SCC. Look here for an index.
In the past couple of months I have been working on the compiler fairly regularly, in manageable 30-40 minute intervals about every day or so, and it has gotten to the point where I know that I can get a working C compiler by just putting in the hours.
This project has been like drinking from a firehose. There were many concepts which I understood but had not internalized before now. Things like how the stack grows backwards, by decrementing the stack pointer instead of incrementing as we normally think of stacks. Or even things like endianness, where I understood them at an intellectual level, but I very seldom had to actually deal with it. The only time I remember caring about endianess before the compiler was briefly when I was working on compression stuff, and even then it was a matter of getting my code to work and then never thinking about it ever again. I was even employed at Intel for a very short time, and even though I will be the first to admit that I didn’t do squat while I was there if we’re being honest, I did do a lot of stepping through assembly and twiddling various CPU registers in whichever way some spec dictated. But even then I didn’t really have to think about endianness that much.
Not that endianness is a particularly complicated thing to know, but like anything else, there is some practice that you need to have with an idea before you are comfortable working with it without conscious effort. That’s what I mean by internalizing. Working on a compiler is a great way to get practice, and get comfortable with, bit twiddling concepts that are always useful but not used often enough to create muscle memory.
The code in scc
is still fairly simple, only recently did it get above the 5000 LOC mark (counted using wc
), and as of this moment my unofficial goal is to get to a working compiler without breaking the 10,000 lines-of-code mark.
I was planning on using Fabrice Bellard’s tcc
as a reference, and there is one thing in scc
that was 100% inspired by tcc
: having a single header file with most of the compiler’s structs and function declarations. There are other similitudes (other than the name), but they were just (extremely validating) coincidences. At this point I’m considering looking at tcc
to be cheating and I won’t be looking at it anymore. At least not until the project is at a much more advanced stage. 90% of the fun of this project is to figure things out on my own.
The latest big change was to introduce a machine abstraction. The first version of the compiler looked something like this:
int main() {
printf("mov eax, 42\n");
}
and every step in the design has been an iterative step from that point towards a working C compiler.
A machine abstraction is the first time where I think I ran into something that I should have done from the start. Before the machine abstraction, the entire code generator was in a file called codegen.c
, and the code inside looked something like the following. No need to pay too much attention at the code, just look at how we were printing an instruction before, and then we were calling some weird new functions.
Before:
if (target == Target_STACK) {
instructionPrintf("push rax");
}
and now it looks more like:
if (target == Target_STACK) {
machStackPushReg(m, machAccumInt64()->location.reg);
}
where machStackPushReg
‘s body is pretty much just the instructionPrintf call.
In the current state, codegen.c
doesn’t ever emit x64 instructions directly. Any time that there used to be an x64 instruction, there is now a function call with the mach
prefix. In theory, if I ever decided to add a separate backend (like LLVM IR), I would just have to implement all the mach*
functions for the corresponding LLVM IR mappings. Currently there is only one backend, in x64.c
, where all the mach*
functions are implemented.
The reason for the machine abstraction is not to have multiple backends, although it will certainly be useful for that. The reason for it is that there are different x64 instructions for floating point data and integer data. When I was starting to implement floating point variables, I started noticing that I had a bunch of switch statements over the type of some value. It was common enough that it felt necessary to abstract away arithmetic instructions.
So now the codegen code for adding just calls a machAdd
call for some pair of ExprType
objects, which might point to a floating point or int variable, or even wide types in the future.
And the machAdd implementation looks something like:
void
machAdd(Machine* m, ExprType* dst, ExprType* src) {
u32 bits = typeBits(&dst->c);
if (!isRealType(&dst->c))) {
instructionPrintf("add %s, %s",
locationString(m, dst->location, bits),
locationString(m, src->location, bits));
}
else {
instructionPrintf("addps %s, %s",
locationString(m, dst->location, bits),
locationString(m, src->location, bits));
}
}
I actually think I will rename the ExprType
struct to something like TypedRegister
. So the machine abstraction will be almost like a flexible ad-hoc typed assembly language. I wish this kind of thing had a name. Something like “intermediate representation” might work :)
It is very fun to do incremental changes when designing a program from scratch, trying not to look at other solutions, and end up reinventing some concept that has already been in use for ever. In the best case, you do something in a different enough way that you might actually contribute something novel, and in the worst case it’s a good learning experience. It’s all good.
The compiler project has pretty much no fat. My other project that is still alive, Milton, has a lot of fat. Not because of bad design but because the kind of problem it wants to solve has a lower interesting-problems to boring-problems ratio. I highly recommend anyone reading who has interest in low level programming to write a toy compiler.
Thanks for reading my ramblings! I hope it doesn’t take me a year to publish another post.
>> Home