5.1 Code that writes code
As we near the conclusion of the course, we'll turn our attention to a paradigm that's used in all of the programming languages we use and interact with. This lesson will feel a little more philosophical and begins to ask and answer questions like, "How are programming languages written?"
What is metaprogramming?
Metaprogramming concerns itself with programs that read a program as input and generate another program as output. In other words, it's code that writes code! This is an odd idea to wrap your head around at first, but if you've ever begun to ask yourself how programming languages are actually built, well, it's with metaprogramming!
Programming languages, whether interpreted or compiled, are programs that read text files (text written according to a specific programming syntax) and either interpret code at runtime (e.g. Python and Javascript), or produce a compiled binary artifact that can be run as an executable program (e.g. Go, C, and C++). The latter type of program is called a compiler, which we'll discuss in greater detail below.
Compilers
The proper definition of a compiler is more generally a program that translates source code from one language into some other programming language. The compiler usually works by translating higher-level languages to lower-level languages, so that programmers can write code with stronger abstractions and therefore a simpler development environment.
However, there's always a trade-off: a compiler ideally produces efficient code in the low-level representation, but it's often not as efficient as it could be written manually. This is not always the case though; Common Lisp (which compiles to C) was found to be faster than C in a specific area of genetic programming. This is an example of compiler optimization, which is a rich area of research.
The way compilers are written typically adopt a particularly phased strategy. The four traditional phases are listed below in the order they occur:
Lexical analysis
Syntax analysis
Semantic analysis
Code generation
Lexical analysis
Lexical analysis is the process of scanning of the input and categorizing each lexeme, or sequence of characters that match a particular pattern (as determined by the programming language). This process creates an ordered list of tokens that represent the individual components of input.
For example, consider the simplest Go program:
With this program as input, the lexer would produce a series of tokens that contains something along the lines of the following (where each element is an individual token):
As you can see, certain elements are grouped together (such as alphanumerics), whereas others aren't (such as the individual separator elements). Now that we have tokenized the program into a simple list, we can parse it with syntax analysis.
Syntax analysis
Syntax analysis is the process of constructing an abstract syntax tree (AST) from the tokenized input. The AST is used to create a hierarchy of the tokens and compose them in a way that represents a logical ordering of the input program. This phase also determines if the tokens actually form a valid expression with respect to the program's grammar. If not, the parser will fail and (hopefully) notify the user with a helpful error message.
For example, suppose we made an invalid change to the program above, where the func keyword
was replaced with fun:
If we tried to compile this code with go build, we'd see the following:
In this case, the Go compiler noticed that a syntax error occurred because it's interpreting
the fun token as a non-declaration statement (which is outside of a function body).
When the parser is successful, it will produce a valid AST. There's a cool tool that will generate an AST from valid Go code that can be found here. Make sure to give it a shot!
Semantic analysis
Once we have successfully parsed the input source code, we proceed to Semantic Analysis, which is used to perform type checking, enforce scope rules, and generally verify the validity of the program. This often requires creating a symbol table which is used as a means to look-up the definitions of program constructs (e.g. variables, functions, etc.) to verify whether or not certain operations are allowed within the program on-the-fly.
Thus, the semantic analysis phase often requires two passes over the AST: one pass to create a symbol table that can be consulted in the second pass to actually enforce the programming language's rules.
For example, suppose we made yet another invalid change to the program above, where we tried to return a value that didn't conform to the signature of its surrounding function:
If we tried to compile this code with go build, we'd see the following:
In this case, the program was successfully parsed, but it didn't pass the Go programming language's rules for semantic analysis!
Code generation
Finally, once our program has passed the semantic analysis step, we can generate a valid program from the AST. In this case, most compilers generate code directly to some assembly, machine language, such as x86.
Aside from raw binary, this is as low-level as it gets. The entire Go program is translated into a series of assembly instructions specific to a particular architecture, which is then used to create an executable binary file.
With the original, valid program we have above, we can inspect the result assembly code by issuing the following command:
And from there, the assembly is compiled into an executable that we can run on our machine!
Fortunately, the go tool does all of this for us with a simple run of go build. Suppose
that we had the following program:
Now we can compile the main.go file into an executable file named hello-world:
Then run it by executing the file on the command line!
Further reading
We've covered a lot of topics here, but there is so much more to learn in each of these domains; we've barely scratched the surface. There are a couple of books focused on building an interpreter and compiler in Go, so we recommend checking them out if you find this topic interesting!