This is the first in a 5-part tutorial on how to implement a programming language. It is intended for people with some programming experience, who want to know how their compiler, interpreter or virtual machine works. Hint: it's not magic.
In the introduction I explained why you might want to make your own programming language, and why you might want to learn it from me. Now we'll start actually doing it. I encourage you to experiment; don't follow my examples blindly. By the end of this installment, we should have a working and useful (if simplistic) interpreter.
I'll spare you a boring lecture on how computers understand only ones and zeroes, and they have to somehow translate your high-level instructions into something they can digest. Let's just say that source code we humans perceive as a structure:
function blah (...) { while (...) { if (...) { ... } else { ... } } }
is just a long string of characters for the computer:
f u n c t i o n b l a h ( . . . ) { w h i l e ( . . . ) . . . }
Imagine having to read a book through a little hole, one letter at a time. It's not easy being a machine.
So the first order of business is to split up the input into words, numbers and so on. Here I'll use a very simple rule: just split at whitespace. If this seems like a cheap cop-out, look at your own source code: most likely, you have whitespace everywhere, anyway.
function ScratchLexer(text) { var words = text.split(/\s+/); var next = 0; this.nextWord = function () { if (next >= words.length) return null; return words[next++]; }; }
I'm taking advantage of the regular expression support in Javascript. Languages that lack it, such as ActionScript 2.0, require a different approach, which I'll demonstrate in part 2. In any event, the lexer's job (that's short for "lexical analyzer") is to supply the next word of input on demand.
We have some input, now to make sense of it. Most programming languages have baroque rules for chaining keywords and punctuation into ellaborate constructs. (Just for kicks, check out the Wikipedia article on Recursive descent parsers.) But since we started simple, let's keep it simple:
As an extra feature, I also make sure that words can be recognized regardless of case.
function Scratch () { var dictionary = {}; this.stack = []; this.addWords = function (new_dict) { for (var word in new_dict) dictionary[word.toUpperCase()] = new_dict[word]; }; this.run = function (text) { var lexer = new ScratchLexer(text); var word; var num_val; while (word = lexer.nextWord()) { word = word.toUpperCase(); num_val = parseFloat(word); if (dictionary[word]) { dictionary[word](this); } else if (!isNaN(num_val)) { this.stack.push(num_val); } else { throw "Unknown word"; } } }; }
Now you can create an interpreter and check that it does indeed push numbers on the stack.
var terp = new Scratch(); // 'terp is short for interpreter, btw... terp.run("1 2 3 45 678"); alert(terp.stack);
Trouble is, without any words in the dictionary, there is nothing you can actually do to those numbers. Booo-ring!
What's the most basic thing you can do with things on the stack? Why, print them, of course!
var PrintingWords = { // Print and discard top of stack. "PRINT": function (terp) { if (terp.stack.length < 1) throw "Not enough items on stack"; var tos = terp.stack.pop(); alert(tos); }, // Print out the contents of the stack. "PSTACK": function (terp) { alert(terp.stack); } };
Add these words to the interpreter like this:
terp.addWords(PrintingWords);
and check that they do indeed work:
10 pstack
1 2 3 print print print
Not yet useful, but now we're getting somewhere.
No comment here. What could be simpler?
var MathWords = { "+": function (terp) { if (terp.stack.length < 2) throw "Not enough items on stack"; var tos = terp.stack.pop(); var _2os = terp.stack.pop(); terp.stack.push(_2os + tos); }, "-": function (terp) { if (terp.stack.length < 2) throw "Not enough items on stack"; var tos = terp.stack.pop(); var _2os = terp.stack.pop(); terp.stack.push(_2os - tos); }, "*": function (terp) { if (terp.stack.length < 2) throw "Not enough items on stack"; var tos = terp.stack.pop(); var _2os = terp.stack.pop(); terp.stack.push(_2os * tos); }, "/": function (terp) { if (terp.stack.length < 2) throw "Not enough items on stack"; var tos = terp.stack.pop(); var _2os = terp.stack.pop(); terp.stack.push(_2os / tos); }, "SQRT": function (terp) { if (terp.stack.length < 1) throw "Not enough items on stack"; var tos = terp.stack.pop(); terp.stack.push(Math.sqrt(tos)); } };
With the math words in place (add them to the interpreter like before), you can say things like:
2 2 + print
or even
3 3 * 4 4 * + sqrt print
Look, ma, a real interpreter!
This is all fine and dandy, but what if you need more complex manipulations? Here are some stack operations that might prove useful.
var StackWords = { // Duplicate the top of stack (TOS). "DUP": function (terp) { if (terp.stack.length < 1) throw "Not enough items on stack"; var tos = terp.stack.pop(); terp.stack.push(tos); terp.stack.push(tos); }, // Throw away the TOS -- the opposite of DUP. "DROP": function (terp) { if (terp.stack.length < 1) throw "Not enough items on stack"; terp.stack.pop(); }, // Exchange positions of TOS and second item on stack (2OS). "SWAP": function (terp) { if (terp.stack.length < 2) throw "Not enough items on stack"; var tos = terp.stack.pop(); var _2os = terp.stack.pop(); terp.stack.push(tos); terp.stack.push(_2os); }, // Copy 2OS on top of stack. "OVER": function (terp) { if (terp.stack.length < 2) throw "Not enough items on stack"; var tos = terp.stack.pop(); var _2os = terp.stack.pop(); terp.stack.push(_2os); terp.stack.push(tos); terp.stack.push(_2os); }, // Bring the 3rd item on stack to the top. "ROT": function (terp) { if (terp.stack.length < 3) throw "Not enough items on stack"; var tos = terp.stack.pop(); var _2os = terp.stack.pop(); var _3os = terp.stack.pop(); terp.stack.push(_2os); terp.stack.push(tos); terp.stack.push(_3os); }, };
Of these, DUP
is probably the most useful (now you
can print the top of the stack whitout losing it). The others
can come in handy, too. By the way, if the names seem cryptic, feel
free to come up with something better. But they are pretty much
standard for stack operations.
Congratulations, you have a... calculator?
Of course, using Scratch that way would be pointless. But now you have a base for defining your own words. Imagine the users of your programs (who might be other computer programs, hint, hint) being able to say things like:
bottom left origin brown pen 0 0 to 100 60 box red pen 0 60 to 50 110 line 50 110 to 100 60 line blue pen 10 40 to 20 50 box 80 40 to 90 50 box 50 0 to 60 40 box
Which, assuming a suitable graphics API, draws a stick figure house. A crooked one, to be sure, but that's not the language's fault.
If that's enough for you, feel free to stop here. But you'd be missing out. In the next installment, I'll start turning Scratch into a real programming language, with variables, strings, comments and so on.
P.S. The informed reader might have noticed that Scratch is essentially
Forth
with more friendly words. This is deliberate. Using "."
for print
made sense in 1969 when Forth was invented.
Nowadays it just scares people away.