, 2008-10-09 (DRAFT 2)

Make Your Own Programming Language
Part 1

This is the first in a 5-part tutorial on how to implement a programming language. It is intended for people with some programming experience, who want to know how their compiler, interpreter or virtual machine works. Hint: it's not magic.

In the introduction I explained why you might want to make your own programming language, and why you might want to learn it from me. Now we'll start actually doing it. I encourage you to experiment; don't follow my examples blindly. By the end of this installment, we should have a working and useful (if simplistic) interpreter.

Introduction

I'll spare you a boring lecture on how computers understand only ones and zeroes, and they have to somehow translate your high-level instructions into something they can digest. Let's just say that source code we humans perceive as a structure:

function blah (...) {
	while (...) {
		if (...) {
			...
		} else {
			...
		}
	}
}

is just a long string of characters for the computer:

f u n c t i o n  b l a h   ( . . . )  {  w h i l e  ( . . . )  . . . }

Imagine having to read a book through a little hole, one letter at a time. It's not easy being a machine.

Splitting up the input

So the first order of business is to split up the input into words, numbers and so on. Here I'll use a very simple rule: just split at whitespace. If this seems like a cheap cop-out, look at your own source code: most likely, you have whitespace everywhere, anyway.

function ScratchLexer(text) {
	var words = text.split(/\s+/);
	var next = 0;
	this.nextWord = function () {
		if (next >= words.length) return null;
		return words[next++];
	};
}

I'm taking advantage of the regular expression support in Javascript. Languages that lack it, such as ActionScript 2.0, require a different approach, which I'll demonstrate in part 2. In any event, the lexer's job (that's short for "lexical analyzer") is to supply the next word of input on demand.

Interpreting stuff

We have some input, now to make sense of it. Most programming languages have baroque rules for chaining keywords and punctuation into ellaborate constructs. (Just for kicks, check out the Wikipedia article on Recursive descent parsers.) But since we started simple, let's keep it simple:

  1. First, look up the word in a dictionary. If found, act upon it.
  2. Else, check whether it's a number. If so, push it on a stack.
  3. Finally, if it's neither, throw up.

As an extra feature, I also make sure that words can be recognized regardless of case.

function Scratch () {
	var dictionary = {};

	this.stack = [];

	this.addWords = function (new_dict) {
		for (var word in new_dict)
			dictionary[word.toUpperCase()] = new_dict[word];
	};

	this.run = function (text) {
		var lexer = new ScratchLexer(text);
		var word;
		var num_val;

		while (word = lexer.nextWord()) {
			word = word.toUpperCase();
			num_val = parseFloat(word);
			if (dictionary[word]) {
				dictionary[word](this);
			} else if (!isNaN(num_val)) {
				this.stack.push(num_val);
			} else {
				throw "Unknown word";
			}
		}
	};
}

Now you can create an interpreter and check that it does indeed push numbers on the stack.

var terp = new Scratch(); // 'terp is short for interpreter, btw...
terp.run("1 2 3 45 678");
alert(terp.stack);

Trouble is, without any words in the dictionary, there is nothing you can actually do to those numbers. Booo-ring!

Basic words

What's the most basic thing you can do with things on the stack? Why, print them, of course!

var PrintingWords = {
	// Print and discard top of stack.
	"PRINT": function (terp) {
		if (terp.stack.length < 1) throw "Not enough items on stack";
		var tos = terp.stack.pop(); alert(tos);
	},
	// Print out the contents of the stack.
	"PSTACK": function (terp) {
		alert(terp.stack);
	}
};

Add these words to the interpreter like this:

terp.addWords(PrintingWords);

and check that they do indeed work:

10 pstack
1 2 3 print print print

Not yet useful, but now we're getting somewhere.

A little math

No comment here. What could be simpler?

var MathWords = {
	"+": function (terp) {
		if (terp.stack.length < 2) throw "Not enough items on stack";
		var tos = terp.stack.pop();
		var _2os = terp.stack.pop();
		terp.stack.push(_2os + tos);
	},
	"-": function (terp) {
		if (terp.stack.length < 2) throw "Not enough items on stack";
		var tos = terp.stack.pop();
		var _2os = terp.stack.pop();
		terp.stack.push(_2os - tos);
	},
	"*": function (terp) {
		if (terp.stack.length < 2) throw "Not enough items on stack";
		var tos = terp.stack.pop();
		var _2os = terp.stack.pop();
		terp.stack.push(_2os * tos);
	},
	"/": function (terp) {
		if (terp.stack.length < 2) throw "Not enough items on stack";
		var tos = terp.stack.pop();
		var _2os = terp.stack.pop();
		terp.stack.push(_2os / tos);
	},
	"SQRT": function (terp) {
		if (terp.stack.length < 1) throw "Not enough items on stack";
		var tos = terp.stack.pop();
		terp.stack.push(Math.sqrt(tos));
	}
};

With the math words in place (add them to the interpreter like before), you can say things like:

2 2 + print

or even

3 3 * 4 4 * + sqrt print

Look, ma, a real interpreter!

Fiddling with the stack

This is all fine and dandy, but what if you need more complex manipulations? Here are some stack operations that might prove useful.

var StackWords = {
	// Duplicate the top of stack (TOS).
	"DUP": function (terp) {
		if (terp.stack.length < 1) throw "Not enough items on stack";
		var tos = terp.stack.pop();
		terp.stack.push(tos);
		terp.stack.push(tos);
	},
	// Throw away the TOS -- the opposite of DUP.
	"DROP": function (terp) {
		if (terp.stack.length < 1) throw "Not enough items on stack";
		terp.stack.pop();
	},
	// Exchange positions of TOS and second item on stack (2OS).
	"SWAP": function (terp) {
		if (terp.stack.length < 2) throw "Not enough items on stack";
		var tos = terp.stack.pop();
		var _2os = terp.stack.pop();
		terp.stack.push(tos);
		terp.stack.push(_2os);
	},
	// Copy 2OS on top of stack.
	"OVER": function (terp) {
		if (terp.stack.length < 2) throw "Not enough items on stack";
		var tos = terp.stack.pop();
		var _2os = terp.stack.pop();
		terp.stack.push(_2os);
		terp.stack.push(tos);
		terp.stack.push(_2os);
	},
	// Bring the 3rd item on stack to the top.
	"ROT": function (terp) {
		if (terp.stack.length < 3) throw "Not enough items on stack";
		var tos = terp.stack.pop();
		var _2os = terp.stack.pop();
		var _3os = terp.stack.pop();
		terp.stack.push(_2os);
		terp.stack.push(tos);
		terp.stack.push(_3os);
	},
};

Of these, DUP is probably the most useful (now you can print the top of the stack whitout losing it). The others can come in handy, too. By the way, if the names seem cryptic, feel free to come up with something better. But they are pretty much standard for stack operations.

The point of it all

Congratulations, you have a... calculator?

Of course, using Scratch that way would be pointless. But now you have a base for defining your own words. Imagine the users of your programs (who might be other computer programs, hint, hint) being able to say things like:

bottom left origin
brown pen
0 0 to 100 60 box
red pen
0 60 to 50 110 line
50 110 to 100 60 line
blue pen
10 40 to 20 50 box
80 40 to 90 50 box
50 0 to 60 40 box

Which, assuming a suitable graphics API, draws a stick figure house. A crooked one, to be sure, but that's not the language's fault.

If that's enough for you, feel free to stop here. But you'd be missing out. In the next installment, I'll start turning Scratch into a real programming language, with variables, strings, comments and so on.

Try Scratch
Stack content:

P.S. The informed reader might have noticed that Scratch is essentially Forth with more friendly words. This is deliberate. Using "." for print made sense in 1969 when Forth was invented. Nowadays it just scares people away.

Creative Commons License
Make Your Own Programming Language by Felix Pleşoianu is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License.