parent
fccc5b95fd
commit
65c1d5ab10
@ -0,0 +1,352 @@ |
|||||||
|
--- |
||||||
|
title: >- |
||||||
|
Ginger |
||||||
|
description: >- |
||||||
|
Yes, it does exist. |
||||||
|
--- |
||||||
|
|
||||||
|
This post is about a programming language that's been bouncing around in my head |
||||||
|
for a _long_ time. I've tried to actually implement the language three or more |
||||||
|
times now, but everytime I get stuck or run out of steam. It doesn't help that |
||||||
|
everytime I try again the form of the language changes significantly. But all |
||||||
|
throughout the name of the language has always been "Ginger". It's a good name. |
||||||
|
|
||||||
|
In the last few years the form of the language has somewhat solidified in my |
||||||
|
head, so in lieu of actually working on it I'm going to talk about what it |
||||||
|
currently looks like. |
||||||
|
|
||||||
|
## Abstract Syntax Lists |
||||||
|
|
||||||
|
_In the beginning_ there was assembly. Well, really in the beginning there were |
||||||
|
punchcards, and probably something even more esoteric before that, but it was |
||||||
|
all effectively the same thing: a list of commands the computer would execute |
||||||
|
sequentially, with the ability to jump to odd places in the sequence depending |
||||||
|
on conditions at runtime. For the purpose of this post, we'll call this class of |
||||||
|
languages "abstract syntax list" (ASL) languages. |
||||||
|
|
||||||
|
Here's a hello world program in my favorite ASL language, brainfuck: |
||||||
|
|
||||||
|
``` |
||||||
|
++++++++[>++++[>++>+++>+++>+<<<<-]>+>+>->>+[<]<-]>>.>---.+++++++..+++.>>.<-.<.++ |
||||||
|
+.------.--------.>>+.>++. |
||||||
|
``` |
||||||
|
|
||||||
|
(If you've never seen brainfuck, it's deliberately unintelligible. But it _is_ |
||||||
|
an ASL, each character representing a single command, executed by the brainfuck |
||||||
|
runtime from left to right.) |
||||||
|
|
||||||
|
ASLs did the job at the time, but luckily we've mostly moved on past them. |
||||||
|
|
||||||
|
## Abstract Syntax Trees |
||||||
|
|
||||||
|
Eventually programmers upgraded to C-like languages. Rather than a sequence of |
||||||
|
commands, these languages were syntactically represented by an "abstract syntax |
||||||
|
tree" (AST). Rather than executing commands in essentially the same order they |
||||||
|
are written, an AST language compiler reads the syntax into a tree of syntax |
||||||
|
nodes. What it then does with the tree is language dependent. |
||||||
|
|
||||||
|
Here's a program which outputs all numbers from 0 to 9 to stdout, written in |
||||||
|
(slightly non-idiomatic) Go: |
||||||
|
|
||||||
|
```go |
||||||
|
i := 0 |
||||||
|
for { |
||||||
|
if i == 10 { |
||||||
|
break |
||||||
|
} |
||||||
|
fmt.Println(i) |
||||||
|
i++ |
||||||
|
} |
||||||
|
``` |
||||||
|
|
||||||
|
When the Go compiler sees this, it's going to first parse the syntax into an |
||||||
|
AST. The AST might look something like this: |
||||||
|
|
||||||
|
``` |
||||||
|
(root) |
||||||
|
|-(:=) |
||||||
|
| |-(i) |
||||||
|
| |-(0) |
||||||
|
| |
||||||
|
|-(for) |
||||||
|
|-(if) |
||||||
|
| |-(==) |
||||||
|
| | |-(i) |
||||||
|
| | |-(10) |
||||||
|
| | |
||||||
|
| |-(break) |
||||||
|
| |
||||||
|
|-(fmt.Println) |
||||||
|
| |-(i) |
||||||
|
| |
||||||
|
|-(++) |
||||||
|
|-(i) |
||||||
|
``` |
||||||
|
|
||||||
|
Each of the non-leaf nodes in the tree represents an operation, and the children |
||||||
|
of the node represent the arguments to that operation, if any. From here the |
||||||
|
compiler traverses the tree depth-first in order to turn each operation it finds |
||||||
|
into the appropriate machine code. |
||||||
|
|
||||||
|
There's a sub-class of AST languages called the LISP ("LISt Processor") |
||||||
|
languages. In a LISP language the AST is represented using lists of elements, |
||||||
|
where the first element in each list denotes the operation and the rest of the |
||||||
|
elements in the list (if any) represent the arguments. Traditionally each list |
||||||
|
is represented using parenthesis. For example `(+ 1 1)` represents adding 1 and |
||||||
|
1 together. |
||||||
|
|
||||||
|
As a more complex example, here's how to print numbers 0 through 9 to stdout |
||||||
|
using my favorite (and, honestly, only) LISP, Clojure: |
||||||
|
|
||||||
|
```clj |
||||||
|
(doseq |
||||||
|
[n (range 10)] |
||||||
|
(println n)) |
||||||
|
``` |
||||||
|
|
||||||
|
Much smaller, but the idea is there. In LISPs there is no differentiation |
||||||
|
between the syntax, the AST, and the language's data structures; they are all |
||||||
|
one and the same. For this reason LISPs generally have very powerful macro |
||||||
|
support, wherein one uses code written in the language to transform code written |
||||||
|
in that same language. With macros users can extend a language's functionality |
||||||
|
to support nearly anything they need to, but because macro generation happens |
||||||
|
_before_ compilation they can still reap the benefits of compiler optimizations. |
||||||
|
|
||||||
|
### AST Pitfalls |
||||||
|
|
||||||
|
The ASL (assembly) is essentially just a thin layer of human readability on top |
||||||
|
of raw CPU instructions. It does nothing in the way of representing code in the |
||||||
|
way that humans actually think about it (relationships of types, flow of data, |
||||||
|
encapsulation of behavior). The AST is a step towards expressing code in human |
||||||
|
terms, but it isn't quite there in my opinion. Let me show why by revisiting the |
||||||
|
Go example above: |
||||||
|
|
||||||
|
```go |
||||||
|
i := 0 |
||||||
|
for { |
||||||
|
if i > 9 { |
||||||
|
break |
||||||
|
} |
||||||
|
fmt.Println(i) |
||||||
|
i++ |
||||||
|
} |
||||||
|
``` |
||||||
|
|
||||||
|
When I understand this code I don't understand it in terms of its syntax. I |
||||||
|
understand it in terms of what it _does_. And what it does is this: |
||||||
|
|
||||||
|
* with a number starting at 0, start a loop. |
||||||
|
* if the number is greater than 9, stop the loop. |
||||||
|
* otherwise, print the number. |
||||||
|
* add one to the number. |
||||||
|
* go to start of loop. |
||||||
|
|
||||||
|
This behavior could be further abstracted into the original problem statement, |
||||||
|
"it prints numbers 0 through 9 to stdout", but that's too general, as there |
||||||
|
are different ways for that to be accomplished. The Clojure example first |
||||||
|
defines a list of numbers 0 through 9 and then iterates over that, rather than |
||||||
|
looping over a single number. These differences are important when understanding |
||||||
|
what code is doing. |
||||||
|
|
||||||
|
So what's the problem? My problem with ASTs is that the syntax I've written down |
||||||
|
does _not_ reflect the structure of the code or the flow of data which is in my |
||||||
|
head. In the AST representation if you want to follow the flow of data (a single |
||||||
|
number) you _have_ to understand the semantic meaning of `i` and `:=`; the AST |
||||||
|
structure itself does not convey how data is being moved or modified. |
||||||
|
Essentially, there's an extra implicit transformation that must be done to |
||||||
|
understand the code in human terms. |
||||||
|
|
||||||
|
## Ginger: An Abstract Syntax Graph Language |
||||||
|
|
||||||
|
In my view the next step is towards using graphs rather than trees for |
||||||
|
representing our code. A graph has the benefit of being able to reference |
||||||
|
"backwards" into itself, where a tree cannot, and so can represent the flow of |
||||||
|
data much more directly. |
||||||
|
|
||||||
|
I would like Ginger to be an ASG language where the language is the graph, |
||||||
|
similar to a LISP. But what does this look like exactly? Well, I have a good |
||||||
|
idea about what the graph _structure_ will be like and how it will function, but |
||||||
|
the syntax is something I haven't bothered much with yet. Representing graph |
||||||
|
structures in a text file is a problem to be tackled all on its own. For this |
||||||
|
post we'll use a made-up, overly verbose, and probably non-usable syntax, but |
||||||
|
hopefully it will convey the graph structure well enough. |
||||||
|
|
||||||
|
### Nodes, Edges, and Tuples |
||||||
|
|
||||||
|
All graphs have nodes, where each node contains a value. A single unique value |
||||||
|
can only have a single node in a graph. Nodes are connected by edges, where |
||||||
|
edges have a direction and can contain a value themselves. |
||||||
|
|
||||||
|
In the context of Ginger, a node represents a value as expected, and the value |
||||||
|
on an edge represents an operation to take on that value. For example: |
||||||
|
|
||||||
|
``` |
||||||
|
5 -incr-> n |
||||||
|
``` |
||||||
|
|
||||||
|
`5` and `n` are both nodes in the graph, with an edge going from `5` to `n` that |
||||||
|
has the value `incr`. When it comes time to interpret the graph we say that the |
||||||
|
value of `n` can be calculated by giving `5` as the input to the operation |
||||||
|
`incr` (increment). In other words, the value of `n` is `6`. |
||||||
|
|
||||||
|
What about operations which have more than one input value? For this Ginger |
||||||
|
introduces the tuple to its graph type. A tuple is like a node, except that it's |
||||||
|
anonymous, which allows more than one to exist within the same graph, as they do |
||||||
|
not share the same value. For the purposes of this blog post we'll represent |
||||||
|
tuples like this: |
||||||
|
|
||||||
|
``` |
||||||
|
1 -> } -add-> t |
||||||
|
2 -> } |
||||||
|
``` |
||||||
|
|
||||||
|
`t`'s value is the result of passing a tuple of two values, `1` and `2`, as |
||||||
|
inputs to the operation `add`. In other words, the value of `t` is `3`. |
||||||
|
|
||||||
|
For the syntax being described in this post we allow that a single contiguous |
||||||
|
graph can be represented as multiple related sections. This can be done because |
||||||
|
each node's value is unique, so when the same value is used in disparate |
||||||
|
sections we can merge the two sections on that value. For example, the following |
||||||
|
two graphs are exactly equivalent (note the parenthesis wrapping the graph which |
||||||
|
has been split): |
||||||
|
|
||||||
|
``` |
||||||
|
1 -> } -add-> t -incr-> tt |
||||||
|
2 -> } |
||||||
|
``` |
||||||
|
|
||||||
|
``` |
||||||
|
( |
||||||
|
1 -> } -add-> t |
||||||
|
2 -> } |
||||||
|
|
||||||
|
t -incr-> tt |
||||||
|
) |
||||||
|
``` |
||||||
|
|
||||||
|
(`tt` is `4` in both cases.) |
||||||
|
|
||||||
|
A tuple with only one input edge, a 1-tuple, is a no-op, semantically, but can |
||||||
|
be useful structurally to chain multiple operations together without defining |
||||||
|
new value names. In the above example the `t` value can be eliminated using a |
||||||
|
1-tuple. |
||||||
|
|
||||||
|
``` |
||||||
|
1 -> } -add-> } -incr-> tt |
||||||
|
2 -> } |
||||||
|
``` |
||||||
|
|
||||||
|
When an integer is used as an operation on a tuple value then the effect is to |
||||||
|
output the value in the tuple at that index. For example: |
||||||
|
|
||||||
|
``` |
||||||
|
1 -> } -0-> } -incr-> t |
||||||
|
2 -> } |
||||||
|
``` |
||||||
|
|
||||||
|
(`t` is `2`.) |
||||||
|
|
||||||
|
### Operations |
||||||
|
|
||||||
|
When a value sits on an edge it is used as an operation on the input of that |
||||||
|
edge. Some operations will no doubt be builtin, like `add`, but users should be |
||||||
|
able to define their own operations. This can be done using the `in` and `out` |
||||||
|
special values. When a graph is used as an operation it is scanned for both `in` |
||||||
|
and `out` values. `in` is set to the input value of the operation, and the value |
||||||
|
of `out` is used as the output of the operation. |
||||||
|
|
||||||
|
Here we will define the `incr` operation and then use it. Note that we set the |
||||||
|
`incr` value to be an entire sub-graph which represents the operation's body. |
||||||
|
|
||||||
|
``` |
||||||
|
( in -> } -add-> out |
||||||
|
1 -> } ) -> incr |
||||||
|
|
||||||
|
5 -incr-> n |
||||||
|
``` |
||||||
|
|
||||||
|
(`n` is `6`.) |
||||||
|
|
||||||
|
The output of an operation may itself be a tuple. Here's an implementation and |
||||||
|
usage of `double-incr`, which increments two values at once. |
||||||
|
|
||||||
|
``` |
||||||
|
( in -0-> } -incr-> } -> out |
||||||
|
} |
||||||
|
in -1-> } -incr-> } ) -> double-incr |
||||||
|
|
||||||
|
1 -> } -double-incr-> t -add-> tt |
||||||
|
2 -> } |
||||||
|
``` |
||||||
|
|
||||||
|
(`t` is a 2-tuple with values `2`, and `3`, `tt` is `5.) |
||||||
|
|
||||||
|
### Conditionals |
||||||
|
|
||||||
|
The conditional is a bit weird, and I'm not totally settled on it yet. For now |
||||||
|
we'll use this. The `if` operation expects as an input a 2-tuple whose first |
||||||
|
value is a boolean and whose second value will be passed along. The `if` |
||||||
|
operation is special in that it has _two_ output edges. The first will be taken |
||||||
|
if the boolean is true, the second if the boolean is false. The second value in |
||||||
|
the input tuple, the one to be passed along, is used as the input to whichever |
||||||
|
branch is taken. |
||||||
|
|
||||||
|
Here is an implementation and usage of `max`, which takes two numbers and |
||||||
|
outputs the greater of the two. Note that the `if` operation has two output |
||||||
|
edges, but our syntax doesn't represent that very cleanly. |
||||||
|
|
||||||
|
``` |
||||||
|
( in -gt-> } -if-> } -0-> out |
||||||
|
in -> } -> } -1-> out ) -> max |
||||||
|
|
||||||
|
1 -> } -max-> t |
||||||
|
2 -> } |
||||||
|
``` |
||||||
|
|
||||||
|
(`t` is `2`.) |
||||||
|
|
||||||
|
It would be simple enough to create a `switch` macro on top of `if`, to allow |
||||||
|
for multiple conditionals to be tested at once. |
||||||
|
|
||||||
|
### Loops |
||||||
|
|
||||||
|
Loops are tricky, and I have two thoughts about how they might be accomplished. |
||||||
|
One is to literally draw an edge from the right end of the graph back to the |
||||||
|
left, at the point where the loop should occur, as that's conceptually what's |
||||||
|
happening. But representing that in a text file is difficult. For now I'll |
||||||
|
introduce the special `recur` value, and leave this whole section as TBD. |
||||||
|
|
||||||
|
`recur` is cousin of `in` and `out`, in that it's a special value and not an |
||||||
|
operation. It takes whatever value it's set to and calls the current operation |
||||||
|
with that as input. As an example, here is our now classic 0 through 9 printer |
||||||
|
(assume `println` outputs whatever it was input): |
||||||
|
|
||||||
|
``` |
||||||
|
// incr-1 is an operation which takes a 2-tuple and returns the same 2-tuple |
||||||
|
// with the first element incremented. |
||||||
|
( in -0-> } -incr-> } -> out |
||||||
|
in -1-> } ) -> incr-1 |
||||||
|
|
||||||
|
( in -eq-> } -if-> out |
||||||
|
in -> } -> } -0-> } -println-> } -incr-1-> } -> recur ) -> print-range |
||||||
|
|
||||||
|
0 -> } -print-range-> } |
||||||
|
10 -> } |
||||||
|
``` |
||||||
|
|
||||||
|
## Next Steps |
||||||
|
|
||||||
|
This post is long enough, and I think gives at least a basic idea of what I'm |
||||||
|
going for. The syntax presented here is _extremely_ rudimentary, and is almost |
||||||
|
definitely not what any final version of the syntax would look like. But the |
||||||
|
general idea behind the structure is sound, I think. |
||||||
|
|
||||||
|
I have a lot of further ideas for Ginger I haven't presented here. Hopefully as |
||||||
|
time goes on and I work on the language more some of those ideas can start |
||||||
|
taking a more concrete shape and I can write about them. |
||||||
|
|
||||||
|
The next thing I need to do for Ginger is to implement (again) the graph type |
||||||
|
for it, since the last one I implemented didn't include tuples. Maybe I can |
||||||
|
extend it instead of re-writing it. After that it will be time to really buckle |
||||||
|
down and figure out a syntax. Once a syntax is established then it's time to |
||||||
|
start on the compiler! |
Loading…
Reference in new issue