mediocre-blog/_posts/2021-01-09-ginger.md

353 lines
12 KiB
Markdown
Raw Permalink Normal View History

2021-01-09 18:21:29 +00:00
---
title: >-
Ginger
description: >-
Yes, it does exist.
---
This post is about a programming language that's been bouncing around in my head
for a _long_ time. I've tried to actually implement the language three or more
times now, but everytime I get stuck or run out of steam. It doesn't help that
everytime I try again the form of the language changes significantly. But all
throughout the name of the language has always been "Ginger". It's a good name.
In the last few years the form of the language has somewhat solidified in my
head, so in lieu of actually working on it I'm going to talk about what it
currently looks like.
## Abstract Syntax Lists
_In the beginning_ there was assembly. Well, really in the beginning there were
punchcards, and probably something even more esoteric before that, but it was
all effectively the same thing: a list of commands the computer would execute
sequentially, with the ability to jump to odd places in the sequence depending
on conditions at runtime. For the purpose of this post, we'll call this class of
languages "abstract syntax list" (ASL) languages.
Here's a hello world program in my favorite ASL language, brainfuck:
```
++++++++[>++++[>++>+++>+++>+<<<<-]>+>+>->>+[<]<-]>>.>---.+++++++..+++.>>.<-.<.++
+.------.--------.>>+.>++.
```
(If you've never seen brainfuck, it's deliberately unintelligible. But it _is_
an ASL, each character representing a single command, executed by the brainfuck
runtime from left to right.)
ASLs did the job at the time, but luckily we've mostly moved on past them.
## Abstract Syntax Trees
Eventually programmers upgraded to C-like languages. Rather than a sequence of
commands, these languages were syntactically represented by an "abstract syntax
tree" (AST). Rather than executing commands in essentially the same order they
are written, an AST language compiler reads the syntax into a tree of syntax
nodes. What it then does with the tree is language dependent.
Here's a program which outputs all numbers from 0 to 9 to stdout, written in
(slightly non-idiomatic) Go:
```go
i := 0
for {
if i == 10 {
break
}
fmt.Println(i)
i++
}
```
When the Go compiler sees this, it's going to first parse the syntax into an
AST. The AST might look something like this:
```
(root)
|-(:=)
| |-(i)
| |-(0)
|
|-(for)
|-(if)
| |-(==)
| | |-(i)
| | |-(10)
| |
| |-(break)
|
|-(fmt.Println)
| |-(i)
|
|-(++)
|-(i)
```
Each of the non-leaf nodes in the tree represents an operation, and the children
of the node represent the arguments to that operation, if any. From here the
compiler traverses the tree depth-first in order to turn each operation it finds
into the appropriate machine code.
There's a sub-class of AST languages called the LISP ("LISt Processor")
languages. In a LISP language the AST is represented using lists of elements,
where the first element in each list denotes the operation and the rest of the
elements in the list (if any) represent the arguments. Traditionally each list
is represented using parenthesis. For example `(+ 1 1)` represents adding 1 and
1 together.
As a more complex example, here's how to print numbers 0 through 9 to stdout
using my favorite (and, honestly, only) LISP, Clojure:
```clj
(doseq
[n (range 10)]
(println n))
```
Much smaller, but the idea is there. In LISPs there is no differentiation
between the syntax, the AST, and the language's data structures; they are all
one and the same. For this reason LISPs generally have very powerful macro
support, wherein one uses code written in the language to transform code written
in that same language. With macros users can extend a language's functionality
to support nearly anything they need to, but because macro generation happens
_before_ compilation they can still reap the benefits of compiler optimizations.
### AST Pitfalls
The ASL (assembly) is essentially just a thin layer of human readability on top
of raw CPU instructions. It does nothing in the way of representing code in the
way that humans actually think about it (relationships of types, flow of data,
encapsulation of behavior). The AST is a step towards expressing code in human
terms, but it isn't quite there in my opinion. Let me show why by revisiting the
Go example above:
```go
i := 0
for {
if i > 9 {
break
}
fmt.Println(i)
i++
}
```
When I understand this code I don't understand it in terms of its syntax. I
understand it in terms of what it _does_. And what it does is this:
* with a number starting at 0, start a loop.
* if the number is greater than 9, stop the loop.
* otherwise, print the number.
* add one to the number.
* go to start of loop.
This behavior could be further abstracted into the original problem statement,
"it prints numbers 0 through 9 to stdout", but that's too general, as there
are different ways for that to be accomplished. The Clojure example first
defines a list of numbers 0 through 9 and then iterates over that, rather than
looping over a single number. These differences are important when understanding
what code is doing.
So what's the problem? My problem with ASTs is that the syntax I've written down
does _not_ reflect the structure of the code or the flow of data which is in my
head. In the AST representation if you want to follow the flow of data (a single
number) you _have_ to understand the semantic meaning of `i` and `:=`; the AST
structure itself does not convey how data is being moved or modified.
Essentially, there's an extra implicit transformation that must be done to
understand the code in human terms.
## Ginger: An Abstract Syntax Graph Language
In my view the next step is towards using graphs rather than trees for
representing our code. A graph has the benefit of being able to reference
"backwards" into itself, where a tree cannot, and so can represent the flow of
data much more directly.
I would like Ginger to be an ASG language where the language is the graph,
similar to a LISP. But what does this look like exactly? Well, I have a good
idea about what the graph _structure_ will be like and how it will function, but
the syntax is something I haven't bothered much with yet. Representing graph
structures in a text file is a problem to be tackled all on its own. For this
post we'll use a made-up, overly verbose, and probably non-usable syntax, but
hopefully it will convey the graph structure well enough.
### Nodes, Edges, and Tuples
All graphs have nodes, where each node contains a value. A single unique value
can only have a single node in a graph. Nodes are connected by edges, where
edges have a direction and can contain a value themselves.
In the context of Ginger, a node represents a value as expected, and the value
on an edge represents an operation to take on that value. For example:
```
5 -incr-> n
```
`5` and `n` are both nodes in the graph, with an edge going from `5` to `n` that
has the value `incr`. When it comes time to interpret the graph we say that the
value of `n` can be calculated by giving `5` as the input to the operation
`incr` (increment). In other words, the value of `n` is `6`.
What about operations which have more than one input value? For this Ginger
introduces the tuple to its graph type. A tuple is like a node, except that it's
anonymous, which allows more than one to exist within the same graph, as they do
not share the same value. For the purposes of this blog post we'll represent
tuples like this:
```
1 -> } -add-> t
2 -> }
```
`t`'s value is the result of passing a tuple of two values, `1` and `2`, as
inputs to the operation `add`. In other words, the value of `t` is `3`.
For the syntax being described in this post we allow that a single contiguous
graph can be represented as multiple related sections. This can be done because
each node's value is unique, so when the same value is used in disparate
sections we can merge the two sections on that value. For example, the following
two graphs are exactly equivalent (note the parenthesis wrapping the graph which
has been split):
```
1 -> } -add-> t -incr-> tt
2 -> }
```
```
(
1 -> } -add-> t
2 -> }
t -incr-> tt
)
```
(`tt` is `4` in both cases.)
A tuple with only one input edge, a 1-tuple, is a no-op, semantically, but can
be useful structurally to chain multiple operations together without defining
new value names. In the above example the `t` value can be eliminated using a
1-tuple.
```
1 -> } -add-> } -incr-> tt
2 -> }
```
When an integer is used as an operation on a tuple value then the effect is to
output the value in the tuple at that index. For example:
```
1 -> } -0-> } -incr-> t
2 -> }
```
(`t` is `2`.)
### Operations
When a value sits on an edge it is used as an operation on the input of that
edge. Some operations will no doubt be builtin, like `add`, but users should be
able to define their own operations. This can be done using the `in` and `out`
special values. When a graph is used as an operation it is scanned for both `in`
and `out` values. `in` is set to the input value of the operation, and the value
of `out` is used as the output of the operation.
Here we will define the `incr` operation and then use it. Note that we set the
`incr` value to be an entire sub-graph which represents the operation's body.
```
( in -> } -add-> out
1 -> } ) -> incr
5 -incr-> n
```
(`n` is `6`.)
The output of an operation may itself be a tuple. Here's an implementation and
usage of `double-incr`, which increments two values at once.
```
( in -0-> } -incr-> } -> out
}
in -1-> } -incr-> } ) -> double-incr
1 -> } -double-incr-> t -add-> tt
2 -> }
```
(`t` is a 2-tuple with values `2`, and `3`, `tt` is `5.)
### Conditionals
The conditional is a bit weird, and I'm not totally settled on it yet. For now
we'll use this. The `if` operation expects as an input a 2-tuple whose first
value is a boolean and whose second value will be passed along. The `if`
operation is special in that it has _two_ output edges. The first will be taken
if the boolean is true, the second if the boolean is false. The second value in
the input tuple, the one to be passed along, is used as the input to whichever
branch is taken.
Here is an implementation and usage of `max`, which takes two numbers and
outputs the greater of the two. Note that the `if` operation has two output
edges, but our syntax doesn't represent that very cleanly.
```
( in -gt-> } -if-> } -0-> out
in -> } -> } -1-> out ) -> max
1 -> } -max-> t
2 -> }
```
(`t` is `2`.)
It would be simple enough to create a `switch` macro on top of `if`, to allow
for multiple conditionals to be tested at once.
### Loops
Loops are tricky, and I have two thoughts about how they might be accomplished.
One is to literally draw an edge from the right end of the graph back to the
left, at the point where the loop should occur, as that's conceptually what's
happening. But representing that in a text file is difficult. For now I'll
introduce the special `recur` value, and leave this whole section as TBD.
`recur` is cousin of `in` and `out`, in that it's a special value and not an
operation. It takes whatever value it's set to and calls the current operation
with that as input. As an example, here is our now classic 0 through 9 printer
(assume `println` outputs whatever it was input):
```
// incr-1 is an operation which takes a 2-tuple and returns the same 2-tuple
// with the first element incremented.
( in -0-> } -incr-> } -> out
in -1-> } ) -> incr-1
( in -eq-> } -if-> out
in -> } -> } -0-> } -println-> } -incr-1-> } -> recur ) -> print-range
0 -> } -print-range-> }
10 -> }
```
## Next Steps
This post is long enough, and I think gives at least a basic idea of what I'm
going for. The syntax presented here is _extremely_ rudimentary, and is almost
definitely not what any final version of the syntax would look like. But the
general idea behind the structure is sound, I think.
I have a lot of further ideas for Ginger I haven't presented here. Hopefully as
time goes on and I work on the language more some of those ideas can start
taking a more concrete shape and I can write about them.
The next thing I need to do for Ginger is to implement (again) the graph type
for it, since the last one I implemented didn't include tuples. Maybe I can
extend it instead of re-writing it. After that it will be time to really buckle
down and figure out a syntax. Once a syntax is established then it's time to
start on the compiler!