mediocre-blog/_posts/2021-01-09-ginger.md
Brian Picciano 65c1d5ab10 Ginger!
2021-01-09 14:06:51 -07:00

12 KiB

title description
Ginger Yes, it does exist.

This post is about a programming language that's been bouncing around in my head for a long time. I've tried to actually implement the language three or more times now, but everytime I get stuck or run out of steam. It doesn't help that everytime I try again the form of the language changes significantly. But all throughout the name of the language has always been "Ginger". It's a good name.

In the last few years the form of the language has somewhat solidified in my head, so in lieu of actually working on it I'm going to talk about what it currently looks like.

Abstract Syntax Lists

In the beginning there was assembly. Well, really in the beginning there were punchcards, and probably something even more esoteric before that, but it was all effectively the same thing: a list of commands the computer would execute sequentially, with the ability to jump to odd places in the sequence depending on conditions at runtime. For the purpose of this post, we'll call this class of languages "abstract syntax list" (ASL) languages.

Here's a hello world program in my favorite ASL language, brainfuck:

++++++++[>++++[>++>+++>+++>+<<<<-]>+>+>->>+[<]<-]>>.>---.+++++++..+++.>>.<-.<.++
+.------.--------.>>+.>++.

(If you've never seen brainfuck, it's deliberately unintelligible. But it is an ASL, each character representing a single command, executed by the brainfuck runtime from left to right.)

ASLs did the job at the time, but luckily we've mostly moved on past them.

Abstract Syntax Trees

Eventually programmers upgraded to C-like languages. Rather than a sequence of commands, these languages were syntactically represented by an "abstract syntax tree" (AST). Rather than executing commands in essentially the same order they are written, an AST language compiler reads the syntax into a tree of syntax nodes. What it then does with the tree is language dependent.

Here's a program which outputs all numbers from 0 to 9 to stdout, written in (slightly non-idiomatic) Go:

i := 0
for {
    if i == 10 {
        break
    }
    fmt.Println(i)
    i++
}

When the Go compiler sees this, it's going to first parse the syntax into an AST. The AST might look something like this:

(root)
   |-(:=)
   |   |-(i)
   |   |-(0)
   |
   |-(for)
       |-(if)
       |  |-(==)
       |  |  |-(i)
       |  |  |-(10)
       |  |
       |  |-(break)
       |
       |-(fmt.Println)
       |       |-(i)
       |
       |-(++)
           |-(i)

Each of the non-leaf nodes in the tree represents an operation, and the children of the node represent the arguments to that operation, if any. From here the compiler traverses the tree depth-first in order to turn each operation it finds into the appropriate machine code.

There's a sub-class of AST languages called the LISP ("LISt Processor") languages. In a LISP language the AST is represented using lists of elements, where the first element in each list denotes the operation and the rest of the elements in the list (if any) represent the arguments. Traditionally each list is represented using parenthesis. For example (+ 1 1) represents adding 1 and 1 together.

As a more complex example, here's how to print numbers 0 through 9 to stdout using my favorite (and, honestly, only) LISP, Clojure:

(doseq
    [n (range 10)]
    (println n))

Much smaller, but the idea is there. In LISPs there is no differentiation between the syntax, the AST, and the language's data structures; they are all one and the same. For this reason LISPs generally have very powerful macro support, wherein one uses code written in the language to transform code written in that same language. With macros users can extend a language's functionality to support nearly anything they need to, but because macro generation happens before compilation they can still reap the benefits of compiler optimizations.

AST Pitfalls

The ASL (assembly) is essentially just a thin layer of human readability on top of raw CPU instructions. It does nothing in the way of representing code in the way that humans actually think about it (relationships of types, flow of data, encapsulation of behavior). The AST is a step towards expressing code in human terms, but it isn't quite there in my opinion. Let me show why by revisiting the Go example above:

i := 0
for {
    if i > 9 {
        break
    }
    fmt.Println(i)
    i++
}

When I understand this code I don't understand it in terms of its syntax. I understand it in terms of what it does. And what it does is this:

  • with a number starting at 0, start a loop.
  • if the number is greater than 9, stop the loop.
  • otherwise, print the number.
  • add one to the number.
  • go to start of loop.

This behavior could be further abstracted into the original problem statement, "it prints numbers 0 through 9 to stdout", but that's too general, as there are different ways for that to be accomplished. The Clojure example first defines a list of numbers 0 through 9 and then iterates over that, rather than looping over a single number. These differences are important when understanding what code is doing.

So what's the problem? My problem with ASTs is that the syntax I've written down does not reflect the structure of the code or the flow of data which is in my head. In the AST representation if you want to follow the flow of data (a single number) you have to understand the semantic meaning of i and :=; the AST structure itself does not convey how data is being moved or modified. Essentially, there's an extra implicit transformation that must be done to understand the code in human terms.

Ginger: An Abstract Syntax Graph Language

In my view the next step is towards using graphs rather than trees for representing our code. A graph has the benefit of being able to reference "backwards" into itself, where a tree cannot, and so can represent the flow of data much more directly.

I would like Ginger to be an ASG language where the language is the graph, similar to a LISP. But what does this look like exactly? Well, I have a good idea about what the graph structure will be like and how it will function, but the syntax is something I haven't bothered much with yet. Representing graph structures in a text file is a problem to be tackled all on its own. For this post we'll use a made-up, overly verbose, and probably non-usable syntax, but hopefully it will convey the graph structure well enough.

Nodes, Edges, and Tuples

All graphs have nodes, where each node contains a value. A single unique value can only have a single node in a graph. Nodes are connected by edges, where edges have a direction and can contain a value themselves.

In the context of Ginger, a node represents a value as expected, and the value on an edge represents an operation to take on that value. For example:

5 -incr-> n

5 and n are both nodes in the graph, with an edge going from 5 to n that has the value incr. When it comes time to interpret the graph we say that the value of n can be calculated by giving 5 as the input to the operation incr (increment). In other words, the value of n is 6.

What about operations which have more than one input value? For this Ginger introduces the tuple to its graph type. A tuple is like a node, except that it's anonymous, which allows more than one to exist within the same graph, as they do not share the same value. For the purposes of this blog post we'll represent tuples like this:

1 -> } -add-> t
2 -> }

t's value is the result of passing a tuple of two values, 1 and 2, as inputs to the operation add. In other words, the value of t is 3.

For the syntax being described in this post we allow that a single contiguous graph can be represented as multiple related sections. This can be done because each node's value is unique, so when the same value is used in disparate sections we can merge the two sections on that value. For example, the following two graphs are exactly equivalent (note the parenthesis wrapping the graph which has been split):

1 -> } -add-> t -incr-> tt
2 -> }
(
    1 -> } -add-> t
    2 -> }

    t -incr-> tt
)

(tt is 4 in both cases.)

A tuple with only one input edge, a 1-tuple, is a no-op, semantically, but can be useful structurally to chain multiple operations together without defining new value names. In the above example the t value can be eliminated using a 1-tuple.

1 -> } -add-> } -incr-> tt
2 -> }

When an integer is used as an operation on a tuple value then the effect is to output the value in the tuple at that index. For example:

1 -> } -0-> } -incr-> t
2 -> }

(t is 2.)

Operations

When a value sits on an edge it is used as an operation on the input of that edge. Some operations will no doubt be builtin, like add, but users should be able to define their own operations. This can be done using the in and out special values. When a graph is used as an operation it is scanned for both in and out values. in is set to the input value of the operation, and the value of out is used as the output of the operation.

Here we will define the incr operation and then use it. Note that we set the incr value to be an entire sub-graph which represents the operation's body.

( in -> } -add-> out
   1 -> }            ) -> incr

5 -incr-> n

(n is 6.)

The output of an operation may itself be a tuple. Here's an implementation and usage of double-incr, which increments two values at once.

( in -0-> } -incr-> } -> out
                    }
  in -1-> } -incr-> }        ) -> double-incr

1 -> } -double-incr-> t -add-> tt
2 -> }

(t is a 2-tuple with values 2, and 3, tt is `5.)

Conditionals

The conditional is a bit weird, and I'm not totally settled on it yet. For now we'll use this. The if operation expects as an input a 2-tuple whose first value is a boolean and whose second value will be passed along. The if operation is special in that it has two output edges. The first will be taken if the boolean is true, the second if the boolean is false. The second value in the input tuple, the one to be passed along, is used as the input to whichever branch is taken.

Here is an implementation and usage of max, which takes two numbers and outputs the greater of the two. Note that the if operation has two output edges, but our syntax doesn't represent that very cleanly.

( in -gt-> } -if-> } -0-> out
     in -> }    -> } -1-> out ) -> max

1 -> } -max-> t
2 -> }

(t is 2.)

It would be simple enough to create a switch macro on top of if, to allow for multiple conditionals to be tested at once.

Loops

Loops are tricky, and I have two thoughts about how they might be accomplished. One is to literally draw an edge from the right end of the graph back to the left, at the point where the loop should occur, as that's conceptually what's happening. But representing that in a text file is difficult. For now I'll introduce the special recur value, and leave this whole section as TBD.

recur is cousin of in and out, in that it's a special value and not an operation. It takes whatever value it's set to and calls the current operation with that as input. As an example, here is our now classic 0 through 9 printer (assume println outputs whatever it was input):

// incr-1 is an operation which takes a 2-tuple and returns the same 2-tuple
// with the first element incremented.
( in -0-> } -incr-> } -> out
            in -1-> }        ) -> incr-1

( in -eq-> } -if-> out
     in -> }    -> } -0-> } -println-> } -incr-1-> } -> recur ) -> print-range

0  -> } -print-range-> }
10 -> }

Next Steps

This post is long enough, and I think gives at least a basic idea of what I'm going for. The syntax presented here is extremely rudimentary, and is almost definitely not what any final version of the syntax would look like. But the general idea behind the structure is sound, I think.

I have a lot of further ideas for Ginger I haven't presented here. Hopefully as time goes on and I work on the language more some of those ideas can start taking a more concrete shape and I can write about them.

The next thing I need to do for Ginger is to implement (again) the graph type for it, since the last one I implemented didn't include tuples. Maybe I can extend it instead of re-writing it. After that it will be time to really buckle down and figure out a syntax. Once a syntax is established then it's time to start on the compiler!