From 65c1d5ab1032a09e8f96b0daeae84a5c6dc10b26 Mon Sep 17 00:00:00 2001 From: Brian Picciano Date: Sat, 9 Jan 2021 11:21:29 -0700 Subject: [PATCH] Ginger! --- _posts/2021-01-09-ginger.md | 352 ++++++++++++++++++++++++++++++++++++ 1 file changed, 352 insertions(+) create mode 100644 _posts/2021-01-09-ginger.md diff --git a/_posts/2021-01-09-ginger.md b/_posts/2021-01-09-ginger.md new file mode 100644 index 0000000..3a97d7f --- /dev/null +++ b/_posts/2021-01-09-ginger.md @@ -0,0 +1,352 @@ +--- +title: >- + Ginger +description: >- + Yes, it does exist. +--- + +This post is about a programming language that's been bouncing around in my head +for a _long_ time. I've tried to actually implement the language three or more +times now, but everytime I get stuck or run out of steam. It doesn't help that +everytime I try again the form of the language changes significantly. But all +throughout the name of the language has always been "Ginger". It's a good name. + +In the last few years the form of the language has somewhat solidified in my +head, so in lieu of actually working on it I'm going to talk about what it +currently looks like. + +## Abstract Syntax Lists + +_In the beginning_ there was assembly. Well, really in the beginning there were +punchcards, and probably something even more esoteric before that, but it was +all effectively the same thing: a list of commands the computer would execute +sequentially, with the ability to jump to odd places in the sequence depending +on conditions at runtime. For the purpose of this post, we'll call this class of +languages "abstract syntax list" (ASL) languages. + +Here's a hello world program in my favorite ASL language, brainfuck: + +``` +++++++++[>++++[>++>+++>+++>+<<<<-]>+>+>->>+[<]<-]>>.>---.+++++++..+++.>>.<-.<.++ ++.------.--------.>>+.>++. +``` + +(If you've never seen brainfuck, it's deliberately unintelligible. But it _is_ +an ASL, each character representing a single command, executed by the brainfuck +runtime from left to right.) + +ASLs did the job at the time, but luckily we've mostly moved on past them. + +## Abstract Syntax Trees + +Eventually programmers upgraded to C-like languages. Rather than a sequence of +commands, these languages were syntactically represented by an "abstract syntax +tree" (AST). Rather than executing commands in essentially the same order they +are written, an AST language compiler reads the syntax into a tree of syntax +nodes. What it then does with the tree is language dependent. + +Here's a program which outputs all numbers from 0 to 9 to stdout, written in +(slightly non-idiomatic) Go: + +```go +i := 0 +for { + if i == 10 { + break + } + fmt.Println(i) + i++ +} +``` + +When the Go compiler sees this, it's going to first parse the syntax into an +AST. The AST might look something like this: + +``` +(root) + |-(:=) + | |-(i) + | |-(0) + | + |-(for) + |-(if) + | |-(==) + | | |-(i) + | | |-(10) + | | + | |-(break) + | + |-(fmt.Println) + | |-(i) + | + |-(++) + |-(i) +``` + +Each of the non-leaf nodes in the tree represents an operation, and the children +of the node represent the arguments to that operation, if any. From here the +compiler traverses the tree depth-first in order to turn each operation it finds +into the appropriate machine code. + +There's a sub-class of AST languages called the LISP ("LISt Processor") +languages. In a LISP language the AST is represented using lists of elements, +where the first element in each list denotes the operation and the rest of the +elements in the list (if any) represent the arguments. Traditionally each list +is represented using parenthesis. For example `(+ 1 1)` represents adding 1 and +1 together. + +As a more complex example, here's how to print numbers 0 through 9 to stdout +using my favorite (and, honestly, only) LISP, Clojure: + +```clj +(doseq + [n (range 10)] + (println n)) +``` + +Much smaller, but the idea is there. In LISPs there is no differentiation +between the syntax, the AST, and the language's data structures; they are all +one and the same. For this reason LISPs generally have very powerful macro +support, wherein one uses code written in the language to transform code written +in that same language. With macros users can extend a language's functionality +to support nearly anything they need to, but because macro generation happens +_before_ compilation they can still reap the benefits of compiler optimizations. + +### AST Pitfalls + +The ASL (assembly) is essentially just a thin layer of human readability on top +of raw CPU instructions. It does nothing in the way of representing code in the +way that humans actually think about it (relationships of types, flow of data, +encapsulation of behavior). The AST is a step towards expressing code in human +terms, but it isn't quite there in my opinion. Let me show why by revisiting the +Go example above: + +```go +i := 0 +for { + if i > 9 { + break + } + fmt.Println(i) + i++ +} +``` + +When I understand this code I don't understand it in terms of its syntax. I +understand it in terms of what it _does_. And what it does is this: + +* with a number starting at 0, start a loop. +* if the number is greater than 9, stop the loop. +* otherwise, print the number. +* add one to the number. +* go to start of loop. + +This behavior could be further abstracted into the original problem statement, +"it prints numbers 0 through 9 to stdout", but that's too general, as there +are different ways for that to be accomplished. The Clojure example first +defines a list of numbers 0 through 9 and then iterates over that, rather than +looping over a single number. These differences are important when understanding +what code is doing. + +So what's the problem? My problem with ASTs is that the syntax I've written down +does _not_ reflect the structure of the code or the flow of data which is in my +head. In the AST representation if you want to follow the flow of data (a single +number) you _have_ to understand the semantic meaning of `i` and `:=`; the AST +structure itself does not convey how data is being moved or modified. +Essentially, there's an extra implicit transformation that must be done to +understand the code in human terms. + +## Ginger: An Abstract Syntax Graph Language + +In my view the next step is towards using graphs rather than trees for +representing our code. A graph has the benefit of being able to reference +"backwards" into itself, where a tree cannot, and so can represent the flow of +data much more directly. + +I would like Ginger to be an ASG language where the language is the graph, +similar to a LISP. But what does this look like exactly? Well, I have a good +idea about what the graph _structure_ will be like and how it will function, but +the syntax is something I haven't bothered much with yet. Representing graph +structures in a text file is a problem to be tackled all on its own. For this +post we'll use a made-up, overly verbose, and probably non-usable syntax, but +hopefully it will convey the graph structure well enough. + +### Nodes, Edges, and Tuples + +All graphs have nodes, where each node contains a value. A single unique value +can only have a single node in a graph. Nodes are connected by edges, where +edges have a direction and can contain a value themselves. + +In the context of Ginger, a node represents a value as expected, and the value +on an edge represents an operation to take on that value. For example: + +``` +5 -incr-> n +``` + +`5` and `n` are both nodes in the graph, with an edge going from `5` to `n` that +has the value `incr`. When it comes time to interpret the graph we say that the +value of `n` can be calculated by giving `5` as the input to the operation +`incr` (increment). In other words, the value of `n` is `6`. + +What about operations which have more than one input value? For this Ginger +introduces the tuple to its graph type. A tuple is like a node, except that it's +anonymous, which allows more than one to exist within the same graph, as they do +not share the same value. For the purposes of this blog post we'll represent +tuples like this: + +``` +1 -> } -add-> t +2 -> } +``` + +`t`'s value is the result of passing a tuple of two values, `1` and `2`, as +inputs to the operation `add`. In other words, the value of `t` is `3`. + +For the syntax being described in this post we allow that a single contiguous +graph can be represented as multiple related sections. This can be done because +each node's value is unique, so when the same value is used in disparate +sections we can merge the two sections on that value. For example, the following +two graphs are exactly equivalent (note the parenthesis wrapping the graph which +has been split): + +``` +1 -> } -add-> t -incr-> tt +2 -> } +``` + +``` +( + 1 -> } -add-> t + 2 -> } + + t -incr-> tt +) +``` + +(`tt` is `4` in both cases.) + +A tuple with only one input edge, a 1-tuple, is a no-op, semantically, but can +be useful structurally to chain multiple operations together without defining +new value names. In the above example the `t` value can be eliminated using a +1-tuple. + +``` +1 -> } -add-> } -incr-> tt +2 -> } +``` + +When an integer is used as an operation on a tuple value then the effect is to +output the value in the tuple at that index. For example: + +``` +1 -> } -0-> } -incr-> t +2 -> } +``` + +(`t` is `2`.) + +### Operations + +When a value sits on an edge it is used as an operation on the input of that +edge. Some operations will no doubt be builtin, like `add`, but users should be +able to define their own operations. This can be done using the `in` and `out` +special values. When a graph is used as an operation it is scanned for both `in` +and `out` values. `in` is set to the input value of the operation, and the value +of `out` is used as the output of the operation. + +Here we will define the `incr` operation and then use it. Note that we set the +`incr` value to be an entire sub-graph which represents the operation's body. + +``` +( in -> } -add-> out + 1 -> } ) -> incr + +5 -incr-> n +``` + +(`n` is `6`.) + +The output of an operation may itself be a tuple. Here's an implementation and +usage of `double-incr`, which increments two values at once. + +``` +( in -0-> } -incr-> } -> out + } + in -1-> } -incr-> } ) -> double-incr + +1 -> } -double-incr-> t -add-> tt +2 -> } +``` + +(`t` is a 2-tuple with values `2`, and `3`, `tt` is `5.) + +### Conditionals + +The conditional is a bit weird, and I'm not totally settled on it yet. For now +we'll use this. The `if` operation expects as an input a 2-tuple whose first +value is a boolean and whose second value will be passed along. The `if` +operation is special in that it has _two_ output edges. The first will be taken +if the boolean is true, the second if the boolean is false. The second value in +the input tuple, the one to be passed along, is used as the input to whichever +branch is taken. + +Here is an implementation and usage of `max`, which takes two numbers and +outputs the greater of the two. Note that the `if` operation has two output +edges, but our syntax doesn't represent that very cleanly. + +``` +( in -gt-> } -if-> } -0-> out + in -> } -> } -1-> out ) -> max + +1 -> } -max-> t +2 -> } +``` + +(`t` is `2`.) + +It would be simple enough to create a `switch` macro on top of `if`, to allow +for multiple conditionals to be tested at once. + +### Loops + +Loops are tricky, and I have two thoughts about how they might be accomplished. +One is to literally draw an edge from the right end of the graph back to the +left, at the point where the loop should occur, as that's conceptually what's +happening. But representing that in a text file is difficult. For now I'll +introduce the special `recur` value, and leave this whole section as TBD. + +`recur` is cousin of `in` and `out`, in that it's a special value and not an +operation. It takes whatever value it's set to and calls the current operation +with that as input. As an example, here is our now classic 0 through 9 printer +(assume `println` outputs whatever it was input): + +``` +// incr-1 is an operation which takes a 2-tuple and returns the same 2-tuple +// with the first element incremented. +( in -0-> } -incr-> } -> out + in -1-> } ) -> incr-1 + +( in -eq-> } -if-> out + in -> } -> } -0-> } -println-> } -incr-1-> } -> recur ) -> print-range + +0 -> } -print-range-> } +10 -> } +``` + +## Next Steps + +This post is long enough, and I think gives at least a basic idea of what I'm +going for. The syntax presented here is _extremely_ rudimentary, and is almost +definitely not what any final version of the syntax would look like. But the +general idea behind the structure is sound, I think. + +I have a lot of further ideas for Ginger I haven't presented here. Hopefully as +time goes on and I work on the language more some of those ideas can start +taking a more concrete shape and I can write about them. + +The next thing I need to do for Ginger is to implement (again) the graph type +for it, since the last one I implemented didn't include tuples. Maybe I can +extend it instead of re-writing it. After that it will be time to really buckle +down and figure out a syntax. Once a syntax is established then it's time to +start on the compiler!