I purchased an EVB001 eval board for the GA144 chip from
GreenArrays a year ago. I wasn't sure what I wanted to do with it at the time but eventually I realised the GA144 might be more appropriate for low power digital signal processing apps than the FPGA apps I'd been developing.
A couple of months ago I started trying to program it. Conceptually the GA144 seems simple: 144 asynchronous cores, CPU instruction set is a variant of Forth called colorForth, multiple ADCs, DACs and I/O ports. What could go wrong? :)
Well, for starters I hadn't written in Forth before. One of my first jobs (about 35 years ago) was writing assembly language code for 8080 and Z80 chips and I remembered many of the problems with low-level programming. Memory management was paramount. Speed of operation was next most important.
I later became a C programmer. Once again 90% of my time was spent managing memory. When I discovered a language called Perl in 1989 which had automatic memory management I was in Heaven.
A recent
blog reminded me of the sheer elegance Perl is capable of. Here is a minimal Quicksort script that had me simply gazing in admiration:
sub quicksort {
@_ and return (
quicksort (grep { $_ < $_[0] } @_),
grep ({ $_ == $_[0] } @_),
quicksort (grep { $_ > $_[0] } @_)
);
();
}
$, = ','; say quicksort(10,2,5,3,1,9,7,4,2,3,4,8,6);
The algorithm takes full advantage of Perl's automated memory management. Copies of lists created by the greps and the recursive calls to quicksort probably eat up huge chunks of RAM. But the "simplicity" and "obviousness" of the algorithm are outstanding.
So could I implement Quicksort in colorForth in the arrayForth IDE simulator?
A quick scan of the online literature brought up
this. The problem for me is that it is written in "standard" Forth. I hadn't looked at Forth until about a year ago and to complicate matters colorForth and Forth have almost nothing in common syntactically. Even when the instructions are the same (e.g. 'or') they do different things ('or' is "inclusive OR" in Forth, "exclusive OR" in colorForth). So unless I was prepared to thoroughly learn Forth in addition to colorForth there was no way I could use Will Baden's algorithm.
So I decided to implement the algorithm from scratch using the GreenArrays
manuals and user guides.
I quickly discovered I was back in memory management hell.
Programming the GA144 is mostly about memory management. Each core (GreenArrays call it a 'node') has only 64 18-bit words of RAM plus 64 18-bit words of ROM library. The 32 colorForth instructions are sufficient to implement a viable version of Forth. There is a 10-word data stack and a 9-word return stack, a general purpose register, A, and a write-only B register used mostly to talk to the I/O ports but which can address all of RAM and ROM.
There are 144 nodes in the GA144 chip running asynchronously but able to synchronise with the other nodes via the I/O port communication mechanism.
So the first decision I had to make was whether to use one node or multiple? The Perl example used only 13 elements in the list and since
Quicksort uses O(log N) stack elements it seemed possible to fit the code and the data in the same node. (The example below runs out of stack space with 18 or more elements in the data array. It also runs out of instruction space with more than 19 elements.)
The second decision was to follow the
wikipedia example using in-place element exchanges rather than creating multiple sub-lists for later merging as used in the Perl example.
The following code is mostly a line-for-line transliteration of the
wikipedia example. I did steal one idea from Will Baden's example. I used his midpoint pivot selection code to bypass Quicksort's weakness when sorting already sorted lists.
The arrayForth IDE is a PITA. It's a steep learning curve for a total beginner like me. And it can't import code from outside sources. Every character has to be hand-typed. So "literate" documentation is virtually impossible and largely a waste of time. But eventually I found that the ability to quickly change an instruction and see its effect in the Softsim simulator provided me with all I needed to learn the colorForth instruction set. Admittedly the IDE won't teach me the idioms Forth programmers have used for decades. Nor will it teach me the idioms colorForth programmers have discovered. But, again, the IDE contains the full source code for colorForth, arrayForth, polyForth and a heap of tests displaying how one uses colorForth in "real life".
I also found useful idioms in Chuck Moore's notes about colorForth's
instructions.
And GreenArrays' email support has been superb. Every question of mine, no matter how dumb, has been answered quickly and comprehensively.
The program code below is in coloured text (red for method definitions, green for instructions and yellow for data) and I've interspersed my comments in black text.
The code and data fill 56 of the available 64 words of RAM. Not much room left for experimentation. The starting line (at the end of the listing) loads the data stack with the beginning and ending address of the data array and calls
qsort which calls
part to divide the data array into partitions of smaller and larger elements around a pivot value and then recursively calls itself on each of these subarrays.
part partitions the array into three sections. The "left" section contains all array elements less than or equal to the pivot value. The "pivot" section contains the pivot value. And the "right" section contains all array elements greater than the pivot value. At completion the data stack contains (left, right, pivot) addresses. (Pivot address is in T).
But most of my time was spent trying to keep the use of the data stack to a minimum because of the recursive calls to
qsort. So often I had to store data in A or R rather than leaving it on the data stack for fear that a recursive call would push previous start and end addresses off the end of the 10-element stack. In fact this does happen when there's more than 18 elements in the data array.
503 node 0 org
I picked node (CPU) 503 randomly. Nothing special about it.
19 , 4 , 7 , 1 , 12 , 13 , 0 , 11 , 3 , 9 ,
17 , 16 , 10 , 6 , 2 , 13 , 5 ,
I put the data array at the start because I always know its start address (0).
setab over over b! a! ;
Duplicate S and T then move T into B and S into A.
exch setab @ @b ! !b ;
Array addresses are in S and T. Call
setab to load A and B with array element addresses, then swap array elements pointed to by A and B.
1+ 1 . + ; 1- -1 . + ;
Increment/decrement T.
part exch drop
Exchange the pivot element (address in T) with the end element (address in S). The pivot address in T can now be dropped from the stack as it is no longer needed.
over over - . + 1+ - push
S and T contain start and end addresses of array to sort. Calculate length of array less 1 for loop counter. Push loop cntr into R register.
setab over
setab puts array start address (S) into A and end address (T) into B. The end array element contains the pivot value. And at start of loop S is also the address of where values less than or equal to the pivot value will be swapped into (referred to as sx, switchindex, hereafter) so put a copy of sx into T (
over).
begin
push a
Save sx onto return stack (nowhere else to safely store it) and load A into data stack
@ @b - . +
Load contents pointed to by A onto data stack, load contents pointed to by B into stack, subtract.
-if drop pop exch 1+ dup push
If T is negative *A (array element pointed to by A) is less than or equal to *SX (pivot value, pointed to by SX) so pop SX into T, exchange *A and *SX, increment SX and push it back into R. The
dup creates a dummy value for the next line to drop
then drop
Drops either result of subtract or dummy value.
1+ a! dup b! pop
Drop leaves A in T. Increment T and store it in A so A points to next element in array. T now holds the value of B. Make a copy and pop it into B because the element exchange destroys the previous contents of B. Then pop SX into T.
next exch ;
When loop finishes T holds pointer to pivot value and S holds SX pointer. So can swap pivot value into its correct place prior to return. After exchange stack holds (left, right, sx).
qsort over - over . +
Duplicate the start and end addresses and subtract the start address from the end address.
-if drop ;
If the array addresses are equal there is nothing more to sort so return with stack effectively unchanged by dropping the result of the subtraction.
then 2/ push over pop . +
If the array addresses are unequal there is something to sort. After the subtraction T contains the length of the array so halve it with a left shift (
2/) and add it to the start address giving the address of the midpoint of the array.
part
part is called with the data stack containing (left, right, mid) addresses.
a! push a 1- a 1+ pop
part returns (left, right, sx) so pop SX into A, push right index into R, retrieve A, decrement it, retrieve another copy of A, increment it and then pop R into T. This leaves (left, pivot-1, pivot+1, right) in stack.
qsort drop drop qsort ;
First call to
qsort will sort right partition. Returns when right partition is (recursively) sorted. So upon return drop pivot+1 and right. Second call to
qsort will (recursively) sort left partition. Upon return whole array is sorted.
0 16 qsort r---
Call
qsort with start and end addresses of data array.
Code without comments:
503 node 0 org
19 , 4 , 7 , 1 , 12 , 13 , 0 , 11 , 3 , 9 ,
17 , 16 , 10 , 6 , 2 , 13 , 5 ,
setab over over b! a! ;
exch setab @ @b ! !b ;
1+ 1 . + ; 1- -1 . + ;
part exch drop
over over - . + 1+ - push
setab over
begin
push a
@ @b - . +
-if drop pop exch 1+ dup push
then drop
1+ a! dup b! pop
next exch ;
qsort over - over . +
-if drop ;
then 2/ push over pop . +
part
a! push a 1- a 1+ pop
qsort drop drop qsort ;
0 16 qsort r---
I wonder what a multi-core version of this algorithm would look like? And it's also pretty obvious that a recursive algorithm is not feasible with such a limited data stack. Might try a comb sort next.