What does optimised pure lisp look like?

In my previous post on Calculating a DOT product I mentioned making a fast pure lisp implementation of the dot product function. Let’s dig into that.

Let’s see if we can compete with the magicl code, or the highly optimised C code in numpy. My goals to explore are twofold:

  1. Make pure lisp code comparable in speed to the R and python best cases
  2. Make the code readable, we’re not going to use fancy tricks or algorithms, or reduce readability (too highly optimised code leads to hard to identify bugs and difficulties maintaining!)

This post is heavily motivated by the magicl dot product code and it appears in the following categories:

Initial setup

As we did previously, let’s create two lists, each being 10 million random numbers between 0 and 1.

(defun randvect (length)
  "A simple list (length LENTH) of random numbers [0,1)"
  (loop for i below length collect (random 1.0)))


(defvar n 10000000      "The length of our test data")
(defvar a (randvect n)  "A list, length N, of random numbers")
(defvar b (randvect n)  "Another list, length N, of random numbers")

Thinking about how we should store our data

lists versus arrays
lists aren’t ideal for storing vectors, lookup speed grows with length, whilst arrays have a constant time. This is important when building models, where the list/array might be 10 million (or more) observations of our experiment
(defparameter array-a (make-array (list (length a)) :initial-contents a))
(defparameter array-b (make-array (list (length b)) :initial-contents b))

Previous version of the code

This is the code I had for naïve-dot-v2 (which I’m calling v1 here)

(loop for i below (length array-a)
      sum (* (aref array-a i) (aref array-b i)))
v1 - 270.0504ms

Let’s turn this into a function so we can optimise it…

(defun dot-v2 (array1 array2)
  (loop for i below (length array1)
        sum (* (aref array1 i) (aref array2 i))))
v2 - 234.21423ms

Adding a check

Before we optimise the code, we should add a check to make sure the arrays are the same size:

(defun dot-v3 (array1 array2)
  (let ((size1 (length array1))
        (size2 (length array2)))
    (assert (cl:= size1 size2))
    (loop for i below size1
          sum (* (aref array1 i) (aref array2 i)))))
v3 - 220.24509ms

I’ve created two new variables size1 and size2 and assert that they are equal.

I’m not really sure why this was faster, it was probably a function of whatever else was (or wasn’t) running on my machine at the time, but it’s not really material.

I want speed!

Let’s be clear - we want to optimise for SPEED, so be explicit in line 2.

1
2
3
4
5
6
7
(defun dot-v4 (array1 array2)
  (declare (optimize speed))
  (let ((size1 (length array1))
        (size2 (length array2)))
    (assert (cl:= size1 size2))
    (loop for i below size1
          sum (* (aref array1 i) (aref array2 i)))))
v4 - 222.48372ms

No improvement!

Why? We haven’t given any hints to the how the compiler/interpreter can improve things. Let’s remove the optimise command and add some type hints.

Adding data type information

Now let’s use lisp’s gradual typing features. I’m planning to write about this in another post (and I’ll update this one to point to it).

data type
Python and R are making assumptions on the data we’re storing, they’re both assuming floating point numbers.

How do I this in lisp? Easily. We just add :element-type 'single-float to the make-array command. This is exactly comparable to the dtype=float32 flag you’d use in python’s NumPy/TensorFlow/etc.

(defparameter array-a-v2 (make-array (list (length a))
                                     :initial-contents a
                                     :element-type 'single-float))
(defparameter array-b-v2 (make-array (list (length b))
                                     :initial-contents b
                                     :element-type 'single-float))

Gradual typing is where we can define the type of data in a particular function - but we don’t have to everywhere. (Most) Other languages have an all-or-nothing approach, but lisp allows you to choose.

Below, in line 2, we simply declare that the type of the inputs array1 and array2 are simple-array single-float (*), which means:

simple-array
just take this as it reads - a simple array - or dig into the CLHS for details
single-float
elements are floating point numbers (as opposed to, say double precision or complex numbers)
(*)
indicates this is a single dimensional array, i.e. vector, of unknown size
1
2
3
4
5
6
7
(defun dot-v5 (array1 array2)
  (declare (type (simple-array single-float (*)) array1 array2))
  (let ((size1 (length array1))
        (size2 (length array2)))
    (assert (cl:= size1 size2))
    (loop for i below size1
          sum (* (aref array1 i) (aref array2 i)))))
v5 - 120.991104ms

We’ve almost halved the run time - because the multiplication on line 7 now can be simplified to only handle single-float arguments rather than a generic operation that works for integers, single and double floats, ratios, complex numbers, …

Adding type information to the loop: the accumulator

We have now optimised our multiplication to work for out single-float arguments, but the accumulator (i.e. the sum part) in the loop function doesn’t know the type of it’s arguments. However, we do. The sum of two values of type single-float is another single-float. In line 7 we use that information:

1
2
3
4
5
6
7
8
9
(defun dot-v6 (array1 array2)
  (declare (type (simple-array single-float (*)) array1 array2))
  (let ((size1 (length array1))
        (size2 (length array2)))
    (assert (cl:= size1 size2))
    (loop for i below size1
          with s of-type single-float = 0.0
          do (incf s (* (aref array1 i) (aref array2 i)))
          finally (return s))))

We’ve also had to add lines 8 and 9 to replicate what the sum keyword is providing in the loop.

The results were staggering when I first saw them.

v6 - 11.116839ms

We took an order of magnitude off the run-time and are now in the realms of the NumPy - hand tuned, optimised and mostly unreadable, C.

Optimise for speed, again

Again, let’s be explicit that we want SPEED.

(defun dot-v7 (array1 array2)
  (declare (optimize speed)
           (type (simple-array single-float (*)) array1 array2))
  (let ((size1 (length array1))
        (size2 (length array2)))
    (assert (cl:= size1 size2))
    (loop for i below size1
          with s of-type single-float = 0.0
          do (incf s (* (aref array1 i) (aref array2 i)))
          finally (return s))))

The lisp implementation I’m using (SBCL) is already a high performance Common Lisp compiler, so there’s not much to be squeezed out here.

v7 - 10.600901ms

Adding type information to the loop: array index

We can help the compiler out a little more by letting it know that the loop variable i is for indexing arrays:

(defun dot-v8 (array1 array2)
  (declare (optimize speed)
           (type (simple-array single-float (*)) array1 array2))
  (let ((size1 (length array1))
        (size2 (length array2)))
    (assert (cl:= size1 size2))
    (loop for i of-type alexandria:array-index below size1
          with s of-type single-float = 0.0
          do (incf s (* (aref array1 i) (aref array2 i)))
          finally (return s))))

With this we can squeeze out a little more speed, but to be honest this result fluctuated depending on what else my computer was doing.

v8 - 10.57616ms

Almost the end

At this point I would look into defgeneric where I can define multiple dot functions each optimised for different inputs, so double-float, complex-float, etc. This is similar to overloading in other languages. However, I’ll leave that for another day.

Now, if this was a language benchmark post, I’d stop here. However, I want to use this dot function in a practical use case. That’s what I do in some following articles where I build a AI/ML model (a neural network in fact) in lisp.

Finally

Let’s compare the initial function and the optimised one:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
(defun final-dot (array1 array2)
  (declare (optimize speed)
           (type (simple-array single-float (*)) array1 array2))
  (let ((size1 (length array1))
        (size2 (length array2)))
    (assert (cl:= size1 size2))
    (loop for i of-type alexandria:array-index below size1
          with s of-type single-float = 0.0
          do (incf s (* (aref array1 i) (aref array2 i)))
          finally (return s))))


(defun initial-dot (array1 array2)
  (loop for i below (length array1)
        sum (* (aref array1 i) (aref array2 i))))
Line 1
Defining the function [Same as line 13]
Line 2
Tell the compiler we want fast code
Line 3
Tell the compiler the arguments are single-float
Lines 4 – 6
Check to see that the vectors are the same length
Line 7
Create our iterating variable [Effectively, same as line 14]
Line 8
Tell the compiler that our accumulator variable s is a single-float
Lines 9 & 10
Sum up the product of the array elements [Same as line 15]

Effectively, by adding lines 3 and 8 we have taken the time to calculate the dot product from 270ms down to 10.576ms, speeding it up by more than 25 times and ending with a run time that is comparable to Numpy (11.154ms - previously ).

An aside…

Why did I say the lines 9 & 10 are the same as line 15?

The loop macro in common lisp is incredibly versatile.

The code generated by these two loops is virtually identical

;; Loop version 1
(loop for i below 5
      sum i)

;; Loop version 2
(loop for i below 5
      with s = 0
      do (incf s i)
      finally (return s))

The only real difference is the variable #:LOOP-SUM-708 in this first code block acting as a proxy for S in the second loop.

This is the expansion of the first loop (using (macroexpand-1 '(loop for i below 5 sum i))):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
(block nil
  (let ((i 0))
    (declare (ignorable i)
             (type (and number real) i))
    (;; some sbcl specific defintions of the loop
     ;; variable #:loop-sum-708
     (tagbody
      sb-loop::next-loop
        (when (>= i '5) (go sb-loop::end-loop))
        (setq #:loop-sum-708 (+ #:loop-sum-708 i))
        (sb-loop::loop-desetq i (1+ i))
        (go sb-loop::next-loop)
      sb-loop::end-loop
        (return-from nil #:loop-sum-708)))))

The expansion of the second loop, where lines 5 & 6 (defining the variable S) are the only material differences.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
(block nil
  (let ((i 0))
    (declare (ignorable i)
             (type (and number real) i))
    (let ((s 0))
      (declare (ignorable s))
      (tagbody
       sb-loop::next-loop
         (when (>= i '5) (go sb-loop::end-loop))
         (incf s i)
         (sb-loop::loop-desetq i (1+ i))
         (go sb-loop::next-loop)
       sb-loop::end-loop
         (return s)))))

In case you were wondering about the (incf s i) in line 10

(macroexpand-1 '(incf s i))
(setq s (+ i s))

which is the same form as line 10 in the first expansion.

Aren’t lisp macros amazing!?