Calculating a DOT product

It should go without saying: Choose the right programming language for your problem.

tl;dr - You can get high speed, numerical calculation output without having to resort to bug-prone, highly tuned and optimised, unreadable code.

Here I explore the code and efficiency (or more accurately run-time) of taking dot product on two vectors that are each ten million elements long. I’ll look at python, R and common lisp and in each case implement the dot product myself as well as using an optimised package for the library. The fastest implementation may surprise you.

This is not a post on benchmarking, the timings in the final section aren’t taken on a particularly powerful machine¹ or intended to be rigorous; the timings are simply to show that rapid prototyping, followed by optimising what matters can be worthwhile. It was, however motivated by my question on Reddit.

This post appears in the following categories:

code
AI / ML in common lisp —
Implementing AI and Machine Learning tools in common lisp.
Lisp versus X — A different perspective is always valuable.

What’s a dot product and why do I care?

A dot product between two vectors \(A\) and \(B\) can be thought of as the sum of the product of the elements, i.e.

\begin{align} A \cdot B &= \sum_{i=1}^{n}a_{i}\,b_{i} \\ &= a_{1}\,b_{1} + a_{2}\,b_{2} + \ldots \end{align}

So why do I (or should you) care? Dot products are pretty fundamental to machine learning - as I’ll show in some upcoming posts. That means, when you have lots of data, you’re doing this operation lots of times. So how well your language implements this matters.

Let’s look at two common (and one not so common²) languages and timings.

A python implementation

Python is the go-to language for machine-learning practitioners currently. There are many extremely powerful frameworks like PyTorch, TensorFlow among others. They are primarily built upon NumPy. If you’re reading this post, you probably know all of this.

Let’s set up our random vectors…

import time
import numpy as np

# How big do we want the vector to be?  10 million!
n = 10000000

# Make two arrays of 10 million random numbers
a = np.random.rand(n)
b = np.random.rand(n)

Naïve loop

If we were to evaluate the dot product with a loop over each of the elements it would look something like

def dot_prod_loop():
    n = len(a)
    tic = time.time()
    c = 0
    for i in range(n):
        c += a[i] * b[i]
    toc = time.time()
    print(f"Python: loop version:  {1000*(toc-tic):.3f}ms")

# Run it
dot_prod_loop()

Python: loop version:  3632.499ms

NumPy

Naturally, we can do better by using the NumPy implementation:

def dot_prod_numpy():
    tic = time.time()
    c = np.dot(a,b)
    toc = time.time()
    print(f"Python: numpy version:  {1000*(toc-tic):.3f}ms")

# Run it
dot_prod_numpy()

Python: numpy version:  11.154ms

So the “vectorised” python version is about 330 times faster!

Of course this is a somewhat unfair comparison, because the NumPy version is actually highly optimised and compiled C code. It’s not pretty or concise (and that’s coming from someone who chooses to code in C!), but it’s fast.

An alternative implementation in R

I used to write a lot in R, so I thought I’d dust it off for a comparison.

Setting our random vectors again

n <- 10000000
a <- runif(n)
b <- runif(n)

Naïve loop

As with the python case loop over each of the elements

# Loop
tic = Sys.time()
result = 0
for (i in 1:length(a))
  result = result + a[i] * b[i]
toc = Sys.time()
cat("R: loop version:  ", round(1000*(toc-tic), 3), "ms\n")

R: loop version:   533.698 ms

Around 7 times faster than the naïve python implementation. Not bad.

Sum the element-wise multiplication

An alternative is to use the built-in element-wise multiplication feature and sum up the results

# Another simplistic approach
tic = Sys.time()
result = sum(a*b)
toc = Sys.time()
cat("R: summing element-wise multiplication version:  ", round(1000*(toc-tic), 3), "ms\n")

R: summing element-wise multiplication version:   1599.384 ms

Ouch. This is not what I would have expected using native functions.

Built-in matrix multiplication

As R is a language built for mathematical operations, we could just use the built-in matrix multiplication operation.

# Built in dot product
tic = Sys.time()
result = a %*% b
toc = Sys.time()
cat("R: Base matrix multiplication version:  ", round(1000*(toc-tic), 3), "ms\n")

R: Base matrix multiplication version:   32.981 ms

As expected, this looks to have been optimised reasonably well.

Geometry library

How about a library that provides a dot function? It’s compiled code so you’d expect it to be fast. I saw this suggestion on StackOverflow.

library(geometry)
tic = Sys.time()
result = dot(a, b)
toc = Sys.time()
cat("R: geometry dot version:  ", round(1000*(toc-tic), 3), "ms\n")

R: geometry dot version:   386.445 ms

This is unexpectedly slow.

Like python, a lot of R’s packages aren’t actually written in R they’re C, C++ and Fortran to get the speed you want.

Common lisp implementations

How about common lisp?

Set up our random vector yet again. It’s lisp, so let’s just use a list.

(defun randvect (length)
  "A simple list (length LENTH) of random numbers [0,1)"
  (loop for i below length collect (random 1.0)))

(defparameter n 10000000)
(defparameter a (randvect n))
(defparameter b (randvect n))

Now, let’s use some of the power of common lisp and write a macro to make generating the timing functions easier. If that sentence doesn’t make sense, this is just a quick way to duplicate the same code across all of my timing functions.

(defmacro mytime (name description &body body)
  "Create a function called NAME to get timing when running BODY"
  `(defun ,name ()
     (let ((tic)
           (toc))
       (setf tic (get-internal-real-time))
       ,@body
       (setf toc (get-internal-real-time))
       (format t "lisp: ~a version: ~fms~&"
               ,description
               (/ (- toc tic) (/ internal-time-units-per-second 1000))))))

Naïve loop

Let’s try two different looping approaches.

v1: One of the simplest data types in lisp is the list, it’s not a numerical type like in other languages, it’s just a linked list of elements…
v2: Let’s test an array object, this would perhaps be preferable?

(mytime naive-dot-v1 "naive loop v1"
  (loop for i in a
        for j in b
        sum (* i j)))

;; Let's make our lists arrays
(defparameter aa (make-array (list (length a)) :initial-contents a))
(defparameter ab (make-array (list (length b)) :initial-contents b))

(mytime naïve-dot-v2 "naïve loop v2"
  (loop for i below (length aa)
        sum (* (aref aa i) (aref ab i))))


(naive-dot-v1)

(naïve-dot-v2)

lisp: naive loop v1 version: 163.385ms
lisp: naïve loop v2 version: 270.050ms

Slower than the best python and R implementations, but not bad for a first pass approach for a non-numerical language.

Magicl library

We can optimise pure lisp, to make it fast…

(defparameter ma (magicl:from-list a (list n)))
(defparameter mb (magicl:from-list b (list n)))

(mytime magicl-dot "magicl pure lisp"
  (magicl:dot ma mb))

(magicl-dot)

lisp: magicl pure lisp version: 11.042ms

Ever so slightly slower than the NumPy version (< 1 ms), but not bad at all for pure lisp.

So, what does “optimised” pure lisp look like?

Well, that’s for another write up on Making (pure) lisp functions faster . I’ve put quotes around optimised as this isn’t a fancy algorithm, it’s simply putting variable type hints in the code. The implementation I’ll discuss is a little bit of a cheat compared to MAGICL; I haven’t generalised it to work for complex numbers - unlike the MAGICL implementation.

What I will do in this post is share the result…

lisp: array-dot-lisp version: 10.576ms

Python vs. R vs. lisp - timing

What does the following table show us? Python (NumPy), R and common lisp can all produce code that is comparably fast, but in each case you need to think a little bit about your problem set and choose the right approach.

The other take-away that I think is important is that (for my use case!) it appears to take considerably less effort and development skill - and is less bug-prone - to take a rapid prototype in lisp and convert it to a fast, production ready implementation than it would in python and R.

Language	Implementation	Time (ms)	Rel. to numpy
lisp	Pure lisp	10.576	0.9
Python	Numpy	11.154	1.0
lisp	Magicl pure lisp	11.042	1.0
R	Matrix multiplication	32.981	3.0
lisp	Naïve loop (v1)	163.385	14.6
lisp	Naïve loop (v2)	270.050	24.2
R	Geometry library	386.445	34.6
R	Naïve loop	533.698	47.8
R	Sum element product	1599.384	143.4
Python	Naïve loop	3632.499	325.7

The computer was a Early 2015 MacBook Air, 2.2 GHz Dual-Core Intel Core i7 with 8 GB of memory. ↩︎
This wasn’t intended as a pun, but I’ll take it! ↩︎

What’s a dot product and why do I care?#

A python implementation#

Naïve loop#

NumPy#

An alternative implementation in R#

Naïve loop#

Sum the element-wise multiplication#

Built-in matrix multiplication#

Geometry library#

Common lisp implementations#

Naïve loop#

Magicl library#

So, what does “optimised” pure lisp look like?#

Python vs. R vs. lisp - timing#

What’s a dot product and why do I care?

A python implementation

Naïve loop

NumPy

An alternative implementation in R

Naïve loop

Sum the element-wise multiplication

Built-in matrix multiplication

Geometry library

Common lisp implementations

Naïve loop

Magicl library

So, what does “optimised” pure lisp look like?

Python vs. R vs. lisp - timing