# My Julia workflow: Readability vs. Performance

Hi!

After my last post, I got many questions like: “What’s Julia? Why you are using Julia? Is it worth it?”

Well, I will try to answer the last two questions by showing my workflow: how I code and what kind of gains I get (for the first question I will point you to the official website 🙂 )

Required background

It is required a little knowledge of C and Julia so that you can fully understand this post.

## My use case

Many of my activities at INPE require analysis that can only be accomplished by optimization algorithms or Monte Carlo simulations. For example, during the Pre-Phase A studies at the Space Mission Integrated Design Center (CPRIME), we sometimes use evolutionary algorithms to obtain optimal solutions for the spacecraft design. It is also common to use Monte Carlo simulations to find the worst case scenarios, e.g. the initial state of the spacecraft (attitude and angular velocity) that yields the longest time to acquire the Sun pointing. Thus, the algorithms that I use are often required to run a thousand, even a million, times. So performance is something that really matters for me.

On the other hand, we are mostly engineers that are used to MATLAB. The algorithms are written to solve equations and simulate dynamic systems. Hence, the closest the algorithm is to the mathematical equations the better.

In the following, I will describe my workflow: how do I write Julia code from the initial validation up to the version that is meant to be used inside an optimizer. I tried many approaches over the years, but this one was what gave me the shortest time to obtain the results.

## Numerical example

To illustrate, let’s assume that we want an algorithm to perform the Kalman filter covariance update of a dynamic system. Do not give too much importance to those equations, I just selected a random scenario to improve the clarity of what we are doing.

Initialization

\begin{aligned}\mathbf{F}_{k-1} &= \exp\left[\begin{array}{c c c} \mathbf{w}^{\times} & -\mathbf{I}_3 & \mathbf{0}_{3 \times 12} \\ \mathbf{0}_{3 \times 3} & \lambda \cdot \mathbf{I}_3 & \mathbf{0}_{3 \times 12} \\ \mathbf{0}_{12 \times 3} & \mathbf{0}_{12 \times 3} & \mathbf{0}_{12 \times 12 } \end{array} \right] \\ \mathbf{H}_k &= \left[ \begin{array}{c c c c c c} \mathbf{s}^{\times} & \mathbf{0}_{3 \times 3} & \mathbf{s}^{\times} & \mathbf{I}_3 & \mathbf{0}_{3 \times 3} & \mathbf{0}_{3 \times 3} \\ \mathbf{B}^{\times} & \mathbf{0}_{3 \times 3} & \mathbf{0}_{3 \times 3} & \mathbf{0}_{3 \times 3} & \mathbf{B}^{\times} & \mathbf{I}_3 \end{array} \right] \\ \mathbf{P}_p &= \mathbf{I}_{18} \\ \mathbf{P}_u &= \mathbf{I}_{18} \end{aligned}

Simulation loop

For $$k = 2$$ to $$60000$$, do:

\begin{aligned} \mathbf{P}_p &= \mathbf{F}_{k-1} \mathbf{P}_p \mathbf{F}_{k-1}^{T} + \mathbf{Q} \\ \mathbf{K}_k &= \mathbf{P}_p\mathbf{H}_k^T(\mathbf{R} + \mathbf{H}_{k} \mathbf{P}_p \mathbf{H}_{k}^{T}) \\ \mathbf{P}_u &= (\mathbf{I}_{18} – \mathbf{K}_k\mathbf{H}_k)\mathbf{P}_p(\mathbf{I}_{18}-\mathbf{K}_k\mathbf{H}_k)^T + \mathbf{K}_k\mathbf{R}\mathbf{K}_k^T \end{aligned}

Every time $$\mathbf{P}_u$$ is modified, store $$\mbox{trace}(\mathbf{P}_u)$$.

Note

$$\mathbf{w}^{\times}$$, $$\mathbf{s}^{\times}$$, and $$\mathbf{B}^{\times}$$ are constant matrices, $$\mathbf{0}_{n \times m}$$ is a $$n \times m$$ matrix filled with 0s, and $$\mathbf{I}_n$$ is an $$n \times n$$ identify matrix.

## First: Code fast, as readable as possible

The very first thing is to code an initial version of the algorithm you want without worrying too much about the performance. In this phase, readability is paramount, because it will make debugging easier.

Our first version of the code looks like:

using LinearAlgebra

# Constants
# ==============================================================================

const O3x3   = zeros(3,3)
const O3x12  = zeros(3,12)
const O12x18 = zeros(12,18)

function kalman_filter()

# Constants
# ==========================================================================

λ = 1/100
Q = 1e-20I
R = 1e-2I

# Dynamic model
# ==========================================================================

wx = Float64[ 0 -1  0;
1  0  0;
0  0  0;]

Ak_1 = [  wx    -I   O3x12;
O3x3  λ*I   O3x12;
O12x18      ;]

Fk_1 = exp(Ak_1)

# Measurement model
# ==========================================================================

sx = Float64[ 0  0  0;
0  0 -1;
0  1  0;]

Bx = Float64[ 0  0 -1;
0  0  0;
1  0  0;]

Hk = [sx O3x3   sx    I O3x3 O3x3;
Bx O3x3 O3x3 O3x3   Bx    I;]

# Kalman filter initialization
# ==========================================================================

Pu = Matrix{Float64}(I,18,18)

# Simulation
# ==========================================================================

result = Vector{Float64}(undef, 60000)

result[1] = tr(Pu)

for k = 2:60000
Pp = Fk_1*Pu*Fk_1' + Q
K  = Pp*Hk'*pinv(R + Hk*Pp*Hk')
Pu = (I - K*Hk)*Pp*(I - K*Hk)' + K*R*K'

result[k] = tr(Pu)
end

return result
end

Is it possible to see how close to the mathematical equation it is? Even the symbols like λ can be used in Julia, that supports any UTF-8 character in the variables name. Now, imagine something with hundreds of equations. It will be way easier to validate the code and find problems when you write using a computational language that is close to the mathematical one.

Information

The I in Julia is not a 3 x 3 identity matrix as in the equations above. It is called the UniformScalling operator. In our code, it turns into an identity matrix with the size it must have to make sense, i.e 3 x 3.

In this very first version of the code, we have two tasks: 1) see if it works properly; and 2) see if the performance is good enough for our needs. We know that after 60,000 iterations the trace of the covariance matrix $$P_u$$ is approximately 7.00089. In Julia, we can run the code in REPL and see if this is true:

julia> result = kalman_filter();

julia> result[end]
7.000894875771719

Good 🎊🎉! At least the code does what it is supposed to. The next thing is to measure the performance and verify if it is enough. This is very important: Julia is a very fast language for me not because it is faster than C, but because the time I spend between the initial code and the final result is way shorter. Sometimes you just do not need a performant code and want to code as fast as possible, Julia can do this as I just show. Sometimes you need speed, Julia can do this as well as I will show.

The correct way to benchmark a Julia code is to put inside a function (we have already done that) and use the excellent package BenchmarkTools.jl.

julia> using BenchmarkTools

julia> @btime kalman_filter()
1.296 s (2640271 allocations: 2.83 GiB)

Now we know that the mean execution time of the function (after the compilation step) is 1.296s and it has allocated 2.83 GiB of memory. At this moment I ask myself: “Is this speed enough for what I need to do?” For example, if I need to run this code just once, then I can live with its current state. However, if I need this code to be called inside an optimizer that will run 1,000,000 times, then I would really want to optimize it.

## Second: Modify your code to improve the performance if necessary

There are some things you need to take care to increase the performance of Julia code. The documentation is very good and I truly suggest you read. For starters, you must look for three things:

• Code type-stability: Julia must know the type of the variables given only the type of the inputs. This can be checked with the macro @code_warntype.
• Allocations: the process to allocate is slow, you want to have the least amount possible of allocations to improve the performance.
• Avoid duplicate computations: the compilers are getting smarter every day and sometimes they got it, but it does not harm to help them!

We can easily see that our code is type stable by calling @code_warntype kalman_filter(). You will see that no variable can have multiple types. I will not go deeper here, because this is a subject for an entire post. For more information, see the documentation.

There are some methods to check which parts of the code are allocating. The “official” one is to start Julia with the option --track-allocation=user, execute the function, and then analyze the .mem file that will be created. This is described here. However, there is a method that I personally find better. It is based on the package TimerOutputs.jl. It provides a way to analyze just specific lines. In our case, we should be concerned about the code executed inside the for-loop. The previous file should be modified as follows:

using LinearAlgebra
using TimerOutputs

...

reset_timer!()
for k = 2:60000
@timeit "Pp" Pp = Fk_1*Pu*Fk_1' + Q
@timeit "K"  K  = Pp*Hk'*pinv(R + Hk*Pp*Hk')
@timeit "Pu" Pu = (I - K*Hk)*Pp*(I - K*Hk)' + K*R*K'

result[k] = tr(Pu)
end

print_timer()

return result
end

Now, when you run kalman_filter();, you will see on the screen:

 ──────────────────────────────────────────────────────────────────
Time                   Allocations
──────────────────────   ───────────────────────
Tot / % measured:      2.01s / 69.0%           2.86GiB / 99.2%

Section   ncalls     time   %tot     avg     alloc   %tot      avg
──────────────────────────────────────────────────────────────────
K          60.0k    673ms  48.5%  11.2μs   1.08GiB  38.3%  19.0Ki
Pu         60.0k    511ms  36.9%  8.52μs   1.29GiB  45.4%  22.5KiB
Pp         60.0k    203ms  14.7%  3.39μs    472MiB  16.3%  8.06KiB
─────────────────────────────────────────────────────────────────

OK, now we have something to work with. We will perform four modifications:

• Those three variables Pp, K, and Pu are being allocated at every iteration. One thing we should do is pre-allocate them before the loop and tell Julia to use the same space to store the new values. The latter can be accomplished by using the operator .= that performs a broadcasted assignment. In other words, each element of the left-side variable will be equal to the corresponding element of the right-side expression, leading to no allocation in this operation. It is also good to replace the + for .+ for the same reason, since summing matrices is an operation performed element-by-element.
• The UniformScaling variable will be exchanged by a pre-allocated matrices. This also helped to reduce the allocations.
• Notice that (I - K*Hk) is being computed twice inside the loop. We will use an auxiliary variable to avoid this. It is important that this auxiliary variable is also pre-allocated before the loop for the same reason we discussed on the first point.
• Since we can ensure that all arrays are accessed by valid indices inside the for, then we can use the @inbounds macro as described in the documentation. This avoid bound checking operations, improving performance.

Given that, our code now should look like:

using LinearAlgebra

# Constants
# ==============================================================================

const O3x3   = zeros(3,3)
const O3x12  = zeros(3,12)
const O12x18 = zeros(12,18)

function kalman_filter()

# Constants
# ==========================================================================

I6  = Matrix{Float64}(I,6,6)
I18 = Matrix{Float64}(I,18,18)
λ   = 1/100
Q   = 1e-20I18
R   = 1e-2I6

# Dynamic model
# ==========================================================================

wx = Float64[ 0 -1  0;
1  0  0;
0  0  0;]

Ak_1 = [  wx     -I   O3x12;
O3x3   λ*I   O3x12;
O12x18     ;]

Fk_1 = exp(Ak_1)

# Measurement model
# ==========================================================================

sx = Float64[ 0  0  0;
0  0 -1;
0  1  0;]

Bx = Float64[ 0  0 -1;
0  0  0;
1  0  0;]

Hk = [sx O3x3   sx    I O3x3 O3x3;
Bx O3x3 O3x3 O3x3   Bx    I;]

# Kalman filter initialization
# ==========================================================================

Pu = Matrix{Float64}(I,18,18)
Pp = similar(Pu)

# Simulation
# ==========================================================================

result = Vector{Float64}(undef, 60000)

result[1] = tr(Pu)

# Auxiliary variables.
K    = zeros(18,6)
aux1 = zeros(18,18)

@inbounds for k = 2:60000
Pp   .= Fk_1*Pu*Fk_1' .+ Q
K    .= Pp*Hk'*pinv(R .+ Hk*Pp*Hk')
aux1 .= I18 .- K*Hk
Pu   .= aux1*Pp*aux1' .+ K*R*K'

result[k] = tr(Pu)
end

return result
end

The benchmark using this new version provides:

julia> @btime kalman_filter();
1.109 s (2340283 allocations: 2.06 GiB)

We just decreased the computational time 14% with only minor modifications! If this function is used in an optimizer that calls it 1,000,000 times, then we just save more than 2 days.

## Third: Going extreme!

If it is really necessary to reduce the computational time, then we can still do things here. However, spoiler alert, this will hurt the readability. Hence, I advise going this path only after your code has been tested and validated and if you really need more speed.

Looking at the last benchmark, we still see that the code is allocating 2GiB. This is coming from the temporary arrays Julia must allocate to store in-between multiplications. For example, in the multiplication Fk_1*Pu*Fk_1' .+ Q, it needs to allocate at least one temporary array to store Fk_1*Pu or Pu*Fk_1'. The compiler cannot (yet?) guess that this can be allocated before the loop instead of allocating it at every iteration. Hence, if we really need to reduce the computation time, then we need to drastically reduce those allocations.

The approach is quite similar for the second phase: create temporary matrices outside the loop and use them to store the in-between operations. However, here we cannot use the broadcast multiplication operator .*, since it will perform an element-wise multiplication instead of a matrix product. Thus, the solution is to use the following function:

mul!(Y, A, B)

that computes the matrix product A*B and stores in the matrix Y. This function also returns the matrix Y to make operation concatenation easier. Thus, the modified code is:

using LinearAlgebra

# Constants
# ==============================================================================

const O3x3   = zeros(3,3)
const O3x12  = zeros(3,12)
const O12x18 = zeros(12,18)

function kalman_filter()

# Constants
# ==========================================================================

I6  = Matrix{Float64}(I,6,6)
I18 = Matrix{Float64}(I,18,18)
λ   = 1/100
Q   = 1e-20I18
R   = 1e-2I6

# Dynamic model
# ==========================================================================

wx = Float64[ 0 -1  0;
1  0  0;
0  0  0;]

Ak_1 = [  wx     -I   O3x12;
O3x3   λ*I   O3x12;
O12x18     ;]

Fk_1 = exp(Ak_1)

# Measurement model
# ==========================================================================

sx = Float64[ 0  0  0;
0  0 -1;
0  1  0;]

Bx = Float64[ 0  0 -1;
0  0  0;
1  0  0;]

Hk = [sx O3x3   sx    I O3x3 O3x3;
Bx O3x3 O3x3 O3x3   Bx    I;]

# Kalman filter initialization
# ==========================================================================

Pu = Matrix{Float64}(I,18,18)
Pp = similar(Pu)

# Simulation
# ==========================================================================

result = Vector{Float64}(undef, 60000)

result[1] = tr(Pu)

# Auxiliary variables.
K    = zeros(18,6)
aux1 = zeros(18,18)
aux2 = zeros(18,18)
aux3 = zeros(18,6)
aux4 = zeros(6,6)
aux5 = zeros(18,18)
aux6 = zeros(18,18)

for k = 2:60000
# Pp = Fk_1*Pu*Fk_1' + Q
mul!(aux2, mul!(aux1, Fk_1, Pu), Fk_1')
@. Pp = aux2 + Q

# K = Pp*Hk'*pinv(R + Hk*Pp*Hk')
mul!(aux4, Hk, mul!(aux3, Pp, Hk'))
mul!(K, aux3, pinv(R + aux4))

# Pu = (I - K*Hk)*Pp*(I - K*Hk)' + K*R*K'
mul!(aux1, K, Hk)
@. aux2 = I18 - aux1
mul!(aux6, mul!(aux5, aux2, Pp), aux2')
mul!(aux5, mul!(aux3, K, R), K')
@. Pu = aux6 + aux5

result[k] = tr(Pu)
end

return result
end

Well… this is… ugly 😅 However, let’s benchmark it:

julia> @btime kalman_filter()
848.334 ms (1680299 allocations: 915.17 MiB)

Wow! It’s 23% faster than the previous version and 34% faster than the first version. Compared to the latter, this code would save us more than 5 days if we perform 1,000,000 evaluations. The trade-off here was to have a less readable code, mathematically speaking.

Information

Julia has superb meta-programming capabilities. One can easily construct a macro that will expand the multiplications to use the auxiliary variables while maintaining the readability. However, this is a somewhat advanced topic to cover for now.

## OK Ronan, what about C++?

This is the most asked question when I talk about Julia. Of course, C++ can be faster in execution time. However, since Julia is a dynamic language (seems interpreted), it is way easier to code and test the results. That’s one of the reasons why languages like MATLAB and Python were created. Thus, if you account for the total time (developing, testing, and executing), in my experience, you can beat C++ using Julia (unless, of course, you are a C++ magician, which most engineers I know, including me, are not).

Anyway, let’s just write the very same algorithm in C++ to see how far we are. Notice that these are my C++ skills. I am almost sure that this code can be optimized (can anyone do this for me, please?). However, I am not a very experienced Julia programmer either. Thus, I have a feeling that the Julia code can also be optimized (can anyone also do this for me, please?).

The C++ version uses Eigen to handle matrices operations. Otherwise, it will be a very, very long code. The code can be seen as follows:

#include <iostream>
#include "Eigen/Core"
#include "Eigen/QR"
#include "unsupported/Eigen/MatrixFunctions"

using namespace Eigen;

int main(int argc, char *argv[])
{
double l = 1.0/100.0;
Matrix<double, 18, 18> Q;
Matrix<double,  6,  6> R;
Matrix<double, 18, 18> Ak_1;
Matrix<double, 18, 18> Fk_1;
Matrix<double,  6, 18> Hk;
Matrix<double, 18, 18> Pp;
Matrix<double, 18, 18> Pu;
Matrix<double, 18,  6> K;

VectorXd result(60000);

// Auxiliary variables.
MatrixXd I6(6,6);
MatrixXd I18(18,18);

MatrixXd aux1(6,6);
MatrixXd aux2(18,18);
MatrixXd pinv(6,6);

// Constants
// =========================================================================

I6  = MatrixXd::Identity(6,6);
I18 = MatrixXd::Identity(18,18);
Q   = I18*1e-20;
R   = I6*1e-2;

// Dynamic model
// =========================================================================

Ak_1 << 0, -1,  0, -1,  0,  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1,  0,  0,  0, -1,  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0,  0,  0,  0,  0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0,  0,  0,  l,  0,  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0,  0,  0,  0,  l,  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0,  0,  0,  0,  0,  l, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0,  0,  0,  0,  0,  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0,  0,  0,  0,  0,  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0,  0,  0,  0,  0,  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0,  0,  0,  0,  0,  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0,  0,  0,  0,  0,  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0,  0,  0,  0,  0,  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0,  0,  0,  0,  0,  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0,  0,  0,  0,  0,  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0,  0,  0,  0,  0,  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0,  0,  0,  0,  0,  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0,  0,  0,  0,  0,  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0,  0,  0,  0,  0,  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0;

// WARNING: Matrix exponential is unsupported in Eigen 3.3.7.
Fk_1 = Ak_1.exp();

// Measurement model
// =========================================================================

Hk << 0,  0,  0,   0, 0, 0,  0,  0,  0,   1, 0, 0,   0,  0,  0,  0, 0, 0,
0,  0, -1,   0, 0, 0,  0,  0, -1,   0, 1, 0,   0,  0,  0,  0, 0, 0,
0,  1,  0,   0, 0, 0,  0,  1,  0,   0, 0, 1,   0,  0,  0,  0, 0, 0,
0,  0, -1,   0, 0, 0,  0,  0,  0,   0, 0, 0,   0,  0, -1,  1, 0, 0,
0,  0,  0,   0, 0, 0,  0,  0,  0,   0, 0, 0,   0,  0,  0,  0, 1, 0,
1,  0,  0,   0, 0, 0,  0,  0,  0,   0, 0, 0,   1,  0,  0,  0, 0, 1;

// Kalman filter initialization
// =========================================================================

Pp.setIdentity();
Pu.setIdentity();

// Simulation
// =========================================================================

result[0] = Pu.trace();

for(int k = 2; k <= 60000; k++) {
Pp   = Fk_1*Pu*Fk_1.transpose() + Q;
aux1 = R + Hk*Pp*Hk.transpose();
pinv = aux1.completeOrthogonalDecomposition().pseudoInverse();
K    = Pp*Hk.transpose()*pinv;
aux2 = I18 - K*Hk;
Pu   = aux2*Pp*aux2.transpose() + K*R*K.transpose();

result[k-1] = Pu.trace();
}

return 0;
}

In my machine, the mean execution time using -O3 is:

\$ time ./a.out
./a.out  0,83s user 0,00s system 99% cpu 0,836 total

which is very, very close to the fastest Julia version of this code.

Information

Some may argue that it would be faster if I used -march=native. Indeed, it is! However, for a fair comparison, I would have also to recompile Julia system image with the same flag, which, unfortunately, I do not have time now. This is an easy procedure if you use the package PackageCompiler.jl.

## Conclusion

I hope that you understood how Julia is helping me to achieve my objectives faster. We now have a language that is easy to code and test algorithms, and that with minor adaptations can reach C performance. I have no more need to translate all the algorithms I need to be executed really fast to a compiled language, which turns out to be very time consuming for an engineer that does not code very well using C/C++.