Writing scientific software by yourself

25 Jul 2021

Today many scientists have problems that are solved by developing new programs. Sadly most of them are not software developers, so the quality of the resulting programs is mostly up to luck and individual interest. They mostly only have to take basic courses into programming by other scientists. Often the professors mostly teach from books because they only wrote basic programs 30 years ago in their own Ph.D. time.

I was in the situation to write some software for theoretical physics, except I studied computer science before and did some work as a programmer on a larger project. In this article I will write about some ideas and techniques I found along the way.

I will only look at programs written by a single researcher by himself, as it was done in my community. Projects with many researchers and distributed to end-users are a much more complex problem. I will only look at program development, not project management.

Why care about this?

Scientists might ask why they should care. They might have some Fortran files they copied from someone else and mess around in till it mostly works. They are here to do science, not software. They might even dismiss their program as mere “codes”. I will try to give some reasons you might want to care.

An example of a program

To make the kind of programs discussed here more clear I will explain my own research. I look at the time evolution of electrons in quantum wires described by differential equations. There were multiple numerical schemas to solve these equations and many ways to combine them with different models of electrodynamics. At first, I looked at some aspects only in 1D systems and later extended them to 2D. These simulations take hours of calculation time and produced up to a few gigabytes of raw data.

The next step was a program to analyze the raw data and figure out how to get the relevant information. I needed to do much experimentation to figure out what I could get out of the data. Here every run might take a few minutes and output less than a megabyte of data.

The last step is to take the results and make good-looking plots with clear labeling. This takes a few seconds and does not contain any ideas relevant to physics.

Infrastructure

Let’s begin with something seemingly boring, that everyone working in software knows anyway. There is some infrastructure around your programming. Let’s look at some stuff.

I know all of this sounds like tedious work, but it is all useful. You will see it in the end!

Modules

It is good practice to split the software into modules and reuse these. The idea is to have the components in a library with well-structured implementations and use these to build multiple programs. This prevents copying of code and enforces good interfaces.

In the diagram above you can see a library simcore, that is used by multiple executables in the row below in my project. This library contains

The library has a few 1000 lines of code, while all the binaries are below 200 lines of code. If I am doing something in multiple binaries, I search for an abstraction and put it into the libraries. Trivial examples of this are writing a matrix to disk, showing the progress of the program, or compression of output data.

I solved this problem by building simcore as a static library that is linked to all the binaries, that there is no time wasted with compiling code multiple times. This is very easy with the build system CMake I use, but there are many more possibilities with different infrastructure.

The reason for having multiple binaries is to investigate your system in different environments. First, it is just the most basic thing possible and then you can add more interesting components. Of course, you can incrementally improve one program, but then you can’t go back to the basic program if you eventually don’t understand your complex problem and need to get back to the basics.

The only hard part is that you must sometimes make sure you don’t break your old basic programs while improving the advanced ones. You should build all of them regularly.

Automatic testing

These days it is usual to test software automatically. The classical approach is to test little parts of the software with so-called unit tests. These can be very helpful and Kent Beck has written an excellent book about these.

In my experience, there is another component to testing scientific software. The underlying models often have special cases with know (approximate) solutions, so you can run your general program to handle these cases and compare them to analytic results.

For these tests perfect is the enemy of the good. It is better to have an incomplete test case than none because it would be hard. Some example from my work are:

Now if you structured your program into modules, as shown above, this is all very easy to do. Your tests can just be another binary besides the other programs to test all the complicated stuff in the library.

Just run a simplified simulation, check the results in your program and print the deviation if something wrong and a green checkmark or smiley face if everything worked.

If at all possible make sure your tests can run in a few seconds. Otherwise, you might get frustrated to wait and stop using them regularly.

Configuration

Almost every scientific program needs many external constants to work. These might be physical constants, material parameters, setup constants, times, numerical parameters. Managing these in a well-structured manner can significantly help to keep an overview of everything at the time of development and even years later. For me, a good configuration system should fulfill the following requirements

One solution to this is use a scripting language like Python, Lua or Scheme for configuration. In my experience, these are kind of complicated to integrate with a program, but otherwise I like this solution a lot and it is sometimes used. It might be possible to make it a little easier to expose your numerics to python by something like F2PY or Boost.Python.

I prefer a simpler approach to define my own simple configuration format. I base the format on .ini files, but variables can contain formulas and it is possible to include base files. A configuration might look like this:

Base.conf

c = 3e8

# space 
c = 300
Nx = 100
dx = x/Nx

#time
Nt = T/dt
dt = 0.98/c/sqrt(1/(dx^2)+1/(dy^2))

Sim.conf

{Base.conf}
T = 30
x = 500
pulse_width = x/5
pulse_energy = exp(5*pi*acos(T/4))

In my implementation (Sourcecode on github) I would use these file as

#define AFPCFG_IMPL
#include "avcfg.h"

typedef struct {
  double pulse_energy;
  double pulse_location;
  /* ... */
} config;

config load_config()
{
  config mycfg;
  avcfg_config *config = avcfg_load("Sim.conf", NULL);
  mycfg.pulse_energy = avcfg_getf(config, "pulse_energy");
  /* Give default value if the variable is not defined in the configfile */
  mycfg.pulse_location = avcfg_getf_op(config, "pulse_location", 0.0);
  /* ... */
  avcfg_free(config);
  return mycfg;
}

A nice trick is that you can now write the processed config (with only numerical values and no formulas in a single file) to disk with the output of your simulation. Here you can check if the values are as expected and you can rerun exactly the same configuration. Because it is basically a .ini file there is a parser for almost any programming language you could want to use for analysis.

Data storage

Simulations can create huge quantities of data. It is in most cases favored to save as much data as possible to do any imaginable analysis without having to rerun the slow simulations.

It took me a long time to figure out a nice structure to store that simulation output and derived analysis. I think in the end it would be optimal to have a versioned database that can run analysis in the background and dynamically move data between fast local storage to large network shares. But that would be a large system with many moving parts that would take many months of design and development from skilled programmers.

In the absence of that the best structure I found was something like:

/raw-data/ (Many gigabytes)
|- /Simulation run 1/
|- /Simulation run 2/
'- ...
/analysis-results/ (A few Megabytes)
|- /Simulation run 1/
|- /Simulation run 2/
'- ...
/plots/ (Some pictures)
|- /Simulation run 1/
|- /Simulation run 2/
'- ...

It might be counterintuitive to interleave the data of different simulation runs, but it has some nice properties.

This exactly replicates my description of time scales from above and also the different iteration numbers. You will presumably, do much more iteration on your analysis than your simulation and hopefully even more iteration on your plots!

Numerics and algorithms

This is the hard part you have to solve and I can’t help you. Sorry.

Neat tricks

At last, I will share a few little ideas you can integrate into your program without too much hassle to make life a little easier.

Floating point precision

Almost any computer you will use has 32bit (float, single) and 64bit (double, full) floating-point numbers. I science most people use 64 bit for more precision and fewer numerical problems. This is mostly a good idea, but in most cases these numbers are (kind of) twice as slow. This can be rather time-wasting while developing your program. Sadly most compilers don’t let you switch automatically, but in C and C++ you can build that yourself easily. For example, I use:

#if 0 /*set to 1 for full precision */
typedef float F;
typedef std::complex<F> C;
typedef Eigen::ArrayXf VecF;
typedef Eigen::ArrayXcf VecC;
#else
typedef double F;
typedef std::complex<F> C;
typedef Eigen::ArrayXd VecF;
typedef Eigen::ArrayXcd VecC;
#endif

And you get very short type names for free!

Also, see if -ffast-math or something similar can speed up your test runs in development.

Simulation progress

Sometimes you wait eagerly in front of your running simulation for results. I build myself a little progress reporter for the command line. It can show percentage progress and extrapolated time remaining, so you know if there is enough idle time for making tea or even lunch at the moment.

My systems need to know the number of the total simulation steps at the beginning of the program and has to be notified after every step. To not use too much computation time I print the information only if 5 seconds elapsed or 5% of total progress were made. There are many more possibilities how you can make this more awesome, but this was enough for me.

config.ini   0.1% done in 0:06. Expected remaining time: 75:51
config.ini   0.3% done in 0:12. Expected remaining time: 76:34
config.ini   0.4% done in 0:18. Expected remaining time: 76:32
config.ini   0.5% done in 0:24. Expected remaining time: 76:04

My (not that advanced) implementation of this idea is on GitHub.

Program version in data

At the beginning of this article, I told you about some boring infrastructure stuff and now it is time to build something cool with it!

If you use version control your code always has a revision ID (kind of version number). I found on the internet how to always put that ID into your code and resulting executable with my build system. Now I told my program to always put a file with this revision number next to the output files.

If you also saved your configuration file as I did you now know exactly what program and what configuration you used to generate a simulation output. With this, you can trivially and with confidence reproduce the output if you mess it up or investigate if it turns out to be strange a few months later.

This might be useless in most cases, but it also might save you from a few sleepless panicked nights before a deadline.