Writing scientific software by yourself
25 Jul 2021
- Why care about this?
- An example of a program
- Infrastructure
- Modules
- Automatic testing
- Configuration
- Data storage
- Numerics and algorithms
- Neat tricks
Today many scientists have problems that are solved by developing new programs. Sadly most of them are not software developers, so the quality of the resulting programs is mostly up to luck and individual interest. They mostly only have to take basic courses into programming by other scientists. Often the professors mostly teach from books because they only wrote basic programs 30 years ago in their own Ph.D. time.
I was in the situation to write some software for theoretical physics, except I studied computer science before and did some work as a programmer on a larger project. In this article I will write about some ideas and techniques I found along the way.
I will only look at programs written by a single researcher by himself, as it was done in my community. Projects with many researchers and distributed to end-users are a much more complex problem. I will only look at program development, not project management.
Why care about this?
Scientists might ask why they should care. They might have some Fortran files they copied from someone else and mess around in till it mostly works. They are here to do science, not software. They might even dismiss their program as mere “codes”. I will try to give some reasons you might want to care.
- In this linked article, Freeman Dyson discusses if tools are as important as ideas to science. Even if you disagree, it makes clear that tools are important and programs are clearly tools for science.
- The programs are a relevant result of the research. They often contain all the messy details and tricks left out in articles and I think they should even be published with the articles. Sadly not even computer science does that so it might take decades for natural sciences to get there.
- You will be faster and more comfortable using and modifying your programs for your research.
- You will be (a little) more confident in your programs if they are less messy.
- These days there are many jobs involving software if you ever leave academia.
- Making better programs is fun.
An example of a program
To make the kind of programs discussed here more clear I will explain my own research. I look at the time evolution of electrons in quantum wires described by differential equations. There were multiple numerical schemas to solve these equations and many ways to combine them with different models of electrodynamics. At first, I looked at some aspects only in 1D systems and later extended them to 2D. These simulations take hours of calculation time and produced up to a few gigabytes of raw data.
The next step was a program to analyze the raw data and figure out how to get the relevant information. I needed to do much experimentation to figure out what I could get out of the data. Here every run might take a few minutes and output less than a megabyte of data.
The last step is to take the results and make good-looking plots with clear labeling. This takes a few seconds and does not contain any ideas relevant to physics.
Infrastructure
Let’s begin with something seemingly boring, that everyone working in software knows anyway. There is some infrastructure around your programming. Let’s look at some stuff.
-
Programming language: Look at the programming language you use. Spend some hours to learn what it can do. There are usually good tutorials online on their official website. I use C++ for the computationally complex part and python for the rest because I know them well.
- Text editor: You will spend much time programming in a text editor or integrated development environment. Spend an hour to test a few of them and stay with the one you like. Don’t blindly use what everyone else in your group uses; it might be worse than what was used in the 90s. I like Visual Studio Code because it works on most operating systems and has many cool features through plugins.
-
Version control systems: Every halfway reasonable software developer uses version control to put his software into and to have access to all previous changes. You should also do this. I think you should use git. It is the industry standard, it works and it is easy to get help with it. There are even multiple graphical interfaces to use git, but I don’t know much about them.
Git allows you to have a remote server to sync with. This is especially useful if you work from different computers on the same program. You can check if your university provides such a server or use a (free) commercial service like github. It might even work as (hopefully another) backup.
-
Build system: You probably need to compile your software. Coping the command to do that from the first line of your source file or your shell history is a bad idea. You should use a system for that as everyone else in software does for 40 years. At least use Make or some shell script.
I use CMake for my C++ programs. It is very flexible, works fine, and even has integration into my editor.
- Compilers: If there are multiple good compilers for your programming language make your code work with more than one of them. This usually finds bugs in your programs and improves the chances that they conform to standards and will continue to work in a few years or on the cluster of your organization. I made sure my programs work with GCC and Clang on Linux and Visual C++ on Windows.
I know all of this sounds like tedious work, but it is all useful. You will see it in the end!
Modules
It is good practice to split the software into modules and reuse these. The idea is to have the components in a library with well-structured implementations and use these to build multiple programs. This prevents copying of code and enforces good interfaces.
In the diagram above you can see a library simcore
, that is used by multiple executables in the row below in my project. This library contains
- Multiple implementations of physical systems with different numerical implementation or dimensionality
- Basic mathematical helper functions an
- Many predefined laser pulses
- Helpers for file output
- …
The library has a few 1000 lines of code, while all the binaries are below 200 lines of code. If I am doing something in multiple binaries, I search for an abstraction and put it into the libraries. Trivial examples of this are writing a matrix to disk, showing the progress of the program, or compression of output data.
I solved this problem by building simcore
as a static library that is linked to all the binaries, that there is no time wasted with compiling code multiple times. This is very easy with the build system CMake I use, but there are many more possibilities with different
infrastructure.
The reason for having multiple binaries is to investigate your system in different environments. First, it is just the most basic thing possible and then you can add more interesting components. Of course, you can incrementally improve one program, but then you can’t go back to the basic program if you eventually don’t understand your complex problem and need to get back to the basics.
The only hard part is that you must sometimes make sure you don’t break your old basic programs while improving the advanced ones. You should build all of them regularly.
Automatic testing
These days it is usual to test software automatically. The classical approach is to test little parts of the software with so-called unit tests. These can be very helpful and Kent Beck has written an excellent book about these.
In my experience, there is another component to testing scientific software. The underlying models often have special cases with know (approximate) solutions, so you can run your general program to handle these cases and compare them to analytic results.
For these tests perfect is the enemy of the good. It is better to have an incomplete test case than none because it would be hard. Some example from my work are:
- I looked at wave packets moving with some speed. So I initialized a gaussian wavepacket and look where its maximum value was after some time and checked if it had the expected velocity. There was no fit to get the exact center of the wave or ensure its shape. Just a check if a rough approximation of the velocity is in 10% of the expected value
- There are complex formulas to predict the state of a two-level quantum system excited with different lasers. I just looked at the simplest perfectly tuned pulse with the simplest formula to see if it worked the same with my numerics
Now if you structured your program into modules, as shown above, this is all very easy to do. Your tests can just be another binary besides the other programs to test all the complicated stuff in the library.
Just run a simplified simulation, check the results in your program and print the deviation if something wrong and a green checkmark or smiley face if everything worked.
If at all possible make sure your tests can run in a few seconds. Otherwise, you might get frustrated to wait and stop using them regularly.
Configuration
Almost every scientific program needs many external constants to work. These might be physical constants, material parameters, setup constants, times, numerical parameters. Managing these in a well-structured manner can significantly help to keep an overview of everything at the time of development and even years later. For me, a good configuration system should fulfill the following requirements
- Options should be named. Otherwise, you have to remember the 5. line is the speed of light and you will be confused eventually.
- No information should be redundant/be copied. If you just want to run another simulation, you should not always have to reenter the speed of light.
- Move some (simple) logic from your program into your configuration.
One solution to this is use a scripting language like Python, Lua or Scheme for configuration. In my experience, these are kind of complicated to integrate with a program, but otherwise I like this solution a lot and it is sometimes used. It might be possible to make it a little easier to expose your numerics to python by something like F2PY or Boost.Python.
I prefer a simpler approach to define my own simple configuration format. I base the format on .ini files, but variables can contain formulas and it is possible to include base files. A configuration might look like this:
Base.conf
c = 3e8
# space
c = 300
Nx = 100
dx = x/Nx
#time
Nt = T/dt
dt = 0.98/c/sqrt(1/(dx^2)+1/(dy^2))
Sim.conf
{Base.conf}
T = 30
x = 500
pulse_width = x/5
pulse_energy = exp(5*pi*acos(T/4))
In my implementation (Sourcecode on github) I would use these file as
#define AFPCFG_IMPL
#include "avcfg.h"
typedef struct {
double pulse_energy;
double pulse_location;
/* ... */
} config;
config load_config()
{
config mycfg;
avcfg_config *config = avcfg_load("Sim.conf", NULL);
mycfg.pulse_energy = avcfg_getf(config, "pulse_energy");
/* Give default value if the variable is not defined in the configfile */
mycfg.pulse_location = avcfg_getf_op(config, "pulse_location", 0.0);
/* ... */
avcfg_free(config);
return mycfg;
}
A nice trick is that you can now write the processed config (with only numerical values and no formulas in a single file) to disk with the output of your simulation. Here you can check if the values are as expected and you can rerun exactly the same configuration. Because it is basically a .ini file there is a parser for almost any programming language you could want to use for analysis.
Data storage
Simulations can create huge quantities of data. It is in most cases favored to save as much data as possible to do any imaginable analysis without having to rerun the slow simulations.
It took me a long time to figure out a nice structure to store that simulation output and derived analysis. I think in the end it would be optimal to have a versioned database that can run analysis in the background and dynamically move data between fast local storage to large network shares. But that would be a large system with many moving parts that would take many months of design and development from skilled programmers.
In the absence of that the best structure I found was something like:
/raw-data/ (Many gigabytes)
|- /Simulation run 1/
|- /Simulation run 2/
'- ...
/analysis-results/ (A few Megabytes)
|- /Simulation run 1/
|- /Simulation run 2/
'- ...
/plots/ (Some pictures)
|- /Simulation run 1/
|- /Simulation run 2/
'- ...
It might be counterintuitive to interleave the data of different simulation runs, but it has some nice properties.
- The raw data can live on a slow (high latency) storage elsewhere, but it is trivial to just copy the analysis result onto your local computer to make plots faster.
- You can also take the analysis data onto your laptop for a conference or home office where you might not fast (or any) internet.
- If you want to write a paper you can simply copy all the plots folder without putting too much irrelevant stuff next to your Latex documents. If you redo your sim/analysis it is trivial to replace them without confusion and messing up.
This exactly replicates my description of time scales from above and also the different iteration numbers. You will presumably, do much more iteration on your analysis than your simulation and hopefully even more iteration on your plots!
Numerics and algorithms
This is the hard part you have to solve and I can’t help you. Sorry.
Neat tricks
At last, I will share a few little ideas you can integrate into your program without too much hassle to make life a little easier.
Floating point precision
Almost any computer you will use has 32bit (float, single) and 64bit (double, full) floating-point numbers. I science most people use 64 bit for more precision and fewer numerical problems. This is mostly a good idea, but in most cases these numbers are (kind of) twice as slow. This can be rather time-wasting while developing your program. Sadly most compilers don’t let you switch automatically, but in C and C++ you can build that yourself easily. For example, I use:
#if 0 /*set to 1 for full precision */
typedef float F;
typedef std::complex<F> C;
typedef Eigen::ArrayXf VecF;
typedef Eigen::ArrayXcf VecC;
#else
typedef double F;
typedef std::complex<F> C;
typedef Eigen::ArrayXd VecF;
typedef Eigen::ArrayXcd VecC;
#endif
And you get very short type names for free!
Also, see if -ffast-math or something similar can speed up your test runs in development.
Simulation progress
Sometimes you wait eagerly in front of your running simulation for results. I build myself a little progress reporter for the command line. It can show percentage progress and extrapolated time remaining, so you know if there is enough idle time for making tea or even lunch at the moment.
My systems need to know the number of the total simulation steps at the beginning of the program and has to be notified after every step. To not use too much computation time I print the information only if 5 seconds elapsed or 5% of total progress were made. There are many more possibilities how you can make this more awesome, but this was enough for me.
config.ini 0.1% done in 0:06. Expected remaining time: 75:51
config.ini 0.3% done in 0:12. Expected remaining time: 76:34
config.ini 0.4% done in 0:18. Expected remaining time: 76:32
config.ini 0.5% done in 0:24. Expected remaining time: 76:04
My (not that advanced) implementation of this idea is on GitHub.
Program version in data
At the beginning of this article, I told you about some boring infrastructure stuff and now it is time to build something cool with it!
If you use version control your code always has a revision ID (kind of version number). I found on the internet how to always put that ID into your code and resulting executable with my build system. Now I told my program to always put a file with this revision number next to the output files.
If you also saved your configuration file as I did you now know exactly what program and what configuration you used to generate a simulation output. With this, you can trivially and with confidence reproduce the output if you mess it up or investigate if it turns out to be strange a few months later.
This might be useless in most cases, but it also might save you from a few sleepless panicked nights before a deadline.