Written on 2019-10-14
In the past seven posts of this series, we've looked at how to make software understandable, reliable, and extendable. We've seen techniques for dealing with errors, reducing complexity, and developing in teams. We've touched on different programming languages and paradigms, and conventions for documenting code. Of course, these have been very cursory glances; but hopefully enough to give a brief overview of what to think about when developing software in a scientific context. Now, in closing, I want to mention two last topics and give a few pointers on where to go from here.
Thankfully, we are no longer in an era where every byte of memory is precious and programmer time is cheaper than computer cycles. Today, most programmers rarely have to worry about performance issues when writing their code.
Unfortunately, computational scientists often still do. Whether we deal with huge data sets (such as genomic sequences or imaging data) or construct complicated mechanistic models, we are often pushing the boundaries of what our computers can do. So if we find our program's memory consumption exceeds the RAM we have available, or the run-time starts to be measured in weeks instead of minutes, we need to optimise. This takes some skill.
The first rule of optimisation is: do it last! In the words of Donald Knuth: “Premature optimisation is the root of all evil.” First make sure your program is correct, then figure out where the performance bottlenecks are, then try and make them more efficient.
Figuring out performance bottlenecks first, before you start tweaking your program, is incredibly important. Usually, one small part of a software is responsible for the lion's share of its run-time or memory consumption. Improving this bottleneck can get you orders of magnitude more efficiency. And as long as the bottleneck remains, improving all the other pieces is virtually useless.
To figure out where these bottlenecks are, you can use a program known as a
“profiler” (every language has at least one available). This will tell you in
detail which function calls use how much time and memory. Alternatively, if you
just need a quick-and-dirty overview, you can use your language's
to calculate function run-times yourself.
(Keep in mind that you can optimise for processing speed and a low memory footprint – but at some point, you are usually going to have to settle for a trade-off between the two.)
Three common types of bottlenecks that you will face are those related to algorithms, data structures, and I/O (input/output) calls.
Algorithms and data structures are a huge component of computer science as a discipline that I won't even try to cover here. Generally, you don't need an in-depth knowledge of the various algorithms anyway, but it does help to at least be aware of tried-and-tested solutions to common problems like sorting or searching. Data structures is something you should try to learn more about, as the right choice of data structure can make a world of difference to your program's performance. You can have a look at this online resource, or check your local library for a relevant text book on the topic.
Whereas algorithmic optimisation is a very mathematical endeavour, I/O optimisation is much more about the hardware you're working on. Often, it isn't the actual calculations that make a program slow, but the time it takes the computer to move the bits and bytes around. Common operations that take a lot of time are network connections, disk reads/writes, or screen output. If you do a lot of these successively, the effect will be noticeable. Here, it often helps to cache data in memory (instead of re-reading it every time it's needed), or buffering output data and flushing it out in one go (instead of setting up a new connection for every little bit).
Overall, optimisation is a huge topic that quickly leads into highly advanced techniques based on very specialised knowledge. Therefore, I will say no more about it, except to encourage you to keep looking and learning, to find out what works in your scenario in your language on your machine.
Strictly speaking, licenses are not an aspect of software development, but if you work with or create open source software, you'll have to know at least a bit about them.
“Open source” means that, unlike for most commercial software, the source code for a program may be freely inspected, modified, and redistributed. There's a whole range of licenses that allow this, each with slightly differing conditions. It can all be a bit confusing, but fortunately, there are good overviews available that can be consulted.
Perhaps the two most important are the GNU General Public License (GPL) and the MIT license. The GPL is what is known as a “copy-left” license – not only must the source code for the work itself be made freely available, but all future derivative works must be published under the same license. The MIT license is less restrictive: it only stipulates that any redistribution of the software must retain the original license notice, but derivative works may choose a different license model.
But don't be too worried: if you're choosing a license for your own software, it basically boils down to a matter of preference (and software-political opinion) – unless your employer has a stated policy in the matter. Αnd if you're joining an existing project, that choice has already been made anyway.
Slowly, the importance of good software development practices is being recognised by the scientific community. Recently, the topic was even covered by Nature. Various groups are forming to train, support, and connect scientists who write software – examples being the Software Sustainability Institute or the Research Software Engineer Association.
As I have previously said, the purpose of this series of articles is to give an overview of topics that must be considered when developing software, especially in a scientific context. I hope this in itself proves useful, but must stress that everything I have touched I have touched only very briefly. Much more could and perhaps should be said – but diving deeper remains an exercise for the reader.
In closing, I can only repeat my mantra from the beginning: to become a better programmer, you must read, write, and repeat. In that spirit, happy hacking!
– – –
Apart from the books cited before and the associations mentioned above, here are three more sources for further reading:
Wilson et al. (2014) “Best Practices for Scientific Computing” – Things to keep in mind when developing code (largely similar to the topics covered in this series).
Wilson et al. (2016) “Good Enough Practices in Scientific Computing” – Things to keep in mind when working with computers in science generally. Includes a brief section on writing code, but also touches on things like data management and project organisation.
Netherlands eScience Center guide – A comprehensive guide to software development produced by the Dutch expertise center for research software. Goes into a lot more detail on many topics covered in this series.