Written on 2019-07-01
Programmers love to debate about programming languages. Almost everyone has their favourite, so discussions as to the relative merits of each often degrade into “holy wars”. Although no one will ever find the “perfect programming language” (despite numerous claims to the title), it is nonetheless instructive to compare different languages. After all, not all languages were created equal, and each have their individual strengths and weaknesses. Knowing about these can help you choose just the right language for just the right situation – in short, to use the right tools.
Of course, being able to choose the right language for the job assumes that you know more than one language. This is advisable anyway: knowing more than one language makes you more flexible as you can collaborate on a wider range of projects. Also, each language embodies a slightly different philosophy of programming – so knowing multiple languages gives you a broader perspective on software development in general. And you don't want the saying of “If all you have is a hammer, everything looks like a nail” to apply to you, do you? 😉
Unfortunately, one is not always able to choose the language one would like for any given job. Perhaps you are joining an existing project, or you have to listen to what your boss and your collaborators say. But if you do have the choice, here are some things to keep in mind, and a selection of good languages for scientific computing:
There are certain design decisions every programming language has to take. These will determine its general approach to programming, and make it more or less useful for different tasks. The most important of these decisions are:
Compiled vs. interpreted. In the end, the objective of any programming language is to take human-readable source code and turn it into computer-readable machine code. Interpreted languages do this “on the fly”, as the program is executing. This means that they can be used interactively (with a REPL), and one doesn't have to recompile a whole system before testing a new version of it. This makes development in such a language easier and faster. However, because a program cannot be run without simultaneously running the interpreter, performance suffers. Compiled languages, on the other hand, separate the “translation” from the execution. A program is compiled once and can then be run on its own, without requiring further intervention from the compiler. This cuts out the middle man, generally making such languages much faster. However, developing is somewhat more tedious. Also, a compiled executable will only run on the architecture and system it was compiled for, a portability problem interpreted languages can usually avoid.
Statically vs. dynamically typed. This is the question of how variable types (strings, integers, etc.) are recognized by the program. Dynamically typed languages don't care about variable types initially. A variable may be assigned values of various types over its lifetime without any problems. A type error is only raised when the programmer attempts to call an invalid type-specific operation on a value – such as indexing an int rather than a list. This makes the programming side of things easier and quicker, but can lead to some hard-to-diagnose bugs that ony manifest themselves at runtime. Statically typed languages avoid these bugs by forcing the programmer to assign a type to every variable, and throw an error if this variable is ever assigned to a value of a different type. Importantly, this type checking can be carried out during compilation, so that potential bugs are caught much sooner. The core rule is: in dynamic typing, type is something only values have, in static typing, variables have types too.
Memory management. All programs manipulate a computer's memory. While a program is executed, objects are created, stored in memory, and eventually have to be deleted again once they are no longer needed. In most modern languages, the task of deciding when an object can be deleted is carried out automatically by the “garbage collector”, so the programmer doesn't have to worry about running out of memory inadvertently. In older (and especially compiled) languages, this was still a manual task; i.e. the assignment and deletion of an object in memory had to be explicitly included in the source code. Of course, this is a tricky task, and there are whole classes of bugs stemming from small mistakes in memory management. However, there are still some applications in which the overhead performance cost of garbage collection is so great (relatively speaking) that a manual memory management is the only viable choice. (This is especially the case in embedded systems programming.)
So, after this theoretical introductio, let's have a look at eight languages that are in some way important to computational biology. Of course, this is not an exhaustive list – amongst others, I have completely ignored web-based languages like PHP and JavaScript. But it does contain a brief introduction to a group of languages that are either already common in scientific computing, or have a great potential in this area.
C. Most commonly used languages nowadays are in some way or another descended from C. C itself is a compiled, statically typed language with manual memory management. Executables produced with C can be very small, fast, and light on memory, making it a good choice for environments in which performance is critical (such as microcontrollers). However, it is infamous for being difficult to write well.
C++. C++ is a superset of the C language that provides facilities for Object Oriented Programming (OOP) and expands the standard library. It is often used for performance-sensitive desktop software, such as games or simulations. Although it is somewhat easier to handle than pure C, it still retains the difficulty of manual memory management.
Java. Java looks like C/C++, but isn't. It is a compiled, statically typed language, but includes a garbage collector (so has automatic memory management). It enforces strict OOP and has several features that make it well suited to development practices in large teams – though rather less so for an individual developer. A big bonus is its huge standard library (possibly the biggest of any language), which even includes a full GUI framework. An interesting feature is that it doesn't actually compile to native machine code, but to a “byte code” that needs the Java Virtual Machine (JVM) to execute. This means that it is slower than “true” compiled languages. On the upside, this byte code can be executed anywhere that has a JVM installed, making it extremely portable. Java is often used for desktop software.
Python. Python is a dynamically typed, interpreted language with automatic memory management. The syntax is very clean and easy to learn, making it an excellent beginners' language. It's not in any way a toy language, though, enjoying immense popularity both for scripting and larger applications. The huge community means that there is a wealth of documentation and third-party libraries to be found online, including many for computational science. It is a pleasure to work in, not only because of its large standard library (it comes “batteries included”, as they say). Unfortunately, it lacks the performance for computationally intensive tasks.
Perl. Perl's popularity preceded that of Python, and it is a similar language in many respects. It is designed particularly with an eye for string manipulation, which has made it wide-spread in fields like bioinformatics that do a lot of string processing. It began to decline with the advent of Python, as the latter is better suited to general programming (and Perl's syntax is notoriously obscure).
R. This is a scripting language designed specifically for data analysis and statistics. (Depending on which field you work in, you may use Matlab or SPSS for this purpose – biologists tend to go with R.) From a linguistic perspective, it is an excrutiatingly ugly language; its plethora of mutually incompatible data types are a right pain to work with. Its analysis and graph plotting libraries (both inbuilt and third-party) do however mean that it is a good choice for data analysis. Just don't do anything else with it. (It's performance is terrible anyways.)
Julia. Julia is a very new language that is still in very active development, but has begun to gain a sizeable following in the scientific computing community. It manages to combine the clean syntax of Python with the scientific library of R and a performance close to C – making it an excellent choice for many applications in scientific computing. (It achieves this performance by being just-in-time compiled: it has an interface like an interpreter, but actually produces machine code.) Its downsides are that the language is still evolving and changing very rapidly, and that there are few resources available online. Still very much worth the effort to learn, though.
Common Lisp. This is perhaps a strange choice for inclusion in the list, as it is generally considered a pretty arcane language nowadays. However, it served as the standard language of AI research for several decades, and is still an advanced, powerful programming language. Notably, it was the first language to include a garbage collector, as well as originating a whole host of features that are slowly finding their way into more “mainstream” languages. (Julia has incorporated a surprisingly large number of these.) Common Lisp code can be both interpreted and compiled, making it both easy to develop and fast to run. Its biggest drawbacks are the small pool of potential collaborators (as hardly anybody knows it anymore) and a comparatively small number of available libraries. (See here for more details.)
Before I close, I'd like to add a thought on third-party libraries:
Using libraries is generally a great idea. You don't have to reinvent the wheel for every new project, instead building on what other (better?) developers have built before you. In many cases, one doesn't have the know-how to build an equivalent library anyway (perhaps when doing image manipulation). And if you want to use standard algorithms, such as for path-finding or compression, it is much better to go with a thoroughly tested public implementation than to hack up your own.
However, beware the dependency hell. The more libraries you use, the more problems you can have with unmet dependencies, out-of-date versions and similar troubles. This is doubly true if you're installing the libraries manually from source, instead of using a centralised package manager (like CRAN for R). Also, not all libraries are good libraries. In general, the more users a library has, the safer and easier it should be to use. Remember that although one should avoid reinventing the wheel, not every go-kart needs F1 wheels – in other words, the library you're thinking about using may really be much too advanced for the purposes you need it for. So use your discretion, and don't be afraid to not use a library, if doing so makes your program smaller, easier to read, and easier to port.
Tagged as computers, programming