On optimization and Python
Here’s another interesting discussion originated on reddit which also led me to create a proper visualization for the latency tables around the web because there was really not a single one that helped understand the times involved, enjoy!
The golden rule of optimization is: benchmark.
Most of the time you assume things about the code that don’t really apply to the context in which it runs; the interpreter, the platform, the hardware and the data all affect the performance in different ways and it’s very hard to predict.
Regardless there’s some theory and principles that hold true in all cases (which doesn’t mean they’ll always give you faster computing) and the most important thing to know is the difference between CPU-intensive tasks and I/O-bound tasks.
CPU-intensive tasks are those that perform lots of calculations with data in memory, an example of this would be to compute the factorial of a large number, if you ask python for
math.factorial(1e6) it’ll take some time and while it’s running you can observe (in a process manager or similar) that one of the CPUs of the machine is at 100% of capacity, this is important because when a CPU is doing something it cannot work on anything else.
There are several ways to deal with this, the first thing to realize here is that dynamic languages are slow for computing compared to static languages so one thing that is common in Python is to replace the part of the code that is the bottleneck (previously identified with profiling and benchmarks) with an extension in C.
Of course writing optimized C code and embedding it into Python isn’t trivial but it’s an option, that is what the NumPy library does for instance.
Another option is to split the task and parallelize the computation, traditionally this was achieved using threads because each thread can use a different core thus making full use of multicore processors, unfortunately the default Python interpreter (CPython) has an issue known as the GIL which prevents multithread applications from using several cores at the same time.
There’s been several attempts to remove the GIL from everyday-Python, nowadays the most promising is PyPy’s STM.
An alternative to threads is to use the multiprocessing module in order to run several processes instead, this of course means an overhead because there’ll be several instances of the interpreter running at the same time but since Python is relatively lightweight compared with, let’s say ‘Java’ this isn’t considered a problem.
Sometimes a single machine isn’t enough no matter how powerful it is, in that case it’s necessary to extend this concept and build a cluster to spread the load, this is what nowadays is known as “Big Data” and there’s lots of frameworks made to work in this fashion, most notably Apache Hadoop.
Another strategy worth mentioning is caching and this isn’t really an optimization but consists in saving the results of an expensive computation to be reused on the next request instead of recalculating it all, it’s a tradeoff between CPU time and the memory consumed but can become a true optimization on algorithms that perform repetitive calculations (when this is applied to a deterministic function it’s known as “memoization”).
The other type of performance bottleneck is the I/O which is simply any operation that communicates with a device; it could be reading or writing to the disk, connecting to the Internet, using a socket, etc.
This is a problem because those operations are really really slow compared to other tasks, here's a chart comparing the time it takes to do some of those.
Whenever a thread or process asks for I/O it gets “blocked” waiting for the data which means it just sits there and does nothing while the OS uses that time to run other programs so it’s important to use strategies that can work with the CPU while waiting for the data to arrive.
Interestingly in this scenario you can use threads and not be affected by the GIL because it only blocks active threads, those that are waiting for I/O won’t block the one doing CPU (of course processes and clusters are also ways to deal with this situation).
Besides that there’s some libraries that implement “Async I/O” such as gevent or Twisted, they also provide a way to make use of the CPU while waiting for data to arrive, this is the core principle behind the performance of Node.js which was known in several other languages for a long time.