blog2samus!

Hear me rant.

Apr 9

On optimization and Python

Here’s another interesting discussion originated on reddit which also led me to create a proper visualization for the latency tables around the web because there was really not a single one that helped understand the times involved, enjoy!


The golden rule of optimization is: benchmark.

Most of the time you assume things about the code that don’t really apply to the context in which it runs; the interpreter, the platform, the hardware and the data all affect the performance in different ways and it’s very hard to predict.

Regardless there’s some theory and principles that hold true in all cases (which doesn’t mean they’ll always give you faster computing) and the most important thing to know is the difference between CPU-intensive tasks and I/O-bound tasks.

CPU-intensive

CPU-intensive tasks are those that perform lots of calculations with data in memory, an example of this would be to compute the factorial of a large number, if you ask python for math.factorial(1e6) it’ll take some time and while it’s running you can observe (in a process manager or similar) that one of the CPUs of the machine is at 100% of capacity, this is important because when a CPU is doing something it cannot work on anything else.

There are several ways to deal with this, the first thing to realize here is that dynamic languages are slow for computing compared to static languages so one thing that is common in Python is to replace the part of the code that is the bottleneck (previously identified with profiling and benchmarks) with an extension in C.

Of course writing optimized C code and embedding it into Python isn’t trivial but it’s an option, that is what the NumPy library does for instance.

Another option is to split the task and parallelize the computation, traditionally this was achieved using threads because each thread can use a different core thus making full use of multicore processors, unfortunately the default Python interpreter (CPython) has an issue known as the GIL which prevents multithread applications from using several cores at the same time.

Other interpreters (based on different VMs) don’t have this problem, Jython for instance which is based on Java’s JVM doesn’t have a GIL.

There’s been several attempts to remove the GIL from everyday-Python, nowadays the most promising is PyPy’s STM.

An alternative to threads is to use the multiprocessing module in order to run several processes instead, this of course means an overhead because there’ll be several instances of the interpreter running at the same time but since Python is relatively lightweight compared with, let’s say ‘Java’ this isn’t considered a problem.

Sometimes a single machine isn’t enough no matter how powerful it is, in that case it’s necessary to extend this concept and build a cluster to spread the load, this is what nowadays is known as “Big Data” and there’s lots of frameworks made to work in this fashion, most notably Apache Hadoop.

Another strategy worth mentioning is caching and this isn’t really an optimization but consists in saving the results of an expensive computation to be reused on the next request instead of recalculating it all, it’s a tradeoff between CPU time and the memory consumed but can become a true optimization on algorithms that perform repetitive calculations (when this is applied to a deterministic function it’s known as “memoization”).

I/O-bound

The other type of performance bottleneck is the I/O which is simply any operation that communicates with a device; it could be reading or writing to the disk, connecting to the Internet, using a socket, etc.

This is a problem because those operations are really really slow compared to other tasks, here's a chart comparing the time it takes to do some of those.

Whenever a thread or process asks for I/O it gets “blocked” waiting for the data which means it just sits there and does nothing while the OS uses that time to run other programs so it’s important to use strategies that can work with the CPU while waiting for the data to arrive.

Interestingly in this scenario you can use threads and not be affected by the GIL because it only blocks active threads, those that are waiting for I/O won’t block the one doing CPU (of course processes and clusters are also ways to deal with this situation).

Besides that there’s some libraries that implement “Async I/O” such as gevent or Twisted, they also provide a way to make use of the CPU while waiting for data to arrive, this is the core principle behind the performance of Node.js which was known in several other languages for a long time.


Feb 5

Global Variables and Shared State in bash (and beyond!)

another one from reddit, this time on the bash subreddit and while the conversation started based on a particular script, it transitioned to general programming concepts that apply elsewhere, in this case global variables and shared state (original thread).


a global variable is one that exists and is accessible from any part of your program, the variables you’re using (and probably the only ones you know about) are globals, notice that for instance the $upper variable is first created at the top of the script but then playgame() reads it and start() updates its value, this is shared state.

ideally functions should be self-contained which means that they shouldn’t access variables that exist outside their own code nor create new ones that would “leak” to the outside after they return, in bash all variables are global by default but inside a function you can use the local keyword to create variables that exist only there and not in the outside.

that solves the last half of our predicament, which is to not “leak” data to the outside world, the other half is solved pretty easily, instead of reading global variables use parameters when calling the function.

with this two things we can have a self-contained function, that is one that does not depend on the state of the external world to properly function, this has several advantages:

your functions become more reusable and more robust

the problem with the globals is that your function depends on the existing conditions of the environment to work properly, imagine you had a typo and instead of “upper=100” it was “uper=100” now calling start() would break, because it expects the variable $upper to exist and have a numeric value.

of course if you’re using parameters and don’t pass anything it’ll also break and in the same way as before, however when you debug and find that the problem is that $upper is missing you’ll be able to trace the function call and know exactly at which point the call being made does not include the desired argument.

if you have the same situation but with a global varaible instead, not only you have to scan the whole program for the definition of the variable (since it could be ANYWHERE) but you also have to trace the program execution from the start up to the point where the missing variable is required and this because even if the varaible would be correctly defined where you expect it, there could be another point in your program (again ANYWHERE) where it’s being assigned a wrong value or deleted entirely, always using parameters helps you with this because it creates a traceable path for you to verify and locate the error.

that’s for robustness, now your parameter-based functions are also more reusable because -when called appropiately- they’ll have everything they need to operate regardless of which part of the code or even which program they’re being executed from! that’s what self-contained means here, the function can operate as long as it’s called appropiately, it does not depend on certain environmental conditions to work.

your code is more readable

when I was studying your script I got confused at some point because the start() function and the playgame() function communicate implicitly thru the $upper variable.

the thing is, when you have shared state the code of a function is not just the body of it, in order to understand it you need to understand everything that is happening on the outside and this is a really big problem, imagine this same situation but in a large-scale system with hundreds of files and thousands of lines of code.

on the other hand when you have a function that is self-contained you just need to focus on that to understant what it does, you won’t know what it’s used for or where but you’ll be able to tell what it does because everything the function needs is right there so it’s easier for you to follow and understand it and being able to break large problems into smaller pieces is key, that’s how we humans can build projects and work on systems so complex that no single person would be able to comprehend entirely on its own, “divide and conquer” it works.


Jan 15

Classes and inheritance in Python

A follow-up on the previous post about classes, also from reddit.


The idea of inheritance is that you can compose new classes on top of existing ones which allows you to reuse their code.

Here’s an example, let’s say we have a class to represent a rectangle:

class Rectangle:
    def __init__(self, length, width):
        self.length = length
        self.width  = width

    def getArea(self):
        return self.length * self.width

    def getPerimeter(self):
        return 2*(self.length+self.width)

and a class to represent a square:

class Square:
    def __init__(self, length):
        self.length = length

    def getArea(self):
        return self.length**2

    def getPerimeter(self):
        return 4*self.length

but the square is a special case of a rectangle (in which both sides are the same length) so we can reuse Rectangle’s code in our Square class like this:

class Square(Rectangle):
    def __init__(self, length):
        super(Square, self).__init__(length, length)

and with just this three lines we have a Square class with all the functionality including methods as before, ie you can do mysquare = Square(3) and then mysquare.getArea() returns 9 even that getArea isn’t defined on the Square class itself.

Let’s analyze the three lines in more detail, first class Square(Rectangle): tells Python that this class definition is based on the definition of Rectangle and so all its methods and attributes will be available here (unless we redefine them).

def __init__(self, length): is the redefinition of the __init__ method, we do this because we want the constructor to have only one argument as opposed to Rectangle that has two, however super(Square, self).__init__(length, length) is notation for calling the parent’s method, in this case Rectangle’s __init__ and we do so because we rely on it for the initialization, we’re just passing it parameters.

Read about new-style classes and why base objects should inherit from object (lowercase).

You should also know that it’s possible to inherit from multiple classes at once, like MyClass(ParentA, ParentB): but you’ll probably won’t need this anytime soon, the idea is the same MyClass receives code from all its parents but the inheritance rules are more complicated because the parents may both define the same methods or attributes.


Jan 11

Classes and initialization in Python

another one from reddit this time explaining the basics of Object-Oriented programming and object initialization in Python.


it’s a big topic in programming in general, I should tell you first that while most programming languages support classes and objects they implement them differently so while the idea is more or less the same, what you learn about objects in Python won’t apply to objects in Java or any other language.

ok so there’s two concepts to discuss here: the concept of “Python classes” (and the class keyword) and then the concept of “class instances” which is related to the previous and both conform the odea of “objects” and “object-oriented programming” (OOP).

a class is a datatype, just like strings or integers or even dictionaries but unlike most plain or container datatypes classes can also contain executable code (methods) along with pure data (attributes).

so the class keyword allows you to create a new class datatype with the definition you provide, this definition specifies what the initial instances (the actual variables of this new datatype) will have and I say initially because Python allows you to modify them after being created, so a given instance may later be different from the original definition.

to make this clearer, think about the following code:

a = 2

here you have a variable ‘a’ with the value 2 which is an integer, so ‘integer’ is the datatype of the value associated with the name ‘a’ ok? now let’s see the following:

class MyClass(object):
    pass

here we’ve declared a new datatype (class) which we’ve called ‘MyClass’ but at this point there’s no variable of this type in existance yet, we’ve just defined the type.

this particular class doesn’t have anything else besides what’s in object (by means of ‘inheritance’ another important concept).

so if we want to create an instance of this class, that is, an actual variable of this type, we do:

b = MyClass()

and now we have ‘b’ referencing this object instance that resides in memory, just like ‘a’ and 2.

so that’s how you declare and use them, but what’s so great about them? the idea of classes is (among other things) to allow you to structure your code modelling real-world objects so you can say a class defines the common properties and behaviors of an object and the instance actually fills those values with the state of a given one.

this is a bit too abstract so let’s see another example, suppose we’re a company that sells vending machines (this ones, not sure how to call them), each unit keeps track of the stock it contains and updates it automatically whenever a soda is sold.

so, we can create a class to model this machine’s attributes and behaviors by declaring the things they have in common in the class and then adjust each one’s attributes as they’re in use.

the class:

class VendingMachine(object):
    def __init__(self, serial_number):
        """ initialize machine """
        self.serial_number = serial_number
        self.stock = {
            'coca-cola': 0,
            'sprite': 0,
            'fanta': 0,
        }

    def sale(self, item):
        """ register a sale (or crash if empty) """
        stock = self.stock
        if stock[item] > 0:
            stock[item] -= 1
        else:
            raise Exception('empty')

    def replenish(self, item, amount):
        """ replenish stock of given item """
        self.stock[item] += amount

    def get_stock(self):
        """ check current stock of drinks """
        return self.stock

new things! :) here I’ve introduced ‘methods’ which are functions associated to a particular instance of a class, this is important but before seeing how this works let’s see how you could’ve implemented this without classes (because it’s totally possible but a bit messier).

def create_vending_machine(serial_number):
    """ returns an object (dict) representing a vending machine """
    return {
        'serial_number': serial_number,
        'stock': {
            'coca-cola': 0,
            'sprite': 0,
            'fanta': 0,
        },
    }

def perform_sale(vending_machine, item):
    """ register a sale on the given machine (or crash if empty) """
    stock = vending_machine['stock']
    if stock[item] > 0:
        stock[item] -= 1
    else:
        raise Exception('empty')

def replenish_machine(vending_machine, item, amount):
    """ replenish stock of given item on the given machine """
    vending_machine['stock'][item] += amount

def get_machine_stock(vending_machine):
    """ check current stock of drinks on this machine """
    return vending_machine['stock']

this code has EXACTLY the same functionality as the class-based version and you would use it like this:

# all the machines I own
my_machines = []

for serial_n in ('001', '002', '003'):
    # create new representation for this serial number
    new_machine = create_vending_machine(serial_n)

    # fill it with drinks
    replenish_machine(new_machine, 'coca-cola', 50)
    replenish_machine(new_machine, 'sprite', 50)
    replenish_machine(new_machine, 'fanta', 50)

    # ready to ship, add to my list of owned machines
    my_machines.append(new_machine)

and then when a machine sells a drink it would register the sale like this:

# machine #2 registers a sale
machine_2 = my_machines[1]
perform_sale(machine_2, 'fanta')

to get a report of the drinks for sale:

# check machine #1 stock
machine_1 = my_machines[0]
print get_machine_stock(machine_1)

so far so good, now let’s compare it with the class-based version (I’ll explain the methods in a second):

# all the machines I own
my_machines = []

for serial_n in ('001', '002', '003'):
    # create new representation for this serial number
    # in this case an instance of the VendingMachine class
    new_machine = VendingMachine(serial_n)

    # fill it with drinks
    # methods operate on the instance directly and apply to its attributes
    new_machine.replenish('coca-cola', 50)
    new_machine.replenish('sprite', 50)
    new_machine.replenish('fanta', 50)

    # ready to ship, add to my list of owned machines
    my_machines.append(new_machine)

a sale:

# machine #2 registers a sale
machine_2 = my_machines[1]
machine_2.sale('fanta')

a report:

# check machine #1 stock
machine_1 = my_machines[0]
print machine_1.get_stock()

it doesn’t look much different right? the actual difference is that the class instance contains the methods and attributes within itself and applies them to each instance in particular, on the procedural version (the one without classes) we were tossing around the dict with the data to the functions that worked with it, with classes you group the variables and the functions that work with it in a single concept that is the class.

there’s more concepts to learn from here, such as encapsulation, inheritance and a few others but this is all you need to know to start using classes, think about them as a way to group related variables and functions within a single concept you can manipulate as a variable on its own.

about __init__

as you’ve seen the methods (i.e. functions that operate on the class instance) are just functions declared within the class statement, they receive a special first argument, called ‘self’ by convention which is the instance they belong to (the instance, not the class).

__init__ is a special method used for initialization purposes, it is automatically called when you create an instance of a class, in our example everytime we executed VendingMachine(serial_n) and as you can see it receives the arguments passed when the object is created, it just a way to set defaults for the attributes and perform any other initialization, no big deal.


Jan 7

On Decorators, Closures and Memoization in Python

this was originally posted on reddit as a reply to someone asking about the usage of decorators for caching results of a method, I went on a lengthy explanation and even found out something I didn’t knew about closures in Python so, here it is.


let’s say you have:

def calculate():
    return 7**7**7

a simple unobtrusive memoization decorator would be:

def memoize(f):
    cached = [] # I'll explain why a list in a minute
    def wrapper():
        if not cached:
           cached.append(f())
        return cached[0]
    return wrapper

we’re using several interesting concepts here, first the decorator itself now I don’t know how familiar you’re with these but basically when you decorate a function:

@memoize
def calculate():
    return 7**7**7

Python replaces the original def calculate with the result of calling memoize and that’s why memoize returns a function (return wrapper) it is the function named wrapper that will be called when you do calculate().

the decorator (the function called memoize) receives the original function as a parameter (in this case I’ve named it f) so it has the chance of invoking the original function if he needs to.

here you see that we create the variable cached and call the original function to assign it a value but only if it’s empty, after that we just use the cached value thus calling the original function only once.

notice the @momeoize notation is just syntactic sugar, we could’ve written this and have the exact same results:

def calculate():
    return 7**7**7

calculate = memoize(calculate)

now the reason the value persists across calls is because of another concept called “closure” but before getting into that it’s important to understand the execution order of this code, let’s see this simplified example to demonstrate (save it as a file, won’t work on the shell):

def mydecorator(f):
    print 'inside decorator'
    def wrapper():
        print 'enter wrapper'
        f()
        print 'exit wrapper'
    return wrapper

@mydecorator
def myfunction():
    print 'inside function'

if __name__ == '__main__':
    print 'started program'
    myfunction()
    print 'program running'
    myfunction()
    print 'exit program'

now run it and try to understand the flow (also be careful with snippets from the web, if there’s any syntax you’re not familiar with look it up first).

Python programs have two instances of evaluation, the first happens when the file is read in which the text is turned into actual executable structures, the second is when you run the program in which, depending on the logic (ifs, fors, function calls, etc.) some or all of the previously interpreted code runs.

the function and class definitions as well as any free code on the file is evaluated when the file is first read, that’s why we guard the start of the program with if __name__ is '__main__' that way the code inside the if branch will only execute when the file is invoked as a script and not if it gets imported as a module.

the function and class definitions are also evaluated but do no run because they haven’t been executed, Python only creates the appropiate references in memory so they’re available by the time you actually want to run them; that also applies to the definition of mydecorator BUT the syntax @mydecorator performas a call to the function in order to replace the original definition of myfunction and this happens when the file is read.

that’s why you can see the text “inside decorator” before anything else from the output of this code, even before we call myfunction and afterwards every call to it goes to the returned func, in this case wrapper.

ok, back to the concept of closures, this is a bit hard to explain but think about it as “a function that carries a context” and in this case it’s the wrapper function that has access to the context in which it was created even after that context is no longer active.

in practical terms, the wrapper function will have access to the cached variable and the f variable everytime it is called (and it’ll be the same variable, meaning it’ll hold its value across calls), keep in mind that the memoize function is only called once when the line with @memoize is evaluated, after that every call to calculate goes to wrapper.

now the tricky part and the reason we’ve used a list is that in Python closures are read-only; this is something I didn’t knew until now and had to look up but turns out you can create closures and read the values of those vars but not assign to them, using a mutable object (in this case a list) is a bit of a hack because the list itself doesn’t change (i.e. the object referenced by the name “cached”) so Python doesn’t complain but the contents of such object are a different story and we can modify them.

honestly I don’t know why it works like this, my original thought (that works in Javascript and many other languages) was:

def memoize(f):
    cached = None
    def wrapper():
        if cached is None:
            cached = f()
        return cached
    return wrapper

which is much clearer but in Python it fails with UnboundLocalError so we’re forced to workaround it using a list.

that’s all you need to solve the problem but when decorating methods instead of functions there’s a few considerations worth addressing here as well.

bound methods (i.e. methods from a class instance) are functions with an implicit first parameter, which we call self by convention, creating decorators for methods is simple if you keep this in mind.

so if you have:

class Example(object):
    def calculate(self):
        return 7**7**7

you would call it like:

example = Example()
example.calculate()

the decorator would be:

def memoize(f):
    cached = []
    def wrapper(self):
        if not cached:
           cached.append(f())
        return cached[0]
    return wrapper

but since we’ve got self which is a persistent object already we can drop the list and write there directly (as in the un-decorated original example from SO):

def memoize(f):
    def wrapper(self):
        cached_name = '_cached_{}'.format(f.__name__)
        if not hasattr(self, cached_name):
            setattr(self, cached_name, f())
        return hasattr(self, cached_name)
    return wrapper

we’re introspecting the function name and using it to dynamically set an attribute on the instance with the cached value, this is meant to prevent collisions in case you decorate several methods of the same class.

the example on SO also had @property which is another decorator meant to allow you to call the method as if it were an attribute (i.e. without parenthesis) here you need to be careful with the order in which the decorators get applied, the right way would be:

class Example(object):
    @property
    @memoize
    def calculate(self):
        return 7**7**7

so memoize gets applied first into calculate and property on top of that.


Page 1 of 3