Crafting pelican-export in 6 hours.

Over the past two or three days, I spent some deep work time on writing pelican-export, a tool to export posts from the pelican static blog creator to WordPress (with some easy hooks to add more). Overall I was happy with the project, not only because it was successful, but because I was able to get to something complete in a pretty short period of time: 6 hours. Reflecting, I owe this to the techniques I’ve learned to prototype quickly.

Here’s a timeline of how I iterated, with some analysis.

[20 minutes] Finding Prior Art

Before I start any project, I try to at least do a few quick web searches to see if what I want already exist. Searching for “pelican to wordpress” pulled up this blog post:

https://code.zoia.org/2016/11/29/migrating-from-pelican-to-wordpress/

Which pointed at a git repo:

https://github.com/robertozoia/pelican-to-wordpress

Fantastic! Something exists that I can use. Even if it doesn’t work off the bat, I can probably fix it, use it, and be on my way.

[60m] Trying to use pelican-to-wordpress

I started by cloning the repo, and looking through the code. From here I got some great ideas to quickly build this integration (e.g. discovering the xmlrpc-wordpress library). Unfortunately the code only supported markdown (mine are in restructuredtext), and there were a few things I wasn’t a fan of (constants including password in a file), so I decided to start doing some light refactoring.

I started organizing things into a package structure, and tried to use the Pelican Python package itself to do things like read the file contents (saves me the need to parse the text myself). While looking for those docs, I stumbled upon some issues in the pelican repository, suggesting that for exporting, one would want to write a plugin:

https://github.com/getpelican/pelican/issues/2143

At this point, I decided to explore plugins.

[60m] Scaffolding and plugin structure.

Looking through the plugin docs, it seemed much easier than me trying to read in the pelican posts myself[ I had limited success with instantiating a pelican reader object directly, as it expects specific configuration variables.

So I started authoring a real package. Copying in the package scaffolding like setup.py from another repo, I added the minimum integration I needed to actually install the plugin into pelican and run it.

[60m] Rapid iteration with pdb.

At that point, I added a pdb statement into the integration, so I could quickly look at the data structures. Using that I crafted the code to migrate post formats in a few minutes:

    def process_post(self, content) -> Optional[WordPressPost]:
        """Create a wordpress post based on pelican content"""
        if content.status == "draft":
            return None
        post = WordPressPost()
        post.title = content.title
        post.slug = content.slug
        post.content = content.content
        # this conversion is required, as pelican uses a SafeDateTime
        # that python-wordpress-xmlrpc doesn't recognize as a valid date.
        post.date = datetime.fromisoformat(content.date.isoformat())
        post.term_names = {
            "category": [content.category.name],
        }
        if hasattr(content, "tags"):
            post.term_names["post_tag"] = [tag.name for tag in content.tags]
        return post

I added a simlar pdb statement to the “finalized” pelican signal, and tested the client with hard-coded values. I was done as far as functionality was concerned!

[180m] Code cleanup and publishing

The bulk of my time after that was just smaller cleanup that I wanted to do from a code hygiene standpoint. Things like:

  • [70m] making the wordpress integration and interface, so it’s easy to hook in other exporters.
  • [40m] adding a configuration pattern to enable hooking in other exporters.
  • [10m] renaming the repo to it’s final name of pelican-export
  • [30m] adding readme and documentation.
  • [30m] publishing the package to pypi.

This was half of my time! Interesting how much time is spent just ensuring the right structure and practices for the long term.

Takeaways

I took every shortcut in my book to arrive at something functional, as quickly as I could. Techniques that saved me tons of time were:

  • Looking for prior art. Brainstorming how to do the work myself would have meant investigating potential avenues and evaluating how long it would take. Having an existing example, even if it didn’t work for me, helped me ramp up of the problem quickly.
  • Throwing code away. I had a significant amount of modified code in my forked exporter. But continuing that route would involve a significant investment in hacking and understanding the pelican library. Seeing that the plugin route existed, and testing it out, saved me several hours of time trying to hack and interface to private pelican APIs.
  • Using pdb to live write code. In Python especially, there’s no replacement to just introspecting and trying things. Authoring just enough code to integrate as a plugin to give me a fast feedback loop, and throwing a pdb statement to quickly learn the data structure, helped me find the ideal structure in about 10 minutes.

There was also a fair bit of Python expertise that I used to drive down the coding time, but what’s interesting is the biggest contributors to time savings were process: knowing the tricks on taking the right code approach, and iterating quickly, helped me get this done in effectively a single work day.

Aiohttp vs Multithreaded Flask for High I/O Applications

Over the past year, my team has been making the transition from Flask to
aiohttp. We’re making this
transition because of a lot of the situations where non-blocking I/O
theoretically scales better:

  • large numbers of simultaneous connections
  • remote http requests with long response times

There is agreement that asyncio scales better memory-wise: a green thread
in Python consumes less memory than a system thread.

However, performance for latency and load is a bit more contentious. The best way to find
out is to run a practical experiment.

To find out, I forked py-frameworks-benchmark, and designed an experiment.

The Experiment

The conditions of the web application, and the work performed, are identical:

  • a route on a web server that: 1. returns the response as json 2. queries a
  • http request to an nginx server returning back html.
  • a wrk benchmark run, with 400 concurrent requests for 20 seconds
  • running under gunicorn, with two worker processes.
  • python3.6

The Variants

The variants are:

  • aiohttp
  • flask + meinheld
  • flask + gevent
  • flask + multithreading, varying from 10 to 1000.

Results

variant min p50 p99 p99.9 max mean duration requests
aiohttp 163.27 247.72 352.75 404.59 1414.08 257.59 20.10 30702
flask:gevent 85.02 945.17 6587.19 8177.32 8192.75 1207.66 20.08 7491
flask:meinheld 124.99 2526.55 6753.13 6857.55 6857.55 3036.93 20.10 190
flask:10 163.05 4419.11 4505.59 4659.46 4667.55 3880.05 20.05 1797
flask:20 110.23 2368.20 3140.01 3434.39 3476.06 2163.02 20.09 3364
flask:50 122.17 472.98 3978.68 8599.01 9845.94 541.13 20.10 4606
flask:100 118.26 499.16 4428.77 8714.60 9987.37 556.77 20.10 4555
flask:200 112.06 459.85 4493.61 8548.99 9683.27 527.02 20.10 4378
flask:400 121.63 526.72 3195.23 8069.06 9686.35 580.54 20.06 4336
flask:800 127.94 430.07 4503.95 8653.69 9722.19 514.47 20.09 4381
flask:1000 184.76 732.21 1919.72 5323.73 7364.60 786.26 20.04 4121

You can probably get a sense that aiohttp can server more requests than any
other. To get a real sense of how threads scale we can put the request count on
a chart:

 

The interesting note is that the meinheld worker didn’t scale very well at all.
Gevent handled requests faster than any threading implementation.

But nothing handled nearly as many requests as aiohttp.

These are the results on my machine. I’d strongly suggest you try the experiment
for yourself: the code is available in my fork.

If anyone has any improvements on the multithreading side, or can explain the discrepency in performance, I’d love to understand more.

KeyError in self._handlers: a journey deep into Tornado’s internals

If you’ve worked with tornado, you may have encountered a traceback of
a somewhat bewildering error:

Traceback (most recent call last):
    File "/usr/local/lib/python2.7/site-packages/tornado/ioloop.py", line 832, in start
fd_obj, handler_func = self._handlers[fd]
KeyError: 16

A few other people have been confused as well. After some digging and a combination
of learning about the event loop, fork, and epoll, the answer finally entered into focus.

TLDR

If you’re looking for the solution, don’t call or start IOLoops before
an os.fork. This happens in web servers like gunicorn, as well as
tornado.multiprocess, so be aware of that caveat as well.

But why does this happen?

As I mentioned previously, this is a combination of behaviour all
across the system, python and tornado stack. Let’s start with
learning more about that error specifically.

The code the traceback is referring occurs in the the IOLoop:

# tornado/ioloop.py
self._events.update(event_pairs)
while self._events:
    fd, events = self._events.popitem()
    try:
        fd_obj, handler_func = self._handlers[fd]
        handler_func(fd_obj, events)

What are these variables? you can read the IOLoop code yourself, but effectively:

  • _handlers is a list of the callbacks that should be called once an async event is complete.
  • _events is a list of events that have occurred, that need to be handled.

What is an FD?

The handlers and events are both keyed off of file descriptors. In a
few words, file descriptors represent a handle to some open file. In
unix, a pattern has propagated where a lot of resources (devices,
cgroups, active/inactive state) are referenced via file descriptors:
it became a lingua franca for low level resources because a lot of
tooling knows how to work with file descriptors, and writing and
reading to a file is simple.

They’re useful for tornado because sockets also have a file descriptor
represent them. So the tornado ioloop could wait for an event
affecting a socket, then pass that socket to a handler when a socket
event is fired (e.g. some new data came into the socket buffer).

What modifies the events and handlers?

A KeyError handlers means there’s a key in events that is not in the
handlers: some code is causing events to be added to the ioloop, and
aren’t registering a handler for it at the same time. So how does that
happen in the code?

A good starting point is looking where _handlers and _events are
modified in the code. In all of the tornado code, there’s only a
couple places:

# tornado/ioloop.py
def add_handler(self, fd, handler, events):
    fd, obj = self.split_fd(fd)
    self._handlers[fd] = (obj, stack_context.wrap(handler))
    self._impl.register(fd, events | self.ERROR)
# tornado/ioloop.py
def remove_handler(self, fd):
    fd, obj = self.split_fd(fd)
    self._handlers.pop(fd, None)
    self._events.pop(fd, None)
    try:
        self._impl.unregister(fd)
    except Exception:
        gen_log.debug("Error deleting fd from IOLoop", exc_info=True)

Looking at these pieces, the code is pretty solid:

  • handlers are added only in add_handler, and they are added to a _impl.register
  • handlers are only removed in remove_handler, where they are removed in _events, _handlers and _impl.
  • events are added to _events in _impl.poll()

So the removing of handlers always make sure that events no longer has
it anymore, and it removes it from this impl thing too.

But what is impl? Could impl be adding fd’s for events that don’t have handlers?

impl: polling objects

It turns out _impl is chosen based on the OS. There is a little bit of
indirection here, but the IOLoop class in tornado extends a configurable object,
which selects the class based on the method configurable_default:

# tornado/ioloop.py
@classmethod
def configurable_default(cls):
    if hasattr(select, "epoll"):
        from tornado.platform.epoll import EPollIOLoop
        return EPollIOLoop
    if hasattr(select, "kqueue"):
        # Python 2.6+ on BSD or Mac
        from tornado.platform.kqueue import KQueueIOLoop
        return KQueueIOLoop
    from tornado.platform.select import SelectIOLoop
    return SelectIOLoop

And each of these loop implementations pass it’s own argument into the impl argument:

class EPollIOLoop(PollIOLoop):
    def initialize(self, **kwargs):
        super(EPollIOLoop, self).initialize(impl=select.epoll(), **kwargs)

Looking at select.epoll, it follows the interface of a polling object: a
class in the Python standard library that has the ability to poll for
changes to file descriptors. If something happens to a file descriptor
(e.g. a socket recieving data), the polling object, it will return
back the file descriptor that was triggered.

Different architectures have different polling objects
implemented. The avaialable ones in tornado by default are:

  • epoll (Linux)
  • kqueue (OSX / BSD)
  • select Windows use

In our case, this was happening on Linux, so we’ll look at epoll.

epoll

So what is epoll? It’s documented in the Python standard library, but
it’s a wrapper around the epoll Linux system calls.

The ioloop code actually looks like:

  • wait for epoll to return a file descriptor that has an event
  • execute the handler (which will presumably register another handler if another step is required, or not if it’s complete)
  • repeat.

epoll has two different configurations, but the one tornado uses is
edge-polling: it only triggers when a CHANGE occurs, vs when a
specific level is hit. In other words, it will only trigger when new
data is available: if the user decides to do nothing with the data,
epoll will not trigger again.

epoll works by registering file descriptors for the epoll object to
listen to. You can also stop listening to file descriptors as well.

So epoll works great for an event loop. But is it possible to somehow
register file descriptors to the epoll/impl object without using the
method above?

epoll and os.fork

It isn’t possible to register things outside of the impl
object. But, os.fork can cause some weird behaviour here. See, the way
that one interfaces with epoll is using file descriptors: you have an
fd to the epoll object, and you can use Linux system calls to work
with that:

As mentioned previously, file descriptors is a common way to reference
some object when using Linux kernel system calls.

Another common system call is fork. The
documentation of fork specifies that fork is equivalent to:

  • copying the memory of the current process to a new space
  • spawning a new process that uses the new copy.

This is fine for most objects in memory, but how about file
descriptors, which reference some object outside of the memory space
of the current process.

In the case of file descriptors, the file descriptor is also cloned to
the new fork. In other words, both the parent and the child process
will have a reference to the same file descriptor.

So, what does this mean for epoll, which is just another file
descriptor under the hood? Well, you can probably guess.

It gets shared.

How the bug works

So this is the crux of the issue. When an os.fork occurs, the parent
and the child share the SAME epoll. So for an IOLoop that is created
by the parent object, the child process uses the same epoll as well!

So, that allows a condition like this:

  1. parent creates an IOLoop loop_1, with an epoll epoll_1
  2. parent calls os.fork, creating loop_2, which shares the same epoll_2
  3. parent starts ioloop, waits for epoll_1.poll()
  4. child adds a handler for fd_2 to epoll_1
  5. parent gets back fd_2, but doesn’t have a handler for it, and raises the KeyError.

So this will pretty much happen at some point anytime a new ioloop is not created for a child process.

Here’s a repro script. I couldn’t figure out a good way to kill this
gracefully, so be warned this will need to be killed externally.

import logging
import select
import socket
import os
import time
import tornado.ioloop
import tornado.httpclient
import tornado.web

serversocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
serversocket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
serversocket.bind(('127.0.0.1', 8080))
serversocket.listen(1)

logging.basicConfig()

loop = tornado.ioloop.IOLoop.current()

if os.fork():
    handler = lambda *args, **kwargs: None
    loop.add_handler(serversocket.fileno(), handler, select.EPOLLIN)
    time.sleep(0.1)
    client = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    client.connect(('127.0.0.1', 8080))
    client.send(b"foo")
else:
    loop.start()

How about gunicorn or tornado.multiprocess?

So how to avoid this in gunicorn or tornado.multiprocess, which uses
an os.fork? The best practice is to not start the ioloop until AFTER
the fork: calling ioloop.Instance() or current() will create an ioloop whose ioloop will be shared
by any child ioloop, without explicitly clearing it.

Gunicorn calls a fork as it’s spawning a worker:

# gunicorn/arbiter.py
def spawn_worker(self):
    self.worker_age += 1
    worker = self.worker_class(self.worker_age, self.pid, self.LISTENERS,
                               self.app, self.timeout / 2.0,
                               self.cfg, self.log)
    self.cfg.pre_fork(self, worker)
    pid = os.fork()
    if pid != 0:
        self.WORKERS[pid] = worker
        return pid

Summary

Tornado is an awesome framework, but it’s not simple. However, thanks
to well documented pieces, it’s possible to diagnose even complex
issues like this, and do a bit of learning along the way.

Also, os.fork is not a complete guarantee that you’ll get a unique
instance of every object you use. Beware file descriptors.

Introducing transmute-core: quickly create documented, input validating APIs for any web framework

A majority of my career has been spent on building web services in
Python. Specifically, internal ones that have minimal or no UIs, and
speak REST (or
at least are rest-ish).

With each new service, I found myself re-implementing work to
make user-friendly REST APIs:

  • validation of incoming data, and descriptive errors when a field does not
    match the type or is otherwise invalid.
  • documenting said schema, providing UIs or wiki pages allowing users to
    understand what the API provides.
  • handling serialization to and from multiple content types (json, yaml)

This is maddening work to do over and over again, and details are
often missed: sometimes yaml is not supported for a particular API, or
there is a specific field that is not validated. Someone will ask about
an API that you changed, and forgot to document a new parameter. It’s hard to
scale API maintenance when you’re dealing with forgetting some minute boilerplate.

This was further exacerbated by using different web frameworks for
different projects. Every framework provides their own REST plugin or
library, and often there’s a lack of functional parity, or declaring
an API is completely different and requires learning multiple
approaches.

So with this monumental pain, what if I told you can get an API that:

  • validates incoming data types
  • supports multiple content types
  • has a fully documented UI

Just by writing a vanilla Python function? And what if I told you
this can work for YOUR Python framework of choice in 100 statements
of Python code?

Well, that’s what the transmute framework is.

How it works

transmute-core is
a library that provides tools to quickly implement rest APIs. It’s
designed to be consumed indirectly, through a thin layer that adapts
it to the style of the individual framework.

HTTP Endpoints

Here is an example of a GET endpoint in flask:

import flask_transmute

# flask-like decorator.
@flask_transmute.route(app, paths='/multiply')
# tell transmute what types are, which ensures validations
@flask_transmute.annotate({"left": int, "right": int, "return": int})
# the function is a vanilla Python function
def multiply(left, right):
    return left * right

And one in aiohttp, the web framework that uses Python 3’s asyncio:

import aiohttp_transmute

@aiohttp_transmute.describe(paths='/multiply')
# tell transmute what types are, which ensures validations
# Python3.5+ supports annotations natively
#
# request is provided by aiohttp.
def multiply(request, left: int, right: int) -> int:
    return left * right

aiohttp_transmute.route(app, multiply)

Both do the following:

  • generate a valid route in the target framework
  • detect the content type (yaml or json, and parse the body)
  • verify that input parameters match the parameters specified. return a 400 status
    code an details if not.
  • write back yaml or json, depending on the content type

Note that we don’t have to deal with the content type serialization,
read from request objects, or returning a valid response object:
that’s all handled by transmute. This keeps the functions cleaner in
general: it looks similar to any other Python function.

Complex Schemas via Schematic (or any validation framework)

Primitive types in the parameters are OK, but it’s often true that
more complex types are desired.

Schema declaration and validation has multiple solutions
already, so transmute defers this other libraries. By default transmute uses
schematics.:

from schematics.models import Model
from schematics.types import StringType, IntType

class Card(Model):
    name = StringType()
    price = IntType()


# passing in a schematics model as the type enables
# validation and creation of the object when converted
# to an API.
@annotate({"card": Card})
def submit_card(card):
    db.save_card(card)

Of course, some may prefer other solutions like marshmallow. In that
case, transmute-core provides a transmute-context for users to customize and use
their own implementation of transmute’s serializers:

from transmute_core import TransmuteContext, default_context

context = TransmuteContext(serializers=MySerializer())

route(app, fn, context=context)

# alternatively, you could modify the default context directly
# (be careful about where this code is called: it needs
# to happen before any routes are constructed)
default_context.serializers = MySerializer()

Documentation via Swagger

Swagger / OpenAPI allows one to define a REST API using json. Transmute generates
swagger json files based on the transmute routes added to an app, and transmute-core provides the static CSS and JavaScript
files required to render a nice documentation interface for it:

from flask_transmute import add_swagger

# reads all the transmute routes that have been added, extracts their
# swagger definitions, and generates a swagger json and an HTML page that renders it.
add_swagger(app, "/swagger.json", "/swagger")

This also means clients can be auto-generated as well: swagger has a
large number of open source projects dedicated to parsing and
generating swagger clients. However, I haven’t explored this too
deeply.

Lightweight Framework Implementations

Earlier in this post, it is mentioned that there should a wrapper
around transmute-core for your framework, as the style of how to add
routes and how to extract values from requests may vary.

A goal of transmute was to make the framework-specific code as thin as
possible: this allows more re-use and common behavior across the
frameworks, enabling developers across frameworks to improve
functionality for everyone.

Two reference implementations exist, and they are very thin. As of this writing, they are at:

  • flask-transmute: 166 lines of code, 80 statements
  • aiohttp-transmute: 218 lines of code, 103 statements (a little bloated to support legacy APIs)

A one-page example for flask integration is also provided, to
illustrate what is required to create a new one. That’s 200 LOC with
comments, a little more than 100 without.

http://transmute-core.readthedocs.io/en/latest/creating_a_framework.html

Impressions

Frameworks are always a means to an end: it’s about reducing that
effort between what you want to build and actually building it.

I love great, well designed APIs. And dealing with the minutiae of
some detail I missed in boilerplate content type handling or object
serialization was draining the enjoyment out of authoring them. Since
I’ve started using transmute for all of my projects, it’s let me focus
on what I care about most: actually writing the functional code, and
designing the great interfaces that let people use them. For the most part,
it feels like just writing another function in Python.

The auto-documentation is freeing from both sides: as an author I can
keep my documentation in line with my implementation, because my
implementation is the source. For consumers, they’re immediately
provided with a simple UI where they can rapidly iterate with the API
call they would like to make.

It’s also great knowing I can use transmute in the next framework,
whatever that may be: I can take all the work and behavior that’s
embedded in transmute, with a module or two’s worth of code.

Conclusion

Give it a shot! Issues
and PRs
are welcome, and I’d love to see someone apply transmute to
another framework.

Global logging with flask

As of December 2016, Flask has a built-in
logger that it instantiates for you. Unfortunately, this misses the
errors and other log messages in other libraries that may also be
valuable.

It would be nice to have a single logger, one that captures BOTH
library AND app logs. For those that want a global logger, this may
take a few concept to get right. You have to:

  1. undo flask’s logging
  2. set up your own logging
  3. set log levels, as the default may not suit you.

Combined, this ends up looking like:

import logging
import sys
from flask import Flask, current_app

LOG = logging.getLogger("my_log")
LOG2 = logging.getLogger(__name__ + ".toheunateh")
app = Flask(__name__)


@app.route("/")
def route():
    current_app.logger.info("flask logger: foo")
    LOG.info("log: foo")
    LOG2.info("log2: foo")
    return "hello!"


# create your own custom handler and formatter.
# you can also use logging.basicConfig() to get
# the python default.
out_hdlr = logging.StreamHandler(sys.stdout)
fmt = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s")
out_hdlr.setFormatter(fmt)
out_hdlr.setLevel(logging.INFO)
# append to the global logger.
logging.getLogger().addHandler(out_hdlr)
logging.getLogger().setLevel(logging.INFO)
# removing the handler and
# re-adding propagation ensures that
# the root handler gets the messages again.
app.logger.handlers = []
app.logger.propagate = True
app.run()

And you get the right messages. Voila!

The Dangers of Patching

If you’ve ever used Mock (or
the built-in mock in python
3
), you’ll
know how powerful of a tool it can be toward making unit testing on
functions modifying state sane. Mocks in Python are effectively a probe
that you can send into a deep, dark function:

import mock

def test_write_hello_world():
    my_filehandle = mock.Mock()
    write_hello_world_to_handle(my_filehandle)
    my_filehandle.write.assert_called_with("hello world")

You can send in a fake object, have it experience what it’s like to be
a real object, and you can ask it questions about what is was like.

The above example doesn’t really test a lot, but for more complex
cases, it can be a lifesaver: you know exactly what was called and
what wasn’t, and if your object modifies some real world state that
you don’t want to (such as a database), it prevents you
from performing dangerous operations.

Another well-known feature of the mock module is patch: a function that
gives you the ability to replace any object in python (in any module)
with a mocked object. An example usage is like this:

import mock

def test_linux():
    with mock.patch('platform.system') as system:
        system.return_value = 'Linux'
        import platform
        assert platform.system() == 'Linux'

Patch is powerful: it actually lets you replace modules, functions, and
values, even if they’re not imported in the current context!

But just because a tool is powerful, doesn’t mean you should use
it. In reality, patch should be a last resort: you should only use it
if there’s no other way to test your code.

But why? Patch is basically making mock even more flexible: you can
literally mock anything you are aware of exists. There’s a couple glaring issues:

It’s not foolproof

Let’s say I have a couple files like this:

# mock_test.py

from mymodule import is_my_os
try:
    from unittest import mock  # py3
except ImportError:
    import mock  # py2

with mock.patch('platform.system', return_value="my os"):
    assert is_my_os()
# mymodule.py
from platform import system

def is_my_os():
    return system() == "my os"

Now patch is patching the platform.system function, so this should pass. Let’s try it:

$ python mock_test.py
Traceback (most recent call last):
  File "./bin/python", line 42, in <module>
    exec(compile(__file__f.read(), __file__, "exec"))
  File "/Users/tsutsumi/sandbox/mock_test.py", line 11, in <module>
assert is_my_os()
    AssertionError

That’s not what we expected! So what happened here?

Internally, every python module contains it’s own scope. Every import,
method declaration, and variable declaration, and expression modifies
that scope in someway. So when you import anything, you are actually
adding in a reference to that object into the global scope. So by the
time we actually mock ‘platform.system’, the module’s ‘platform’
already contains a reference to the ‘system’ function:

$ python
>>> import platform
>>> from platform import system
>>> import mock
>>> with mock.patch('platform.system') as mock_system:
...     print(mock_system)
...     print(system)
...     print(platform.system)
...
<MagicMock name='system' id='4307612752'>
<function system at 0x100bf9c80>
<MagicMock name='system' id='4307612752'>
>>>

So even if you do patch a method, you won’t necessarily patch all the
uses of that method, depending on how they’re imported in. This
means your patching must directly match how the object you want to
mock is imported into the code to test.

For example, we can fix the mock_test.py file above by changing the patch:

# mock_test.py

from mymodule import is_my_os
try:
    from unittest import mock  # py3
except ImportError:
    import mock  # py2

with mock.patch('mymodule.system', return_value="my os"):
    assert is_my_os()

So in order to use a patch effectively, you have to be aware of exact
semantics
by which a method is both imported an invoked. And this
leads up to the ultimate problem with patch:

Really tightly coupling tests with implementation

Patching in general, regardless of the implementation, tightly couples
your test code and your regular code beyond the typical bounds of unit
testing. Once you get patching involved, you have to not only be
conscious of the effect of your code, but also it’s
implementation. Modifying the internal code of the method also
requires modifying the test code. If your unit tests change, the
actual functionality it’s testing is also changed: you’re no longer
guaranteed that your code is identical because the same tests pass:
because modifying your code requires you to change your test code.

Ultimately however, we don’t live in an ideal world. Times will come
when you have to test code that is hard to refactor into a method that
works with only mocks or actual objects. But with code you control,
it’s almost completely avoidable.

So how do we avoid patching?

Patching is the result of coupled complex state, relying on multiple
global variables. We can remedy this by doing the exact opposite:

  • decouple complex state
  • don’t rely on global variables

Let’s take a look at some practices to help with this:

Don’t use global variables

for example, let’s look at an object that creates a persistent db
connection based on configuration parameters:

db_connection = db_connect(DB_URL)

class MyObject:

    def __init__(self, name):
        self.name = name

    def save(self):
        db_connection.write(self.to_dict())

    def to_dict():
        return { 'name': self.name }

To test this object’s save method, you would have either patch the
db_connection object, or replace the DB_URL to reflect a test
database. Either method is an extra step from testing what you really
want on just the save method: the db method is called, and is passed the
dictionary representation of the object.

You can accomplish this without patch by passing in objects as you
need them: by explicitly passing them in, it makes it really easy to mock:

class MyObject:

    def __init__(self, name):
        self.name = name

    def save(self, db):
        db.write(self.to_dict())

    def to_dict():
        return { 'name': self.name }

 def test_myobject_save():
     import mock
     my_object = MyObject("foo")
     db = mock.Mock()
     my_object.save(db)
     assert db.write.assert_called_with({
         'name': 'foo'
     })

Decouple complex state

Complex state coupling occurs when you attempt to hide a lot of the
difficulty with creating objects from a user. Using the database above, as an example:

class MyObject:

    def __init__(self, db_url, name):
        self._db = db_connection(db_url)
        self.name = name

    def save(self):
        self._db.write(self.to_dict())

    def to_dict():
        return { 'name': self.name }

Now the only way to actually test this save method (aside from a full
stack test) is to mock the db_connection method. It wouldn’t work to
assign the db attribute afterward (my_object._db = Mock()) because
this would mean that the objects was already instantiated: your db
connection already exists, creating extra overhead you won’t used.

Instead of trying to hide the complex state from the user of your
class, let them actually choose the db object to pass in:

class MyObject:

    def __init__(self, db, name):
        self._db = db
        self.name = name

    def save(self):
        self._db.write(self.to_dict())

    def to_dict():
        return { 'name': self.name }

 def test_myobject_save():
     import mock
     db = mock.Mock()
     my_object = MyObject(db, "foo")
     my_object.save()
     assert db.write.assert_called_with({
         'name': 'foo'
     })

This not only allows us to test operations on complex objects, but
also makes the class more flexible as well (e.g. compatible with more
db objects than just the one that db_connection returns)

Final thoughts

Once again, patch exists for a reason. It’s almost like a magic wand
that allows you to test otherwise untestable code. But this magic wand
comes with making your life harder the more you use it.

So all in all: beware the dangers of patching.

Pycon 2014: Day 2 Recap

The second day of PyCon was just as amazing as the first. The day
started with a couple keynote speakers:

Keynote: Jessica McKellar

Jessica McKellar gave a talking about the current state of the world
for programming classes in U.S. grade schools, and the numbers weren’t
pretty. The number of programming classes is thin as it is, and the
prospect for girls was even worse: female students made up less than a
third of the test takes, and two states didn’t have any female
students take AP computer science at all.

It’s a bit dishearting to know that this is the state of
C.S. education in the U.S., but I think that it’s not a hopeless
situation: a third of the attendees at PyCon this year were women,
which is phenomenal. In addition, there’s a lot of people who discover
the joys of programming after high school (including myself).

Ultimately though, the lesson of the talk was that we need more
professional programmers fighting against this wave. Unfortunately all
of my free time is spent on several other projects, but I’ll definitely
remember that education needs some help when I have a spare second.

Keynote: Fernando Pérez

Geez, iPython is amazing. It’s so much more than just a fancier python
interpreter. The science community made it into more of a matlab
Frankenstein, complete with math, data analysis, and more.

Fernando demoed the iPython notebook, which is leagues ahead of
anything I’ve seen in the scientific note taking community. Rich ways
to express data, easily extensible (a lightning talk speaker today added d3
support, which only makes it look all that more amazing).

my limited experience in the academic world makes me cringe with bad
programming practices: lack of version control, no good documentation
(and I was definitely a part of that) made life incredibly
difficult. I think Pérez and the rest of the iPython community members
are definitely turning this trend around, with a system that allows
live documentation and easily modifiable data, blowing past anything
we have on the private side. I’d love to take the concepts iPython is
pushing and see if I can’t make something new and powerful out of it.

Talk: Designing Django’s Migrations

I can definitely tell Andrew Godwin is a smart guy. His extensive time
maintaining Django has really given him a good sense of what works and
what doesn’t for a general web framework. His talk on designing
migrations was basically explaining how it works, but he did share
some of his insights from looking at the previous migration tool, south:

  • simplify the logic and abstract it out to a python construct: this
    allows for less hard-coded files in a Python file, which is what South does.
  • load the whole migration path into memory to build a meaningful
    context: with an idea of the whole upgrade path, it’s easy to see
    what sort of deltas exist. This will probably be ok for 90% of the common migrations, but
    data re-formatting still needs a very human touch.
  • keeping the layer that performs migrations functionally separate
    from the database translation layer: effectively ensuring that the
    migration layer only uses methods exposed by the ORM. Good idea in
    my humble opinion: keeping the architecture separate blocks the
    possibly that a database feature could someday have to be
    implemented twice for both the model and the migration.

All in all, a good way to dive into an Open Source maintainer’s head.

Talk: Designing Poetic APIs

Wow. Erik Rose really gets what
software should be. When people say that programming is like writing,
I have to say I was skeptical. I could see how being a good writer
helps from a documentation perspective, but I couldn’t say it directly
affected one’s programming ability. Well, this talk threw everything I
thought I knew out the window.

Erik talks about seven principles that can help one design really
clean apis from a consumer perspective:

  • Don’t predict the future: stop coding against what you think is
    going to happen years from now. It rarely leads down a good path.
  • Consistency: make sure that not only are all your methods performing
    actions that are consistent with the behavior of other methods in
    your library, but take it a step further and be consistent with all
    of Python as well. The principle of least astonishment applies just as much to your internal library as it does to a web application.
  • Brevity: your methods should return minimal data structures that
    perform what you want, and should require as few arguments as
    possible. Not doing so leads to incredibly complicated, possibly
    untestable, most likely incomprehensible code.
  • Composability: instead of having methods that do five things, try to
    decompose it in a way that allows consumers to chain operations
    together. Not only does this help with brevity, but it helps complex
    work-flows become simpler as well.
  • Plain Data: use built-in types when possible. Not only dict and
    list, but also constructs built-in to the language, such as
    ConfigParser. This allows an astounding level of compatibility
    across all sort of code.
  • Grooviness: I’m not sure what he means 100% by this, bud I think
    grooviness has to do with how easily you can fall into a groove
    while coding. Things like checking docs, or digging through deep
    stack traces really hampers a good work-flow.
  • Safety: states that are impossibly shouldn’t be representable, and
    throwing exceptions is better than a return value if the case is
    truly an exception.

Seriously a mind-blowing talk. I’ve had feelings about a lot of these,
but to have someone qualify it with words really makes the point
clear. And this is where it all ties in to writing: poetry and coding
good apis require a very similar set of skills: having the ability to
express yourself in an eloquent, clear way. To be successful at either,
this ability is and absolute necessity.

Talk: Fast Python, Slow Python

I’m sure Alex Gaynor is an
incredibly smart person, but maybe his talk went over my head a bit. I
originally thought this was going to be a talk about practices that
would allow me to optimize Python, but he made me remember that
CPython is not the only Python around. His talk was actually about
making implementation-agnostic Python faster. He gave a few tips on
how to do this, but of course he didn’t really explain why it would be
faster. He gave using classes over dicts as an example of a performance
increase, arguing that dicts are object-object mappings and a class
provides a more explicit key. I’m not 100% sure why that would be any
better, consider one could optimize a dictionary in a very similar
way if you know what all the keys and values are going to be.

Not really a lot to be gleaned from this talk from my perspective,
unless you want to follow practices that would make you faster in PyPy
and possibly CPython (if you upgrade to the most recent version). Of
course that’s still not an implementation-agnostic performance increase.

Talk: What is Coming in Python Packaging

Great state of the world talk from Noah Kantrowitz. A lot of big
changes coming in the Python packaging world, including:

  • new PyPi, known as warehouse.
  • Wheels will take over at some point, but quite a few packages can’t
    use them due to their reliance on native dependencies and wheel
    can’t handle that for multiple platforms.
  • virtualenv is now in Python as of 3.4 (pyvenv)
  • twine is the defacto way to submit packages now, no longer using setup.py directly
  • devpi is apparently a popular local proxy for hosting packages internally.
  • the Python Packaging Association (PyPa) is now responsible for all packaging technologies, including:
    • setuptools
    • pip
    • wheel
    • warehouse
    • virtualenv

So definitely a good talk to check out if you want to know what’s the way to go in today’s Python world.

Summary

Another great day at PyCon. It’s awesome being able to hear from the
horses mouth (so to speak) about the state of the world in
Python. Also an amazing set of lightning talks too. Learning a lot all
over the place really.

Really excited for day 3, and the dev sprints to follow!

Pycon 2014: Day 1 Recap

Today was my first day at a PyCon conference, and the first day of
PyCon 2014. I wanted to talk about some of my favorite events:

Keynote: John Perry Barlow

I didn’t know that the lyricist for The Grateful Dead also founded the
Electronic Frontier Foundation. I
think it shows you how technology merges paths of those from all walks
of life. An interesting talk about how exposing more information is
ultimately putting more power into the hands of the people: a lack of
information is how corporations are allowed to continue policies and
practices that are unfair and don’t benefit society.

Talk: All Your Ducks in A Row: Data Structures in the Standard Library and Beyond

Long title, but great talk from Brandon Rhodes. Coming from an embedded-ish
background, I’ve had curiosities about how Python’s data structures
work internally. This talk discussed very interesting concepts like:

  • C Structs exists in Python
  • how Python can build c-like arrays
  • how Python’s built-in arrays isn’t very good because it requires
    converting the data into a Python object (and hence only really good
    for a compact storage mechanism), and one should use NumPy’s arrays instead.
  • Python’s built-in binary search
  • Anything in queue is thread-safe

He’s done a lot of talks on Python’s Data Structures before as well,
so I definitely have to catch up there.

Talk: Twisted Mixing

Laurens Van Houtven gave a good talk on
how one can mix Twisted Python
into a variety of things. It seems like libraries exist to mix
anything into Twisted and Twisted into anything:

  • crochet is a library
    that creates a Twisted reactor that you can use whenever you
    need. Basically a Twisted-on-demand type model.
  • geventreactor to run Gevent and Twisted side-by-side.

So it just makes me think that a lot of people are pushing Twisted
forward. Definitely speaks in spades about a technology. Twisted very
well may be the future for async (for Python 2 at least)

Talk: Real-time Predictive Analytics using Scikit-learn and RabbitMQ

Decided to diverge a bit and go to a machine learning talk from
Michael Becker. Really awesome
stuff. The description for the talk talks about how he’s going to make
a simple service that detects what language a block of text is written
in. He shows you how it’s done, and it’s crazy simple: Scikit-learn
for the machine learning, RabbitMQ to maintain the task queue, and a
worker to pull from the queue (and the client server).

The crazy thing here was how powerful Scikit-learn really is. Complex
algorithms such as various implementations of K-nearest-neighbors. It
makes me realize that academia is an incredibly powerful ally: getting
buy in from a community which solves very hard complex problems ends
up with amazing technology at the tip of your fingertips.

I’m definitely going to try something with Scikit-learn very soon.

Talk: Castle Anthrax: Dungeon Generation Techniques

Listened in to a talk from James King about how to generate
dungeons. Like a lot of pieces in game programming, a heavy algorithm
is required to generate the best results. James discussed a ton of generations algorithms from:

  • taking a square and cutting it a bunch of different ways randomly.
  • generating random noise and then connecting with a minimal spanning tree.

He discussed use methods such as Poisson Disks, Cellular Automation,
and Perlin Noise. I haven’t had time to grok all of it just yet, but
Definitely going to investigate those next time (or really the first
time) I make a rogue-like game.

Talk: Fan-In Fan-Out

Brett Slatkin discussion was mainly supposed to be about the
advantages of a map-reduce type architecture of delegating work to
multiple machines and retrieving and aggregating data, but it felt
more like a demo of how awesome asyncio (the new async library in
Python 3.4) really is.

Regardless of what his goal was, it’s really cool to see an Async
library in Python. I’m just reading into in now, and I see that it’s
incredibly powerful: easily customizable, providing a lot of the
facilities that you see in some of the more traditionally concurrent
languages (such as the future idea). Definitely one huge reason to
move to Python3 if possible.

Dinner: PyCon Dinner with Brandon Rhodes

I signed up for the dinner not really knowing who Brandon Rhodes is,
but my more community-literate Python colleagues tell me he’s one of
the must-see/hears. The dinner with him was definitely enjoyable. I
ended up at a table with a lot of Pythonistas way more knowledgeable
than me, which was really important as the dinner involved a
three-round python trivia game. A lot of fun questions, and I learned
a lot about Python 3 and python in general.

Conclusions

My first day at PyCon (albeit a jet-lagged one less than five hours of
sleep), was awesome. Aside from all the great talks, the breakfasts
and lunches spent talking to other Python enthusiasts was an eye
opener. I definitely learned a lot, including:

  • flask, django, and pyramids is definitely the current trend of web
    frameworks that the Python community is using
  • Python 3’s addition of async (and the lack of it in Python 2) is a
    strong reason to move to three, and it’s only going te get stronger
    as Python 2 continues to stagnate.
  • Despite this, almost everyone is sticking to 2: it’s really hard to
    migrate your code to Python 3.

And that’s it for Day 1. Next, day 2!

Python Internals

The internals of Python are actually pretty straightforward, but it’s
still worth a dive. I recently gave a talk at Zillow about it, so I’d
thought I’d share some points here as well.

Everything here prefixed with >>> can be typed into the python
interpreter (activated by typing ‘python’ in your shell if you have
python installed). I strongly encourage playing and trying some of
this stuff out yourself!

Basics

At the core, everything in python is an object. Each object has three properties:

  • a unique identifier of the object via ‘id()’
  • a type of the object via ‘type()’
  • and it’s value

The base object is represented by the keyword ‘object’ in python:

>>> object
<type 'object'>

And you can always find the methods available on any object (i.e. anything) using ‘dir’:

>>> dir(object)
['__class__',
 '__delattr__',
 '__doc__',
 '__format__',
 '__getattribute__',
 '__hash__',
 '__init__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__']

So let’s talk a little bit about the more interesting ones:

  • __class__ returns the type of an object. If the object is a type, it returns the type ‘type’
  • __doc__ is the docstring attached to a file. These are the triple quotes contained directly below a method or class declaration.
  • __new__ is called whenever a new instance of an object is created. It almost always calls __init__
  • __sizeof__ get the size of the object. One can also use sys.getsizeof. This isn’t the most reliable because it doesn’t get the size of referenced objects, just the size of the reference itself.
  • __delattr__, __getattribute__, and __setattr__ are used to get the attributes regarding a particular object. However, you should use (set|get|has)attr methods instead of directly calling these.

Types

Types are special kind of object in Python, designed to be
constructors for classes. It’s not possible to create a new object
(aside from built-in shorthand like {} for dictionaries and [] for
lists) without using a type object and instantiating something with it:

>>> object()
<object object at 0x7f1e14eee080>

exec, eval, and compile

exec, eval, and compile are also built-in functions in Python. They
compile and evaluate code.

‘exec’ executes a particlular string of code

>>> exec("print 'hello world'")
hello world

‘eval’ evaluates an expression. Note: this can not be a statement. e.g. assigning a value.

>>> eval("1")
1

‘compile’ compiles an expression or statement into a ‘code’ objects,
which actually contained the byte-compiled executable code, and is
what gets ultimately executed by Python.

Note that you have to choose to either ‘eval’ or ‘exec’ the string
passed. Conversely, you can pass a file.

>>> compile('./test.py')
>>> compile('print "hello world", '', 'exec')

Functions

Functions (or methods) consist of two objects:

  • a code object, containing the bytecode for a particular object
  • a globals dictionary, containing the global variables necessary

One can’t instantiate functions directly, so we have to get the type of a function first:

>>> ftype = type(lambda: None)
>>> fn = ftype(compile('print test', '', 'exec'), {'test': "hello world"})
>>> fn
<function <module>>
>>> fn()
hello world

So what’s actually going on here?

  • I get the type object of function. The easiest method to do this is
    to get the type of a lambda method which returns None. Since the
    type of the lambda is a ‘function’, it’s the quickest way to get
    what we need.

If you wanted to modify a function directly, you can! There’s a large
number of method available that you can play with.

>>> filter(lambda x: x.startswith('func'), dir(fn))
['func_closure',
 'func_code',
 'func_defaults',
 'func_dict',
 'func_doc',
 'func_globals',
 'func_name']
>>> fn.func_name
'<module>'
>>> fn.func_name = 'hello_world'
'hello_world'
>>> fn.func_code = compile('print "not " + test', '', 'exec')
>>> fn()
not hello world
>>> fn.func_globals['test'] = "goodbye world"
not goodbye world

Classes

Classes are just basically custom types. How can you tell? It’s made by using the ‘type’ constructor!

The ‘type’ method can not only return the type of an object, it can
create one for you too! Since ‘type’ is a type object, it can be used
to instantiate new types.

>>> a = type('MyClassType', (), {'test': lambda self: 1 })
>>> b = a()
>>> b.test
<bound method MyClassType.<lambda> of <__main__.MyClassType object at 0x7f524b71e510>>
>>> b.test()
1

The syntax is:

type(name, parents, attributes + values)
  • Name: the name of the new type
  • Parents: references to the parent classes
  • attributes + values: a list of tuples of the key and values of the attributes of the class.

Python’s objects are incredibly maleable. You can actually modify class methods directly:

>>> a.test = lambda self : return "noooo!"
>>> b.test()
noooo!

Although you can also override the method on the instance directly:

>>> b.test = lambda self : return "yes!"
>>> b.test()
yes!

So how does this work? Well every python object who’s type isn’t a
built in (think str, int) contains a dictionary-like object with all
of it’s attributes. This can be viewed by the “__dict__” attribute of an object:

>>> class ABC:
...     pass
...
>>> a = ABC()
>>> print a
<__main__.ABC instance at 0x19879e0>
>>> a.__dict__
{}

So how does Python know which attribute to call? This is actually
dictated in a method! If you noticed, when I ran a dir() on the
object, there was an attribute ‘__getattribute__’. This method
defaults to:

  • if the attribute is in the object’s own __dict__, then use that method.
  • if not, the attribute call’s it’s parents __getattribute__ method,
    which of course recurses to it’s own parents on being unable to find it

One of the things about __dict__ is it’s not directly writable. If you
want to modify attributes on an object, python provides built-in
functions for this:

  • hasattr(foo, ‘bar’) returns true if the object foo has the attribute ‘bar’
  • getattr(foo, ‘bar’) returns the attribute foo.bar
  • setattr(foo, ‘bar’, val) is equivalent to foo.bar = val

back to classes/types, there’s some interesting hidden features as well:

You can find out all the superclasses of a ‘type’ with .__bases__:

>>> a.__bases__
(object,)

And all subclasses:

>>> str.__subclasses__()
[<class 'apt.package.__dstr'>]

So how could I find all the classes in my scope? Since everything is
an object, we just find all subclasses of it.

>>> object.__subclasses__()

Pop Quiz: Is object a subclass of type, or visa versa?

Answer: both are subclasses of each other! Kind of.

>>> isinstance(object, type)
True
>>> isinstance(type, object)
True
>>> issubclass(object, type)
False
>>> issubclass(type, object)
True

Frames

Want to look at the stack frames within python? That’s possible too.

>>> import sys
>>> sys._getframe()

Will get you an instance of the existing frame, with references to the variables in the inner scope, outer scope, and more!

Conclusion

There’s a lot of interesting stuff going on under the hood of Python,
way beyond the brief discussion I covered here. The interpretive
nature of python is one that promotes exploration, so don’t hesitate!
Explore the wonderful world of python internals.

DRY Principles through Python Decorators

Python decorators are a
powerful tool to remove redundancy. Along with modularizing
functionality into appropriate bite-sized methods, it makes even the
most complex workflows into concise functionality.

For example, let’s look at the Django web framework, which handles
requests by methods which receive a method object and return a
response object:

def handle_request(request):
    return HttpResponse("Hello, World")

A case I ran into recently was having to write several api methods
which must:

  • return json responses
  • some must return an error code if it’s a GET request vs a POST

As an example, for a register api endpoint, I would write something like this:

def register(request):
    result = None
    # check for post only
    if request.method != 'POST':
        result = {"error": "this method only accepts posts!"}
    else:
        try:
            user = User.objects.create_user(request.POST['username'],
                                            request.POST['email'],
                                            request.POST['password'])
            # optional fields
            for field in ['first_name', 'last_name']:
                if field in request.POST:
                    setattr(user, field, request.POST[field])
            user.save()
            result = {"success": True}
        except KeyError as e:
            result = {"error": str(e) }
    response = HttpResponse(json.dumps(result))
    if "error" in result:
        response.status_code = 500
    return response

However, I’m going to need json responses and error returned in pretty
much every api method I create. This would result in a majority of
logic reproduced over and over again. Let’s try implementing some DRY principles with decorators.

Decorator Introduction

If you’re not familiar with decorators, they are effectively function
wrappers that are run when the python interpreter loads the function,
and can modify what the function receives and returns. For example, if
I wanted to always return an integer result of one larger than whatever was
returned, I could write my decorator as so:

# a decorator receives the method it's wrapping as a variable 'f'
def increment(f):
    # we use arbitrary args and keywords to
    # ensure we grab all the input arguments.
    def wrapped_f(*args, **kw):
        # note we call f against the variables passed into the wrapper,
        # and cast the result to an int and increment .
        return int(f(*args, **kw)) + 1
    return wrapped_f  # the wrapped function gets returned.

And now we can use it to decorate another method using the ‘@’ symbol:

@increment
def plus(a, b):
    return a + b

result = plus(4, 6)
assert(result == 11, "We wrote our decorator wrong!")

Decorators modify the existing function, and assign the variable to
whatever is returned by the decorator. In this case, ‘plus’ really
refers to the result of increment(plus)

Return an error on non-post requests

Now let’s apply decorators to something useful. Let’s make a decorator
that returns an error response if the request received isn’t a POST request in
django:

def post_only(f):
    """ Ensures a method is post only """
    def wrapped_f(request):
        if request.method != "POST":
            response = HttpResponse(json.dumps(
                {"error": "this method only accepts posts!"}))
            response.status_code = 500
            return response
        return f(request)
    return wrapped_f

Now, we can apply this to our register api above:

@post_only
def register(request):
    result = None
    try:
        user = User.objects.create_user(request.POST['username'],
                                        request.POST['email'],
                                        request.POST['password'])
        # optional fields
        for field in ['first_name', 'last_name']:
            if field in request.POST:
                setattr(user, field, request.POST[field])
        user.save()
        result = {"success": True}
    except KeyError as e:
        result = {"error": str(e) }
    response = HttpResponse(json.dumps(result))
    if "error" in result:
        response.status_code = 500
    return response

And now we have a repeatable decorator we can apply to every api method we have.

Send the response as json

To send the response as json (and also handle the 500 status code
while we’re at it), we can just create another decorator:

def json_response(f):
    """ Return the response as json, and return a 500 error code if an error exists """
    def wrapped(*args, **kwargs):
        result = f(*args, **kwargs)
        response = HttpResponse(json.dumps(result))
        if type(result) == dict and 'error' in result:
            response.status_code = 500
        return response

Now we can remove the json code from our methods, and add a decorator instead:

@post_only
@json_response
def register(request):
    try:
        user = User.objects.create_user(request.POST['username'],
                                        request.POST['email'],
                                        request.POST['password'])
        # optional fields
        for field in ['first_name', 'last_name']:
            if field in request.POST:
                setattr(user, field, request.POST[field])
        user.save()
        return {"success": True}
    except KeyError as e:
        return {"error": str(e) }

Now, if I need to write a new method, I can just use these decorators
to re-do the redundant work. If I need to make a sign-in method, I
only have to write the real relevant code a second time:

@post_only
@json_response
def login(request):
    if request.user is not None:
        return {"error": "User is already authenticated!"}
    user = auth.authenticate(request.POST['username'], request.POST['password'])
    if user is not None:
        if not user.is_active:
            return {"error": "User is inactive"}
        auth.login(request, user)
        return {"success": True, "id": user.pk}
    else:
        return {"error": "User does not exist with those credentials"}

BONUS: parameterizing your request method

I’ve used the Turbogears
framework for python, and something I’ve fallen in love with is the
way query parameters are interpreted and passed directory into the
method. So how can I mimic this behaviour in Django? Well, a decorator
is one way!

Here’s one:

def parameterize_request(types=("POST",)):
    """
    Parameterize the request instead of parsing the request directly.
    Only the types specified will be added to the query parameters.

    e.g. convert a=test&b=cv in request.POST to
    f(a=test, b=cv)
    """
    def wrapper(f):
        def wrapped(request):
            kw = {}
            if "GET" in types:
                for k, v in request.GET.items():
                    kw[k] = v
            if "POST" in types:
                for k, v in request.POST.items():
                    kw[k] = v
            return f(request, **kw)
        return wrapped
    return wrapper

Note that this is an example of a parameterized decorator. In this
case, the result of the function is the actual decorator.

Now, I can write my methods with parameterized arguments! I can even
choose whether to allow GET and POST, or just one type of
query parameter.

@post_only
@json_response
@parameterize_request(["POST"])
def register(request, username, email, password,
             first_name=None, last_name=None):
    user = User.objects.create_user(username, email, password)
    user.first_name=first_name
    user.last_name=last_name
    user.save()
    return {"success": True}

Now, we have a succinct, and easily understandable api!

BONUS #2: Using functools.wraps to preserve docstrings and function name

(Credit goes to Wes Turner to pointing this out)

Unfortunately, one of the side effects of using decorators is that the
method’s name (__name__) and docstring (__doc__) values are not
preserved:

def increment(f):
    """ Increment a function result """
    wrapped_f(a, b):
        return f(a, b) + 1
    return wrapped_f

@increment
def plus(a, b)
    """ Add two things together """
    return a + b

plus.__name__  # this is now 'wrapped_f' instead of 'plus'
plus.__doc__   # this now returns 'Increment a function result' instead of 'Add two things together'

This causes issues for applications which use reflection, like Sphinx,
a library to that automatically generates documentation for your
python code.

To resolve this, you can use a ‘wraps’ decorator to attach the name and docstring:

from functools import wraps

def increment(f):
    """ Increment a function result """
    @wraps(f)
    wrapped_f(a, b):
        return f(a, b) + 1
    return wrapped_f

@increment
def plus(a, b)
    """ Add two things together """
    return a + b

plus.__name__  # this returns 'plus'
plus.__doc__   # this returns 'Add two things together'

BONUS #3: Using the ‘decorator’ decorator

(Credit goes to LeszekSwirski for
this awesome tip.)

** NOTE ** : Elghinn mentions in the comments that there are caveats
to using this decorator.

If you look at the way decorators above, there is a lost of repeating
going on there as well, in the declaring and returning of a wrapper.

you can install the python egg ‘decorator’, which includes a ‘decorator’
decorator that provides the decorator boilerplate for you!

With easy_install:

$ sudo easy_install decorator

Or Pip:

$ pip install decorator

Then you can simply write:

from decorator import decorator

@decorator
def post_only(f, request):
    """ Ensures a method is post only """
    if request.method != "POST":
        response = HttpResponse(json.dumps(
            {"error": "this method only accepts posts!"}))
        response.status_code = 500
        return response
    return f(request)

What’s even more awesome about this decorator, is the fact that it
preserves the return values of __name__ and __doc__, so it wraps in
the functionality functools.wraps performs as well!