Crafting pelican-export in 6 hours.

Over the past two or three days, I spent some deep work time on writing pelican-export, a tool to export posts from the pelican static blog creator to WordPress (with some easy hooks to add more). Overall I was happy with the project, not only because it was successful, but because I was able to get to something complete in a pretty short period of time: 6 hours. Reflecting, I owe this to the techniques I’ve learned to prototype quickly.

Here’s a timeline of how I iterated, with some analysis.

[20 minutes] Finding Prior Art

Before I start any project, I try to at least do a few quick web searches to see if what I want already exist. Searching for “pelican to wordpress” pulled up this blog post:

https://code.zoia.org/2016/11/29/migrating-from-pelican-to-wordpress/

Which pointed at a git repo:

https://github.com/robertozoia/pelican-to-wordpress

Fantastic! Something exists that I can use. Even if it doesn’t work off the bat, I can probably fix it, use it, and be on my way.

[60m] Trying to use pelican-to-wordpress

I started by cloning the repo, and looking through the code. From here I got some great ideas to quickly build this integration (e.g. discovering the xmlrpc-wordpress library). Unfortunately the code only supported markdown (mine are in restructuredtext), and there were a few things I wasn’t a fan of (constants including password in a file), so I decided to start doing some light refactoring.

I started organizing things into a package structure, and tried to use the Pelican Python package itself to do things like read the file contents (saves me the need to parse the text myself). While looking for those docs, I stumbled upon some issues in the pelican repository, suggesting that for exporting, one would want to write a plugin:

https://github.com/getpelican/pelican/issues/2143

At this point, I decided to explore plugins.

[60m] Scaffolding and plugin structure.

Looking through the plugin docs, it seemed much easier than me trying to read in the pelican posts myself[ I had limited success with instantiating a pelican reader object directly, as it expects specific configuration variables.

So I started authoring a real package. Copying in the package scaffolding like setup.py from another repo, I added the minimum integration I needed to actually install the plugin into pelican and run it.

[60m] Rapid iteration with pdb.

At that point, I added a pdb statement into the integration, so I could quickly look at the data structures. Using that I crafted the code to migrate post formats in a few minutes:

    def process_post(self, content) -> Optional[WordPressPost]:
        """Create a wordpress post based on pelican content"""
        if content.status == "draft":
            return None
        post = WordPressPost()
        post.title = content.title
        post.slug = content.slug
        post.content = content.content
        # this conversion is required, as pelican uses a SafeDateTime
        # that python-wordpress-xmlrpc doesn't recognize as a valid date.
        post.date = datetime.fromisoformat(content.date.isoformat())
        post.term_names = {
            "category": [content.category.name],
        }
        if hasattr(content, "tags"):
            post.term_names["post_tag"] = [tag.name for tag in content.tags]
        return post

I added a simlar pdb statement to the “finalized” pelican signal, and tested the client with hard-coded values. I was done as far as functionality was concerned!

[180m] Code cleanup and publishing

The bulk of my time after that was just smaller cleanup that I wanted to do from a code hygiene standpoint. Things like:

  • [70m] making the wordpress integration and interface, so it’s easy to hook in other exporters.
  • [40m] adding a configuration pattern to enable hooking in other exporters.
  • [10m] renaming the repo to it’s final name of pelican-export
  • [30m] adding readme and documentation.
  • [30m] publishing the package to pypi.

This was half of my time! Interesting how much time is spent just ensuring the right structure and practices for the long term.

Takeaways

I took every shortcut in my book to arrive at something functional, as quickly as I could. Techniques that saved me tons of time were:

  • Looking for prior art. Brainstorming how to do the work myself would have meant investigating potential avenues and evaluating how long it would take. Having an existing example, even if it didn’t work for me, helped me ramp up of the problem quickly.
  • Throwing code away. I had a significant amount of modified code in my forked exporter. But continuing that route would involve a significant investment in hacking and understanding the pelican library. Seeing that the plugin route existed, and testing it out, saved me several hours of time trying to hack and interface to private pelican APIs.
  • Using pdb to live write code. In Python especially, there’s no replacement to just introspecting and trying things. Authoring just enough code to integrate as a plugin to give me a fast feedback loop, and throwing a pdb statement to quickly learn the data structure, helped me find the ideal structure in about 10 minutes.

There was also a fair bit of Python expertise that I used to drive down the coding time, but what’s interesting is the biggest contributors to time savings were process: knowing the tricks on taking the right code approach, and iterating quickly, helped me get this done in effectively a single work day.

Aiohttp vs Multithreaded Flask for High I/O Applications

Over the past year, my team has been making the transition from Flask to
aiohttp. We’re making this
transition because of a lot of the situations where non-blocking I/O
theoretically scales better:

  • large numbers of simultaneous connections
  • remote http requests with long response times

There is agreement that asyncio scales better memory-wise: a green thread
in Python consumes less memory than a system thread.

However, performance for latency and load is a bit more contentious. The best way to find
out is to run a practical experiment.

To find out, I forked py-frameworks-benchmark, and designed an experiment.

The Experiment

The conditions of the web application, and the work performed, are identical:

  • a route on a web server that: 1. returns the response as json 2. queries a
  • http request to an nginx server returning back html.
  • a wrk benchmark run, with 400 concurrent requests for 20 seconds
  • running under gunicorn, with two worker processes.
  • python3.6

The Variants

The variants are:

  • aiohttp
  • flask + meinheld
  • flask + gevent
  • flask + multithreading, varying from 10 to 1000.

Results

variant min p50 p99 p99.9 max mean duration requests
aiohttp 163.27 247.72 352.75 404.59 1414.08 257.59 20.10 30702
flask:gevent 85.02 945.17 6587.19 8177.32 8192.75 1207.66 20.08 7491
flask:meinheld 124.99 2526.55 6753.13 6857.55 6857.55 3036.93 20.10 190
flask:10 163.05 4419.11 4505.59 4659.46 4667.55 3880.05 20.05 1797
flask:20 110.23 2368.20 3140.01 3434.39 3476.06 2163.02 20.09 3364
flask:50 122.17 472.98 3978.68 8599.01 9845.94 541.13 20.10 4606
flask:100 118.26 499.16 4428.77 8714.60 9987.37 556.77 20.10 4555
flask:200 112.06 459.85 4493.61 8548.99 9683.27 527.02 20.10 4378
flask:400 121.63 526.72 3195.23 8069.06 9686.35 580.54 20.06 4336
flask:800 127.94 430.07 4503.95 8653.69 9722.19 514.47 20.09 4381
flask:1000 184.76 732.21 1919.72 5323.73 7364.60 786.26 20.04 4121

You can probably get a sense that aiohttp can server more requests than any
other. To get a real sense of how threads scale we can put the request count on
a chart:

 

The interesting note is that the meinheld worker didn’t scale very well at all.
Gevent handled requests faster than any threading implementation.

But nothing handled nearly as many requests as aiohttp.

These are the results on my machine. I’d strongly suggest you try the experiment
for yourself: the code is available in my fork.

If anyone has any improvements on the multithreading side, or can explain the discrepency in performance, I’d love to understand more.

KeyError in self._handlers: a journey deep into Tornado’s internals

If you’ve worked with tornado, you may have encountered a traceback of
a somewhat bewildering error:

Traceback (most recent call last):
    File "/usr/local/lib/python2.7/site-packages/tornado/ioloop.py", line 832, in start
fd_obj, handler_func = self._handlers[fd]
KeyError: 16

A few other people have been confused as well. After some digging and a combination
of learning about the event loop, fork, and epoll, the answer finally entered into focus.

TLDR

If you’re looking for the solution, don’t call or start IOLoops before
an os.fork. This happens in web servers like gunicorn, as well as
tornado.multiprocess, so be aware of that caveat as well.

But why does this happen?

As I mentioned previously, this is a combination of behaviour all
across the system, python and tornado stack. Let’s start with
learning more about that error specifically.

The code the traceback is referring occurs in the the IOLoop:

# tornado/ioloop.py
self._events.update(event_pairs)
while self._events:
    fd, events = self._events.popitem()
    try:
        fd_obj, handler_func = self._handlers[fd]
        handler_func(fd_obj, events)

What are these variables? you can read the IOLoop code yourself, but effectively:

  • _handlers is a list of the callbacks that should be called once an async event is complete.
  • _events is a list of events that have occurred, that need to be handled.

What is an FD?

The handlers and events are both keyed off of file descriptors. In a
few words, file descriptors represent a handle to some open file. In
unix, a pattern has propagated where a lot of resources (devices,
cgroups, active/inactive state) are referenced via file descriptors:
it became a lingua franca for low level resources because a lot of
tooling knows how to work with file descriptors, and writing and
reading to a file is simple.

They’re useful for tornado because sockets also have a file descriptor
represent them. So the tornado ioloop could wait for an event
affecting a socket, then pass that socket to a handler when a socket
event is fired (e.g. some new data came into the socket buffer).

What modifies the events and handlers?

A KeyError handlers means there’s a key in events that is not in the
handlers: some code is causing events to be added to the ioloop, and
aren’t registering a handler for it at the same time. So how does that
happen in the code?

A good starting point is looking where _handlers and _events are
modified in the code. In all of the tornado code, there’s only a
couple places:

# tornado/ioloop.py
def add_handler(self, fd, handler, events):
    fd, obj = self.split_fd(fd)
    self._handlers[fd] = (obj, stack_context.wrap(handler))
    self._impl.register(fd, events | self.ERROR)
# tornado/ioloop.py
def remove_handler(self, fd):
    fd, obj = self.split_fd(fd)
    self._handlers.pop(fd, None)
    self._events.pop(fd, None)
    try:
        self._impl.unregister(fd)
    except Exception:
        gen_log.debug("Error deleting fd from IOLoop", exc_info=True)

Looking at these pieces, the code is pretty solid:

  • handlers are added only in add_handler, and they are added to a _impl.register
  • handlers are only removed in remove_handler, where they are removed in _events, _handlers and _impl.
  • events are added to _events in _impl.poll()

So the removing of handlers always make sure that events no longer has
it anymore, and it removes it from this impl thing too.

But what is impl? Could impl be adding fd’s for events that don’t have handlers?

impl: polling objects

It turns out _impl is chosen based on the OS. There is a little bit of
indirection here, but the IOLoop class in tornado extends a configurable object,
which selects the class based on the method configurable_default:

# tornado/ioloop.py
@classmethod
def configurable_default(cls):
    if hasattr(select, "epoll"):
        from tornado.platform.epoll import EPollIOLoop
        return EPollIOLoop
    if hasattr(select, "kqueue"):
        # Python 2.6+ on BSD or Mac
        from tornado.platform.kqueue import KQueueIOLoop
        return KQueueIOLoop
    from tornado.platform.select import SelectIOLoop
    return SelectIOLoop

And each of these loop implementations pass it’s own argument into the impl argument:

class EPollIOLoop(PollIOLoop):
    def initialize(self, **kwargs):
        super(EPollIOLoop, self).initialize(impl=select.epoll(), **kwargs)

Looking at select.epoll, it follows the interface of a polling object: a
class in the Python standard library that has the ability to poll for
changes to file descriptors. If something happens to a file descriptor
(e.g. a socket recieving data), the polling object, it will return
back the file descriptor that was triggered.

Different architectures have different polling objects
implemented. The avaialable ones in tornado by default are:

  • epoll (Linux)
  • kqueue (OSX / BSD)
  • select Windows use

In our case, this was happening on Linux, so we’ll look at epoll.

epoll

So what is epoll? It’s documented in the Python standard library, but
it’s a wrapper around the epoll Linux system calls.

The ioloop code actually looks like:

  • wait for epoll to return a file descriptor that has an event
  • execute the handler (which will presumably register another handler if another step is required, or not if it’s complete)
  • repeat.

epoll has two different configurations, but the one tornado uses is
edge-polling: it only triggers when a CHANGE occurs, vs when a
specific level is hit. In other words, it will only trigger when new
data is available: if the user decides to do nothing with the data,
epoll will not trigger again.

epoll works by registering file descriptors for the epoll object to
listen to. You can also stop listening to file descriptors as well.

So epoll works great for an event loop. But is it possible to somehow
register file descriptors to the epoll/impl object without using the
method above?

epoll and os.fork

It isn’t possible to register things outside of the impl
object. But, os.fork can cause some weird behaviour here. See, the way
that one interfaces with epoll is using file descriptors: you have an
fd to the epoll object, and you can use Linux system calls to work
with that:

As mentioned previously, file descriptors is a common way to reference
some object when using Linux kernel system calls.

Another common system call is fork. The
documentation of fork specifies that fork is equivalent to:

  • copying the memory of the current process to a new space
  • spawning a new process that uses the new copy.

This is fine for most objects in memory, but how about file
descriptors, which reference some object outside of the memory space
of the current process.

In the case of file descriptors, the file descriptor is also cloned to
the new fork. In other words, both the parent and the child process
will have a reference to the same file descriptor.

So, what does this mean for epoll, which is just another file
descriptor under the hood? Well, you can probably guess.

It gets shared.

How the bug works

So this is the crux of the issue. When an os.fork occurs, the parent
and the child share the SAME epoll. So for an IOLoop that is created
by the parent object, the child process uses the same epoll as well!

So, that allows a condition like this:

  1. parent creates an IOLoop loop_1, with an epoll epoll_1
  2. parent calls os.fork, creating loop_2, which shares the same epoll_2
  3. parent starts ioloop, waits for epoll_1.poll()
  4. child adds a handler for fd_2 to epoll_1
  5. parent gets back fd_2, but doesn’t have a handler for it, and raises the KeyError.

So this will pretty much happen at some point anytime a new ioloop is not created for a child process.

Here’s a repro script. I couldn’t figure out a good way to kill this
gracefully, so be warned this will need to be killed externally.

import logging
import select
import socket
import os
import time
import tornado.ioloop
import tornado.httpclient
import tornado.web

serversocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
serversocket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
serversocket.bind(('127.0.0.1', 8080))
serversocket.listen(1)

logging.basicConfig()

loop = tornado.ioloop.IOLoop.current()

if os.fork():
    handler = lambda *args, **kwargs: None
    loop.add_handler(serversocket.fileno(), handler, select.EPOLLIN)
    time.sleep(0.1)
    client = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    client.connect(('127.0.0.1', 8080))
    client.send(b"foo")
else:
    loop.start()

How about gunicorn or tornado.multiprocess?

So how to avoid this in gunicorn or tornado.multiprocess, which uses
an os.fork? The best practice is to not start the ioloop until AFTER
the fork: calling ioloop.Instance() or current() will create an ioloop whose ioloop will be shared
by any child ioloop, without explicitly clearing it.

Gunicorn calls a fork as it’s spawning a worker:

# gunicorn/arbiter.py
def spawn_worker(self):
    self.worker_age += 1
    worker = self.worker_class(self.worker_age, self.pid, self.LISTENERS,
                               self.app, self.timeout / 2.0,
                               self.cfg, self.log)
    self.cfg.pre_fork(self, worker)
    pid = os.fork()
    if pid != 0:
        self.WORKERS[pid] = worker
        return pid

Summary

Tornado is an awesome framework, but it’s not simple. However, thanks
to well documented pieces, it’s possible to diagnose even complex
issues like this, and do a bit of learning along the way.

Also, os.fork is not a complete guarantee that you’ll get a unique
instance of every object you use. Beware file descriptors.

Introducing transmute-core: quickly create documented, input validating APIs for any web framework

A majority of my career has been spent on building web services in
Python. Specifically, internal ones that have minimal or no UIs, and
speak REST (or
at least are rest-ish).

With each new service, I found myself re-implementing work to
make user-friendly REST APIs:

  • validation of incoming data, and descriptive errors when a field does not
    match the type or is otherwise invalid.
  • documenting said schema, providing UIs or wiki pages allowing users to
    understand what the API provides.
  • handling serialization to and from multiple content types (json, yaml)

This is maddening work to do over and over again, and details are
often missed: sometimes yaml is not supported for a particular API, or
there is a specific field that is not validated. Someone will ask about
an API that you changed, and forgot to document a new parameter. It’s hard to
scale API maintenance when you’re dealing with forgetting some minute boilerplate.

This was further exacerbated by using different web frameworks for
different projects. Every framework provides their own REST plugin or
library, and often there’s a lack of functional parity, or declaring
an API is completely different and requires learning multiple
approaches.

So with this monumental pain, what if I told you can get an API that:

  • validates incoming data types
  • supports multiple content types
  • has a fully documented UI

Just by writing a vanilla Python function? And what if I told you
this can work for YOUR Python framework of choice in 100 statements
of Python code?

Well, that’s what the transmute framework is.

How it works

transmute-core is
a library that provides tools to quickly implement rest APIs. It’s
designed to be consumed indirectly, through a thin layer that adapts
it to the style of the individual framework.

HTTP Endpoints

Here is an example of a GET endpoint in flask:

import flask_transmute

# flask-like decorator.
@flask_transmute.route(app, paths='/multiply')
# tell transmute what types are, which ensures validations
@flask_transmute.annotate({"left": int, "right": int, "return": int})
# the function is a vanilla Python function
def multiply(left, right):
    return left * right

And one in aiohttp, the web framework that uses Python 3’s asyncio:

import aiohttp_transmute

@aiohttp_transmute.describe(paths='/multiply')
# tell transmute what types are, which ensures validations
# Python3.5+ supports annotations natively
#
# request is provided by aiohttp.
def multiply(request, left: int, right: int) -> int:
    return left * right

aiohttp_transmute.route(app, multiply)

Both do the following:

  • generate a valid route in the target framework
  • detect the content type (yaml or json, and parse the body)
  • verify that input parameters match the parameters specified. return a 400 status
    code an details if not.
  • write back yaml or json, depending on the content type

Note that we don’t have to deal with the content type serialization,
read from request objects, or returning a valid response object:
that’s all handled by transmute. This keeps the functions cleaner in
general: it looks similar to any other Python function.

Complex Schemas via Schematic (or any validation framework)

Primitive types in the parameters are OK, but it’s often true that
more complex types are desired.

Schema declaration and validation has multiple solutions
already, so transmute defers this other libraries. By default transmute uses
schematics.:

from schematics.models import Model
from schematics.types import StringType, IntType

class Card(Model):
    name = StringType()
    price = IntType()


# passing in a schematics model as the type enables
# validation and creation of the object when converted
# to an API.
@annotate({"card": Card})
def submit_card(card):
    db.save_card(card)

Of course, some may prefer other solutions like marshmallow. In that
case, transmute-core provides a transmute-context for users to customize and use
their own implementation of transmute’s serializers:

from transmute_core import TransmuteContext, default_context

context = TransmuteContext(serializers=MySerializer())

route(app, fn, context=context)

# alternatively, you could modify the default context directly
# (be careful about where this code is called: it needs
# to happen before any routes are constructed)
default_context.serializers = MySerializer()

Documentation via Swagger

Swagger / OpenAPI allows one to define a REST API using json. Transmute generates
swagger json files based on the transmute routes added to an app, and transmute-core provides the static CSS and JavaScript
files required to render a nice documentation interface for it:

from flask_transmute import add_swagger

# reads all the transmute routes that have been added, extracts their
# swagger definitions, and generates a swagger json and an HTML page that renders it.
add_swagger(app, "/swagger.json", "/swagger")

This also means clients can be auto-generated as well: swagger has a
large number of open source projects dedicated to parsing and
generating swagger clients. However, I haven’t explored this too
deeply.

Lightweight Framework Implementations

Earlier in this post, it is mentioned that there should a wrapper
around transmute-core for your framework, as the style of how to add
routes and how to extract values from requests may vary.

A goal of transmute was to make the framework-specific code as thin as
possible: this allows more re-use and common behavior across the
frameworks, enabling developers across frameworks to improve
functionality for everyone.

Two reference implementations exist, and they are very thin. As of this writing, they are at:

  • flask-transmute: 166 lines of code, 80 statements
  • aiohttp-transmute: 218 lines of code, 103 statements (a little bloated to support legacy APIs)

A one-page example for flask integration is also provided, to
illustrate what is required to create a new one. That’s 200 LOC with
comments, a little more than 100 without.

http://transmute-core.readthedocs.io/en/latest/creating_a_framework.html

Impressions

Frameworks are always a means to an end: it’s about reducing that
effort between what you want to build and actually building it.

I love great, well designed APIs. And dealing with the minutiae of
some detail I missed in boilerplate content type handling or object
serialization was draining the enjoyment out of authoring them. Since
I’ve started using transmute for all of my projects, it’s let me focus
on what I care about most: actually writing the functional code, and
designing the great interfaces that let people use them. For the most part,
it feels like just writing another function in Python.

The auto-documentation is freeing from both sides: as an author I can
keep my documentation in line with my implementation, because my
implementation is the source. For consumers, they’re immediately
provided with a simple UI where they can rapidly iterate with the API
call they would like to make.

It’s also great knowing I can use transmute in the next framework,
whatever that may be: I can take all the work and behavior that’s
embedded in transmute, with a module or two’s worth of code.

Conclusion

Give it a shot! Issues
and PRs
are welcome, and I’d love to see someone apply transmute to
another framework.

Global logging with flask

As of December 2016, Flask has a built-in
logger that it instantiates for you. Unfortunately, this misses the
errors and other log messages in other libraries that may also be
valuable.

It would be nice to have a single logger, one that captures BOTH
library AND app logs. For those that want a global logger, this may
take a few concept to get right. You have to:

  1. undo flask’s logging
  2. set up your own logging
  3. set log levels, as the default may not suit you.

Combined, this ends up looking like:

import logging
import sys
from flask import Flask, current_app

LOG = logging.getLogger("my_log")
LOG2 = logging.getLogger(__name__ + ".toheunateh")
app = Flask(__name__)


@app.route("/")
def route():
    current_app.logger.info("flask logger: foo")
    LOG.info("log: foo")
    LOG2.info("log2: foo")
    return "hello!"


# create your own custom handler and formatter.
# you can also use logging.basicConfig() to get
# the python default.
out_hdlr = logging.StreamHandler(sys.stdout)
fmt = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s")
out_hdlr.setFormatter(fmt)
out_hdlr.setLevel(logging.INFO)
# append to the global logger.
logging.getLogger().addHandler(out_hdlr)
logging.getLogger().setLevel(logging.INFO)
# removing the handler and
# re-adding propagation ensures that
# the root handler gets the messages again.
app.logger.handlers = []
app.logger.propagate = True
app.run()

And you get the right messages. Voila!

The Dangers of Patching

If you’ve ever used Mock (or
the built-in mock in python
3
), you’ll
know how powerful of a tool it can be toward making unit testing on
functions modifying state sane. Mocks in Python are effectively a probe
that you can send into a deep, dark function:

import mock

def test_write_hello_world():
    my_filehandle = mock.Mock()
    write_hello_world_to_handle(my_filehandle)
    my_filehandle.write.assert_called_with("hello world")

You can send in a fake object, have it experience what it’s like to be
a real object, and you can ask it questions about what is was like.

The above example doesn’t really test a lot, but for more complex
cases, it can be a lifesaver: you know exactly what was called and
what wasn’t, and if your object modifies some real world state that
you don’t want to (such as a database), it prevents you
from performing dangerous operations.

Another well-known feature of the mock module is patch: a function that
gives you the ability to replace any object in python (in any module)
with a mocked object. An example usage is like this:

import mock

def test_linux():
    with mock.patch('platform.system') as system:
        system.return_value = 'Linux'
        import platform
        assert platform.system() == 'Linux'

Patch is powerful: it actually lets you replace modules, functions, and
values, even if they’re not imported in the current context!

But just because a tool is powerful, doesn’t mean you should use
it. In reality, patch should be a last resort: you should only use it
if there’s no other way to test your code.

But why? Patch is basically making mock even more flexible: you can
literally mock anything you are aware of exists. There’s a couple glaring issues:

It’s not foolproof

Let’s say I have a couple files like this:

# mock_test.py

from mymodule import is_my_os
try:
    from unittest import mock  # py3
except ImportError:
    import mock  # py2

with mock.patch('platform.system', return_value="my os"):
    assert is_my_os()
# mymodule.py
from platform import system

def is_my_os():
    return system() == "my os"

Now patch is patching the platform.system function, so this should pass. Let’s try it:

$ python mock_test.py
Traceback (most recent call last):
  File "./bin/python", line 42, in <module>
    exec(compile(__file__f.read(), __file__, "exec"))
  File "/Users/tsutsumi/sandbox/mock_test.py", line 11, in <module>
assert is_my_os()
    AssertionError

That’s not what we expected! So what happened here?

Internally, every python module contains it’s own scope. Every import,
method declaration, and variable declaration, and expression modifies
that scope in someway. So when you import anything, you are actually
adding in a reference to that object into the global scope. So by the
time we actually mock ‘platform.system’, the module’s ‘platform’
already contains a reference to the ‘system’ function:

$ python
>>> import platform
>>> from platform import system
>>> import mock
>>> with mock.patch('platform.system') as mock_system:
...     print(mock_system)
...     print(system)
...     print(platform.system)
...
<MagicMock name='system' id='4307612752'>
<function system at 0x100bf9c80>
<MagicMock name='system' id='4307612752'>
>>>

So even if you do patch a method, you won’t necessarily patch all the
uses of that method, depending on how they’re imported in. This
means your patching must directly match how the object you want to
mock is imported into the code to test.

For example, we can fix the mock_test.py file above by changing the patch:

# mock_test.py

from mymodule import is_my_os
try:
    from unittest import mock  # py3
except ImportError:
    import mock  # py2

with mock.patch('mymodule.system', return_value="my os"):
    assert is_my_os()

So in order to use a patch effectively, you have to be aware of exact
semantics
by which a method is both imported an invoked. And this
leads up to the ultimate problem with patch:

Really tightly coupling tests with implementation

Patching in general, regardless of the implementation, tightly couples
your test code and your regular code beyond the typical bounds of unit
testing. Once you get patching involved, you have to not only be
conscious of the effect of your code, but also it’s
implementation. Modifying the internal code of the method also
requires modifying the test code. If your unit tests change, the
actual functionality it’s testing is also changed: you’re no longer
guaranteed that your code is identical because the same tests pass:
because modifying your code requires you to change your test code.

Ultimately however, we don’t live in an ideal world. Times will come
when you have to test code that is hard to refactor into a method that
works with only mocks or actual objects. But with code you control,
it’s almost completely avoidable.

So how do we avoid patching?

Patching is the result of coupled complex state, relying on multiple
global variables. We can remedy this by doing the exact opposite:

  • decouple complex state
  • don’t rely on global variables

Let’s take a look at some practices to help with this:

Don’t use global variables

for example, let’s look at an object that creates a persistent db
connection based on configuration parameters:

db_connection = db_connect(DB_URL)

class MyObject:

    def __init__(self, name):
        self.name = name

    def save(self):
        db_connection.write(self.to_dict())

    def to_dict():
        return { 'name': self.name }

To test this object’s save method, you would have either patch the
db_connection object, or replace the DB_URL to reflect a test
database. Either method is an extra step from testing what you really
want on just the save method: the db method is called, and is passed the
dictionary representation of the object.

You can accomplish this without patch by passing in objects as you
need them: by explicitly passing them in, it makes it really easy to mock:

class MyObject:

    def __init__(self, name):
        self.name = name

    def save(self, db):
        db.write(self.to_dict())

    def to_dict():
        return { 'name': self.name }

 def test_myobject_save():
     import mock
     my_object = MyObject("foo")
     db = mock.Mock()
     my_object.save(db)
     assert db.write.assert_called_with({
         'name': 'foo'
     })

Decouple complex state

Complex state coupling occurs when you attempt to hide a lot of the
difficulty with creating objects from a user. Using the database above, as an example:

class MyObject:

    def __init__(self, db_url, name):
        self._db = db_connection(db_url)
        self.name = name

    def save(self):
        self._db.write(self.to_dict())

    def to_dict():
        return { 'name': self.name }

Now the only way to actually test this save method (aside from a full
stack test) is to mock the db_connection method. It wouldn’t work to
assign the db attribute afterward (my_object._db = Mock()) because
this would mean that the objects was already instantiated: your db
connection already exists, creating extra overhead you won’t used.

Instead of trying to hide the complex state from the user of your
class, let them actually choose the db object to pass in:

class MyObject:

    def __init__(self, db, name):
        self._db = db
        self.name = name

    def save(self):
        self._db.write(self.to_dict())

    def to_dict():
        return { 'name': self.name }

 def test_myobject_save():
     import mock
     db = mock.Mock()
     my_object = MyObject(db, "foo")
     my_object.save()
     assert db.write.assert_called_with({
         'name': 'foo'
     })

This not only allows us to test operations on complex objects, but
also makes the class more flexible as well (e.g. compatible with more
db objects than just the one that db_connection returns)

Final thoughts

Once again, patch exists for a reason. It’s almost like a magic wand
that allows you to test otherwise untestable code. But this magic wand
comes with making your life harder the more you use it.

So all in all: beware the dangers of patching.

Pycon 2014: Day 2 Recap

The second day of PyCon was just as amazing as the first. The day
started with a couple keynote speakers:

Keynote: Jessica McKellar

Jessica McKellar gave a talking about the current state of the world
for programming classes in U.S. grade schools, and the numbers weren’t
pretty. The number of programming classes is thin as it is, and the
prospect for girls was even worse: female students made up less than a
third of the test takes, and two states didn’t have any female
students take AP computer science at all.

It’s a bit dishearting to know that this is the state of
C.S. education in the U.S., but I think that it’s not a hopeless
situation: a third of the attendees at PyCon this year were women,
which is phenomenal. In addition, there’s a lot of people who discover
the joys of programming after high school (including myself).

Ultimately though, the lesson of the talk was that we need more
professional programmers fighting against this wave. Unfortunately all
of my free time is spent on several other projects, but I’ll definitely
remember that education needs some help when I have a spare second.

Keynote: Fernando Pérez

Geez, iPython is amazing. It’s so much more than just a fancier python
interpreter. The science community made it into more of a matlab
Frankenstein, complete with math, data analysis, and more.

Fernando demoed the iPython notebook, which is leagues ahead of
anything I’ve seen in the scientific note taking community. Rich ways
to express data, easily extensible (a lightning talk speaker today added d3
support, which only makes it look all that more amazing).

my limited experience in the academic world makes me cringe with bad
programming practices: lack of version control, no good documentation
(and I was definitely a part of that) made life incredibly
difficult. I think Pérez and the rest of the iPython community members
are definitely turning this trend around, with a system that allows
live documentation and easily modifiable data, blowing past anything
we have on the private side. I’d love to take the concepts iPython is
pushing and see if I can’t make something new and powerful out of it.

Talk: Designing Django’s Migrations

I can definitely tell Andrew Godwin is a smart guy. His extensive time
maintaining Django has really given him a good sense of what works and
what doesn’t for a general web framework. His talk on designing
migrations was basically explaining how it works, but he did share
some of his insights from looking at the previous migration tool, south:

  • simplify the logic and abstract it out to a python construct: this
    allows for less hard-coded files in a Python file, which is what South does.
  • load the whole migration path into memory to build a meaningful
    context: with an idea of the whole upgrade path, it’s easy to see
    what sort of deltas exist. This will probably be ok for 90% of the common migrations, but
    data re-formatting still needs a very human touch.
  • keeping the layer that performs migrations functionally separate
    from the database translation layer: effectively ensuring that the
    migration layer only uses methods exposed by the ORM. Good idea in
    my humble opinion: keeping the architecture separate blocks the
    possibly that a database feature could someday have to be
    implemented twice for both the model and the migration.

All in all, a good way to dive into an Open Source maintainer’s head.

Talk: Designing Poetic APIs

Wow. Erik Rose really gets what
software should be. When people say that programming is like writing,
I have to say I was skeptical. I could see how being a good writer
helps from a documentation perspective, but I couldn’t say it directly
affected one’s programming ability. Well, this talk threw everything I
thought I knew out the window.

Erik talks about seven principles that can help one design really
clean apis from a consumer perspective:

  • Don’t predict the future: stop coding against what you think is
    going to happen years from now. It rarely leads down a good path.
  • Consistency: make sure that not only are all your methods performing
    actions that are consistent with the behavior of other methods in
    your library, but take it a step further and be consistent with all
    of Python as well. The principle of least astonishment applies just as much to your internal library as it does to a web application.
  • Brevity: your methods should return minimal data structures that
    perform what you want, and should require as few arguments as
    possible. Not doing so leads to incredibly complicated, possibly
    untestable, most likely incomprehensible code.
  • Composability: instead of having methods that do five things, try to
    decompose it in a way that allows consumers to chain operations
    together. Not only does this help with brevity, but it helps complex
    work-flows become simpler as well.
  • Plain Data: use built-in types when possible. Not only dict and
    list, but also constructs built-in to the language, such as
    ConfigParser. This allows an astounding level of compatibility
    across all sort of code.
  • Grooviness: I’m not sure what he means 100% by this, bud I think
    grooviness has to do with how easily you can fall into a groove
    while coding. Things like checking docs, or digging through deep
    stack traces really hampers a good work-flow.
  • Safety: states that are impossibly shouldn’t be representable, and
    throwing exceptions is better than a return value if the case is
    truly an exception.

Seriously a mind-blowing talk. I’ve had feelings about a lot of these,
but to have someone qualify it with words really makes the point
clear. And this is where it all ties in to writing: poetry and coding
good apis require a very similar set of skills: having the ability to
express yourself in an eloquent, clear way. To be successful at either,
this ability is and absolute necessity.

Talk: Fast Python, Slow Python

I’m sure Alex Gaynor is an
incredibly smart person, but maybe his talk went over my head a bit. I
originally thought this was going to be a talk about practices that
would allow me to optimize Python, but he made me remember that
CPython is not the only Python around. His talk was actually about
making implementation-agnostic Python faster. He gave a few tips on
how to do this, but of course he didn’t really explain why it would be
faster. He gave using classes over dicts as an example of a performance
increase, arguing that dicts are object-object mappings and a class
provides a more explicit key. I’m not 100% sure why that would be any
better, consider one could optimize a dictionary in a very similar
way if you know what all the keys and values are going to be.

Not really a lot to be gleaned from this talk from my perspective,
unless you want to follow practices that would make you faster in PyPy
and possibly CPython (if you upgrade to the most recent version). Of
course that’s still not an implementation-agnostic performance increase.

Talk: What is Coming in Python Packaging

Great state of the world talk from Noah Kantrowitz. A lot of big
changes coming in the Python packaging world, including:

  • new PyPi, known as warehouse.
  • Wheels will take over at some point, but quite a few packages can’t
    use them due to their reliance on native dependencies and wheel
    can’t handle that for multiple platforms.
  • virtualenv is now in Python as of 3.4 (pyvenv)
  • twine is the defacto way to submit packages now, no longer using setup.py directly
  • devpi is apparently a popular local proxy for hosting packages internally.
  • the Python Packaging Association (PyPa) is now responsible for all packaging technologies, including:
    • setuptools
    • pip
    • wheel
    • warehouse
    • virtualenv

So definitely a good talk to check out if you want to know what’s the way to go in today’s Python world.

Summary

Another great day at PyCon. It’s awesome being able to hear from the
horses mouth (so to speak) about the state of the world in
Python. Also an amazing set of lightning talks too. Learning a lot all
over the place really.

Really excited for day 3, and the dev sprints to follow!