26 Jul 2019

Writing a Python background worker for Cisco NSO - part deux

1 Writing a Python background worker for Cisco NSO - part deux (2)

We ended the first part with a chunk of code that implements all the functionality we had listed as our requirements. You'll find the code in this commit in case you want to look at it in more detail and how it then evolved, which is what I'll detail in this post.

First thing I did, after having played around with this code myself for some time, was to hand it over to my colleague Marko. Marko just happened to be implementing an NSO component that needed to run in the background and so he was in need of the background_process design that I had been working on. I couldn't find a better beta tester than Marko, as he was in fact involved in the design of this background worker process pattern and so was familiar with the whole idea but hadn't, up to this point, written any code on it. I expected to get some feedback on if the interface was any good or how documentation was lacking. Instead he comes back saying NSO hangs.

1.1 Debugging a hung Python VM

At work we have a rather sleek way of starting up a development environment for NSO. We call it a playground and it consists of one or more containers running NSO and a number of virtual routers or other support containers that together form a small topology. Starting a playground is just a matter of executing one script, waiting a few minutes (both NSO and virtual routers are slow in starting up) and then you can start coding. There are different topologies you can choose from as well as options for controlling persistence of data or restarting playgrounds until tests fail (for finding bug reproduction cases). We start the same topologies in CI for testing. All in all it gives a rather good guarantee that the environment we develop and test in are all uniform.

When Marko said his NSO had hung I was perplexed as it had worked so well for me. We started looking into differences but there were none, or very few. We ran the same playground topology using the same code versions and we were even running it on the same physical machine. Yet mine worked fine, being able to consistently redeploy the bgworker package and his didn't. It was clear we needed to find out what was going on and fix it.

To reproduce the hang, using the bgworker as it looked then all that is required is to redeploy the package a few times and you are likely to encounter the hang. The background worker would periodically log a message and when it was hung there would simply be no message. We tried adding print statements but it made little difference. We tried to instead write some data to a temporary file but it didn't work.

Increasing the log level to debug from a lower level seemed to increase the likelihood of the hang but we still didn't understand why. We added a stack dumper and a signal handler so when we sent a USR1 signal we would dump the stack to a file which we could inspect to understand what the program was doing. The stack:

File "/usr/lib/python3.5/logging/__init__.py", line 1487, in callHandlers
  hdlr.handle(record)

File "/usr/lib/python3.5/logging/__init__.py", line 853, in handle
  self.acquire()

File "/usr/lib/python3.5/logging/__init__.py", line 804, in acquire
  self.lock.acquire()

Finally some progress. We repeated this, taking a stack dump for multiple invocations when Python had hung and they all showed the same stack. We were waiting on acquiring a lock!

1.2 A ten year old bug

The stack dump further reveals that the lock is being acquired in the logging module, which simply takes a lock around the output in the handler when it is emitting a log message. I didn't actually know the logging module took locks - mostly because I never thought about how it worked internally - but it's fairly natural since we would otherwise risk two log messages emitted at the same time overwriting each other.

The bgworker uses a mix of logging, threads and multiprocessing. As it turns out, there is a bug related to this exact mix that Marko found. The bug is ten years old and hadn't yet been fixed.

The multiprocessing module uses fork to create a child process. fork() works by creating a copy of itself, the running process, so that there are two processes after the fork() call, one parent and one child process. The child process thus has a copy of all the file handles, sockets and other things in memory.. and if a lock happened to be taken while the child was forked then the lock would be held in both the parent and child, although it is at this time different locks as they had been copied.

Just using multiprocessing and logging is safe since a single threaded program will only do one thing at a time; it will either be logging a message or starting a new child process through fork. As we also mix in the threading module and run multiple threads there is a possibility that we will be forking at the exact moment when a separate thread is logging a message and then the lock will be held! In the parent process the thread emitting a log message will release the lock as soon as it is done but in the child process that never happens since the logging thread isn't running there and the lock is thus held indefinitely. When we then try to log something in the child we first try to acquire the lock but since it is indefinitely held, our Python VM hangs.

When bgworker starts up, it is starting multiple threads and forks off the child all around the same time. It also logs quite a few things on startup which means the likelihood of forking while emitting a log message is actually quite high. Raising the log level to debug naturally increases the likelihood even more.

It now also became clear why this hadn't manifested itself for me as I was developing bgworker in a standalone instance whereas Marko was integrating it into a larger package that also ran a number of other threads - also emitting log messages on startup, thus increasing the likelihood of a hang.

1.3 Incredible timing

Note that the version of bgworker that we are talking about was committed on the 5th of July. After that it took us a few days, till the 8th, to hunt this bug down. 8th of July is also the day that Python 3.7.4rc2 was released and 3.7.4rc2 should address this issue!

We find a bug that is over ten years old and on the day we find it, a fixed version of Python is released. What are the odds?

1.4 Revamping the NSO docker container

We run NSO in a docker container so switching Python version is relatively simple, but we still have various components and NEDs that include a couple too many dependencies, which does make it trickier to upgrade. We also try to rely on official packages from the distribution repositories rather than pulling down and compiling our own Python build… but we wanted to make progress, so that had to go out the window.

Marko and I each rewrote the Dockerfile producing our NSO image, getting multiple alternatives of how we could get a newer Python version in there. We tried pyenv, manual install, some PPA (which wasn't on 3.7.4 yet but we figured we could have waited a few days).

In the end however, it turns out 3.7.4 doesn't really fix our issue. While it does bring improvements it ultimately does not address our issue. More code would be needed for that (you can't just remove the locks - so trying to log to the same file from what is now two UNIX processes would instead require inter-process locks). Anyway, we needed to fix the real problem through a redesign of our application code.

1.5 A fresh start

The multiprocessing library uses fork() per default on UNIX like operating systems but this can be influenced through an argument and instead of fork() we can tell the multiprocessing library to spawn which I'm pretty sure maps to posix_spawn() under the hood. Unlike fork(), posix_spawn() doesn't copy all of the memory of the parent to the child process and so we won't get a copy of the held locks - instead we start off fresh. It also means we don't have any loggers at all so we have to set those up. What we want to provide to the user of our background_process micro-framework is a smooth experience and it's just nice if the loggers are already set up for you so you can focus on your own code instead of overhead stuff like logging.

We wanted to continue using the same log files that NCS uses per default, like ncs-python-vm-bgworker.log, so that the operations aspect remain the same regardless if you are writing standard Python components for NSO or if you are using the backgrund process design. Since we couldn't write to the same log file from two processes we would have to ship the logs from the child process to the parent process which then could write the messages to the file.

Marko quickly put together a queue listener and emitter so we could send log messages over a multiprocessing.Queue. The logging tree in Python is a singleton so we can attach a queue emitting handler in the child process and wrap all this away so that the user of our framework don't have to think about this. We do all this through a wrapping function, like this:

def _bg_wrapper(bg_fun, q, log_level, *bg_fun_args):
    """Internal wrapper for the background worker function.
    Used to set up logging via a QueueHandler in the child process. The other end
    of the queue is observed by a QueueListener in the parent process.
    """
    queue_hdlr = logging.handlers.QueueHandler(q)
    root = logging.getLogger()
    root.setLevel(log_level)
    root.addHandler(queue_hdlr)
    logger = logging.getLogger(bg_fun.__name__)
    bg_fun(logger, *bg_fun_args)

The queue log handler is set up and after this we run the bg_function provided to us by the user of the framework.

1.6 The promise of efficient logging

Logging can be tricky. We often want to add log messages in various places to easily understand what our program is doing. However, logging itself comes at a cost, not just writing the messages but actually doing the string formatting of them can be relatively expensive. Python's standard logging module is pretty clever and will only format+emit a message if the log level is set high enough. If you have a tight loop and a log.debug() statement it won't actually run unless the log level is set to debug. This makes it possible to leave the logging statements in your code and know it will normally run fast. You will only incur a performance penalty when you actually turn on debugging.

We can easily show this using a simple program:

#!/usr/bin/env python3

import logging
import timeit

log = logging.getLogger()
def noop():
    a = 1

def log_some():
    a = 1
    log.debug("foo")

if __name__ == '__main__':
    print("Noop                     : {}".format(timeit.timeit(noop, number=100000)))
    log.setLevel(logging.INFO)
    print("Without debug log level  : {}".format(timeit.timeit(log_some, number=100000)))
    log.setLevel(logging.DEBUG)
    print("With debug log level     : {}".format(timeit.timeit(log_some, number=100000)))
    ch = logging.FileHandler('foo.log')
    ch.setLevel(logging.DEBUG)
    log.addHandler(ch)
    print("With debug log level + FH: {}".format(timeit.timeit(log_some, number=100000)))
    log.info("foo")

kll@nuc:~$ python3 slowlog.py 
Noop                     : 0.005788944661617279
Without debug log level  : 0.03270535729825497
With debug log level     : 0.7911096690222621
With debug log level + FH: 1.8663266659714282
kll@nuc:~$

As we can see, calling a function that doesn't call log.debug() at all is vastly faster than calling a function that does call log.debug() - it's roughly an order of magnitude. Then enabling the DEBUG log level makes it roughly an order of magnitude slower and finally, actually writing the messages to a file slows it down to about half the speed.

I think of this as a promise to the programmer. You should be able to put log statements in tight loops that need to run fast (obviously not the tightest of loops - there you simply need to strip out your log statements for running in production). The log calls, when the debug logging isn't enabled, should be very very cheap.

Reconfiguring log levels in NCS will only reconfigure the log handler level in the parent process. The child process will remain oblivious. For us to be able to capture all log messages the child process must therefore always emit all log messages, including debug messages, to the queue and then we can filter them away in the parent process in case the current log level doesn't include debug messages.

This is however a rather naive implementation design and it breaks the promise of cheap logging. We need to do better.

1.7 Log control queue

To uphold the promise of cheap logging in an environment with multiple processes like ours we need to propagate the log level configuration to the child process so it doesn't need to format and emit the log messages unless enabled by the currently configured logging level.

I implemented a CDB subscriber that listens to changes of the logging configuration and then passes this through a queue to the child process which then reconfigures its logging handler with the appropriate level.

Mission accomplished!

No more hangs of the Python VM
Consistent logging ergonomics with and without the background process framework
Cheap logging calls when debug is disabled

Read the final part on how to avoid busy polling and efficiently implementing process monitoring!