November 11, 2015

Building an analysis monitor for a genomics pipeline: Tracking remote analyses

A typical genomics data analysis lasts usually for several hours or even days, depending on the type of analysis performed. A common problem with these kind of analysis is that sometimes its hard, or inconvenient at least, to know the current status of the analysis being performed. In most cases it involves SSH-ing to remote machines, looking for logs and GREP-ing through them.

bcbio-nextgen is a python toolkit providing best-practice pipelines for fully automated high throughput sequencing analysis.

The problem

As mentioned before, tracking long-run analyses is not always easy on these kind of workflows, and bcbio-nextgen is not an exception. When running an analysis, the only way to know what’s the status of it is by looking into the logs. Wouldn’t it be nice, and more convenient, to have a simple frontend layer that displayed the progress of the current analysis? What if it could do so independently from the pipeline itself?

The solution

That’s precisely what bcbio-monitor does. Now, to build a frontend for such a piece of software like bcbio-nextgen, there are a couple of architectural decisions one needs to take. The very first question that came to my mind was:

Coupled or decoupled?

Do you want the tracker to be tighten to the software its tracking, in the sense of requiring it (installed) to function? You may think that’s the logical choice at first, but think about this: bcbio-nextgen, as many complex analysis pipelines has a lot of dependencies and its tricky to install. Even with all the facilities that Brad Chapman gives us to install bcbio, still time and computational resources are needed to install bcbio-nextgen. I thought that, to be a practical tool, the potential users should be able to install the monitor really quick on a personal computer and start tracking analyses wherever they’re running (which is generally a remote HPC or Cloud environment).

Every programmer should know the importance of good logging practices for any application, right? bcbio-nextgen does this really well. So I though that I could just use the information from the log that bcbio-nextgen generates, instead of inserting tracking code in the pipeline itself. This makes the monitor independent from the pipeline, making it possible to just read ongoing analyses log files to build a status graph, or even read finished analyses logs to get a global summary.

Updating the web client

Having decided to use the log file, we need a way to update the client application on every pipeline step. There are mainly two options:

Client-Server communication through polling

Though this would work fine, I wanted to avoid it mainly for two reasons:

You have to define a polling interval yourself, which has to be sensitive given the update ratio of the pipeline. Its not easy to decide for a sensitive interval since the duration of different steps goes from seconds to hours.
Even deciding for a good enough polling interval, by definition, polling implies hundreds or thousands of useless requests to the server.

Server-Client communication through HTML5 SSE events

The ideal solution would be to update the client only when a new step has finished, or when some relevant info (i.e an error in the analysis) needs to be presented to the user.

Doing some research I learned about SSE protocol in HTML5, and so I decided to go for that. SSE is precisely that: The client application subscribes to the server, who publishes messages that all subscribers will read. The protocol is robust, and even if you kill bcbio-monitor, when starting it again it will reconnect to the client.

With SSE, the client is updated only on important events, and no useless requests to the server are needed.

This is how the final architecture looks like:

architecture

bcbio-monitor: Features

Once the backend architecture was decided, the ““only”” thing remaining was to build the informative frontend application. As of by version 1.0, this is how bcbio-monitor looks like:

Progress flowchart and steps table

monitor

The table on the right gives the user information about the finished and running steps, as well as timing information readable by humans (thanks to momentjs). The flowchart on the left is, to be honest, not very informative on this version. The idea for the flowchart is to create branches for every computing node where the current step is running.

Percentage of time used by every step

progress

For quick visual reference of which steps take the longer time

Constant feedback (optional)

Since the time between steps can be very long, and no feedback would be given to the user in several hours, I added the option of showing the last log message read on a small panel:

message

This can be deactivated starting bcbio-monitor with --no-update.

Error detection

error

In case something goes wrong.

Summary for the analysis

summary

At the end of the analysis, get a summary of the analysis. Same than with the flowchart, this will contain more information in future versions.

Final remarks

Lots of features could (and hopefully will) be added to bcbio-monitor, but this very first version 1.0 is already a working prototype to track bcbio-nextgen analyses. I’m creating a series of issues for version 1.1, please feel free to add whatever you think could be useful :-).

And as always, share if you liked!

Guillermo Carrasco

In automation, we trust.