How I Use Jug & IPython Notebooks

May 21, 2014

Having just released Jug 1.0 and having recently started using Ipython notebooks for data analysis, I want to describe how I combine these two technologies:

Jug is for heavy computation. This should run in batch mode so that it can take advantage of a computer cluster.
The IPython notebook is for visualization of the results. From the notebook, I will load the results of the jug run and plot them.

I am going to use, as an example, the sort of work I did for classifying images with local features that I did for my Bioinformatics paper last year That code did not use IPython notebook, but I already used a split between heavy computation and plotting[1].

I write a jugfile.py with my heavy computation, in this case, feature computation and classification [2]:

from jug import TaskGenerator from features import computefeatures from ml import classification # computefeatures takes an image path and returns features computefeatures = TaskGenerator(computefeatures) # crossvalidation returns a confusion matrix crossvalidation = TaskGenerator(crossvalidation) images,labels = load_images() # This loads all the images features = [computefeatures(im) for im in images] results = crossvalidation(features, labels)  This way, if I have 1000 images, the computefeatures step can be run in parallel and use many cores. § When the computation is finished, I will want to look at the results and display them. For example, graphically plot a confusion matrix. The only non-obvious trick is how to load the results from jug: from jug import value, set_jugdir import jugfile set_jugdir('jugfile.jugdata') results = value(jugfile.results)  And, boom!, results is a variable in our notebook with all the data from the computations (if the computation is not finished, an exception will be raised). Let's unpack this one by one: from jug import value, set_jugdir  Imports from jug. Nothing special. You are just importing jug in a Python notebook. import jugfile  Here you import your jugfile. set_jugdir('jugfile.jugdata')  This is the important step! You need to tell jug where its data is. Here I assumed you used the defaults, otherwise just pass a different argument to this function. results = value(jugfile.results)  You now use the value function to load the value from disk. Done. Now, use a second cell to plot: from matplotlib import pyplot as plt from matplotlib import cm plt.imshow(results, interpolation='nearest', cmap=cm.OrRd) § I find this division of labour to maximize the value of each tool: jug does well for long computations and ensures that the results are consistent while making it easy to use the cluster; ipython is nicer at exploring the results and tweaking the graphical outputs.    [1] I would save the results from jug to a text file and load it from another script.       [2] This is a very simplified form of what the original actually looked like. I started to write this post trying to make it realistic, but the complexity was too much. The plot is from real data, though.

Rabbit Thoughts

Discussion about this post