"Default" Bayesian Models

written by Eric J. Ma on 2017-03-11

As a positive distraction from my thesis writing, I've been thinking a bit about the statistical crisis in biomedical sciences and psychology research, and how it might be mitigated.

A number of opponents of p-values and proponents of Bayesian inference have influenced my thinking around this issue. As such I have come to the conclusion that Bayesian estimation and inference should be more widely used, because it essentially comes with interpretable uncertainty built into the inference philosophy.

I think one thing preventing adoption of Bayesian inference methods is their flexibility (read: complexity). How does one compose an model with little grounding in statistics?

To address this problem, I've started putting together Jupyter notebooks showing common problems in the experimental sciences and a sensible default model that one can use for that kind of problem.

For me, a recurrent (and very interesting) theme came up. The nature of probabilistic graphical models is such that if we are able to forward-simulate how the data may be generated, then given the data and a loss function, fitting the data is merely a matter of optimization. The core idea behind these notebooks, then, is that there are a small number of "generic" models of how data may be generated that can cover a large proportion of scenarios, particularly in scenarios where we don't have sufficiently good theory to forward-simulate the complex data-generating distribution underlying the data.


Boston's Geospatial Data

written by Eric J. Ma on 2017-02-13

I finally did it!

The city of Boston recently released their data in the open. I wanted to have an excuse to play with geospatial data, so as a distraction from thesis & proposal writing, I hunkered down for about a workday's worth of time and put together a bokeh app.

In this web app, the end user can select from one of a number of geospatial datasets released, and visualize the distribution of those data points. There's no added metadata right now, but if I get fed up with thesis & proposal writing I might go back and add in metadata hover/tooltips.

I hope you have fun cloning the repo and running it!


Numba: My first attempt at being serious with it

written by Eric J. Ma on 2017-02-08

This evening, I saw a Tweet about using numba, and I thought, it's about time I give it a proper shot. I had been solving some dynamic programming problems just for fun, and I thought this would be a good test case for numba's capabilities.

The DP problem I was trying to solve was that of collecting apples on a grid. Here's how the problem is posed:

I have a number of apples distributed randomly on a grid. I start at the top-left hand corner, and I'm only allowed to move downwards or to the right. Along the way, I pick up apples. What's the maximum number of apples I can pick up along the way?

This is a classic 2-dimensional DP problem. I simulated some random integers:

n = 200
arr = np.random.randint(low=0, high=100, size=n**2).reshape(n, n)

I then wrote out my solution, and wrapped it in two versions of the function call: one native and one numba-JIT'd.

# Let's collect apples.

from numba import jit

@jit
def collect_apples(arr):
    sum_apples = np.zeros(shape=arr.shape)
    for row in range(arr.shape[0]):
        for col in range(arr.shape[1]):
            if col != 0:
                val_left = arr[row, col - 1]
            else:
                val_left = 0
            if row != 0:
                val_up = arr[row - 1, col]
            else:
                val_up = 0
            sum_apples[row, col] = arr[row, col] + max(val_left, val_up)
    return sum_apples

def collect_apples_nonjit(arr):
    sum_apples = np.zeros(shape=arr.shape)
    for row in range(arr.shape[0]):
        for col in range(arr.shape[1]):
            if col != 0:
                val_left = arr[row, col - 1]
            else:
                val_left = 0
            if row != 0:
                val_up = arr[row - 1, col]
            else:
                val_up = 0
            sum_apples[row, col] = arr[row, col] + max(val_left, val_up)
    return sum_apples

Here's the performance results:

%%timeit
collect_apples(arr)

The slowest run took 4.27 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 99.7 ┬Ás per loop

%%timeit
collect_apples_nonjit(arr)

10 loops, best of 3: 50.3 ms per loop

Wow! Over 500-fold speedup! All obtained for free using the @jit decorator.


Data Diagnostics: missingno

written by Eric J. Ma on 2017-02-06

Sometimes, all that you need is a visual cue on whether the data you have on hand are complete or not. Looking at a table can be dizzying at times, so I'm very glad I found this packaged called missingno! It provides a way to quickly visualize the "nullity" of your dataset. See an example below:

Displaying nullity of a data set.

It's built on top of matplotlib, and takes in pandas DataFrames, which means it plays very nicely with the rest of the PyData stack. I recently took it for a tour when I did a quick stats consult with Mia Lieberman (DCM); the above plot was made using her data, used with permission.

Highly recommended package!


Verily Interview!

written by Eric J. Ma on 2017-01-25

I have finished my first complete job interview! I'm happy for the experience, no matter what the result is, as it's a very eye-opening one. I've been in an ivory tower for quite a while, so going out and seeing what the experience like is worthwhile enough, even though it'd also be a very nice icing on the cake to have a job offer too.

If I'm not mistaken, I'm bound by NDA not to write/speak about the specifics of the interview, but I probably can speak about it in general. Overall, they were very welcoming! I was lodged in a very nice hotel just 6 minutes away from Verily by foot, with breakfast included (yum!). 5 people served as my interviewers, and I had lunch with a more senior person to learn more about the company's history; turns out 2 of them share a Boston connection as well.

There was a good mix of coding and biology, and after the interview, I immediately went back and coded up my answer to my second interviewer's questions... only to find out I had done it wrong! Oh well :-). What's done is done, and I now know the combinatorics functions in the Python standard library much better.

I also went back and thought more about the other coding problem, asked by my final interviewer. It was a frequentist statistics question, and I had been very much in a Bayesian frame of mind when he asked the question, so I was tripped up by my own biases when that happened. (There are foundational philosophical differences between the Bayesian and frequentist statistical mind sets, but I won't go in that direction.)

Three of my other interviewers are very interested in the biological question behind what I was working on, and so I had a much smoother time (I think) with them. One was a wet bench scientist, and the other two do computation work as part of the computational biology team. I also think that my time with them played to my strengths, which are translating the biological problems into code, rather than making algorithms efficient. Not that I'm ruling out learning how to write efficient algorithms, though - I'd definitely love to master that! I say that my time with them played to my strengths because I was able to show them the thought process of how I turned a biological problem into a computable problem, and was able to highlight where the limitations of what I did were. My third and fourth interviewers took a deep interest in my work, and I was really, really happy to share with them what I'd been doing.

Overall, a very fun experience! Big thanks to the HR and Comp Bio teams there, it was an intense day, but it was also very fun. I'm now just waiting for my red-eye flight to go back to Boston to teach a Data Carpentry workshop at Tufts Medical. Hope I survive till noon tomorrow!