Always Check Your Data

written by Eric J. Ma on 2017-10-31

True story, just happened today. I was trying to fit a Poisson likelihood to estimate event cycle times (in discreet weeks). For certain columns, everything went perfectly fine. Yet for other columns, I was getting negative infinity’s likelihoods, and was banging my head over this problem for over an hour and a half.

As things turned out, those columns that gave me negative infinity likelihood initializations were doing so because of negative values in the data. Try fitting a Poisson likelihood, which only has positive support, on that!

This lost hour and a half was a good lesson in data checking/testing: always be sure to sanity check basic stats associated with the data - bounds (min/max), central tendency (mean/median/mode) and spread (variance, quartile range) - always check!

Did you enjoy this blog post? Let's discuss more!


Random Forests: A Good Default Model?

written by Eric J. Ma on 2017-10-27

I've been giving this some thought, and wanted to go out on a limb to put forth this idea:

I think Random Forests (RF) are a good "baseline" model to try, after establishing a "random" baseline case.

(Clarification: I'm using RF as a shorthand for "forest-based ML algorithms", including XGBoost etc.)

Before I go on, let me first provide some setup.

Let's say we have a two-class classification problem. Assume everything is balanced. One "dumb baseline"" case is a coin flip. The other "dumb baseline" is predicting everything to be one class. Once we have these established, we can go to a "baseline" machine learning model.

Usually, people might say, "go do logistic regression (LR)" as your first baseline model for classification problems. It sure is a principled choice! Logistic regression is geared towards classification problems, makes only linear assumptions about the data, and identifies directional effects as well. From a practical perspective, it's also very fast to train.

But I've found myself more and more being oriented towards using RFs as my baseline model instead of logistic regression. Here are my reasons:

  1. Practically speaking, any modern computer can train a RF model with ~1000+ trees in not much more time than it would need for an LR model.
  2. By using RFs, we do not make linearity assumptions about the data.
  3. Additionally, we don't have to scale the data (one less thing to do).
  4. RFs will automatically learn non-linear interaction terms in the data, which is not possible without further feature engineering in LR.
  5. As such, the out-of-the-box performance using large RFs with default settings is often very good, making for a much more intellectually interesting challenge in trying to beat that classifier.
  6. With scikit-learn, it's a one-liner change to swap out LR for RF. The API is what matters, and as such, drop-in replacements are easily implemented!

Just to be clear, I'm not advocating for throwing away logistic regression altogether. There are moments where interpretability is needed, and is more easily done by using LR. In those cases, LR can be the "baseline model", or even just back-filled in after training the baseline RF model for comparison.

Random Forests were the darling of the machine learning world before neural networks came along, and even now, remain the tool-of-choice for colleagues in the cheminformatics world. Given how easy they are to use now, why not just start with them?

Did you enjoy this blog post? Let's discuss more!


PyPy: Impressive!

written by Eric J. Ma on 2017-10-11

A few years on after trying out PyPy for the first time and wrestling with it, I still find it to be pretty awesome.

Now that PyPy officially supports numpy, I'm going to profile a few simple statistical simulation tasks:

I'll profile each of the tasks four ways:

So, how do PyPy and CPython fare? Let's show the results up front first.

Profiling results.

Click on the image to view a higher resolution chart. The raw recorded measurements can be found on Google Sheets.

Here's a description of what's happening:

It's pretty clear that when PyPy is dealing with "pure" data (i.e. not having to pass data between Python and C), PyPy runs very, very fast, and, at least in the scenarios tested here, it performs faster than the CPython interpreter. This is consistent with my previous observations, and probably explains why PyPy is very good for code that is very repetitive; the JIT tracer really speeds things up.

That last plot (bottom-right) is a big curiosity. Using the code below, I measured the random number generation is actually just as fast as it should be using CPython, but that PyPy failed badly when I was passing in a numpy array to the Counter() object (from the standard library). I'm not sure what is happening behind-the-scenes, but I have reached out to the PyPy developers to ask what's going on, and will update this post at a later date.

UPDATE: I heard back from the PyPy devs on BitBucket, and this is indeed explainable by data transfer between the C-to-PyPy interface. It's probably parallel to the latency that arises from transferring data between the CPU and GPU, or between compute nodes.

So, what does this mean? It means that for pure Python code, PyPy can be a very powerful way to accelerate your code. One example I can imagine is agent-based simulations using Python objects. Another example that comes to mind is running a web server that only ever deals with strings, floats and JSONs (in contrast to matrix-heavy scientific computing).

Now, for those who are curious, here's the source code for the pure Python implementation of the mean of random numbers.

# Mean of 10 million random number draws.
from time import time
from random import random

start = time()
n = 1E7
rnds = 0
for i in range(int(n)):
    rnds += random()

print(rnds / n)
end = time()
print("{} seconds".format(end - start))

And here's the source code for the numpy implementation of the mean of random numbers.

import numpy as np
from time import time

start = time()
mean = np.mean(np.random.random(size=int(1E7)))
print(mean)
end = time()

print('{} seconds'.format(end - start))

Next, here's the source code for coin flips in pure Python:

# Simulate 10 million biased coin flips with p = 0.3
from random import random
from time import time
from collections import Counter

start = time()


def bernoulli(p):
    rnd = random()
    if rnd < p:
        return True
    else:
        return False


p = 0.3
result = [bernoulli(p) for i in range(int(1E7))]
print(Counter(result))
end = time()

print('{} seconds'.format(end - start))

And finally, source code for coin flips using numpy:

from numpy.random import binomial
from time import time
from collections import Counter

start = time()
coinflips = binomial(n=1, p=0.3, size=int(1E7))
end = time()
print('Time for numpy coinflips: {} seconds'.format(end - start))

print(Counter(coinflips))
end = time()

print('{} seconds'.format(end - start))

Did you enjoy this blog post? Let's discuss more!


PyData NYC 2017

written by Eric J. Ma on 2017-10-10

I'm seriously looking forward to PyData NYC this year -- there's a great lineup of talks that I'm particularly looking forward to hearing! The theme for my set of must-see talks this year is "Bayesian machine learning" - there's much for me to learn!

The first is by my fellow Boston Bayesian Colin Caroll with his talk titled Two views on regression with PyMC3 and scikit-learn. Colin is a mathematician at heart, even though he does software engineering for living now, and I can't wait to hear about regularization strategies!

The second is by Nicole Carlson, with her talk titled Turning PyMC3 into scikit-learn. Nicole's talk is of interest to me because I've implemented models in PyMC3 before, and now would like to know how to make them reusable!

The third talk is by Chaya Stern, with her talk titled Bayesian inference in computational chemistry. Super relevant to my work at Novartis!

The fourth is by my fellow Boston Pythonista Joe Jevnik, who will be speaking on the first day about his journey into deep learning on some really cool time-series data. He works at Quantopian, BUT the spoiler here is that his talk is NOT about financial data! (I've heard his talk outline already.)

The fifth is a tutorial by Jacob Schrieber, with his talk titled pomegranate: fast and flexible probabilistic modeling in python. pomegranate's API models after the scikit-learn's API; with the API being the user-facing interface, and scikit-learn being the de facto go-to library for machine learning, I'd be interested to see how much more pomegranate adds to the ecosystem, particularly w.r.t. Bayesian models.

There are a swathe of other good talks that I'm expecting to be able to catch online later on. Matt Rocklin, who is the lead developer of Dask, has done a ton of work on speeding Python up through parallelism. His talk will be on the use of Cython & Dask to speed up GeoPandas.

Also, Thomas Caswell, one of the matplotlib lead devs who helped guide my first foray into open source contributions, is giving a tutorial on developing interactive figures in matplotlib. Highly recommended if you're into the visualization world!

Finally, the always-interesting, always entertaining en zyme will be speaking on an interesting topic.

Looking forward to being at the conference, and meeting old and new friends there!

Did you enjoy this blog post? Let's discuss more!


Recursive Programming and DAGs

written by Eric J. Ma on 2017-10-10

Over the past few days, I've found myself using recursive programming to implement a "model specification" system with inheritance for deep learning. The goal here is to enable reproducible computational experiments for particular deep learning hyperparameter sets. Reproducibility is something I learned from the Software/Data Carpentry initiative, thus I wanted to ensure that my own work was reproducible, even if it's not (because of corporate reasons) open-able, because it's the right thing to do.

So, how do these "model spec" files work? I call them "experiment profiles", and they specify a bunch of things: model architecture, training parameters, and data tasks. These experiment profiles are stored in YAML files on disk. A profile essentially looks like the following (dummy examples provided, naturally):

# Name: default.yaml
parent: null
data_tasks:
    tasks: [task1, task2, task3]
model_architecture:
    hidden_layers: [20, 20, 20]
    hidden_dropouts: [0.1, 0.2, 0.3]
training_parameters:
    optimizer: "sgd"
    optimizer_options:
        n_epochs: 20

In this YAML file, the key-value pairs essentially match the API of the tooling I've built on top of Keras' API to make myself more productive. (From the example, it should be clear that we're dealing with only feed-forward neural networks and nothing else more complicated.) The key here (pun unintended) is that I have a parent key-value pair that specifies another experiment profile that I can inherit from.

Let's call the above example default.yaml. Let's say I want to run another computational experiment that uses the adam optimizer instead of plain vanilla sgd. Instead of re-specifying the entire YAML file, by implementing an inheritance scheme, I can re-specify only the optimizer and optimizer_options.

# Name: adam.yaml
parent: "default.yaml"
training_parameters:
    optimizer: "adam"

Finally, let's say I find out that 20 epochs (inherited from default.yaml) is too much for Adam - after all, Adam is one of the most efficient gradient descent algorithms out there - and I want to change it to 3 epochs instead. I can do the following:

# Name: adam-3.yaml
parent: "adam.yaml"
training_parameters:
    optimizer_options:
        n_epochs: 3

Okay, so specifying YAML files with inheritance is all good, but how do I ensure that I get the entire parameter set out correctly, without writing verbose code? This is where the power of recursive programming comes in. Using recursion, I can solve this problem with a single function that calls itself on one condition, and returns a result on another condition. That's a recursive function in its essence.

The core of this problem is traversing the inheritance path, from adam-3.yaml to adam.yaml to default.yaml. Once I have the inheritance path specified, loading the YAML files as a dictionary becomes the easy part.

How would this look like in code? Let's take a look at an implementation.

import yaml 

def inheritance_path(yaml_file, path):
    """
    :param str yaml_file: The path to the yaml file of interest.
    :param list path: A list specifying the existing inheritance path. First
        entry is the file of interest, and parents are recursively appended to
        the end of the list.
    """
    with open(yaml_file, 'r+') as f:
        p = yaml.load(f)
        if p['parent'] is None:
            return path
        else:
            path.append(p['parent'])
            return inheritance_path(p['parent'], path)

The most important part of the function is in the if/else block. If I have reached the "root" of the inheritance path, (that is, I have hit default.yaml which has no parent), then I return the path traversed. Otherwise, I return into the inheritance_path function call again, but with an updated path list, and a different yaml_file to read. It's a bit like doing a while loop, but in my opinion, a bit more elegant aesthetically.

Once I've gotten the path list, I can finally load the parameters using a single function that calls on inheritance_path.

def load_params(yaml_file):
    path = inheritance_path(yaml_file, [yaml_file])
    p = dict(data_tasks=dict(), 
             model_architecture=dict(), 
             training_parameters=dict())
    for fn in path[::-1]:  # go in reverse!
        with open(fn, 'r+') as f:
            for k, v in yaml.load(f).items():
                if k in p.keys():
                    p[k].update(v)
    return p

This is the equivalent of traversing a Directed Acyclic Graph (DAG), or in some special cases, a tree data structure, but in a way where we don't have to know the entire tree structure ahead of time. The goal is to reach the root from any node:

root
    |- A
        |- B
        |- C
            |- D
            |- E
    |- F
        |- G
        |- H
        |- I 
            |- J

Also, because we only have one pointer in each YAML file to its parent, we have effectively created a "Linked List" that we can use to trace a path back to the "root" node, along the way collecting the information that we need together. By using this method of traversal, we only need to know the neighbors, and at some point (however long it takes), we will reach the root.

D -> C -> A -> root
E -> C -> A -> root
J -> I -> F -> root

If you were wondering why linked lists, trees and other data structures might be useful as a data scientist, I hope this illustrates on productive example!

Did you enjoy this blog post? Let's discuss more!