Bayesian Learning and Overfitting

written by Eric J. Ma on 2017-11-16

Yesterday, after I did my Boston Bayesians dry run talk, there was a point raised that I had only heard of once before: Bayesian learning methods don't overfit. Which means we're allowed to use all the data on hand. The point holds for simple Bayesian networks, and for more complicated deep neural nets.

Though I believe it, I wasn't 100% convinced of this myself, so I decided to check it up. I managed to get my hands on Radford Neal's book, Bayesian Learning for Neural Networks, and found the following quotable paragraphs:

It is a common belief, however, that restricting the complexity of the models used for such tasks is a good thing, not just because of the obvious computational savings from using a simple model, but also because it is felt that too complex a model will overfit the training data, and perform poorly when applied to new cases. This belief is certainly justified if the model parameters are estimated by maximum likelihood. I will argue here that concern about overfitting is not a good reason to limit complexity in a Bayesian context.

A few paragraphs later, after explaining the frequentist procedure:

From a Bayesian perspective, adjusting the complexity of the model based on the amount of training data makes no sense. A Bayesian defines a model, selects a prior, collects data, computes the posterior, and then makes predictions. There is no provision in the Bayesian framework for changing the model or the prior depending on how much data was collected. If the model and prior are correct for a thousand observations, they are correct for ten observations as well (though the impact of using an incorrect prior might be more serious with fewer observations). In practice, we might sometimes switch to a simpler model if it turns out that we have little data, and we feel that we will consequently derive little benefit from using a complex, computationally expensive model, but this would be a concession to practicality, rather than a theoretically desirable procedure.

Finally, in the following section after describing how neural networks are built:

In a Bayesian model of this type, the role of the hyperparameters controlling the priors for weights is roughly analogous to the role of a weight decay constant in conventional training. With Bayesian training, values for these hyperparameters (more precisely, a distribution of values) can be found without the need for a validation set.

This seems to dovetail well with the following convoluted intuition that I've had: if I fit a Bayesian model on the "training" set of the data, then update it with the "test" set, it's equivalent to just training with the whole dataset. With wide priors, if I fit with a smaller dataset, my posterior distribution will be wider than if I fit with the entire dataset. So... where possible, just train with the entire dataset. That said, I've not had sufficient grounding in Bayesian stats (after all, still a newcomer) to justify this.

I certainly have more reading/learning to do here. Looks like something neat to explore in the short-term.

Did you enjoy this blog post? Let's discuss more!

The Value of Thinking Simply

written by Eric J. Ma on 2017-11-14

Einstein has a famous quote that most people don't hear about.

It can scarcely be denied that the supreme goal of all theory is to make the irreducible basic elements as simple and as few as possible without having to surrender the adequate representation of a single datum of experience.

Albert Einstein

It instead, most people hear the misquote:

Everything should be made as simple as possible, but no simpler.

Misquoted version

Though a misquote, it's still a fair (though lopsided -- missing a sufficient translation of the latter half) simplification of the original.

In my work, I'm reminded of this point. I can choose to go for the complex fancy thing, but if I don't start from first principles, or start with simplistic approximations, I will struggle to have a sufficiently firm grasp on a problem to start tackling it. And therein lies the key, I think, in making progress on creative, intellectual work.

The past week, I've noticed myself not wasting time on mindless coding (which usually amounts to re-running code with tweaks), and instead devoting more time to strategic thinking. As an activity, strategic thinking isn't just sitting there and thinking. For me, it involves writing and re-writing what I'm thinking, drawing and re-drawing what I'm seeing, and arranging and composing the pieces that are floating in my mind. During that time of writing, drawing, arranging and composing, I'm questioning myself, "What if I didn't have this piece?". Soon enough, the "simplest complex version" (SCV) of whatever I'm working on begins to emerge -- but it never really is the final version! I go back and prototype it in code, and then get stuck on something, and realize I left something out in that SCV, and re-draw the entire SCV from scratch.

Here's my misquote, then, offered up:

Sufficiently simple, and only necessarily complex.

A further mutated version.

Did you enjoy this blog post? Let's discuss more!

Boston Bayesians Talk: An Attempt at Demystifying Bayesian Deep Learning

written by Eric J. Ma on 2017-11-03

It's confirmed! I will be rehearsing my PyData NYC talk at Boston Bayesians, held at McKinsey's office.

This time round, I've challenged myself with making the slides without using PowerPoint or Keynote, and I think I've successfully done it! Check them out:

Side note, I'm starting to really love what we can do with the web!

Did you enjoy this blog post? Let's discuss more!

Always Check Your Data

written by Eric J. Ma on 2017-10-31

True story, just happened today. I was trying to fit a Poisson likelihood to estimate event cycle times (in discreet weeks). For certain columns, everything went perfectly fine. Yet for other columns, I was getting negative infinity’s likelihoods, and was banging my head over this problem for over an hour and a half.

As things turned out, those columns that gave me negative infinity likelihood initializations were doing so because of negative values in the data. Try fitting a Poisson likelihood, which only has positive support, on that!

This lost hour and a half was a good lesson in data checking/testing: always be sure to sanity check basic stats associated with the data - bounds (min/max), central tendency (mean/median/mode) and spread (variance, quartile range) - always check!

Did you enjoy this blog post? Let's discuss more!

Random Forests: A Good Default Model?

written by Eric J. Ma on 2017-10-27

I've been giving this some thought, and wanted to go out on a limb to put forth this idea:

I think Random Forests (RF) are a good "baseline" model to try, after establishing a "random" baseline case.

(Clarification: I'm using RF as a shorthand for "forest-based ML algorithms", including XGBoost etc.)

Before I go on, let me first provide some setup.

Let's say we have a two-class classification problem. Assume everything is balanced. One "dumb baseline"" case is a coin flip. The other "dumb baseline" is predicting everything to be one class. Once we have these established, we can go to a "baseline" machine learning model.

Usually, people might say, "go do logistic regression (LR)" as your first baseline model for classification problems. It sure is a principled choice! Logistic regression is geared towards classification problems, makes only linear assumptions about the data, and identifies directional effects as well. From a practical perspective, it's also very fast to train.

But I've found myself more and more being oriented towards using RFs as my baseline model instead of logistic regression. Here are my reasons:

  1. Practically speaking, any modern computer can train a RF model with ~1000+ trees in not much more time than it would need for an LR model.
  2. By using RFs, we do not make linearity assumptions about the data.
  3. Additionally, we don't have to scale the data (one less thing to do).
  4. RFs will automatically learn non-linear interaction terms in the data, which is not possible without further feature engineering in LR.
  5. As such, the out-of-the-box performance using large RFs with default settings is often very good, making for a much more intellectually interesting challenge in trying to beat that classifier.
  6. With scikit-learn, it's a one-liner change to swap out LR for RF. The API is what matters, and as such, drop-in replacements are easily implemented!

Just to be clear, I'm not advocating for throwing away logistic regression altogether. There are moments where interpretability is needed, and is more easily done by using LR. In those cases, LR can be the "baseline model", or even just back-filled in after training the baseline RF model for comparison.

Random Forests were the darling of the machine learning world before neural networks came along, and even now, remain the tool-of-choice for colleagues in the cheminformatics world. Given how easy they are to use now, why not just start with them?

Did you enjoy this blog post? Let's discuss more!