Offline Time

written by Eric J. Ma on 2018-01-10

I was locked out of my work computer due to password reasons. (It's not human error - something about corporate management tools locked me out. Okay, well, that's human error too.) That said, I inadvertently gained a full two hours of offline time on Monday.

Those two hours turned out to be pretty productive. I spent some time sketching out my work projects, trying to make better sense of how the project could fit in a disease area researcher's workflow, and figuring out derivative analyses that could enhance the value to them. This was something I probably wouldn't be able to accomplish if I had the regular distractions of my computer nearby.

Seems like Cal Newport's "digital distraction de-cluttering" is a good thing to do. I must do more of it.

Did you enjoy this blog post? Let's discuss more!


Bayesian Uncertainty: A More Nuanced View

written by Eric J. Ma on 2018-01-08

The following thought hit my mind just last night.

Bayesian inference requires the computation of uncertainty. Computing that uncertainty is computationally expensive compared to simply computing point estimates/summary statistics. But when exactly is uncertainty useful, and more importantly, actionable? That's something I've not really appreciated in the past. It's probably not productive to be dogmatic about always computing uncertainty if that uncertainty is not actionable.

Did you enjoy this blog post? Let's discuss more!


Visual Studio Code: A New Microsoft?

written by Eric J. Ma on 2017-12-13

During my week attending PyData NYC 2017, which was effectively a mini-mini-sabbatical from work, I got a chance to try out Visual Studio Code. Part of it was curiosity, having seen so many PyData participants using it; part of it was because of Steve Dowell, a core CPython contributor who works at Microsoft, who mentioned about the Python-friendly tools they added into VSCode.

I think VSCode is representative of a new Microsoft.

But first, let me describe what using it is like.

User Interface

First off, the UI is beautiful. It's impossible to repeat enough how important the UI is. With minimal configuration, I made it basically match Atom's UI, which I had grown used to. It has an integrated terminal, and the colours are... wow. That shade of green, blue and red are amazing, ever just so slightly muted compared to the Terminal or iTerm. The background shade of black matches well with the rest of VSCode, and the colour scheme is changeable to match that of Atom's. The design feels... just right. Wow!

Git Integration

Secondly, the integration with Git rivals Atom; in fact, there's a one-click "sync" button! It also has nice git commit -am analog where I can add and commit all of the files simultaneously.

Intellisense

Thirdly, IntelliSense is just amazing! I like how I can use it to look up a function signature just by mousing over the function name.

Open Source

Finally, it’s fully open source and back able, in the same vein as Atom, minus the bloat that comes from building on top of electron. Impressive stuff!

Other Thoughts

Now, on the new Microsoft.

Only at the recent PyData NYC did I learn that Microsoft has hired almost half of the core CPython developers! Not only that, they are encouraged to continue their contributions into the CPython code base. In my view, that’s a pretty awesome development! It means the Python programming language will continue to have a strong corporate backing while also enjoying community support. Its a sign of a healthy ecosystem, IMO, and also a sign of Microsoft’s support for Open Source Software!

I’m more and more impressed by what Microsoft is doing for the Open Source community. I’m hoping they’ll continue up with this!!

Did you enjoy this blog post? Let's discuss more!


PyData NYC 2017 Recap

written by Eric J. Ma on 2017-11-30

With that, we’ve finished PyData NYC! Here's some of my highlights of the conference.

Keynotes

There were three keynotes, one each by Kerstin Kleese van Dam, Thomas Sargent, and Andrew Gelman. Interestingly enough, they didn't do what I would expect most academics to do -- give talks highlighting the accomplishments of their research groups. Rather, Kerstin gave a talk that highlighted the use of PyData tools at Brookhaven National Labs. Thomas Sargent gave a philosophical talk on what economic models really are (they're "games", in a mathematical sense), and I took back the importance of being able to implement models, otherwise, "you're just bull*****ing".

Andrew Gelman surprised me the most - he gave a wide-ranging talk about the problems we have in statistical analysis workflows. He emphasized that "robustness checks" are basically scams, because they're basically methods whose purpose is reassurance. He had a really cool example that highlighted that we need to understand our models by modifying our models, perhaps even using a graph of models to identify perturbations to our model that will help us understand our model. He also peppered his talk with anecdotes about how he made mistakes in his analysis workflows. I took home a different philosophy of data analysis: when we evaluate how "good" a model is, the operative question is, "compared against what?"

Talks

The talks were, for me, the highlight of the conference. A lot of good learning material around. Here's the talks from which I learned actionable new material.

Analyzing NBA Foul Calls using Python

This talk by the prolific PyMC blogger Austin Rochford is one that I really enjoyed. The take-home that I got from him was towards the end of his talk, in which I picked up three ways to diagnose probabilistic programming models.

The first was the use of residuals - which I now know can be used for classification problems as well as regression problems.

The second was the use of the energy plots in PyMC3, where if the "energy transition" and "marginal energy distribution" plots match up (especially in the tails), then we know that the NUTS sampler did a great job.

The third was the use of the Gelman-Rubin statistic to measure the in-chain vs. between-chain variation; measures close to 1 are generally considered good.

Check out the talk slides here.

scikit-learn-compatible model stacking

This talk was a great one because it shows how to use model stacking (also known as "ensembling", a technique commonly used in Kaggle competitions) to enable better predictions.

Conceptually, model stacking works like this: I train a set of model individually on a problem, and use the predictions from those models as features for a meta-model. The meta-model should perform, at worst, on par with the best model inside the ensemble, but may also perform better. This idea was first explored in Polley and van der Laan's work, available online.

Civis Analytics has released their implementation of model stacking in their GitHub repository, and it's available on PyPI. The best part of it? They didn't try inventing a new API, they kept a scikit-learn-compatible API. Kudos to them!

Check out the repository here.

Stream Processing with Dask

Matthew Rocklin, as usual, gave an entertaining and informative talk on the use of Streamz, a lightweight library he built, to explore the use of Dask for streaming applications. The examples he gave were amazing showcases of the library's capabilities. Given the right project, I'd love to try this out!

Check out his slides here.

Asynchronous Python: A Gentle Introduction

This talk, delivered by James Cropcho, defined what asynchronous programming was all about. For me, things finally clicked at the end when I asked him for an example of how asynchronous programming would be done in data analytics workflows -- to which he responded, "If you're querying for data and then performing a calculation, then async is a good idea."

The idea behind this is as such: web queries written serially are often "blocking", meaning we can't do anything while we wait for the web query to return. If we want to do a calculation on the returned data point, we have to wait for it to return first. On the other hand, if written asynchronously, we could potentially do a calculation on the previous data point while waiting for the current result to return, shaving the total time off potentially by some considerable fraction.

Turning PyMC3 into scikit-learn

This talk was by Nicole Carlson, and she did a tremendously great job delivering this talk. In it, she walked through how to wrap a PyMC3 model inside a scikit-learn estimator, including details on how to implement the .fit(), .predict(), and .predict_proba() methods. The code in her repository provides a great base example to copy from.

One thing I can't emphasize enough is that from a user experience standpoint, it's super important to follow idioms that people are used to. What Nicole did in this talk is to show how we can provide such idioms to end-users, rather than inventing a slightly modified wheel. Props to her for that!

Her slides are online here.

An Attempt at Demystifying Bayesian Deep Learning

This talk was my own, put at the end of the 2nd day. The title definitely contributed to the hype. I popped into the room early, but then left for the restroom. When I got back, there was a lineup in the front door and in the back door. Totally unexpected. That said, big credit to the Boston Bayesians organizers Jordi and Colin, who let me do the talk as a rehearsal for PyData, so I felt very grounded.

The talk went mostly smoothly. I think I was channeling my colleague, Brant Peterson, with his sense of humour during that time. There was one really hilarious hiccup - right after mentioning that I wouldn't overdo the "math" and "equations", I accidentally opened an adjacent tab with an alternate version of the slides... with, surprise surprise, a math equation on it! During the Q&A, when I shared the point of not needing to do train/test splits in Bayesian analysis, I could sense the jaws dropping and eyes widening in disbelief; more than just a handful of people came up and asked for the reference later on.

Having done the talk, I now realize how much people will appreciate a lighthearted and lightweight introduction to a topic that's very dense and filled with jargon. Conference speakers, we need to do more of this!

From an emotional standpoint, many people brought me joy with their positive comments on the visuals and structure of the talk. Others put out positive comments on Twitter, which I collected together in a Twitter Moment. It was very encouraging, especially on this deep learning journey that I'm on right now.

Tutorials

Interactive matplotlib Figures

This was something I totally didn't realize was possible before - we can create interactive matplotlib figures very easily! I have cloned the repository, and I think it'll be neat to hack on some projects at work to use this.

The tutorial repository can be found here.

Linear Regression Three Ways

This one was by Colin Carroll, a software engineer at the MIT Media Lab (previously at Kensho). I sat in and learned a good deal of math from him. One thing new I learned was how we can specify a model without requiring the use of observed variables. Any sampling we do will take into account the hierarchical and mathematical relationships we've done. This makes it neat to implement Bayesian nets to test how things will look under different scenarios!

His tutorial repository can be found here.

Top-To-Bottom, Line-By-Line

This one was led by the ever-entertaining, ever-surprising James Powell. I wish the tutorial was recorded, because even though this is a "novice" tutorial, it nonetheless was still an eye-opening talk. Anybody who thinks they know Python should go listen to James' talks, whenever he gives them live - it's bound to be entertaining and eye-opening!

Overall Thoughts

I'm glad I made it to PyData NYC 2017 this year. Made many new friends and connections, and caught up with old friends in the PyData community. As always, learned a ton as well!

Did you enjoy this blog post? Let's discuss more!


Bayesian Learning and Overfitting

written by Eric J. Ma on 2017-11-16

Yesterday, after I did my Boston Bayesians dry run talk, there was a point raised that I had only heard of once before: Bayesian learning methods don't overfit. Which means we're allowed to use all the data on hand. The point holds for simple Bayesian networks, and for more complicated deep neural nets.

Though I believe it, I wasn't 100% convinced of this myself, so I decided to check it up. I managed to get my hands on Radford Neal's book, Bayesian Learning for Neural Networks, and found the following quotable paragraphs:

It is a common belief, however, that restricting the complexity of the models used for such tasks is a good thing, not just because of the obvious computational savings from using a simple model, but also because it is felt that too complex a model will overfit the training data, and perform poorly when applied to new cases. This belief is certainly justified if the model parameters are estimated by maximum likelihood. I will argue here that concern about overfitting is not a good reason to limit complexity in a Bayesian context.

A few paragraphs later, after explaining the frequentist procedure:

From a Bayesian perspective, adjusting the complexity of the model based on the amount of training data makes no sense. A Bayesian defines a model, selects a prior, collects data, computes the posterior, and then makes predictions. There is no provision in the Bayesian framework for changing the model or the prior depending on how much data was collected. If the model and prior are correct for a thousand observations, they are correct for ten observations as well (though the impact of using an incorrect prior might be more serious with fewer observations). In practice, we might sometimes switch to a simpler model if it turns out that we have little data, and we feel that we will consequently derive little benefit from using a complex, computationally expensive model, but this would be a concession to practicality, rather than a theoretically desirable procedure.

Finally, in the following section after describing how neural networks are built:

In a Bayesian model of this type, the role of the hyperparameters controlling the priors for weights is roughly analogous to the role of a weight decay constant in conventional training. With Bayesian training, values for these hyperparameters (more precisely, a distribution of values) can be found without the need for a validation set.

This seems to dovetail well with the following convoluted intuition that I've had: if I fit a Bayesian model on the "training" set of the data, then update it with the "test" set, it's equivalent to just training with the whole dataset. With wide priors, if I fit with a smaller dataset, my posterior distribution will be wider than if I fit with the entire dataset. So... where possible, just train with the entire dataset. That said, I've not had sufficient grounding in Bayesian stats (after all, still a newcomer) to justify this.

I certainly have more reading/learning to do here. Looks like something neat to explore in the short-term.

Did you enjoy this blog post? Let's discuss more!