I first learned GPs about two years back, and have been fascinated by the idea. I learned it through a video by David MacKay, and managed to grok it enough that I could put it to use in simple settings. That was reflected in my Flu Forecaster project, in which my GPs were trained only on individual latent spaces.

Recently, though, I decided to seriously sit down and try to grok the math behind GPs (and other machine learning models). To do so, I worked through Nando de Freitas' YouTube videos on GPs. (Super thankful that he has opted to put these videos up online!)

The product of this learning is two-fold. Firstly, I have added a GP notebook to my Bayesian analysis recipes repository.

Secondly, I have also put together some hand-written notes on GPs. (For those who are curious, I first hand-wrote them on paper, then copied them into my iPad mini using a Wacom stylus. We don't have the budget at the moment for an iPad Pro!) They can be downloaded here.

Some lessons learned:

- Algebra is indeed a technology of sorts (to quote Jeremy Kun's book). Being less sloppy than I used to be gives me the opportunity to connect ideas on the page to ideas in my head, and express them more succinctly.
- Grokking the math behind GPs at the minimum requires one thing: remembering, or else knowing how to derive, the formula for how to get the distribution parameters of a multivariate Gaussian conditioned on some of of its variables.
- Once I grokked the math, implementing a GP using only NumPy was trivial; also, extending it to higher dimensions was similarly trivial!

*Did you enjoy this blog post? Let's discuss more!*

deep learning bayesian math data science

Last week, I picked up Jeremy Kun's book, "A Programmer's Introduction to Mathematics". In it, I finally found an explanation for my frustrations when reading math papers:

What programmers would consider “sloppy” notation is one symptom of the problem, but there there are other expectations on the reader that, for better or worse, decelerate the pace of reading. Unfortunately I have no solution here. Part of the power and expressiveness of mathematics is the ability for its practitioners to overload, redefine, and omit in a suggestive manner. Mathematicians also have thousands of years of “legacy” math that require backward compatibility. Enforcing a single specification for all of mathematics—a suggestion I frequently hear from software engineers—would be horrendously counterproductive.

Reading just *that* paragraph explained, in such a lucid manner, how my frustrations reading mathematically-oriented papers, stemmed from mismatched expectations. I come into a paper thinking like a software engineer. Descriptive variable names (as encouraged by Python), which are standardized as well, with structured abstractions providing a hierarchy of logic between chunks of code... No, mathematicians are more like Shakespeare - or perhaps linguists - in that they will take a symbol and imbibe it with a subtly new meaning or interpretation inspired by a new field. That "L" you see in one field of math doesn't always exactly mean the same thing in another field.

The contrast is stark when compared against reading a biology paper. With a biology paper, if you know the key wet-bench experiment types (and there's not that many), you can essentially get the gist of a paper by reading the abstract and dissecting the figures, which, granted, are described and labelled with field-specific jargon, but are at least descriptive names. With a math-oriented paper, the equations are the star, and one has to really grok each element of the equations to know what they mean. It means taking the time to dissect each equation and ask what each symbol is, what each group of symbols means, and how those underlying ideas connect with one another and with other ideas. It's not unlike a biology paper, but requiring a different kind of patience, one that I wasn't trained in.

As Jeremy Kun wrote in his book, programmers do have some sort of a leg-up when it comes to reading and understanding math. It's a bit more than what Kun wrote, I think - yes, many programming ideas have deep mathematical connections. But I think there's more.

One thing we know from research into how people learn is that teaching someone something is an incredible way to learn that something. From my prior experience, the less background a student has in a material, the more demands are placed on the teacher's understanding of the material, as we work out how the multiple representations in our head to try to communicate it to them.

As it turns out, we programmers have the ultimate dumb "student" available at our fingertips: Our computers! By implementing mathematical ideas in code, we are essentially "teaching" the computer to do something mathematical. Computers are not smart; they are programmed to do exactly what we input to them. If we get an idea wrong, our implementation of the math will likely be wrong. That fundamental law of computing shows up again: Garbage in, garbage out.

More than just that, when we programmers implement a mathematical idea in code, we can start putting our "good software engineering" ideas into place! It helps the math become stickier when we can see, through code, the hierarchy of concepts that are involved.

An example, for me, comes from the deep learning world. I had an attempt dissecting two math-y deep learning papers last week. Skimming through the papers didn't do much good for my understanding of the paper. Neither did trying to read the paper like I do a biology paper. Sure, I could perhaps just read the ideas that the authors were describing in prose, but I had no intuition on which to base a proper critique of the idea's usefulness. It took implementing those papers in Python code, writing tests for them, and using abstractions that I had previously written, to come to a place where I felt like the ideas in the paper were a flexibly wieldable tool in my toolkit.

Reinventing the wheel, such that we can learn the wheel, can in fact help us decompose the wheel so that we can do other new things with it. Human creativity is such a wonderful thing!

*Did you enjoy this blog post? Let's discuss more!*

data science insight data science

There's a quote by John Tukey that has been a recurrent theme at work.

It's better to solve the right problem approximately than to solve the wrong problem exactly.

Continuing on the theme of quoting two Georges:

All models are wrong, but some are more useful than others.

H/T Allen Downey for pointing out that our minds think alike.

I have been working on a modelling effort for colleagues at work. There were two curves involved, and the second depended on the first one.

In both cases, I started with a simple model, and made judgment calls along the way as to whether to continue improving the model, or to stop there because the current iteration of the model was sufficient enough to act on. With first curve, the first model was actionable for me. With the second curve, the first model I wrote clearly wasn't good enough to be actionable, so I spent lots more rounds of iteration on it.

But wait, how does one determine "actionability"?

**For myself**, it has generally meant that I'm confident enough in the results to take the next modelling step. My second curves depended on the first curves, and after double-checking multiple ways, I thought the first curve fits, though not perfect, were good enough when applied across a large number of samples that I could instead move on to the second curves.

**For others**, particularly at my workplace, it generally means a scientist can make a decision about what next experiment to run.

Going through Insight Data Science drilled into us an instinct for developing an MVP for our problem before going on to perfect it. I think that general model works well. My project's final modelling results will be the result of chains of modelling assumptions at every step. Documenting those steps clearly, and then being willing to revisit those assumptions, is going always a good thing.

*Did you enjoy this blog post? Let's discuss more!*

Having used Black for quite a while now, I have a hunch that it will continue to surpass its current popularity amongst projects.

It's one thing to be opinionated about things that matter for a project, but don't matter personally. Like code style. It's another thing to actually build a tool that, with one command, realizes those opinions in (milli)seconds. That's exactly what Black does.

At the end of the day, it was, and still is, a tool that has a very good human API - that of convenience.

By being opinionated about what code *ought* to look like, `black`

has very few configurable parameters. Its interface is very simple. *Convenient.*

By automagically formatting *every* Python file in subdirectories (if not otherwise configured so), it makes code formatting quick and easy. *Convenient.*

In particular, by being opinionated about conforming to community standards for code style with Python, `black`

ensures that formatted code is consistently formatted and thus easy to read. *Convenient!*

Because of this, I highly recommend the use of `black`

for code formatting.

```
pip install black
```

*Did you enjoy this blog post? Let's discuss more!*

bayesian data science statistics

It’s definitely not easy work; anybody trying to tell you that you can "just apply this model and just be done with it" is probably wrong.

Let me clarify: I agree that doing the first half of the statement, "just apply this model", is a good starting point, but I disagree with the latter half, "and just be done with it". I have found that writing and fitting a very naive Bayesian model to the data I have is a very simple thing. But doing the right thing is not. Let’s not be confused: I don’t mean a Naive Bayes model, I mean naively writing down a Bayesian model that is structured very simply with the simplest of priors that you can think of.

Write down the model, including any transformations that you may need on the variables, and then lazily put in a bunch of priors. For example, you might just start with Gaussians everywhere a parameter could take on negative to positive infinity values, or a bounded Half Gaussian if it can only take values above (or below) a certain value. You might assume Gaussian-distributed noise in the output.

Let’s still not be confused: Obviously this would not apply to a beta-bernoulli/binomial model!

Doing the right thing, however, is where the tricky parts come in. To butcher and mash-up two quotes:

All models are wrong, but some are useful (Box), yet some models are more wrong than others (modifying from Orwell).

When doing modeling, a series of questions comes up:

- Do my naive assumptions about "Gaussians everywhere" hold?
- Given that my output data are continuous, is there a better distribution that can describe the likelihood?
- Is there are more principled prior for some of the variables?
- Does my link function, which joins the input data to the output parameters, properly describe their relationship?
- Instead of independent priors per group, would a group prior be justifiable?
- Does my model yield posterior distributions that are within bounds of reasonable ranges, which come from my prior knowledge? If it does not, do I need to bound my priors instead of naively assuming the full support for those distributions?

I am quite sure that this list is non-exhaustive, and probably only covers the bare minimum we have to think about.

Doing these model critiques is not easy. Yet, if we are to work towards truthful and actionable conclusions, it is a necessity. We want to know ground truth, so that we can act on it accordingly, and hence take appropriate actions.

I have experienced this modeling loop that Mike Betancourt describes (in his Principled Bayesian Workflow notebook) more than once. One involved count data, with a data scientist from TripAdvisor last year at the SciPy conference; another involved estimating cycle time distributions at work, and yet another involved a whole 4-parameter dose-response curve. In each scenario, model fitting and critique took hours at the minimum; I’d also note that with real world data, I didn’t necessarily get to the "win" was looking for.

With the count data, the TripAdvisor data scientist and I reached a point where after 5 rounds of tweaking his model, we had a model that fit the data, and described a data generating process that mimics closely to what we would expect given his process. It took us 5 rounds, and 3 hours of staring at his model and data, to get there!

Yet with cycle time distributions from work, a task ostensibly much easier ("just fit a distribution to the data"), none of my distribution choices, which reflected what I thought would be the data generating process, gave me a "good fit" to the data. I checked by many means: K-S tests, visual inspection, etc. I ended up abandoning the fitting procedure, and used empirical distributions instead.

With a 4-parameter dose-response curve, it took me 6 hours to go through 6 rounds of modeling to get to a point where I felt comfortable with the model. I started with a simplifying "Gaussians everywhere" assumption. Later, though, I hesitantly and tentatively putting in bound priors because I knew some posterior distributions were completely out of range under the naive assumptions of the first model, and were likely a result of insufficient range in the concentrations tested. Yet even that model remained unsatisfying: I was stuck with some compounds that didn’t change the output regardless of concentration, and that data are fundamentally very hard to fit with a dose response curve. Thus I the next afternoon,I modeled the dose response relationship using a Gaussian Process instead. Neither model is completely satisfying to the degree that the count data model was, but both the GP and the dose-response curve are and will be roughly correct modeling choices (with the GP probably being more flexible), and importantly, both are actionable by the experimentalists.

As you probably can see, whenever we either (1) don’t know ground truth, and/or (2) have
messy, real world data that don’t fit idealized assumptions about the data generating
process, **getting the model "right" is a very hard thing to do!** Moreover, data are
insufficient on their own to critique the model; we will always need to bring in prior
knowledge. Much as all probability is conditional probability (Venn), all modeling involves
prior knowledge. Sometimes it comes up in non-modellable ways, though as far as possible,
it’s a good exercise to try incorporating that into the model definition.

Even with that said, I’m still a fan of canned models, such as those provided by
`pymc-learn`

and `scikit-learn`

- provided we recognize that their "canned" nature and are
equipped to critique and modify said models. Yes, they provide easy, convenient baselines
that we can get started with. We can "just apply this model". But we can’t "just be done
with it": the hard part of getting the model right takes much longer and much more hard
work. *Veritas!*

*Did you enjoy this blog post? Let's discuss more!*

« Previous
| 1 |
Next »