It's been about two weeks since SciPy 2018 ended, and I've finally found some breathing room to write about it.

SciPy 2018 is the 4th year I've made it to the conference, my first one being SciPy 2015 (not 2014, as I had originally incorrectly remembered). The conference has grown over the years that I've attended it!

This year, I served again as the Financial Aid Co-Chair with Scott Collis and Celia Cintas. It brings me joy to be able to help bring others to the conference, much as I was a few years back when I was still a graduate student.

Building upon last year's application process, where we implemented blinded reviews, this year, we improved the review process so that it was much less tedious and more user-friendly for reviewers, i.e. myself, Celia, Scott and our two committee reviewers, Kasia and Patrick.

The review process can always be improved; we still have some work to do. One would be making the application less intimidating for under-represented individuals. Two might be reworking how we quantify how good our selections are; rather than some aggregate "total" score, it might be that we ought to optimize for a breadth of worthy contributions to the community. Finally, we definitely want to ensure that our focus is on FinAid's mission: to enable us to bring deserving and new people to the conference who otherwise might not have the resources!

I did two tutorials, one with Hugo and one with Mridul. The one with Mridul was on Network Analysis, and the one with Hugo was on Bayesian statistics. Over the years, I've developed muscle memory on Network Analysis, so it felt very natural to me. Bayesian statistics and probabilistic programming was a new topic for myself and Hugo; as such, I spent proportionally more time preparing for that tutorial instead.

What was pleasantly surprising for me was that Bayesian statistics was gaining a ton of popularity, and this tutorial just happened to be there at the right time. I had a lot of one-on-one chats with tutorial participants after the tutorial and during the conference days, where we talked about the application of Bayesian methods to problems that they had encountered. There's a lot to do until people can generally communicate about data problems using Bayesian methods, but I think we're at an upwards-inflection point right now!

I missed the talks mostly because I was doing the Tejas Room track (sit in the Tejas room and chat with people). I nonetheless had a very fruitful and fun time doing so!

To make up for lost time, I put together a playlist of things I'd like to catch up on later.

For the first time ever, I stayed on to sprint! However, I also simultaneously caught a conference bug, so I was basically knocked out for the second day of sprints. For this year's sprints, I implemented a declarative interface for geographic graph visualizations in nxviz, where node placement is prioritized according by geographic information. The intent here isn't to replace geospatial analysis packages, but rather to provide a quick, `seaborn`

-like view into a graph's geographical structure. Once a user has a feel for the data, if nothing more is needed, they can use the graph as is; otherwise, they can move onto a different package.

*Did you enjoy this blog post? Let's discuss more!*

bayesian statistics data science

Over the past year, having learned about Bayesian inference methods, I finally see how estimation, group comparison, and model checking build upon each other into this really elegant framework for data analysis.

The foundation of this is "estimating a parameter". In a typical situation, we are most concerned with the parameter of interest. It could be a population mean, or a population variance. If there's a mathematical function that links the input variables to the output (a.k.a. "link function"), then the parameters of the model are that function's parameters. The key point here is that the atomic activity of Bayesian analysis is the estimation of a parameter, and its associated uncertainty.

Building on that, we can then estimate parameters for more than one group of things. As a first pass, we could assume that each of the groups are unrelated, and thus "independently" (I'm trying to avoid overloading this term) estimate parameters per group under this assumption. Alternatively, we could assume that the groups are related to one another, and thus use a "hierarchical" model to estimate parameters for each group.

Once we've done that, what's left is the comparison of parameters between the groups. The simplest activity is to compare the posterior distributions' 95% highest posterior densities, and check to see if they overlap. Usually this is done for the mean, or for regression parameters, but the variance might also be important to check as well.

Rounding off this framework is model checking: how do we test that the model is a good one? The bare minimum that we should do is simulate data from the model - it should generate samples whose distribution looks like the actual data itself. If it doesn't, then we have a problem, and need to go back and rework the model structure until it does.

Could it be that we have a model that only fits the data on hand (overfitting)? Potentially so - and if this is the case, then our best check is to have an "out-of-sample" group. While a train/test/validation split might be useful, the truest test of a model is new data that has been collected.

These three major steps in Bayesian data analysis workflows did not come together until recently; they each seemed disconnected from the others. Perhaps this was just an artefact of how I was learning them. However, I think I've finally come to a unified realization: Estimation is necessary before we can do comparison, and model checking helps us build confidence in the estimation and comparison procedures that we use.

When doing Bayesian data analysis, the key steps that we're performing are:

- Estimation
- Comparison
- Model Checking

*Did you enjoy this blog post? Let's discuss more!*

statistics visualization data science

In my two SciPy 2018 co-taught tutorials, I made the case that ECDFs provide richer information compared to histograms. My main points were:

- We can more easily identify central tendency measures, in particular, the median, compared to a histogram.
- We can much more easily identify other percentile values, compared to a histogram.
- We become less susceptible to outliers arising from binning issues.
- It is more difficult to hide multiple modes.
- We can easily identify repeat values.

What are ECDFs? ECDFs stand for the "empirical cumulative distribution function", and they map every data point in the dataset to a quantile, which is a number between 0 and 1 that indicates the cumulative fraction of data points smaller than that data point itself.

To illustrate, let's take a look at the following plots.

import numpy as np import matplotlib.pyplot as plt # Generate a mixture of two normal distributions, but with # very few data points. np.random.seed(3) mx1 = np.random.normal(loc=0, scale=1, size=20) mx2 = np.random.normal(loc=2, scale=1, size=20) mx = np.concatenate([mx1, mx2, [5], [-4]]) # one outlier def ecdf(data): x, y = np.sort(data), np.arange(1, len(data)+1) / len(data) return x, y fig = plt.figure(figsize=(8, 4)) ax_ecdf = fig.add_subplot(121) ax_hist = fig.add_subplot(122) ax_hist.set_title('histogram') ax_hist.hist(mx) x, y = ecdf(mx) ax_ecdf.scatter(x, y) ax_ecdf.set_title('ecdf')

Let's compare the ECDF and the histogram for this data.

**Is the central tendency measure easily discoverable?** We might say that there's some peak at just below the x-axis at just above zero, but is that the mode, median or mean? And what is its exact value? On the other hand, at least the median is easily discoverable on the ECDF: Draw a horizontal line from 0.5 on the y-axis until it crosses a data point, and then drop a line down to the x-axis to get the median value.

**Are percentiles easily discoverable?** It's much clearer that the answer is "yes" for the ECDF, and "no" for the histogram.

**What is the value of the potential outlier?** Difficult to tell on the histogram: it could be anywhere from 4 to 5 (high outlier) and maybe -3 to -4 on the low outlier. On the other hand, just drop a line down from the suspected outliers to the x-axis to read off their values.

**Is this a mixture distribution or is this a single Normal distribution?** If you looked at the histogram, you might be tempted to think that the data are normally distributed with mean 0.5 and standard deviation about 2. However, if you look at the ECDF, it's clear that there are multiple modes, as shown by two or three sigmoidal-like curves. This should give us pause to see if there's a mixture distribution at play here.

**Are there repeat values?** You can't tell in a histogram. However, it's evidently clear on the ECDF scatterplot that there's no repeat values -- they would show up on the plot as vertical stacks of dots. (Repeat-values might be important when working with, say, a zero- or X-inflated distribution.)

I hope this post showed you why ECDFs contain richer information than histograms. They're taught less commonly than histograms, so people will have a harder time interpreting them at first glance. However, a bit of guidance and orientation will bring out the rich information on the ECDFs.

I credit Justin Bois (Caltech) for teaching me about ECDFs, and Hugo Bowne-Anderson (DataCamp) for reinforcing the idea.

*Did you enjoy this blog post? Let's discuss more!*

git version control code snippets

I learned a new thing this weekend: we apparently can apply a patch onto a branch/fork using `git apply [patchfile]`

.

There's a few things to unpack here. First off, what's a `patchfile`

?

The long story cut short is that a `patchfile`

is nothing more than a plain text file that contains all information about `diffs`

between one commit and another. If you've ever used the `git diff`

command, you'll know that it will output a `diff`

between the current state of a repository, and the last committed state. Let's take a look at an example.

Say we have a file, called `my_file.txt`

. In a real world example, this would be parallel to, say, a `.py`

module that you've written. After a bunch of commits, I have a directory structure that looks like this:

$ ls total 8 drwxr-xr-x 4 ericmjl staff 128B Jun 17 10:26 ./ drwx------@ 19 ericmjl staff 608B Jun 17 10:26 ../ drwxr-xr-x 12 ericmjl staff 384B Jun 17 10:27 .git/ -rw-r--r-- 1 ericmjl staff 68B Jun 17 10:26 my_file.txt

The contents of `my_file.txt`

are as follows:

$ cat my_file.txt Hello! This is a text file. I have some text written inside here.

Now, let's say I edit the text file by adding a new line and removing one line.

$ cat my_file.txt Hello! This is a text file. I have some text written inside here. This is a new line!

If I looked at the "diff" between the current state of the file and the previous committed state of the file:

$ git diff my_file.txt diff --git a/my_file.txt b/my_file.txt index a594a37..d8602e1 100644 --- a/my_file.txt +++ b/my_file.txt @@ -1,4 +1,4 @@ Hello! This is a text file. -I have some text written inside here. +This is a new line!

While this may look intimidating at first, the key thing that one needs to look at is the `+`

and `-`

. The `+`

signals that there is an addition of one line, and the `-`

signals the removal of one line.

Turns out, I can export this as a file.

```
$ git diff my_file.txt > /tmp/patch1.txt
$ cat /tmp/patch1.txt
diff --git a/my_file.txt b/my_file.txt
index a594a37..d8602e1 100644
--- a/my_file.txt
+++ b/my_file.txt
@@ -1,4 +1,4 @@
Hello! This is a text file.
-I have some text written inside here.
+This is a new line!
```

Now, let's simulate the scenario where I accidentally discarded those changes in the repository. A real-world analogue happened to me while contributing to CuPy: I had a really weird commit history, and couldn't remember how to rebase, so I exported the patch from my GitHub pull request (more on this later) and applied it following the same conceptual steps below.

$ git checkout -- my_file.txt

Now, the repository is in a "cleaned" state -- there are no changes made:

$ git status On branch master nothing to commit, working tree clean

Since I have saved the diff as a file, I can apply it onto my project:

$ git apply /tmp/patch1.txt $ git status On branch master Changes not staged for commit: (use "git add <file>..." to update what will be committed) (use "git checkout -- <file>..." to discard changes in working directory) modified: my_file.txt no changes added to commit (use "git add" and/or "git commit -a")

Looking at the diff again, I've recovered the changes that were lost!

$ git diff diff --git a/my_file.txt b/my_file.txt index a594a37..d8602e1 100644 --- a/my_file.txt +++ b/my_file.txt @@ -1,4 +1,4 @@ Hello! This is a text file. -I have some text written inside here. +This is a new line!

Don't forget to commit and push!

I mentioned earlier that I had exported the patch file from GitHub and then applied it on a re-forked repository. How does one do that? It's not as hard as you think.

Here's the commands below with comments.

# Download the patch from the pull request URL. # Replace curly-braced elements with the appropriate names. # Export it to /tmp/patch.txt. $ wget https://github.com/{repo_owner}/{repo}/pull/{pr_number}.patch -O /tmp/patch.txt # Now, apply the patch to your project $ git apply /tmp/patch.txt

*Did you enjoy this blog post? Let's discuss more!*

data science machine learning deep learning causal inference graph theory probability

It took reading Judea Pearl's "The Book of Why", and Jonas Peters' mini-course on causality, for me to finally figure out why I had this lingering dissatisfaction with modern machine learning. It's because modern machine learning (deep learning included) is most commonly used as a tool in the service of finding correlations, and is not concerned with understanding systems.

Perhaps this is why Pearl writes of modern ML as basically being "curve fitting". I tend to believe he didn't write those words in a dismissive way, though I might be wrong about it. Regardless, I think there is an element of truth to that statement.

Linear models seek a linear combination of correlations between input variables and their targets. Tree-based models essentially seek combinations of splits in the data, while deep learning models are just stacked compositions of linear models with nonlinear functions applied to their outputs. As Keras author Francois Chollet wrote, deep learning can be thought of as basically geometric transforms of data from one data manifold to another.

(For convenience, I've divided the ML world into linear models, tree-based models, and deep learning models. Ensembles, like Random Forest, are just that: ensembles composed of these basic models.)

Granted, curve fitting is actually very useful: much of image deep learning has found pragmatic use: image search, digital pathology, self-driving cars, and more. Yet, in none of these models is the notion of causality important. This is where these models are dissatisfying: they do not provide the tools to help us interrogate these questions in a structured fashion. I think it's reasonable to say that these models are essentially concerned with conditional probabilities. As written by Ferenc HuszĂˇr, conditional probabilities are different from interventional probabilities (ok, I mutilated that term).

Humans are innately wired to recognize and ask questions about causality; consider it part of our innate makeup. That is, of course, unless that has been drilled out of our minds by our life experiences. (I know of a person who insists that causes do not exist. An extreme Hume-ist, I guess? As I'm not a student of philosophy much, I'm happy to be corrected on this point.) As such, I believe that part of being human involves asking the question, "Why?" (and its natural extension, "How?"). Yet, modern ML is still stuck at the question of, "What?"

To get at why and how, we test our understanding of a system by perturbing it (i.e. intervening in it), or asking about "what if" scenarios (i.e. thinking about counterfactuals). In the real world of biological research (which I'm embedded in), we call this "experimentation". Inherent in a causal view of the world is a causal model. In causal inference, these things are structured and expressed mathematically, and we are given formal procedures for describing an intervention and thinking about counterfactual scenarios. From what I've just learned (baby steps at the moment), these are the basic ingredients, and their mathematical mappings:

- Causal model: a directed, acyclic graph
- Variables: nodes in a graph
- Relationships: structured causal model's equations (math transforms of incoming variables with a noise distribution added on top, embedded in each node)

- Interventions: removal of edges in a graph ("do-calculus")
- Counterfactuals: set causal model based on observation, then perform do-calculus.

Having just learned this, I think there's a way out of this latent dissatisfaction that I have with modern ML. A neat thing about ML methods is that we can use them as tools to help us better identify the important latent factors buried inside our (observational) data, which we can use to construct a better model of our data generating process. Better yet, we can express the model in a structured and formal sense, which would expose our assumptions more explicitly for critique and reasoning. Conditioned on that, perhaps we may be able to write better causal models of the world!

*Did you enjoy this blog post? Let's discuss more!*

« Previous
| 2 |
Next »