Git Tip: Apply a Patch

written by Eric J. Ma on 2018-06-17

I learned a new thing this weekend: we apparently can apply a patch onto a branch/fork using git apply [patchfile].

There's a few things to unpack here. First off, what's a patchfile?

The long story cut short is that a patchfile is nothing more than a plain text file that contains all information about diffs between one commit and another. If you've ever used the git diff command, you'll know that it will output a diff between the current state of a repository, and the last committed state. Let's take a look at an example.

Say we have a file, called my_file.txt. In a real world example, this would be parallel to, say, a .py module that you've written. After a bunch of commits, I have a directory structure that looks like this:

$ ls
total 8
drwxr-xr-x   4 ericmjl  staff   128B Jun 17 10:26 ./
drwx------@ 19 ericmjl  staff   608B Jun 17 10:26 ../
drwxr-xr-x  12 ericmjl  staff   384B Jun 17 10:27 .git/
-rw-r--r--   1 ericmjl  staff    68B Jun 17 10:26 my_file.txt

The contents of my_file.txt are as follows:

$ cat my_file.txt
Hello! This is a text file.

I have some text written inside here.

Now, let's say I edit the text file by adding a new line and removing one line.

$ cat my_file.txt
Hello! This is a text file.

I have some text written inside here.

This is a new line!

If I looked at the "diff" between the current state of the file and the previous committed state of the file:

$ git diff my_file.txt
diff --git a/my_file.txt b/my_file.txt
index a594a37..d8602e1 100644
--- a/my_file.txt
+++ b/my_file.txt
@@ -1,4 +1,4 @@
 Hello! This is a text file.

-I have some text written inside here.
+This is a new line!

While this may look intimidating at first, the key thing that one needs to look at is the + and -. The + signals that there is an addition of one line, and the - signals the removal of one line.

Turns out, I can export this as a file.

$ git diff my_file.txt > /tmp/patch1.txt
$ cat /tmp/patch1.txt
diff --git a/my_file.txt b/my_file.txt
index a594a37..d8602e1 100644
--- a/my_file.txt
+++ b/my_file.txt
@@ -1,4 +1,4 @@
 Hello! This is a text file.

-I have some text written inside here.
+This is a new line!

Now, let's simulate the scenario where I accidentally discarded those changes in the repository. A real-world analogue happened to me while contributing to CuPy: I had a really weird commit history, and couldn't remember how to rebase, so I exported the patch from my GitHub pull request (more on this later) and applied it following the same conceptual steps below.

$ git checkout -- my_file.txt

Now, the repository is in a "cleaned" state -- there are no changes made:

$ git status
On branch master
nothing to commit, working tree clean

Since I have saved the diff as a file, I can apply it onto my project:

$ git apply /tmp/patch1.txt
$ git status
On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

        modified:   my_file.txt

no changes added to commit (use "git add" and/or "git commit -a")

Looking at the diff again, I've recovered the changes that were lost!

$ git diff
diff --git a/my_file.txt b/my_file.txt
index a594a37..d8602e1 100644
--- a/my_file.txt
+++ b/my_file.txt
@@ -1,4 +1,4 @@
 Hello! This is a text file.

-I have some text written inside here.
+This is a new line!

Don't forget to commit and push!

How to export a patch from GitHub?

I mentioned earlier that I had exported the patch file from GitHub and then applied it on a re-forked repository. How does one do that? It's not as hard as you think.

Here's the commands below with comments.

# Download the patch from the pull request URL.
# Replace curly-braced elements with the appropriate names.
# Export it to /tmp/patch.txt.
$ wget https://github.com/{repo_owner}/{repo}/pull/{pr_number}.patch -O /tmp/patch.txt

# Now, apply the patch to your project
$ git apply /tmp/patch.txt

Did you enjoy this blog post? Let's discuss more!


My Latent Dissatisfaction with Modern ML

written by Eric J. Ma on 2018-06-05

It took reading Judea Pearl's "The Book of Why", and Jonas Peters' mini-course on causality, for me to finally figure out why I had this lingering dissatisfaction with modern machine learning. It's because modern machine learning (deep learning included) is most commonly used as a tool in the service of finding correlations, and is not concerned with understanding systems.

Perhaps this is why Pearl writes of modern ML as basically being "curve fitting". I tend to believe he didn't write those words in a dismissive way, though I might be wrong about it. Regardless, I think there is an element of truth to that statement.

Linear models seek a linear combination of correlations between input variables and their targets. Tree-based models essentially seek combinations of splits in the data, while deep learning models are just stacked compositions of linear models with nonlinear functions applied to their outputs. As Keras author Francois Chollet wrote, deep learning can be thought of as basically geometric transforms of data from one data manifold to another.

(For convenience, I've divided the ML world into linear models, tree-based models, and deep learning models. Ensembles, like Random Forest, are just that: ensembles composed of these basic models.)

Granted, curve fitting is actually very useful: much of image deep learning has found pragmatic use: image search, digital pathology, self-driving cars, and more. Yet, in none of these models is the notion of causality important. This is where these models are dissatisfying: they do not provide the tools to help us interrogate these questions in a structured fashion. I think it's reasonable to say that these models are essentially concerned with conditional probabilities. As written by Ferenc Huszár, conditional probabilities are different from interventional probabilities (ok, I mutilated that term).

Humans are innately wired to recognize and ask questions about causality; consider it part of our innate makeup. That is, of course, unless that has been drilled out of our minds by our life experiences. (I know of a person who insists that causes do not exist. An extreme Hume-ist, I guess? As I'm not a student of philosophy much, I'm happy to be corrected on this point.) As such, I believe that part of being human involves asking the question, "Why?" (and its natural extension, "How?"). Yet, modern ML is still stuck at the question of, "What?"

To get at why and how, we test our understanding of a system by perturbing it (i.e. intervening in it), or asking about "what if" scenarios (i.e. thinking about counterfactuals). In the real world of biological research (which I'm embedded in), we call this "experimentation". Inherent in a causal view of the world is a causal model. In causal inference, these things are structured and expressed mathematically, and we are given formal procedures for describing an intervention and thinking about counterfactual scenarios. From what I've just learned (baby steps at the moment), these are the basic ingredients, and their mathematical mappings:

Having just learned this, I think there's a way out of this latent dissatisfaction that I have with modern ML. A neat thing about ML methods is that we can use them as tools to help us better identify the important latent factors buried inside our (observational) data, which we can use to construct a better model of our data generating process. Better yet, we can express the model in a structured and formal sense, which would expose our assumptions more explicitly for critique and reasoning. Conditioned on that, perhaps we may be able to write better causal models of the world!

Did you enjoy this blog post? Let's discuss more!


Causal Modelling

written by Eric J. Ma on 2018-05-26

Finally, I have finished Judea Pearl's latest work "The Book of Why"! Having read it, I have come to appreciate how much work had to go on in order to formalize the very intuitions that we have for causal reasoning into essentially a modelling language.

"The Book of Why" is geared towards the layman reader. Thus, unlike a textbook, it does not contain "simplest complex examples" that a reader can walk through and do calculations by hand (or through simulation). Thankfully, there is a lecture series by Jonas Peters, organized by the Broad Institute and held at MIT, that are available freely online.

From just viewing the first of the four lectures, I am thoroughly enjoying Jonas' explanations of the core ideas in causal modelling. Indeed, Jonas is a very talented lecturer! He builds up the ideas from simple examples, finally culminating in a "simple complex example" that we can simulate on a computer. Having just freshly read "The Book of Why" also helps immensely; it's also clear to me that people in the world of causal modelling are very much familiar with the same talking points. For those interested in learning more about causal modelling, I highly recommend both the book and the lecture series!

Did you enjoy this blog post? Let's discuss more!


Model Baselines Are Important

written by Eric J. Ma on 2018-05-06

For any problem that we think is machine learnable, having a sane baseline is really important. It is even more important to establish them early.

Today at ODSC, I had a chance to meet both Andreas Mueller and Randy Olson. Andreas leads scikit-learn development, while Randy was the lead developer of TPOT, an AutoML tool. To both of them, I told a variation of the following story:

I had spent about 1.5 months building and testing a graph convolutions neural network model to predict RNA cleavage by an enzyme. I was suffering from a generalization problem - this model class would never generalize beyond the training samples for my problem on hand, even though I saw the same model class perform admirably well for small molecules and proteins.

Together with an engineer at NIBR, we brainstormed a baseline with some simple features, and threw a random forest model at it. Three minutes later, after implementing everything, we had a model that generalized and outperformed my implementation of graph CNNs. Three days later, we had an AutoML (TPOT) model that beat the random forest. After further discussion, we realize then that the work that we did is sufficiently publishable even without the fancy graph CNNs.

I think there’s a lesson in establishing baselines and MVPs early on!

Did you enjoy this blog post? Let's discuss more!


Consolidate your scripts using click

written by Eric J. Ma on 2018-03-30

Overview

click is amazing! It's a Python package that allows us to add a command-line interface (CLI) to our Python scripts easily. This blog post is a data scientist-oriented post on how we can use click to build useful tools for ourselves. In this blog post, I want to focus on how we can better organize our scripts.

I have found myself sometimes writing custom scripts to deal with custom data transforms. Having them refactored out into a library of modular functions can really help with maintenance. However, I still end up with multiple scripts that might not have a naturally logical organization... except for the fact that they are scripts that I run from time to time! Rather than have them scattered in multiple places, why not have them put together into a single .py file, with options that are callable from the command line instead?

Template

Here's a template for organizing all those messy scripts using click.

import click


@click.group()
def main():
    pass


@main.command()
def script1():
    """
    Makes stuff happen.
    """
    # do stuff that was originally in script 1
    click.echo('script 1 was run!')  # click.echo is recommended by the click authors.


@main.command()
def script2():
    """Makes more stuff happen."""
    # do stuff that was originally in script 2.
    print('script 2 was run!')  # we can run print instead of click.echo as well!

if __name__ == '__main__':
    cli()

How to use

Let's call this new meta-script jobs.py, and make it executable.

$ chmod +x jobs.py

To execute it at the command line, we now a help command for free:

$ ./jobs.py --help
Usage: jobs.py [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  script1  Makes stuff happen.
  script2  Makes more stuff happen.

We can also use just one script with varying commands to control the execution of what was originally two different .py files.

$ ./jobs.py script1
script 1 was run!
$ ./jobs.py script2
script 2 was run!

Instead of versioning multiple .py files, we now only have to keep track of one file where all non-standard custom stuff goes!

Details

Here's what's going on under the hood.

With the decorator @click.group(), we have exposed the main() function from the command line as a "group" of commands that are callable from the command line. What this does is then "wrap" the main() function (somehow), such that now it can be used to decorate another function (in our case, script1 and script2) using the decorator syntax @main.command().

Recap

Did you enjoy this blog post? Let's discuss more!