Joint, conditional, and marginal probability distributions

written by Eric J. Ma on 2018-08-07

statistics probability bayesian data science

Joint probability, conditional probability, and marginal probability... These are three central terms when learning about probability, and they show up in Bayesian statistics as well. However... I never really could remember what they were, especially since we were usually taught them using formulas, rather than pictures.

Well, for those who learn a bit better using pictures... if you know what a probability distribution is, then hopefully these help with remembering what these terms mean. (Clicking on the image will bring you to the original, hosted on GitHub.)

Did you enjoy this blog post? Let's discuss more!

d-separation in causal inference

written by Eric J. Ma on 2018-08-06

causal inference bayesian data science

Yesterday evening, I had an empty block of time during which I finally did a worked example of finding whether two nodes are "d-separated" in a causal graph. It was pretty instructive to implement the algorithm. It also reminded me yet again: there's this weird thing about me where I need programming to learn math!

Anyways, if you're interested in seeing the implementation, it's available at GitHub.

Did you enjoy this blog post? Let's discuss more!

nxviz 0.5 released!

written by Eric J. Ma on 2018-08-01

nxviz visualization data science software open source

A new version of nxviz is released!

In this update, I have added a declarative interface for visualizing geographically-constrained graphs. Here, nodes in a graph have their placement constrained by longitude and latitude.

An example of how to use it is below:

nxviz geoplots

In the GeoPlot constructor API, the keyword arguments node_lat and node_lon specify which node metadata are to be used to place nodes on the x- and y- axes.

By no means do I intend for GeoPlot to replace more sophisticated analysis methods; like seaborn, the interface is declarative; for me, the intent is to provide a very quick-and-dirty way for an end user to visualize graphs with spatially constrained nodes.

Please enjoy!

Did you enjoy this blog post? Let's discuss more!

pyjanitor 0.3 released!

written by Eric J. Ma on 2018-07-27

open source pyjanitor data science

A new release of pyjanitor is out!

Two new features that I have added in include:

  1. Concatenating column names into a single column, such that each item is separated by a delimiter.
  2. Deconcatenating a column into multiple columns, separating on the basis of a delimiter.

Both of these tasks come up frequently in data preparation.

For example, concatenating a few columns together oftentimes lets us create an unique index based sample properties.

On the other hand, deconcatenating columns into multiple columns can be useful when our index is used to store metadata. (This really shouldn't be happening, but... sometimes that's just how the world works right now...)

Here's an example of how it works:

To install pyjanitor, grab it from PyPI:

$ pip install pyjanitor

The conda-forge build will be coming soon!

Did you enjoy this blog post? Let's discuss more!

SciPy 2018

written by Eric J. Ma on 2018-07-26

scipy conferences python

It's been about two weeks since SciPy 2018 ended, and I've finally found some breathing room to write about it.

SciPy 2018 is the 4th year I've made it to the conference, my first one being SciPy 2015 (not 2014, as I had originally incorrectly remembered). The conference has grown over the years that I've attended it!


This year, I served again as the Financial Aid Co-Chair with Scott Collis and Celia Cintas. It brings me joy to be able to help bring others to the conference, much as I was a few years back when I was still a graduate student.

Building upon last year's application process, where we implemented blinded reviews, this year, we improved the review process so that it was much less tedious and more user-friendly for reviewers, i.e. myself, Celia, Scott and our two committee reviewers, Kasia and Patrick.

The review process can always be improved; we still have some work to do. One would be making the application less intimidating for under-represented individuals. Two might be reworking how we quantify how good our selections are; rather than some aggregate "total" score, it might be that we ought to optimize for a breadth of worthy contributions to the community. Finally, we definitely want to ensure that our focus is on FinAid's mission: to enable us to bring deserving and new people to the conference who otherwise might not have the resources!


I did two tutorials, one with Hugo and one with Mridul. The one with Mridul was on Network Analysis, and the one with Hugo was on Bayesian statistics. Over the years, I've developed muscle memory on Network Analysis, so it felt very natural to me. Bayesian statistics and probabilistic programming was a new topic for myself and Hugo; as such, I spent proportionally more time preparing for that tutorial instead.

What was pleasantly surprising for me was that Bayesian statistics was gaining a ton of popularity, and this tutorial just happened to be there at the right time. I had a lot of one-on-one chats with tutorial participants after the tutorial and during the conference days, where we talked about the application of Bayesian methods to problems that they had encountered. There's a lot to do until people can generally communicate about data problems using Bayesian methods, but I think we're at an upwards-inflection point right now!


I missed the talks mostly because I was doing the Tejas Room track (sit in the Tejas room and chat with people). I nonetheless had a very fruitful and fun time doing so!

To make up for lost time, I put together a playlist of things I'd like to catch up on later.


For the first time ever, I stayed on to sprint! However, I also simultaneously caught a conference bug, so I was basically knocked out for the second day of sprints. For this year's sprints, I implemented a declarative interface for geographic graph visualizations in nxviz, where node placement is prioritized according by geographic information. The intent here isn't to replace geospatial analysis packages, but rather to provide a quick, seaborn-like view into a graph's geographical structure. Once a user has a feel for the data, if nothing more is needed, they can use the graph as is; otherwise, they can move onto a different package.

Did you enjoy this blog post? Let's discuss more!