Next Steps

written by Eric J. Ma on 2017-08-10

Signed and done! I will be joining the Novartis Institutes for Biomedical Research (NIBR) in September, as part of the Scientific Data Analysis (SDA) team under Novartis Informatics (NX).

NIBR is the research arm of Novartis, and the SDA team is essentially a "Data Special Ops" team inside NIBR. The nature of the position involves both internal consulting and the development of new initiatives across teams.

The nature of the role I'm being hired into is in statistical learning, which is a general direction I've been moving towards during my time in grad school. I picked up and implemented a number of useful and interesting deep learning algorithms back then, and over the past half a year, have finally gotten in underneath the hood of graph & image convolutions, variational autoencoders and gaussian processes. It's really fun stuff, at its core, and to me, it's even more fun translating biological and chemical data problems into that language, and back.

After a summer learning lots and networking with industry professionals and fellow Fellows at Insight, I'm ready for a bit more structure in my life. Looking forward to starting there!


Open Source Software

written by Eric J. Ma on 2017-08-02

Open source software is awesome, and I've just been thoroughly convinced of why.

Today, I put in a PR to PyMC3. This was a bug fix related to the PyMC3 multinomial distribution's random variates generator, which uses numpy's multinomial under the hood, which arose from floating point precision errors.

I first encountered this bug last week, when I started trying out the use of PyMC3 on my GPU tower. GPU stuff is tricky. One of the issues relates to floating point precision. I'm not well-versed enough on this to write intelligently about the underlying causes, but one thing I learned is that GPUs prefer 32-bit floating point precision (float32), while modern CPUs can handle 64-bit (float64). (I'm sure this will change in the future.) In the vast majority of "large number" computations, it's no big deal, but when we deal with small numbers (decimals in the thousandths range and smaller), addition errors can crop up.

This was the exact problem I was facing. I had some numbers crunching on the GPU in float32 space. Then, I had to pass them back to numpy's multinomial, which implicitly converts everything to float64. Because multinomial takes in a list of ps (probabilities) that must sum to one, I was getting issues with my list of ps summing to just infinitesimally (in computation land) greater than one. I dug around on-and-off for about a week to look for a solution, but none came. Instead, I had to rely on a small hack that I didn't like, adding a 1 millionth value to the sum and renormalizing probabilities... but that felt hacky and unprincipled.

The fix was inspired by someone else's problems that was discussed on numpy's repository. The trick was to convert the numbers from float32 to float64 first and re-compute the probabilities in float64 precision. I implemented that locally, and everything worked! I quickly ran two of the most relevant tests in the test suite, and they both passed. So I pushed up to GitHub and submitted a PR on this (after checking in with the lead devs on their issue tracker) - and it was just merged tonight!

If PyMC3's and numpy's code bases were not open source, with issues discussed openly, I would not have been able to figure out a possible fix to the issues with the help of other people. Also, I wouldn't have been able to patch the codebase locally first to see if it solved my own problems. I also wouldn't have access to the test suite to check that nothing was broken. All-in-all, working with an open source codebase was instrumental to getting this fix implemented.

Big shout-out to the PyMC devs I interacted with on this - Colin & Junpeng. Thank you for being so encouraging and helpful!


Bayesian Neural Networks

written by Eric J. Ma on 2017-07-22

During this week, while us Insight Fellows begin going out to interview with other companies, my "side hustle" has been working on my Bayesian Analysis Recipes repository.

Two particularly interesting problems I've wanted to write my own implementation for are multinomial classification and Bayesian deep learning. I finally got both of them done today, after about 2-3 days of hacking on them.

Multinomial classification (notebook here) is the problem where we try to classify an item as being one of multiple classes. This is the natural extension to binary classification (done by logistic regression). To do this, I took the forest cover dataset and used PyMC3 to implement multinomial logistic regression. Seeing how to do it with PyMC3 was the most important aspect of this; actual accuracy wasn't much of a concern for me.

However, having seen the classification report (at the bottom of the notebook), and having read that the dataset was originally classified using neural networks, I immediately had the thought of doing a Bayesian neural network for multi-class classification, having seen it implemented for binary classification on the PyMC3 website.

Bayesian neural networks are not hard to intuit - basically, we place priors on the weights, rather than learning point estimates. In doing so, we are able to propagate uncertainty forward to predictions. Speaking as a non-expert in the field, I think the tricky part is the sampling algorithms needed.

One thing nice about the field of Bayesian deep learning is the use of variational inference to approximate the true distribution of predictions with a mathematically more tractable one (e.g. a Gaussian). In doing so, we gain a fast way towards approximately learning the uncertainty in predictions - essentially we trade a little bit of accuracy for a lot of speed. For complex models like neural nets, this can be very valuable, as the number of parameters to learn grows very, very quickly with model complexity, so anything fast can make iteration easier.

Starting with the code from Thomas Wiecki's website, I hacked together a few utility functions and boiled down the example to its essentials. Feed-forward neural nets aren't difficult to write - just a bunch of matrix ops and we're done. The notebook is available as well. One nice little feature is that by going with a deep neural network, we have additional predictive accuracy!

Moving forward, I'd like to improve on that notebook a bit more, by somehow implementing/developing a visualization for multiclass classification uncertainty which is the thing we gain from going Bayesian. Hopefully I'll be able to get to that next week - it's shaping up to look quite hectic!

As a side note, I found a bug with the multinomial distribution implementation in PyMC3, and am working with one of the core developers to get it fixed in PyMC3's master branch. (Thanks a ton, Junpeng, if you ever get to read this! ) In the meantime, I simply took his patch, modified mine a little bit, and used the patched up PyMC3 for my own purposes.

This is why I think open source is amazing - I can literally patch the source code to get it to do what I need correctly! Wherever I work next has to be supportive of things like this, and have to allow re-release of generally/broadly useful code that I touch - it is the right thing to do!


Lessons Learned During Insight

written by Eric J. Ma on 2017-07-17

(a) Solving healthcare goes beyond solving the science underlying it.

At its core, healthcare delivery is essentially a human problem. Even what we choose to optimize for is a hard problem. Do we optimize for changing human behaviour, or do we optimize for more precise treatments?

(b) Healthcare is complex

The biggest thing preventing a "solving of healthcare" is misaligned incentives.

(c) I like scientific data

Regardless of the lesson that healthcare needs to be solved with more than science, I still found myself naturally much more engaged with companies that were dealing with scientific data as part of their data science problems. Teams that were dealing with other types of data: insurance claims, financial, marketing, platform product analytics, click streams... these were much less engaging. I know my best fit now, though I won’t rule out other teams.

(d) People can change the equation.

I met with some people whose intellect and grasp of knowledge I really admire! Additionally, passion is infectious. It helps to work with colleagues who energize one another, rather than drain each others’ energy.

(e) Some Insight alumni are awesome

And I want to be like them when I help with mentoring for the next batch. Perhaps if I get a chance to interview others, I’d like to be able to model how I interview after the alumni mentors.

Biggest shout-out to George Leung, who works for Vectra, tailored his mentoring session by first asking me about my Insight project, which involved Gaussian processes and variational auto-encoders (VAEs). George asked me first about what VAEs were, and then asked me to solve a Bayes problem on the board. I could tell he was building his questions on-the-fly.

The other shout-out goes to Ramsey Kamar, who went through the “Big 4” questions: tell me about yourself, what’s your previous accomplishments, how did you face a conflict, and what’s your biggest weakness. His feedback to me was direct, positive, and most importantly, always encouraging.

(f) Humanities tools are needed

On reflection, I think that if we’re going to solve the “human” portion of healthcare, we’re going to need tools from the humanities - the tools that let us qualitatively and quantitatively study human behaviour. While data science can provide a quantitative path towards a solution, the qualitative side of it will remain as important as ever.


Insight Week 7

written by Eric J. Ma on 2017-07-15

Aaand with that, week 7 of Insight is done!

I had a short week because of SciPy 2017, and I'm thankful that I got a chance to head out there - had the opportunity to reconnect with many friends from the SciPy community.

The two days of Week 7 that I experienced were probably the weirdest week 7 any Fellow has experienced to date. Because I had missed a demo on account of SciPy, and because the company didn't want to just watch the pre-recorded demo video, I made a trek up to Cambridge to demo on-site. What initially was a 30 minute session turned out to be a 1.5 hr demo.

I have two more demo obligations to fulfill next week. Other than that, it's going to be mostly interview preparation with other fellows, and more data and coding challenges, and more studying of topics that we're not familiar with. I am trying to brush up on SQL more, as I can see it being a useful tool to have to query data out of databases.

Now that we're done with Week 7, we're going to be alumni soon. As such, I've began thinking about how I could give back as an alumni. Some ideas have come to mind, inspired by what others have done.

Firstly, I think I can help standardize future Fellows' coding environments by providing a set of annotated instructions for installing the Anaconda distribution of Python. Perhaps even an evening workshop on the first Thursday might be useful.

Secondly, I've come to recognize that the biggest bottleneck for Fellows' projects is the web deployment and design portion. Model training to obtain an MVP is fairly fast - one of scikit-learn's models is often good enough. However, most of us didn't know HTML and Bootstrap CSS, and the deadline makes it stressful enough to pick this up on-the-fly. (The stress is probably compounded by the fact that the web app/blog post is not the most intellectually interesting portion of the project.) Perhaps a workshop at the end of Week 2 or beginning of Week 3 might be good.

Thirdly, I see this trend where a lot more projects are going to start using deep learning. I think putting a workshop together with, say, Jigar, might be a useful thing to have.

Finally, my interview simulator questions have become famous for being a 'hybrid' between stats, ML and CS. It's very much in the same vein as what I got when I interviewed with Verily.

Until we get hired, we are allowed (and one might even say, expected) to continue coming into the office to help each other prepare for upcoming interviews. We're all looking forward to getting hired and solving data problems!

With this post, I think I'll end the regular blog post series here. Hope this post series was an informative insight into Insight! Next one I'll post is going to be a summary of lessons learned from my time as an Insight Health Data Fellow.