SciPy 2017

written by Eric J. Ma on 2017-07-12

I just finished from SciPy 2017! This was a fruitful conference, and I'm glad I managed to make it.

Monday was the first day. I wanted to get a better feel for the Jupyter widgets ecosystem, and as such I sat in on the corresponding tutorial. That happened to be the only tutorial I sat in live.

Nonetheless, one nice thing about the tutorials is that they are live recorded, and so we can watch the ones we missed on our own time when back home. These are the ones I hope to catch, partly out of interest, partly from recommendations by other conference attendees who sat in them:

Looking at the list, I kind of realize now how much of a Continuum Analytics fanboy I've become...

On the second day, I delivered my own Network Analysis Made Simple. I collected some feedback right at the end of the tutorial, and it looked like they were overall very positive. Many liked the whiteboard illustrations that I added on. When delivering this at PyCon, I think it would benefit from having a whiteboard of sorts.

The third day was the start of the conference talks. There's many, many great talks out there! I also had the opportunity to connect with new people over breakfast, lunch, coffee and dinner. I tried hosting "office hours", like Matt Davis did last year, but I think I announced it a bit too late.

All-in-all, I think it was great to attend SciPy 2017 this year. I'm happy to have not broken the chain of attendance. Looking forward to serving on next year's organizing committee again, and I hope to have a new tutorial in the works!

Insight Week 6

written by Eric J. Ma on 2017-07-08

We had a short week this week because of the long July 4th weekend (Happy Birthday, America!).

Wednesday was my second demo day, this time at MGH. There were 8 of us demoing at MGH's Clinical DS team, and I really enjoyed the interaction with them. The team asked of me two technical questions about Flu Forecaster, both of which were analogous to other questions I had heard before. After the demo, we hung out with the team and chatted a bit about their latest projects.

In the afternoon, I focused on doing the data challenge and leetcode exercises; in the evening, I (at the last minute) signed up for back-to-back behavioral and ML interview practice sessions. It was good to chat with the alumni helping with the sessions, as I learned much more about their thought process. In the future, I'll probably be called on to interview other people, and I will definitely draw on my experiences here.

On Thursday we had more prep. I helped with mock interviewing by being an observer for Xi and an interviewer for Angela. The role-playing with Angela was an interesting one for me. I tried playing the role of a conversational but technically-competent interviewer. Also asked questions genuinely out of curiosity too. I think that combined with Angela's outgoing personality kept the conversation enjoyable for all three of our spectators.

In the late afternoon, an NYC session alum came by and gave us a session on data challenges. The exercise he gave was quite neat - basically, given one categorical output column and a slew of other feature columns, train the best model that has the highest accuracy score. Oh, the twist? Do it in 25 minutes.

The key point from this exercise was to have us get prepared for an on-site data challenge. The on-site data challenge mainly helps the hiring team check that we have the necessary coding chops to work with the team. It also lets them see how we perform under time constraints. The most important thing is to deliver a model with some form of results. Iterating fast is very important. Thus, it helps to push out fast one model that works.

On Friday, we did another round of the interview simulator. I thought it was better run this time round. The mutual feedback from one another is very helpful. I was tasked with a stats question, which I melded into a hybrid stats + CS question, thus modelling what I had received when I was interviewed at Verily. FWIW, the question I asked was to define bootstrap resampling (sampling with replacement), implement it using the Python standard library, and discuss the scenarios where it becomes a useful thing.

If tasked with a similar one for the next time, I will probably ask about writing a function to sample from a Bernoulli distribution using only the Python standard library. It's useful to know how to implement these statistical draws when it's not easy or impossible to use other libraries. (I had to do it when trying out the PyPy interpreter a few years back, and didn't want to mess with installing numpy for PyPy.)

I liked a few of the other questions asked as well - for example, the knapsack problem posed by Steve: Given a set of produce items, each with their own value and weight (in Kg), and a knapsack that can only carry a maximum weight of produce, find the set of produce that will maximize value at the market.

That afternoon, we slowed things down a bit. Regardless of how much we benefit from them, the interview simulators nonetheless are tiring. But that's the key point - interviews are day-long, exhausting endeavours that test stamina and ability to switch between contexts (both technical and social). The simulator aims to simulate that.

Looking forward to next week. For me it'll be a short one, because I'll be at SciPy 2017 to lead a Network Analysis tutorial. Also hoping to represent Insight well!

Insight Week 5

written by Eric J. Ma on 2017-07-01

First off, Happy Canada Day!

Week 5 is primarily focused on interview prep as a bunch of us go out for our demos.

We kicked off Monday with an interview prep field day. The main areas of focus for us were CS fundamentals, machine learning, SQL, and behavioral interviewing. I found SQL to be my weakest point, and I'll definitely be focusing a lot of efforts on there. I had a chance to explain gradient descent and regularization using algebra - something I never thought I would do!

On Tuesday, Fellows began going outside for demos. My first demo will be at Boston Health Economics this Thursday, followed by (in no particular order) MGH, Biogen, Merck, OM1, and Immuneering. Definitely looking forward to presenting Flu Forecaster to them!

On the side, we also started thinking through computer science fundamentals problems, and doing data analytics challenges. CS fundamentals are what you think it would be, covering data structures and algorithms. I found myself to be particularly fond of recursion, and implemented a recursive algorithm for something that could be solved in linear time without recursion. It was good to see my biases, and to try my hand at implementing the same thing in fundamentally two different styles.

In the evening, Nick (one of the fellows) gave us a run through on SQL. It was very useful to have his perspective, which was basically that most of the problems we will encounter involve some degree of nested searches, and that we have to work backwards from what we want. I also had a good perspective from my alumni mentor on how to approach describing my thesis to interviewers.

On Wednesday, the interview prep continued with more coding challenges, demo trips, and fellow-led workshops. Together with Jeff and Jigar, we led a deep learning fundamentals workshop, in which we went through how deep learning works for feed forward neural networks and convolutions neural networks.

Thursday came my first demo, which was at Boston Health Economics. Overall, I thought the demo session went well, and that Catherine, our host, kept engaged with the presentations. I very much appreciate her intellect. Additionally, I took the approach of "free styling it" (of course conditioned on having previously rehearsed it enough times), which resulted in a demo presentation that was overall smoother than what I had previously delivered

Apart from that, we continued our interview prep. This involved more CS fundamentals for me, getting more practice with common algorithms, and finishing the coding exercises that Ivan gave us.

On Friday, we did an interview simulator, in which we practiced interviewing one another. This gave me a better view into the thought process that an interviewer might be going through, particularly when conducting a technical interview. From prior experience interviewing, I remembered that my most pleasant interviews were with individuals who kept the atmosphere positive, encouraging, and provided hints along the way. Thus, I tried to conduct the mock interviews in the same way.

In the afternoon, I gave a very short workshop on how to write Pythonic code, which covered PEP8 (which is now check-able using pycodestyle). It was fun seeing everybody go, "Whoa! Atom can do that?!" and then promptly going ahead to clean up their code according to the flake8 linter's recommendations.

Interspersed throughout the week, I made an effort to summarize my thesis work a bit more. I think I have a few ways/hooks to explain it to a 'recruiter without a technical background', a 'computer scientist without biology background', and a 'biologist without a computing background'. Making it concise with a good "hook" was the hardest part, but I think I have something good now.

Using Bokeh in FluForecaster

written by Eric J. Ma on 2017-06-30

Author: Eric J. Ma, Insight Health Data Science Fellow (Boston 2017b)


In this blog post, I will show how Bokeh featured in my Insight project, FluForecaster.


As a Health Data Fellow at Insight, we spend about 3 weeks executing on a data project, which we demo at companies that are hiring. I built FluForecaster, which was a project aimed at forecasting influenza sequence evolution using deep learning.

My choice of project was strategically aligned with my goals on a few levels. Firstly, I wanted to make sure my project showcased deep learning, as it's currently one of the hottest skills to have. Secondly, I had components of the code base written in separate Jupyter notebooks prior to Insight, meaning, I could execute on it quickly within the three weeks we had. Thirdly, I had intended to join Insight primarily with the goal of networking with the Insight community, and that basically meant 'being a blessing' to others on their journey too - if I could execute fast and well on my own stuff, then there'd be time to be a team player with other Fellows in the session, and help them get their projects across the finish line.

Each of us had to demo a "final product". Initially, I was thinking about a "forecasting dashboard", but one of our program directors, Ivan, suggested that I include more background information. As such, I decided to make the dashboard an interactive blog post instead. Thus, with FluForecaster being a web-first project, I finally had a project in which I could use Bokeh as part of the front-end.

Applying Bokeh

Bokeh was used mainly for displaying three data panels in the browser. Firstly, I wanted to show how flu vaccine efficacy rarely crossed the 60% threshold over the years. Secondly, I wanted to show a breakdown of the number of sequences collected per year (as used in my dataset). Thirdly, I wanted to show a visual display of influenza evolution.

For yearly vaccine effectiveness, it was essentially a line and scatter chart, with the Y-axis constrained between 0 and 100%. I added a hover tooltip to enable my readers to see the exact value of vaccine effectiveness as measured by the US CDC.

To show the number of sequences per year in the dataset, the same kind of chart was deployed.

Bokeh magic became really evident later when I wanted to show sequence evolution in 3 dimensions. Because 3D charts are generally a poor choice for a flat screen, I opted to show pairs of dimensions at a time. A nice side-effect of this is that because my ColumnDataSource was shared amongst each of the three pairs of coordinates, panning and selection was automatically linked for free.

Usage Pros and Cons

Bokeh's API is very powerful, in that it supplies many plotting primitive objects (glyphs, particularly), and that makes it a big plus for users who are experienced with the library, who are creating complex interactive charts.

Most of my fellow Fellows at Insight ended up using the bokeh.plotting interface, and I did too. I think the bokeh.plotting interface provides the best balance between ease-of-use and flexibility. If you take a look at the code here, you'll notice that there's often a bit of boilerplate that gets repeated with variation, such as in the configuration of custom hover tools. I think this is the tradeoff we make for configurability... or I might just be not writing code most efficiently. :)

There were times where I was tempted to just use the bkcharts' declarative interface instead. It's a lot more easy to use. However, I did have some time on hand, and wanted to get familiar with the bokeh.plotting interface, because there's a possibility that I might want to make wrappers for other visualizations that can lend themselves to a declarative API.

Embedding Visualizations

I built my interactive blog post using a combination of Flask, hand-crafted HTML, Bootstrap CSS & JS, and Bokeh - which took care of the bulk of visuals. I drew static figures using Illustrator.

Embedding the necessary Bokeh components wasn't difficult. Very good documentation is available on the Bokeh docs. The key insight that I had learned was that I could have the components passed into my Flask app functions' return statements, and embed them using Jinja2 templating syntax. An example can be found here. Basically, components returns a div and a js object, which are essentially just strings. To embed them in the templates, we use the syntax {{ div|safe }} and {{ js|safe }}. That |safe is very important: it tells the Jinja2 templating engine that it's safe to render those pieces of Javascript and HTML.


Through the project, I became a lot more familiar with the Bokeh plotting library. Now I feel a bit torn! I've contributed to both the Bokeh and matplotlib projects, and I love them both! I've also come to deeply respect the lead developers of both projects, having interacted with them many times. If I were to make a comment on "where to use what" based on my experience, it'd probably still be the conservative view of "matplotlib for papers, bokeh for the web"... but I'm sure that will be outdated soon. Who knows how the Python plotting landscape will evolve - it's exciting times ahead, and at least for now, I'm happy for the experience driving a dataviz project with Bokeh!

Insight Week 4

written by Eric J. Ma on 2017-06-24

Week 4 has been all about demos. Polishing our demos, picking companies that we want to demo at (and possibly interview at later on). Every morning, we practice our demos, 10 minutes per person, with the goal of keeping our demo to under 5 minutes to leave time for Q&A. I've found that the act of rehearsing our demos makes it much easier to pick out where I need improvement. For example, I tended to have trouble with explaining the validation portion smoothly, even though I knew what I was doing there. A tool that seems useful, especially for short demos, is to write out exactly what I want to say, and that definitely helped.

On the type of work that I'm interested in, here's some things I've become much clearer on.

Firstly, the factors I'm considering for a company. The ideal combination is: a company that deeply values the hard sciences (in my case life sciences), and is solving very tough technical problems that requires growth in and mastery of deep technical topics, on a team that encourages experimentation, personal growth, and open source contributions on company time. We'd have to be at the innovation boundary of very powerful techniques. This is important for me, because I believe that 5-10 years down the road, I would have mastery over very foundational and broadly applicable tools with the appropriate experience applying them to real-world problems, which I could leverage to solve more cool and interesting problems. It's also a good defence against being pigeon-holed into a particular domains or tasks - autonomy in problem selection and definition is very important to me, so most of my choices aim to maximize that over money.

Secondly, I've effectively ruled out companies that are dealing with non hard-science data, e.g. insurance claims, marketing & advertising, finance, and business data. Having applied computation to the life sciences over grad school, and being trained in the life sciences for over 10 years, I'm not ready to give up that background knowledge to work on other problems. I also believe that investing in the hard sciences means investing in the next wave of real-world innovation, and I'd like to ride that wave.

Thirdly, within the next 5 years, I see myself growing as a technical person, rather than a management person. People issues, particularly conflict resolution, make it difficult to focus on being a good craftsman, and I much more enjoy craftsmanship than management.

Now, on the companies that have come by...

Most are using open data science tools in their toolkit, and this mostly means Python and R, Spark and a few other big DB tools. Some are still using SAS (.................) and didn't show a trend towards open data science languages, and effectively ruled themselves out of contention. (Using legacy tools signals a lack of forward-thinking and a desire to favour the status quo over pushing boundaries.)

Some have given us words of wisdom. One guy basically said that healthcare has messed up (he used stronger language) incentives. Another said that to solve healthcare we need to first solve human behaviour. All very interesting points that are well-taken on my side. A non-healthcare company told us that if we're not paying for a service, then we're the product.

In our session, it was basically the pharma research arms that piqued my interest the most, aside from one hospital's internal startup team. The gap in interest between #4 and #5 (for me, at least) was really big, and the gap of interest from #5 to the rest was even larger.

Anyways, week 5 begins soon, and we pivot over into interview prep. Looking forward to learning lots, particularly doing deep dives on my weak spots!