Visualize Large Datasets by Sampling

written by Eric J. Ma on 2017-09-14

Just a little tip, putting it here for myself and others in case it helps.

Sometimes, you need to visualize a large dataset, but it takes a ton of time to render it or compute the necessary transforms.

If your samples are statistically sampled independently of one another (i.e. basically not timeseries), and the goals are to do some statistical visualizations, then it's basically valid to visualize a downsampled set of the dataset.

I recently encountered this point at work. After running a clustering analysis, I wanted to see a pair plot of the distribution of features in each cluster. However, with cluster sizes ranging from 200-2 million, rendering times were unreasonably long (making things non-interactive) for the large sized clusters. I thus decided to downsample the large clusters to a maximum of 2,000 data points. Instantly, render times improved, and I could start interacting with my data again.

Little things matter!


nano text editor hacks

written by Eric J. Ma on 2017-09-11

Much as I've embraced the Atom text editor, there are times when the GUI isn't accessible to us, and we are forced to use a Terminal-based text editor.

Now, I'm not one of those crazy types who use emacs or vim - those are the real seasoned pros. (I still don't know how to exit vim, btw.) As such, my terminal editor of choice remains the venerable nano. Here's some hacks that I recently figured out, to make text editing much easier in nano.

(1) Syntax highlighting

This is such a big one! Syntax highlighting seriously helps a ton. If you're on a Mac, make sure you install homebrew's version of nano - you can look at my dotfiles or run the command:

$ brew install nano

Then, edit your ~/.nanorc file to look something like this:

include /usr/local/share/nano/python.nanorc  # gives you Python syntax highlighting
include /usr/local/share/nano/sh.nanorc  # gives you bash shell syntax highlighting

Next time you use nano (from your user account), syntax highlighting should be enabled!

You can find a sample .nanorc file on my GitHub dotfiles repository

(2) Keyboard Shortcuts

Here's a laundry list of keyboard shortcuts I've muscle-memorized:

(3) Persistence

nano, being not as fancy as vim or emacs, means it doesn't have the concept of sessions. Doesn't matter - use tmux to persist!


All-in-all, the biggest one that aids in writing on a terminal editor is syntax highlighting. I wrote this blog post in nano, and being able to visually see different parts of my text highlighted according to their meaning has made writing much easier.


What would be useful for aspiring data scientists to know?

written by Eric J. Ma on 2017-08-31

I originally titled this post, "What you need to know to become a data scientist", but I backed off from having such an authoritative post title for I wanted to keep things opinionated without being pompous :).

Data Science (DS) is a hot field, and I'm going to be starting my new role doing DS at Novartis in September. As an aside, what makes me most happy about this role is that I'm going to do DS in the context of the life sciences (one of the "hard sciences")!

Now that I have secured a role, some people have come to ask me questions about how I made the transition into DS and into the industry in general. I hope to provide answers to those questions in this blog post, and that you, the reader, find it useful.

I will structure this blog post into two sections:

  1. What do I need to know and how do I go about it?
  2. What do I need to do?

Ready? Here we go :)


First off, let's talk about what I think you, an aspiring data scientist, needs to know, and how to go about learning it.

Topic 1: Statistical Learning

Statistical learning methods are going to top the list. From the standpoint of "topics to learn", there's a laundry list one can write - all of the ML methods in scikit-learn, neural networks, statistical inference methods and more. It's also very tempting to go through that laundry list of terms, learn how they work underneath, and call it a day there. I think that's all good, but only if that material is learned while in the service of picking up the meta-skill of statistical thinking. This includes:

  1. Thinking about data as being sampled from a generative model parameterized by probability distributions (my Bayesian fox tail is revealed!),
  2. Identifying biases in the data and figuring out how to use sampling methods to help correct those biases (e.g. bootstrap resampling, downsampling), and
  3. Figuring out when your data are garbage enough that you shouldn't proceed with inference and instead think about experimental design.

That meta-skill of statistical thinking can only come with practice. Some only need a few months, some need a few years. (I needed about a year's worth of self-directed study during graduate school to pick it up.) Having a project that involves this is going to be key! A good introduction to statistical thinking for data science can be found in a SciPy 2015 talk by Chris Fonnesbeck, and working through the two-part computational statistics tutorial by him and Allen Downey (Part 1, Part 2) helped me a ton.

Recommendation & Personal Story: Nothing beats practice. This means finding ways to apply statistical learning methods to projects that you already work on, or else coming up with new projects to try. I did this in graduate school: my main thesis project was not a machine learning-based project. However, I found a great PLoS Computational Biology paper implementing Random Forests to identify viral hosts from protein sequence, and it was close enough in research topic that I spent two afternoons re-implementing it using scikit-learn, and presenting it during our lab's Journal Club session. I then realized the same logic could be applied to predicting drug resistance from protein sequence, and re-implemented a few other HIV drug resistance papers before finally learning and applying a fancier deep learning-based method that had been developed at Harvard to the same problem.

Topic 2: Software Engineering

Software engineering (SE), to the best of my observation, is about three main things: (a) learning how to abstract and organize ideas in a way that is logical and humanly accessible, (b) writing good code that is well-tested and documented, and (c) being familiar with the ever-evolving ecosystem of packages. SE is important for a data scientist, because models that are making predictions often are put into production systems and used beyond just the DS themselves.

Now, I don't think a data scientist has to be a seasoned software engineer, as most companies have SE teams that a data scientist can interface with. However, having some experience building a software product can be very helpful for lubricating the interaction between DS and SE teams. Having a logical structure to your code, writing basic tests for it, and providing sufficiently detailed documentation, are all things that SE types will very much appreciate, and it'll make life much easier for them when coming to code deployment and helping with maintenance. (Aside: I strongly believe a DS should take primary responsibility for maintenance, and not the SE team, and only rely on the SE team as a fallback, say, when people are sick or on vacation.)

Recommendation & Personal Story: Again, nothing beats practice here. Working on your own projects, whether work-related or not, will help you get a feel for these things. I learned my software engineering concepts from participating in open source contributions. The first was a contribution to matplotlib documentation, where I first got to use Git (a version control system) and Travis CI (a continuous integration system). It was there that I also got my first taste of software testing. The next year, I quickly followed it up with a small contribution to bokeh, and then decided at SciPy 2016 to build nxviz for my Network Analysis Made Simple tutorials. nxviz became my first independent software engineering project, and also my "capstone" project for that year of learning. All-in-all, getting practice was instrumental for my learning process.

Topic 3: Industry-Specific Business Cases

This is something I learned from my time at Insight, and is non-negotiable. Data Science does not exist in a vacuum; it is primarily in the service of solving business problems. At Insight, Fellows get exposure to business case problems from a variety of industries, thanks to the Program Directors' efforts in collecting feedback from Insight alumni who are already Data Scientists in the industry.

I think business cases show up in interviews as a test of a candidate's imaginative capacity and/or experience: can the candidate demonstrate (a) the creativity needed in solving tough business problems, and (b) the passion for solving those problems? Neither of these are easy to fake when confronted with a well-designed business case. In my case, it was tough for me to get excited about data science in an advertisement technology firm, and was promptly rejected right after an on-site business case.

It's important to note that these business cases are very industry specific. Retail firms will have a distinct need from marketing firms, and both will be very distinct from healthcare and pharmaceutical companies.

Recommendation & Personal Story: For aspiring data scientists, I recommend prioritizing the general industry area that you're most interested in targeting. After that, start going to meet-ups and talking with people about the kinds of problems they're solving - for example, I started going to a Quantitative Systems Pharmacology meet-up to learn more about quantitative problems in the pharma research industry; I also presented a talk & poster at a conference organized by Applied BioMath, where I knew lots of pharma scientists would be present. I also started reading through scientific journals (while I still had access to them through the MIT Libraries), and did a lot of background reading on the kinds of problems being solved in drug discovery.

Topic 4: CS Fundamentals

CS fundamentals really means things like algorithms and data structures. I didn't do much to prepare for this. The industry I was targeting didn't have a strong CS legacy/tradition, unlike most other technology firms doing data science (think the Facebooks, Googles, and Amazons), which do. Thus, I think CS fundamentals are mostly important for cracking interviews, and while problems involving CS fundamentals certainly can show up at work, unless something changes, they probably won't occupy a central focus of data science roles for a long time.

Recommendation & Personal Story: As I don't really like "studying to the test", I didn't bother with this - but that also meant I was rejected from tech firms that I did apply to (e.g. I didn't pass Google Brain's phone interview). Thus, if you're really interested in those firms, you'll probably have to spend a lot of time getting into the core data structures in computer science (not just Python). Insight provided a great environment for us Fellows to learn these topics; that said, it's easy to over-compensate and neglect the other topics. Prioritize accordingly - based on your field/industry of experience.


Now, let's talk about things you can start doing from now on that will snowball your credibility for entering into a role in data science. To be clear, these recommendations are made with a year-long time horizon in mind - these are not so much "crack-the-interview" tips as they are "prepare yourself for the transition" strategies.

Strategy 1: Create novel and useful material, and share it freely

This is very important, as it builds a personal portfolio of projects that showcase your technical skill. A friend of mine, Will Wolf, did a self-directed Open Source Masters, where he not only delved deeply into learning data science topics, but also set about writing blog posts that explained tough and difficult concepts for others to understand, and showcased data projects that he was hacking on while learning his stuff.

Another friend of mine, Jon Charest, wrote a blog post doing a network analysis about metal bands and their shared genre labels - along the way producing a great Jupyter Notebook and network visualizations that yielded contributions to nxviz! Starting with that project, he did a few more, and eventually landed a role as a data scientist at Mathworks.

Apart from blog posts, giving technical talks is another great way to showcase your technical mastery. I had created the Network Analysis Made Simple tutorials, inspired by Allen Downey's X Made Simple series, as a way of solidifying my knowledge on graph theory and complex systems, and a very nice side product was recognition that I had capabilities in computation, resulting in more opportunities - my favourite being becoming a DataCamp instructor on Network Analysis!

A key here is to create materials that are accessible. Academic conferences likely won't cut it for accessibility - they're often not recorded, and not published to the web, meaning people can't find it. On the other hand, blog posts are publicly accessible, as are PyCon/SciPy/JupyterCon/PyData videos. Another key is to produce novel material - simple rehashes aren't enough; they have to bring value to someone else's. Your materials only count if people can find you and they expand someone's knowledge.

A few other data scientists, I think, will concur very strongly with this point; Brandon Rorher has an excellent blog post on this.

Strategy 2: Talk with people inside and adjacent to industries that you're interested in.

The importance of learning from other people cannot be understated. If you're releasing novel and accessible material, then you'll find this one to be much easier, as your credibility w.r.t. technical mastery will already be there - you'll have opportunities to bring value to industry insiders, and you can take that opportunity to get inside information on the kinds of problems that are being solved there. That can really help you strategize the kinds of new material that you make, which feeds back into a positive cycle.

Talking with people in adjacent industries and beyond is also very important. I think none put it better than Francois Chollet in his tweet:

The main thing here is to have a breadth of ideas to draw on for inspiration when solving your own problem at hand. I had a first-hand taste of it when trying to solve the drug resistance problem (see above) - which turned out to be my introduction into the deep learning world proper!

Strategy 3: Learn Python

Yes, I put this as a strategy rather than as a topic, mainly because programming languages are kind of arbitrary, and as such are less about whether a language is superior to others and more about whether you can get stuff done with that language.

I suggest Python only because I've tasted for myself the triumphant feeling of being able to do all of the following:

in one language. That's right - one language! (Sprinkling in a bit of HTML/CSS/JS in deployment, and bash in environment setup, of course.)

There's very few languages with the flexibility of Python, and having a team converse in one language simply reduces that little bit of friction that comes from reading another language. There's a ton of productivity gains to be had! It's not the fastest, it's not the most elegant, but over the years, it's adopted the right ideas and built a large community developers, as such many people have built on it and used it to solve all manners of problems they're facing - heck, I even found a package that converts between traditional and simplified Chinese!

It takes time to learn the language well enough to write good code with it, and nothing beats learning Python apart from actually building a project with it - I hope this idea of "building stuff" is now something ingrained in you after reading this post!

Strategy 4: Find a community of people

When it comes to building a professional network and making friends, nothing beats going through a shared experience of thick & thin together with other people. Data science, being a really new thing, is a growing community of people, and being plugged into the community is going to be important for learning new things.

The Insight Summer 2017 class did this - we formed a closely-knit community of aspiring data scientists, cheered each other on, and coached each other on topics that were of interest. I know that this shared experience with other Insighters will give us a professional network that we can tap into in the future!


Conclusions

Alrighty, to conclude, here's the topics and strategies outlined above.

Topics to learn:

  1. Must-have: Statistical learning & statistical thinking
  2. Good-to-have: Software engineering
  3. Good-to-have: Business case knowledge
  4. Dependency, Optional: CS Fundamentals

Strategies:

  1. Proven: Make novel and useful materials and freely release them - teaching materials & projects!
  2. Very Useful: Learn from industry insiders.
  3. Very Useful: Learn Python.
  4. Don't Forget: Build community.

All-in-all, I think it boils back down to the fundamentals of living in a society: it's still about creating real value for others, and receiving commensurate recognition (not always money, by the way) for what you've delivered. Tips and tricks can sometimes get us ahead by a bit, but the fundamentals matter the most.

For aspiring data scientists, some parting words: build useful stuff, learn new things, demonstrate that you can deliver value using data analytics and work with others using the same tools, and good luck on your job hunt!


Reading & Writing Docs: The Overlooked Programming Skill?

written by Eric J. Ma on 2017-08-24

I recently read a blog article by DataCamp's CTO (Dieter) on how to scale their projects and their engineering team - it's a great read! In the article, Dieter states that the only way to scale an engineering team is to have well-written docs. I can see the benefits to doing it this way - we minimize the number of channels that any coder needs to use to find out information; the docs should be the place where the intent and technical detail of the code are simultaneously documented alongside usage examples.

Thus, in the final weeks up to starting my new job at Novartis as a Data Scientist, I decided to make sure I have the practice of writing, reading and publishing docs as good as muscle memory. I can already envision cases where, while conducting and building analyses, I end up writing a bunch of generally-useful functions that should be documented as well. What I write may eventually need to be used by someone else, including my future self; keeping track of how exactly a function is intended to be used is going to be very useful.

I think reading and writing docs is an overlooked skill in programming. It's probably because this isn't a test of "creative capacity" (i.e. can you build something new), which is the "sexy" thing. It's more a test of "maintenance capacity" - and this is given less value and importance in the coding world. But it's incredibly important - many basic problems can be solved by reading the docs... but also, so many problems can be avoided by writing really good docs! The onus is on both parties - package maintainers and developers - to write and read good docs.

But writing good docs is a tough job! I absolutely agree with this. There are different styles through which developers read docs - some prefer examples, while others just want to see function definitions - and it's very difficult to cater to every style. I personally think starting off with the style one's most comfortable with, and then gradually accepting community contributions, is the right way to go.

One package that I maintain, nxviz, used to not have any docs written apart from that single file in the README. Thanks to my friend Remi Picone, I was able to learn how to configure Sphinx to get my docs working through copying his example repository. Through that, I configured Sphinx to build docs on my nxviz project - and finally got it going! You can find it on RTFD.

Learning this was really fun - looking forward to putting up more docs!


Next Steps

written by Eric J. Ma on 2017-08-10

Signed and done! I will be joining the Novartis Institutes for Biomedical Research (NIBR) in September, as part of the Scientific Data Analysis (SDA) team under Novartis Informatics (NX).

NIBR is the research arm of Novartis, and the SDA team is essentially a "Data Special Ops" team inside NIBR. The nature of the position involves both internal consulting and the development of new initiatives across teams.

The nature of the role I'm being hired into is in statistical learning, which is a general direction I've been moving towards during my time in grad school. I picked up and implemented a number of useful and interesting deep learning algorithms back then, and over the past half a year, have finally gotten in underneath the hood of graph & image convolutions, variational autoencoders and gaussian processes. It's really fun stuff, at its core, and to me, it's even more fun translating biological and chemical data problems into that language, and back.

After a summer learning lots and networking with industry professionals and fellow Fellows at Insight, I'm ready for a bit more structure in my life. Looking forward to starting there!