I recently became addicted to Kaggle competitions. To fuel this addiction, much larger processing speeds were needed for preprocessing and deep learning. Loading a small dataset is not a problem on my 8GB Macbook, but when you start dealing with millions of rows, memory errors become inevitable… Maximizing the ability to experiment with data means having a reliable environment with ample computing power. I see two possible solutions - (1) drop a few grand on a high performance PC, or (2) spend about $1 per hour for Google Compute Engine high memory instance.

Google Cloud Engine for Kaggle Step-by-Step

Google Cloud Engine (GCE) is basically identical to Amazon Web Services (AWS). I have used both in my professional work, but am especially impressed by the recently revamped GCE interface. Setting up an environment on GCE provides access to unlimited computing power that can be accessed from any device.

In this post, I will explain how to…

  1. Setup a Compute Engine instance with data science libraries.
  2. Access the instance over HTTP to run a Jupyter Notebook in a web browser.
  3. Save the disk image to replicate this environment on any virtual machine.

Prerequisites

Before starting, you will need to have Google Cloud Engine account with a billing-enabled project. You will also need to have the Gcloud Command Line Tools installed on your local system.

Setup a Compute Engine Instance with Python Data Science Tools

The first step is create a virtual instance with necessary Python libraries, such as Jupyter, Pandas, Sklearn, etc. Once this instance is created, the disk image can be saved and reused on any instance of any size.

Screenshot of google compute new instance

  • Check -> Allow HTTP
  • Check -> Allow HTTPS
  • Uncheck -> Disks - Delete boot disk when instance is deleted

Installing Tools

With the instance created, the Gcloud SDK makes it possible to connect via SSH from a local computer.

gcloud compute --project "<project-name>" ssh --zone "<your-zone>" "<instance-name>"

This command should connect to the newly created instance, so now it’s a matter of installing the desired tools. The following steps are based on the default Debian Linux distro on GCE, but will work for Ubuntu as well.

I have found Miniconda to be the most convenient package manager for installing Python libraries in the cloud.

apt-get install bzip2
wget https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh
sudo bash Miniconda2-latest-Linux-x86_64.sh

That should take care of Miniconda. Now the conda command is available to install popular data analysis packages.

conda install scikit-learn
conda install pandas
conda install jupyter

Connecting via HTTP

With all the libraries installed, it’s time to access the iPython/Jupyter Notebook over the web. If you try to link to basic IP address, you will get an “This site can’t be reached” error in the browser. The root of the problem is the default ephemeral external IP address needs to be promoted a static external IP.

Navigate to the External IP addresses page GCE Console and promote the ephemeral IP to a static IP. Navigate to Networking –> External IP addresses

Screenshot of google compute new instance

Tip - After you shut down your virtual instance, make sure to go back to this screen and release the static IP. Google will charge a small fee for unused static IPs on your account.

Security Note - Accessing your instance openly over HTTP is not secure. It’s a good idea to password protect your notebook to prevent unauthorized access, or configure SSH tunneling.

Once the IP is promoted, you can launch the Juptyer Notebook from the instance SSH command line.

jupyter notebook --ip=0.0.0.0 --port=8888 --no-browser &

Navigate to http://<your-static-ip>:8888

Screenshot of google compute new instance

Saving the Disk Image

I only want to pay for an instance when I need it, which is usually just a few hours. Saving a disk image makes it possible to replicate the environment on any newly created instance. For one situation you might need shared core 0.6GB machine, but on another you might 16 CPU cores and 108GB of memory. The disk image is just a template that prevents the tedious job of installing and configuring the environment on each new instance.

  • Verify the “Delete boot disk when instance is deleted” option is unchecked on the current instance. This will save the disk in its current state after being shut down.
  • Delete the running instance.
  • Navigate to Compute Engine –> Images and create a new image, using the disk that was automatically saved from the deleted instance.

That’s about it. Now you can use this specify disk image when creating new GCE instances. You’re entire environment can be scaled to any size with a single click. Pretty awesome.