Create PySpark Playground in Google Colab using 3 simple steps

1 minute read

PySpark provides several convenient functions to interact with Spark environment. But, setting up pyspark from scratch can be pretty hectic and cumbersome. Though, there are pyspark jupyter notebooks and docker images to install and run pyspark locally, but it is always better to find cloud machines with GPUs or even TPUs that will eventually help to see the real processing power of Spark. One such environment is Google Colab where you can create as many notebooks you want for !!FREE!!. This is what google says about Colab:

Colab notebooks execute code on Google’s cloud servers, meaning you can leverage the power of Google hardware, including GPUs and TPUs.

You can checkout more stuff here: Introduction to Google Colab

Now, let’s see how we can easily create a simple pyspark playground on google servers using 3 simple steps.

Step 1: Install Dependencies

We need to install following components to run pyspark seamlessly:

  • OpenJDK 8
  • Spark Environment
  • FindSpark package

Using below commands, we will install Spark 2.4.5. You might need to change that depending upon the version you want to install. You can refer to Spark downloads to get appropriate URL for your version: Spark Download Page

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://apache.mirror.amaze.com.au/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
!tar xf spark-2.4.5-bin-hadoop2.7.tgz
!pip install -q findspark

Step 2: Add environment variables

After installing dependencies, we need to some variables to the environment so that pyspark knows where to look for using dependencies. We can do that using following commands:

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.5-bin-hadoop2.7"

Step 3: Initilize pyspark

Finally, we just need to initilize pyspark which can be easily achieved using third-party package named findspark as shown below:

import findspark
findspark.init()

And, IT’S DONE! Your pyspark playground is now ready to play with spark environment using python commands.

You can try running following commands to check if pyspark is properly installed or not:

import pyspark
sc = pyspark.SparkContext(appName="yourAppName")

If you are able to get spark context, then you are good to go!!

Cheers!

Leave a comment