Data Engineering

Introduction to Spark-Part 3:Installing Jupyter Notebook Kernels

Posted on 19th February 201816th December 2019 by Nigel Meakins

19
Feb

This is the third post in a series on Introduction To Spark.

Introduction

There are a large number of kernels that will run within Jupyter Notebooks, as listed here.

I’ll take you through installing and configuring a few of the more commonly used ones, as listed below:

Python3
PySpark
Scala
Apache Toree (Scala)

Kernel Configuration

Each kernel has its own kernel.json file, containing the required configuration settings. Jupyter will use this when loading the kernels registered in the environment. These are created in a variety of locations, depending on the kernel installation specifics. The file must be named kernel.json, and located within a folder that matches the kernel name.

Kernel Locations

There are various locations for the installed kernels. For those included in this article the locations below have been identified:

AppDataRoamingjupyterkernels
envssharejupyterkernels
jupyterkernels

Where will be as per the Environment variable %UserProfile%, will be as per %ProgramData%, and is the installation root directory for Anaconda, assuming you are using this for your Python installation.

Listing Jupyter Kernels

You can see what kernels are currently installed by issuing the following:

Jupyter kernelspec list

Installation

Python3

This comes ‘out of the box’ with the Python 3 environment, so should require no actual setup in order to use. You’ll find the configuration file at envsPython36sharejupyterkernelsPython3. The configuration contains little else other than the location of the python.exe file, some flags, and the Jupyter diplay name and language to use. It will only be available within the Python environment in which it is installed, so you will need to change to that Python environment prior to starting Jupyter notebooks, using ‘Activate ‘ from a conda prompt.

PySpark

This requires a little more effort than the Python 3 kernel. You will need to create a PySpark directory in the required location for your Python environment, i.e. envssharejupyterkernelsPySpark

Within this directory, create a kernel.json file, with the following data:

{

"display_name": "PySpark",

"language": "python",

"argv": [

"\Envs\\python.exe",

"-m",

"ipykernel_launcher",

"-f",

"{connection_file}"

],

"env": {

"SPARK_HOME": "",

"PYSPARK_PYTHON": "\Envs\\python.exe ",

"PYTHONPATH": "\python; \python\pyspark; \python\lib\py4j-0.10.4-src.zip; \python\lib\pyspark.zip",

"PYTHONSTARTUP": "\python\pyspark\shell.py",

"PYSPARK_SUBMIT_ARGS": "--master local[*] pyspark-shell"

}

}

All windows paths will of course use backslashes, which must be escaped using a backslash, hence the ‘\’. You need to include paths to a zip archives for py4j and pyspark in order to have full kernel functionality. In addition to the basic Python pointers we saw in the Python 3 configuration, we have set a number of windows environment variables for the lifetime of the kernel. These could have course be set ‘globally’ within the machine settings (see here for details on setting these variables), but this is not necessary and I have avoided this to reduce clutter.

The PYSPARK_SUBMIT_ARGS parameter will vary based on how you are using your Spark environment. Above I am using a local install with all cores available (local[*]).

In order to use the kernel within Jupyter you must then ‘install’ it into Jupyter, using the following:

jupyter PySpark install envssharejupyterkernelsPySpark

Jupyter-Scala

This can be downloaded from here. Unzip and run the jupyter-scala.ps1 script on windows using elevated permissions in order to install.

The kernel files will end up in AppDataRoamingjupyterkernelsscala-develop and the kernel will appear in Jupyter with the default name of ‘Scala (develop)’. You can of course change this in the respective kernel.json file.

Apache Toree

This allows the use of Scala, Python and R languages (you will only see Scala listed after install but apparently it can also process Python and R), and is currently at incubator status within the Apache Software Foundation. The package can be downloaded from Apache here, however to install, just use pip install with the required tarball archive url and then jupyter install as below (from an elevated command prompt):

pip install http://apache.mirror.anlx.net/incubator/toree/0.1.0-incubating/toree-pip/apache-toree-0.1.0.tar.gz

jupyter toree install

This will install the kernel to jupyterkernelsapache_toree_scala

You should now see your kernels listed when running Jupyter from the respective Python environment. Select the ‘New’ dropdown to create a new notebook, and select your kernel of choice.

Coming Soon…

In part 4 of this series we’ll take a quick look at the Azure HDInsight Spark offering.

Nigel Meakins

Introduction to Data Wrangler in Microsoft Fabric

What is Data Wrangler? A key selling point of Microsoft Fabric is the Data Science

25
Jul

Autogen Power BI Model in Tabular Editor

In the realm of business intelligence, Power BI has emerged as a powerful tool for

12
Jul

Microsoft Healthcare Accelerator for Fabric

Microsoft released the Healthcare Data Solutions in Microsoft Fabric in Q1 2024. It was introduced

09
Jul

Unlock the Power of Colour: Make Your Power BI Reports Pop

Colour is a powerful visual tool that can enhance the appeal and readability of your

09
Jul

Python vs. PySpark: Navigating Data Analytics in Databricks – Part 2

Part 2: Exploring Advanced Functionalities in Databricks Welcome back to our Databricks journey! In this

20
May

GPT-4 with Vision vs Custom Vision in Anomaly Detection

Businesses today are generating data at an unprecedented rate. Automated processing of data is essential

08
May

Exploring DALL·E Capabilities

What is DALL·E? DALL·E is text-to-image generation system developed by OpenAI using deep learning methodologies.

03
May

Using Copilot Studio to Develop a HR Policy Bot

The next addition to Microsoft’s generative AI and large language model tools is Microsoft Copilot

29
Apr

Data Engineering

Introduction to Spark-Part 3:Installing Jupyter Notebook Kernels

Introduction

Kernel Configuration

Kernel Locations

Listing Jupyter Kernels

Installation

Python3

PySpark

Jupyter-Scala

Apache Toree

Coming Soon…

Nigel Meakins

Introduction to Data Wrangler in Microsoft Fabric

Autogen Power BI Model in Tabular Editor

Microsoft Healthcare Accelerator for Fabric

Unlock the Power of Colour: Make Your Power BI Reports Pop

Python vs. PySpark: Navigating Data Analytics in Databricks – Part 2

GPT-4 with Vision vs Custom Vision in Anomaly Detection

Exploring DALL·E Capabilities

Using Copilot Studio to Develop a HR Policy Bot

London

Aberdeen

Surrey

Bulgaria

India

Glasgow

Manchester

Get in touch

Our Social Channels

Introduction

Kernel Configuration

Kernel Locations

Listing Jupyter Kernels

Installation

Python3

PySpark

Jupyter-Scala

Apache Toree

Coming Soon…

Related Posts

Nigel Meakins

Introduction to Data Wrangler in Microsoft Fabric

Autogen Power BI Model in Tabular Editor

Microsoft Healthcare Accelerator for Fabric

Unlock the Power of Colour: Make Your Power BI Reports Pop

Python vs. PySpark: Navigating Data Analytics in Databricks – Part 2

GPT-4 with Vision vs Custom Vision in Anomaly Detection

Exploring DALL·E Capabilities

Using Copilot Studio to Develop a HR Policy Bot