Data Engineering

Introduction to Spark-Part 3:Installing Jupyter Notebook Kernels

Posted on 19th February 201816th December 2019 by Nigel Meakins

19
Feb

This is the third post in a series on Introduction To Spark.

Introduction

There are a large number of kernels that will run within Jupyter Notebooks, as listed here.

I’ll take you through installing and configuring a few of the more commonly used ones, as listed below:

Python3
PySpark
Scala
Apache Toree (Scala)

Kernel Configuration

Each kernel has its own kernel.json file, containing the required configuration settings. Jupyter will use this when loading the kernels registered in the environment. These are created in a variety of locations, depending on the kernel installation specifics. The file must be named kernel.json, and located within a folder that matches the kernel name.

Kernel Locations

There are various locations for the installed kernels. For those included in this article the locations below have been identified:

AppDataRoamingjupyterkernels
envssharejupyterkernels
jupyterkernels

Where will be as per the Environment variable %UserProfile%, will be as per %ProgramData%, and is the installation root directory for Anaconda, assuming you are using this for your Python installation.

Listing Jupyter Kernels

You can see what kernels are currently installed by issuing the following:

Jupyter kernelspec list

Installation

Python3

This comes ‘out of the box’ with the Python 3 environment, so should require no actual setup in order to use. You’ll find the configuration file at envsPython36sharejupyterkernelsPython3. The configuration contains little else other than the location of the python.exe file, some flags, and the Jupyter diplay name and language to use. It will only be available within the Python environment in which it is installed, so you will need to change to that Python environment prior to starting Jupyter notebooks, using ‘Activate ‘ from a conda prompt.

PySpark

This requires a little more effort than the Python 3 kernel. You will need to create a PySpark directory in the required location for your Python environment, i.e. envssharejupyterkernelsPySpark

Within this directory, create a kernel.json file, with the following data:

{

"display_name": "PySpark",

"language": "python",

"argv": [

"\Envs\\python.exe",

"-m",

"ipykernel_launcher",

"-f",

"{connection_file}"

],

"env": {

"SPARK_HOME": "",

"PYSPARK_PYTHON": "\Envs\\python.exe ",

"PYTHONPATH": "\python; \python\pyspark; \python\lib\py4j-0.10.4-src.zip; \python\lib\pyspark.zip",

"PYTHONSTARTUP": "\python\pyspark\shell.py",

"PYSPARK_SUBMIT_ARGS": "--master local[*] pyspark-shell"

}

}

All windows paths will of course use backslashes, which must be escaped using a backslash, hence the ‘\’. You need to include paths to a zip archives for py4j and pyspark in order to have full kernel functionality. In addition to the basic Python pointers we saw in the Python 3 configuration, we have set a number of windows environment variables for the lifetime of the kernel. These could have course be set ‘globally’ within the machine settings (see here for details on setting these variables), but this is not necessary and I have avoided this to reduce clutter.

The PYSPARK_SUBMIT_ARGS parameter will vary based on how you are using your Spark environment. Above I am using a local install with all cores available (local[*]).

In order to use the kernel within Jupyter you must then ‘install’ it into Jupyter, using the following:

jupyter PySpark install envssharejupyterkernelsPySpark

Jupyter-Scala

This can be downloaded from here. Unzip and run the jupyter-scala.ps1 script on windows using elevated permissions in order to install.

The kernel files will end up in AppDataRoamingjupyterkernelsscala-develop and the kernel will appear in Jupyter with the default name of ‘Scala (develop)’. You can of course change this in the respective kernel.json file.

Apache Toree

This allows the use of Scala, Python and R languages (you will only see Scala listed after install but apparently it can also process Python and R), and is currently at incubator status within the Apache Software Foundation. The package can be downloaded from Apache here, however to install, just use pip install with the required tarball archive url and then jupyter install as below (from an elevated command prompt):

pip install http://apache.mirror.anlx.net/incubator/toree/0.1.0-incubating/toree-pip/apache-toree-0.1.0.tar.gz

jupyter toree install

This will install the kernel to jupyterkernelsapache_toree_scala

You should now see your kernels listed when running Jupyter from the respective Python environment. Select the ‘New’ dropdown to create a new notebook, and select your kernel of choice.

Coming Soon…

In part 4 of this series we’ll take a quick look at the Azure HDInsight Spark offering.

Nigel Meakins

Pareto Charts in Power BI and the DAX behind them

The Pareto principle, commonly referred to as the 80/20 rule, is a concept of prioritisation.

19
Apr

Databricks: Cluster Configuration

Databricks, a cloud-based platform for data engineering, offers several tools that can be used to

15
Apr

AI Assistance in Microsoft Fabric

The exponential growth of Large Language Models (LLMs) couples with Microsoft’s close partnership with OpenAI

11
Apr

10 reasons why it’s worth the effort to understand the value of your data

“If leaders really want to create a data driven culture, the journey starts with them!

05
Apr

Content Safety in Azure AI Studio

Azure AI Content Safety is a solution designed to identify harmful content, whether generated by

02
Apr

Model Benchmarks in Azure AI Studio

In the constantly changing field of artificial intelligence (AI) and machine learning (ML), choosing the

02
Apr

Celebrating International Women’s Day: from Classroom to Code

As we celebrate International Women’s Day, I want to share my journey of breaking stereotypes

11
Mar

Pretty Power BI – Adding Pagination to Bar Charts

Good User Experience (UX) design is crucial in enabling stakeholders to maximise the insights that

12
Feb

Data Engineering

Introduction to Spark-Part 3:Installing Jupyter Notebook Kernels

Introduction

Kernel Configuration

Kernel Locations

Listing Jupyter Kernels

Installation

Python3

PySpark

Jupyter-Scala

Apache Toree

Coming Soon…

Nigel Meakins

Pareto Charts in Power BI and the DAX behind them

Databricks: Cluster Configuration

AI Assistance in Microsoft Fabric

10 reasons why it’s worth the effort to understand the value of your data

Content Safety in Azure AI Studio

Model Benchmarks in Azure AI Studio

Celebrating International Women’s Day: from Classroom to Code

Pretty Power BI – Adding Pagination to Bar Charts

London

Aberdeen

Surrey

Bulgaria

India

Glasgow

Manchester

Get in touch

Our Social Channels

Introduction

Kernel Configuration

Kernel Locations

Listing Jupyter Kernels

Installation

Python3

PySpark

Jupyter-Scala

Apache Toree

Coming Soon…

Related Posts

Nigel Meakins

Pareto Charts in Power BI and the DAX behind them

Databricks: Cluster Configuration

AI Assistance in Microsoft Fabric

10 reasons why it’s worth the effort to understand the value of your data

Content Safety in Azure AI Studio

Model Benchmarks in Azure AI Studio

Celebrating International Women’s Day: from Classroom to Code

Pretty Power BI – Adding Pagination to Bar Charts