For developers with a Microsoft .NET background who want to get familiar with building Spark applications with Scala programming language, this blog post series is a walk through from installing the development tools and building a simple Spark application, then submit against an HDInsight Spark cluster.
My HDInsight configuration is Spark 2.0 (HDI 3.5) with Azure Data Lake Store as the primary storage.
The articles I used to understand the installation and setup:
A point of confusion for me was that HDInsight tools for IntelliJ is deprecated. It has been rolled into Azure Toolkit for IntelliJ. So, any online articles referring to HDInsight tools are valid, but see it with Azure Toolkit in mind.
- JDK 8 jdk-8u131-windows-x64.exe
- Azure Toolkits for IntelliJ
Easier to install through IntelliJ as shown in the steps below.
- Spark SDKhttps://github.com/MSBigDataTeam/AzureHDInsightIntelliJTools/releases/download/v1.0.0/spark-assembly-2.0.0-hadoop2.7.0-SNAPSHOT.jar
- Scala SDKIt can be downloaded through IntelliJ as shown in the steps below.
I’m working on a Windows Server 2012 R2 VM.
Create New Project
Before creating a project, we need to install Azure Toolkit so that we can use Spark on HDInsight project template.
Select Spark On HDInsight (Scala) project template. As a side note, an alternate approach is with Maven, but I found this approach generally easier.
Enter Project Name
For Project SDK, Select Java 1.8 by finding it at C:\Program Files\Java\jdk1.8.0_131
For Scala SDK, Download and select latest version. In this screen shot I had previously downloaded it.
For Spark SDK, Select and find where you downloaded spark-assembly-2.0.0-hadoop2.7.0-SNAPSHOT.jar
Note: For Spark 2.0 cluster, you need Java 1.8 and Scala 2.11 or above.
Go to File >Project Structure, setProject Language level to 7
To write code and submit to HDInsight, see my next blog post Building a Spark Application for HDInsight using IntelliJ Part 2 of 2