Create HDInsight Spark Cluster with Azure Data Lake Store

The Spark cluster is one of the several cluster types that is offered through HDInsight platform-as-a-service. The unique capabilities of the Spark cluster are the in-memory processing that supports overall performance benefit over Hadoop cluster type. As a result, build big data analytics applications.

For further overview read https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-overview

I will walk through and comment on a series of steps to create a spark cluster with Azure Data Lake  Store as the primary storage as opposed to Azure Blob Store. For a comparison read https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-comparison-with-blob-storage

Create HDInsight Spark Cluster with Azure Data Lake Store-1

  1. In the Azure Portal, Create HDInsight
    Fill the obvious and standard inputs.
    For cluster configuration, select This version supports Data Lake store as primary data storage.
    Create HDInsight Spark Cluster with Azure Data Lake Store-2
  2. For primary storage type, select Data Lake Store
    In order for the data lake store to allow the HDInsight cluster access, we must create a service principal and Azure AD application in Azure Active Directory. This wizard helps do that. Create a certificate along with its password.
    Create HDInsight Spark Cluster with Azure Data Lake Store-3
  3. Select the Azure Data Lake Store that was already created and grant the permissions.
    Create HDInsight Spark Cluster with Azure Data Lake Store-4
  4. Click Run to assign permissions to all the files and folders.
    Create HDInsight Spark Cluster with Azure Data Lake Store-5
  5. Confirm success.
    Create HDInsight Spark Cluster with Azure Data Lake Store-6
  6. Download the certificate and take note of the password in the future to re-provision the cluster and point to the same Data Lake Store.
    Create HDInsight Spark Cluster with Azure Data Lake Store-7
  7. I chose to leverage an existing SQL database for Hive Metastore. This is so that definitions created for Hive databases and tables are preserved when deleting the cluster.
    Create HDInsight Spark Cluster with Azure Data Lake Store-8Prerequisite to Step 7:
    Before starting to create the Spark Cluster, must create a blank SQL database for each Hive and Oozie metastore. This is so that the SQL databases display in the drop down selection in the previous step.
    createhdinsight1x
  8. Through this blade, you can choose the number of worker nodes and the VM size. To save on costs I chose 2 nodes as opposed to the default 4. The head nodes are always 2 and used for failover and running some services. The number of nodes cannot be changed. Note that the costs can easily be expensive as you are charged for even idle time. So I strongly consider scripting scheduled provisioning and deleting of the cluster when needed. Especially for dev environments.Create HDInsight Spark Cluster with Azure Data Lake Store-10
  9. Confirm configuration
    createhdinsight2x
    Script actions option
    Allow for customizations onto the cluster such as adding additional components or configuration changes. For details read https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-customize-cluster-linuxVirtual Network optionAllow connecting to other VMs and resources made available within or the appointed virtual network. Also added network security layer. For details read https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-extend-hadoop-virtual-network 
  10. Upon completed provisioning, we see the Overview blade.
    Create HDInsight Spark Cluster with Azure Data Lake Store-12
  11. Click on Dashboard along the top to open the Ambari web management site for administrators
    Create HDInsight Spark Cluster with Azure Data Lake Store-13
  12. Go to the Hive View or URL looking like https://rkhdinsight.azurehdinsight.net/#/main/views/HIVE/1.5.0/AUTO_HIVE_INSTANCE to manage Hive databases and the data. This is analogous to SQL management studio.
    Create HDInsight Spark Cluster with Azure Data Lake Store-14
  13. For Zeppelin notebooks to do interactive data analysis and queries, go to URL https://rkhdinsight.azurehdinsight.net/zeppelin
    Create HDInsight Spark Cluster with Azure Data Lake Store-15
    I feel the Spark cluster in HDInsight has many capabilities and advantages that I am still learning. It is a compelling option for building big data applications.


Power BI and Read Only Access to Azure Data Lake Store

Power BI Desktop has a connector to read files in the Azure Data Lake Store. Azure Data Lake Store supports role based access security and granular permissions on files and folders. Business scenarios would require tight security permissions with respect to what files and folders, who have access along with appropriate permission. I will demonstrate one specific user account against a sub folder hierarchy with only read permissions. And to ensure that Power BI Desktop can get data with these security controls in place.

Azure Data Lake Store and Access Security

  1. Go to Azure Portal > Azure Data Lake Store > Data Explorer > Click on sub folder to grant access > Click on Access
    Power BI reports with Azure Data Lake Store-1
  2. Click Add to add a user or group
  3. Type in a name of an existing Azure Active Directory user. I had one previously created.
    Power BI reports with Azure Data Lake Store-2
  4. Click Select Permissions > Select Read and This folder and all children
    Power BI reports with Azure Data Lake Store-3
  5. Upon clicking Ok, you see the confirmed user permission setting
    Power BI reports with Azure Data Lake Store-4
  6. To connect to Azure Data Lake Store with Power BI Desktop record with URL. Go to Overview blade and copy the URL
    Power BI reports with Azure Data Lake Store-5

Power BI Desktop and Azure Data Lake Store

  1. From your computer, download and/or launch Power BI Desktop.
    https://powerbi.microsoft.com/en-us/desktop/
  2. Upon the launch screen, click on Get data
    Power BI reports with Azure Data Lake Store-6
  3. You will see a list of connectors. Filter by Azure and select Azure Data Lake Store
    Power BI reports with Azure Data Lake Store-7
  4. You will be prompted for the URL. Note this URL is specifically targeting the file system hierarchy which is the root. The read access permission was only set for a subfolder jobpostings, but let’s test out this URL.
    Power BI reports with Azure Data Lake Store-8
  5. Click Sign in
    Power BI reports with Azure Data Lake Store-9
  6. You will see a browser based login page. Enter the Azure AD account username and password. In my case, it was the John Smith user account.
  7. Upon successful authentication and clicking Connect, the user account is not authorized. This is expected.
    Power BI reports with Azure Data Lake Store-10
  8. Click Back and enter the URL including the subfolder
    Power BI reports with Azure Data Lake Store-11
  9. Click Sign in with the same credentials and click Connect
  10. Connection is successful due to granted read permissions.
    Power BI reports with Azure Data Lake Store-12
  11. I can load and create a query upon a .tsv file. This file is a result of U-SQL script.
  12. Here is a quick use of ESRI map visualization based on the query.
    Power BI reports with Azure Data Lake Store-13I have demonstrated the use case of allowing users with Azure AD Accounts to build reports with read only access to a specific sub folder in Azure Data Lake Store. This is to support overall security and governance practices.


Azure Data Lake Analytics: Finding Duplicates With U-SQL Windows Functions

I will demonstrate a U-SQL script I wrote to identify duplicate rows by a field using the windows functions. The scenario is where by ingesting data from sources, you may get duplicate rows where by you must resort to reactively de-duplicate. My technique will involve U-SQL windows functions.

Window functions are used to do computation within sets of rows called windows. Windows are defined in the OVER clause. Window functions solve some key scenarios in a highly efficient manner.

Read Using U-SQL window functions for Azure Data Lake Analytics jobs for further details and sample code.

The following script will identify a set of rows or a “window” that have the same jobkey unique identifier. This will indicate duplicates that occurred in the data ingestion process.

Also, I will compare and contrast with the Group By clause to group by the jobkey field to identify the number of duplicates.

// Read from jobpostings data file
@jobPostings =
    EXTRACT jobtitle string,
        company string,
        city string,
        jobkey string
  FROM @"/jobpostings/outputtsv/v3/JobsData.tsv"
    USING Extractors.Tsv();

// Query from jobpostings data
// Set ROW_NUMBER() of each row within the window partitioned by jobkey field
@jobPostingsDuplicates =
    SELECT ROW_NUMBER() OVER(PARTITION BY jobkey) AS RowNum,
           jobkey AS JobKey
    FROM @jobPostings;

// ORDER BY jobkey to see duplicate rows next to one another
@jobPostingsDuplicates2 =
    SELECT *
    FROM @jobPostingsDuplicates
    ORDER BY JobKey
    OFFSET 0 ROWS;

// Write to file
OUTPUT @jobPostingsDuplicates2
TO "/jobpostings/outputtsv/v3/JobsOver-Dups.tsv"
USING Outputters.Tsv();

// Group By and count # of duplicates per jobkey
@groupBy = SELECT jobkey, COUNT(jobkey) AS jobKeyCount
FROM @jobPostings
GROUP BY jobkey
ORDER BY jobkey
OFFSET 0 ROWS;

// Write to file
OUTPUT @groupBy
TO "/jobpostings/outputtsv/JobPostingsGroupBy-Dups.tsv"
USING Outputters.Tsv();

Upon clicking Submit, we see the U-SQL Job execution graph
Azure Data Lake Analytics- Finding Duplicates With U-SQL Windows Functions-1

For the query involving windows function, we can see a preview of the file output identifying a record with one duplicate as identified by row number 2. The set of rows or ‘window’ is set a partition of the jobkey.
Azure Data Lake Analytics- Finding Duplicates With U-SQL Windows Functions-2

For the query with the Group By clause, we can compare the same duplicate in the previous file with job key ‘00052e6ad78510f3’. The difference with the Group By is that it is an aggregation.
Azure Data Lake Analytics- Finding Duplicates With U-SQL Windows Functions-3

If to identify and remove the duplicate rows, then I would use the windows functions and set a field by its ROW_NUMBER within the partitioned window. I could then remove duplicates row with a filter as such:

@jobPostingsDuplicates3 =
    SELECT *
    FROM @jobPostingsDuplicates2
    WHERE RowNum == 1;

OUTPUT @jobPostingsDuplicates3
TO "/jobpostings/outputtsv/v3/JobsOver-deduped.tsv"
USING Outputters.Tsv();

If to only count the occurrence of each jobkey, then I would use the Group By.

In conclusion, this is a way to manage duplicates using windows functions.


Azure Data Lake Analytics: Job Execution Time and Cost

In running an intensive U-SQL job against a large number of files and records, I will show the performance diagnostics and its estimated cost.

I ran a U-SQL job against 4,272 JSON files with about 95,000 rows.
Azure Data Lake Analytics- Job Execution Time and Cost-1

1.2 hours to complete
Azure Data Lake Analytics- Job Execution Time and Cost-2

Parallelism was set to 5 AU.
Azure Data Lake Analytics- Job Execution Time and Cost-3

An AU is an analytic unit which is approximately a compute resource similar to 2 CPU cores and 6 GB of RAM as of Oct 2016. For details and background read https://blogs.msdn.microsoft.com/azuredatalake/2016/10/12/understanding-adl-analytics-unit/

To see diagnostics, first, click Load Job Profile.
Azure Data Lake Analytics- Job Execution Time and Cost-4

Click on Diagnostics tab
Azure Data Lake Analytics- Job Execution Time and Cost-5

Click on Resource usage

In the AU Usage tab, 5 processes have been fully utilized throughout the 1 hr 7 minutes of execution.
Azure Data Lake Analytics- Job Execution Time and Cost-6

The AU Usage Modeler provides a rough estimation of the number of AU for the best time. Here it estimates 4,272 AUs for 35.22 seconds of execution time.Azure Data Lake Analytics- Job Execution Time and Cost-7

Interestingly, 4,272 is the same number of JSON files to be analyzed. So, I am assuming to get best time for estimation, it would desire to allocate an AU for each file. There is probably more explanation to this but just commenting on the observation.

When going to job submission settings, I observed the max AUs that can be allocated is 112.
Azure Data Lake Analytics- Job Execution Time and Cost-8

Adjusting the Number to model to 112 we see an estimated 197 seconds of execution time.
Azure Data Lake Analytics- Job Execution Time and Cost-9

In submitting the job again at 112 AUs for parallelism, the outcome isAzure Data Lake Analytics- Job Execution Time and Cost-10

AU Usage diagnostics
Azure Data Lake Analytics- Job Execution Time and Cost-11

So the difference between the estimated time and actual for 112 AUs is that the actual tested is 6.2 minutes compared to about 2.5 minutes. So about 2.5 times longer than estimated.

The graph above you can see there is roughly 65-70% usage at 112 AUs compared to almost 100% usage at 5 AUs. But at 5 AUs, it took 10 times longer.

To do the price comparison, I couldn’t find the exact cost per each job execution, but I can guess based on the pricing stated when you open a U-SQL editor in the Azure Portal.

At 5 AUs, the cost is $0.17USD/minute. 72 minutes of execution costs $12.24
Azure Data Lake Analytics- Job Execution Time and Cost-12

At 112 AUs, cost is $3.73USD/minute. 6.2 minutes of execution costs $23.12
Azure Data Lake Analytics- Job Execution Time and Cost-13

AUs Execution time Cost / min USD Est. Total Cost
5 67 mins $0.17 11.39
112 3.3 mins $3.73 12.30

So based on my analysis it seems going for max AUs is a good deal where it cost about a dollar more yet save over an hour of time. I would assume the cost difference matters with other varying factors such as data size, # of records, data structure and the amount of number crunching.

If there is any feedback on my analysis, feel free to drop me a comment below. Hope my performance and cost analysis at least gives a ball park idea of what to expect.


Azure Data Lake Analytics: U-SQL C# Programmability

One of the neat features of U-SQL for .NET developers are that you can write C# code for further extensibility and be applying custom logic. This is useful for scenarios for extensive ETL processes where you must take raw data, clean, scrub, transform to meet business requirements for further analytical processing and BI reporting. The alternative languages for Big Data analytics are languages such as Python, R, Scala and Pig. They are more popular in the Big Data community and serve their intended use cases very well. And is favoured by various type of data developers. For the .NET developer who wants to generally start out, then U-SQL is a good choice. After some time, one can advance to those other ‘big data’ languages.

U-SQL programmability guide

My U-SQL with C# code involves counting the frequency of words that appear in a text-based field. The purpose of this code is to test a hypothesis whether getting the most frequent words in a text field can help determine what the text field is about. For example, if the job description field has top words such as ‘developer’, ‘javascript’, ‘json’, then perhaps it gives some indication that the job is a technical job. Let’s continue to explore.

@jobpostings variable represents the data from a file extraction. Let’s process it further with C# code-behind to count the occurrence of each word in the jobDescription field.

@jobpostings_Parsed =
     PROCESS @jobpostings
     PRODUCE jobtitle string,
        company string,
        city string,
        state string,
        country string,
        date string,
        url string,
        latitude float,
        longitude float,
        jobDescription string,
        salaryRate string,
        jobType string,
        hourlyRate double?,
        salary double?,
        compensationType string,
        jobDescriptionTopWords string,
        jobTitleTopWords string
     USING new JobPostings.USql.JobPostingsParser();

For the C# code-behind, I am using the User-defined processor which allows custom logic to be applied on each row. For details read https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-sql-programmability-guide#use-user-defined-processors

Azure Data Lake Analytics- U-SQL C# Programmability - 1

using Microsoft.Analytics.Interfaces;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;

namespace JobPostings.USql
{
    public class JobPostingsParser : IProcessor
    {
	 // Dictionary to tally each word and its occurence
        Dictionary<string, int> jobDescriptionFreqWords = new Dictionary<string, int>();

	 // Process each row
        public override IRow Process(IRow input, IUpdatableRow output)
        {
            string jobDescription = input.Get<string>("jobDescription");

                if (!String.IsNullOrEmpty(jobDescription))
                {
                    Dictionary<string, int> <strong>jobDescriptionFreqWords</strong> = GetMostFrequentWords(jobDescription);
                    List<KeyValuePair<string, int>> myList = jobDescriptionFreqWords.ToList();

			// Sort by descending
                    myList.Sort(
                        delegate (KeyValuePair<string, int> pair1,
                        KeyValuePair<string, int> pair2)
                        {
                            return pair2.Value.CompareTo(pair1.Value);
                        }
                    );

			// To output the three most frequent words and their word count as a text string into a new field called jobDescriptionTopWords
                    if (myList.Count >= 3)
                        output.Set<string>("jobDescriptionTopWords", myList[0] + ", " + myList[1] + ", " + myList[2]);
                }

// Need to explicitly output each field even though there is no processing on them.
            output.Set<string>(0, input.Get<string>("jobtitle"));
            output.Set<string>(1, input.Get<string>("company"));
            output.Set<string>(2, input.Get<string>("city"));
            output.Set<string>(3, input.Get<string>("state"));
            output.Set<string>(4, input.Get<string>("country"));
            output.Set<string>(6, input.Get<string>("date"));
            output.Set<string>(8, input.Get<string>("url"));
            output.Set<float>(9, input.Get<float>("latitude"));
            output.Set<float>(10, input.Get<float>("longitude"));
            output.Set<string>(16, input.Get<string>("jobDescription"));
            output.Set<string>(17, input.Get<string>("salaryRate"));
            output.Set<string>(18, input.Get<string>("jobType"));

            return output.AsReadOnly();
        }
	// Helper method to tally frequency of words for a given text field
private Dictionary<string, int> GetMostFrequentWords(string text)
        {
            Dictionary<string, int> wordCount = new Dictionary<string, int>();
            Dictionary<string, int> stopWords = new Dictionary<string, int>() {
                { "a", 0 }, { "able", 0 }, { "about", 0 }, …<omitted for brevity> …
            };

            char[] delimiters = new char[] { ' ', '\r', '\n', ',', '.', ':' };
            string[] words = text.Split(delimiters, StringSplitOptions.RemoveEmptyEntries);
            string w;

            foreach (string word in words)
            {
                w = word.ToLower();
                if (stopWords.ContainsKey(w))
                    continue;

                if (wordCount.ContainsKey(w))
                    wordCount[w]++;
                else
                    wordCount.Add(w, 1);
            }

            return wordCount;
        }

To summarize the logic:

  1. Read each row
    1. Get the jobDescription field value
    2. Call custom GetMostFrequentWords to return a dictionary of word and its frequency
      1. GetMostFrequentWords simply keeps a tally of each word encountered for that row but ignores a predefined static list of stop words. The list of stop words is from an arbitrary internet source.
      2. Sort by descending the tally of frequent words
      3. Construct a text string of 3 key/value pairs of the top most frequent words.
      4. Output this text string into a new field called jobDescriptionTopWords

After running the u-sql script and doing a file preview, the following shows the jobDescriptionTopWords field and the resulting text string to show the top 3 words.

Azure Data Lake Analytics- U-SQL C# Programmability - 2     Azure Data Lake Analytics- U-SQL C# Programmability - 3

So, going back to my hypothesis of whether displaying most frequent words gives an indication of the type of job, I think there is still some ambiguity. For example, the first row had most frequent words of visit, flexible, anytime. I wouldn’t be able to guess what type of job this would be. The associated job title is Merchandiser. Looking at the row with words heat, machine and sewing, I would have guessed that this job did indeed involve operating a sewing machine. My conclusion is that top word counting in a text field with many words can help, but not significantly. To be more productive, use other text analysis services to extract more meaning. In this article, I have shown the use of U-SQL C# programmability with using the user-defined processor to count the frequency of words in a row’s text field.


Azure Data Lake Analytics: How To Extract JSON Files

The ability to read files from Azure Data Lake is dependent on U-SQL Built-in Extractors:

  • Extractors.Text() : Provides extraction from delimited text files of different encodings.
  • Extractors.Csv() : Provides extraction from comma-separated value (CSV) files of different encodings.
  • Extractors.Tsv() : Provides extraction from tab-separated value (TSV) files of different encodings.

An example from my previous blog post using the CSV Extractor:

@csvData =
    EXTRACT policyID int,
        statecode string,
        county string,
        eq_site_limit float,
        hu_site_limit float,
        fl_site_limit float,
        fr_site_limit float,
        tiv_2011 float,
        tiv_2012 float,
        eq_site_deductible float,
        hu_site_deductible float,
        fl_site_deductible float,
        fr_site_deductible float,
        point_latitude float,
        point_longitude float,
        line string,
        construction string,
        point_granularity int?
    FROM "/MyData/FL_insurance_sample.csv"
    USING Extractors.Csv(skipFirstNRows:1);

I need to work with JSON files; however, there is no built-in support for JSON which I found surprising. Especially, when much of internet data is in the form of JSON from APIs these days. Also there should be XML extractor as well.

To solve this challenge, I found a solution in the following github repo https://github.com/Azure/usql. It has extractors for both JSON and XML https://github.com/Azure/usql/tree/master/Examples/DataFormats/Microsoft.Analytics.Samples.Formats
The readme has instructions how to register and sample scripts.

This is how it looks when I registered the 2 assemblies
Azure Data Lake Analytics- How To Extract JSON Files-1

The U-SQL script:

REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats]; 

DECLARE @inputfile string="/jobpostings/postings-{*}.json";

@jobPostingsSchema =
EXTRACT jobtitle string,
        company string,
        city string,
        state string,
        country string,
        date string,
        snippet string,
        url string,
        latitude float,
        longitude float
FROM @inputfile
USING new Microsoft.Analytics.Samples.Formats.Json.JsonExtractor("Results[*]");

OUTPUT @jobPostingsSchema
TO "/jobpostings/JSONoutput.csv"
USING Outputters.Csv();

The REFERENCE ASSEMBLY expression makes use of the registered assemblies in the JobPostingsBigData. Ensure that the script is using this database.

The extractor is reading multiple files that match the expression /jobpostings/postings-{*}.json. I have file names that end with date.

The JSON file structure

{
  "TotalResults": 41,
  "PageNumber": 0,
  "Results": [
    {
      "jobtitle": "Junior Web Developer",
      "company": "Company X",
      "city": "Calgary",
      "state": "AB",
      "country": "CA"
    },
    {……}]
}

I am only interested in extracting the Results collection. Therefore, I have a parameter value of “Results[*]” new Microsoft.Analytics.Samples.Formats.Json.JsonExtractor(“Results[*]”);
This extracts each record in this collection.

I can then take the extraction in the @jobPostingsSchema variable write to a flat file or to a table. In this script, I chose to store in a CSV file.

Upon submitting the script, the following is the file preview of the output flat file.
Azure Data Lake Analytics- How To Extract JSON Files-2

The JSON extractor would be applicable when retrieving data from REST APIs as it is becoming the modern standard. It would be a good idea that this become built-in as part of increasing support for a rich set of capabilities.

Azure Data Lake Analytics: Populating & Querying Tables

Upon creating a table with a defined set of columns, let’s go through a basic example of populating it from a CSV file stored in Azure Data Lake.

The .csv file as the sample data to be imported into the table. I am using Data Lake Explorer in visual studio. The relative file path is \MyData\FL_insurance_sample.csv. This sample data file I got from a public website.
Azure Data Lake Analytics- Populating - Querying Tables-1

U-SQL application:

// Use insurance database;
USE Insurance;

// Extract from file
@csvData =
    EXTRACT policyID int,
        statecode string,
        county string,
        eq_site_limit float,
        hu_site_limit float,
        fl_site_limit float,
        fr_site_limit float,
        tiv_2011 float,
        tiv_2012 float,
        eq_site_deductible float,
        hu_site_deductible float,
        fl_site_deductible float,
        fr_site_deductible float,
        point_latitude float,
        point_longitude float,
        line string,
        construction string,
        point_granularity int?
    FROM "/MyData/FL_insurance_sample.csv"
    USING Extractors.Csv(skipFirstNRows:1);

//Insert it into a previously created table
INSERT INTO Policies
SELECT *
FROM @csvData;

Since the data file was in CSV format, I would use the Extractors.csv. Other extractors are TSV and Text file. Since the data file had a header row of column names as its first row, I had to indicate to skip the first row by stating skipFristNrows: 1. Otherwise, the script submission would result in an error because the header columns wouldn’t work with the field data types defined. There are other arguments to specify to offer more configurability and flexibility.

Submit Job
Azure Data Lake Analytics- Populating - Querying Tables-2
Azure Data Lake Analytics- Populating - Querying Tables-3

To query and see the results, they must be stored into an output file in the Azure Data Lake Store. Unfortunately, I don’t see a way to see them on a console like you do in SQL Management Studio.

// Use insurance database;
USE Insurance;

@queryPolicy = SELECT * FROM Policies
WHERE county == "NASSAU COUNTY";

// output query into a file
OUTPUT @queryPolicy
TO "/MyData/queryresults/policies-by-nassaucounty.csv"
USING Outputters.Csv();

Azure Data Lake Analytics- Populating - Querying Tables-4

You can preview the file in Visual Studio
Azure Data Lake Analytics- Populating - Querying Tables-5

@queryView = SELECT * FROM PoliciesByCounty;

// output query into a file
OUTPUT @queryView
TO "/MyData/queryresults/view-by-county.csv"
USING Outputters.Csv();     

Note the View definition is:
CREATE VIEW IF NOT EXISTS Insurance.dbo.PoliciesByCounty AS
    SELECT COUNT(policyID) AS NumOfPolicies, county
    FROM Insurance.dbo.Policies
    GROUP BY county;

The preview of the output file:
Azure Data Lake Analytics- Populating - Querying Tables-6

I just went through some basic scenarios for populating and querying a table and view in ADLA.

References:
Tutorial: Get started with Azure Data Lake Analytics U-SQL language
U-SQL Views


Azure Data Lake Analytics: Database and Tables

Upon creating your Azure Data Lake Analytics service, you are ready to create database and tables. This is to define a schema and structure to your data. Your data may be in an semi-structured format. For the development tooling, you need to install Azure Data Lake Tools for Visual Studio as a plugin-in to Visual Studio.

This provides the ability to create U-SQL projects and functionality for development.
Azure Data Lake Analytics- Database and Tables-1
Azure Data Lake Analytics- Database and Tables-2

The catalog of Azure Data Lake Analytics:

Visual Studio – Sever Explorer Pane
Azure Data Lake Analytics- Database and Tables-3

Azure Portal – Azure Data Lake AnalyticsAzure Data Lake Analytics- Database and Tables-4

When creating tables, you have the option to create database into master database. This has assemblies that you can be leveraged in your scripts such as Cognitive Services to do key phrase extraction in text.

To create a new database and table, create a new .usql file into your U-SQL project.


//Create Database
CREATE DATABASE IF NOT EXISTS Insurance;

//Create Table
CREATE TABLE IF NOT EXISTS Insurance.dbo.Policies
(
        //Define schema of table
        policyID int,
        statecode string,
        county string,
        eq_site_limit float,
        hu_site_limit float,
        fl_site_limit float,
        fr_site_limit float,
        tiv_2011 float,
        tiv_2012 float,
        eq_site_deductible float,
        hu_site_deductible float,
        fl_site_deductible float,
        fr_site_deductible float,
        point_latitude float,
        point_longitude float,
        line string,
        construction string,
        point_granularity int?,
    INDEX idx1 //Name of index
    CLUSTERED (county ASC) //Column to cluster by
    DISTRIBUTED BY HASH (county) //Column distribute by
);

// CREATE VIEW based on the Policies table
CREATE VIEW IF NOT EXISTS Insurance.dbo.PoliciesByCounty AS
    SELECT COUNT(policyID) AS NumOfPolicies
    FROM Insurance.dbo.Policies
    GROUP BY county;

  • Create a database that is a container for tables, views, assemblies, schemas, table valued functions, etc..
  • Create a table with a set fields with data types. The data types are C#-like in such that you can define nullable types just as I did with point_granularity as int?.
  • Create an index for this table based for how rows are physically stored to one another and distribute by hash. The purpose is for efficient querying and processing.

Click Submit to submit the script as a job
Azure Data Lake Analytics- Database and Tables-5

Azure Data Lake Analytics- Database and Tables-6

In the Server Explorer pane, you will see the resulting outcome.
Azure Data Lake Analytics- Database and Tables-7

In the following article Azure Data Lake Analytics: Populating & Querying Tables, we will explore how to populate and query the table from a flat file from Azure Data Lake Store.

Creating Azure Data Lake Analytics

Azure Data Lake Analytics is a pay per use big data analytics service where you can write and submit scripts in U-SQL. This is a platform-as-a-service offering which requires less management of infrastructure and you focus on building your applications.

Some Key capabilities that I favored and found useful

  • U-SQL
    Simple and powerful while leveraging my existing skills of SQL, C# and Visual Studio IDE. Therefore, making me highly productive and deliver results.
  • Cost-effective
    I pay for the processing time of my script jobs. Overall I spend hardly much since my job executions are only a few minutes on average and not very frequent.
  • Role-based access security
    Able to grant various levels of permissions and assign roles to various types of users. This is leveraging Azure Active Directory. For example, enabling read-only users to report on data with tools such as Power BI.

For a more detailed overview and more of the key capabilities, read https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-overview

The following article nicely explains how to create Azure Data Lake analytics, but this blog will talk through how I set things up and weigh in on my own thoughts and considerations.

https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-get-started-portal

Go to Azure Portal and add a new Azure Data Lake Analytics serviceCreating Azure Data Lake Analytics1

Data Lake Store

I selected an existing data lake store within the same region/location.

Pricing

Pricing Details: https://azure.microsoft.com/en-us/pricing/details/data-lake-analytics/

Pay-as-you-go is $2 USD/hr per Analytics Unit. For development purposes, when working with data in roughly 100,000s of rows as a rough indicator, this pricing was more than well with me. I am not running jobs against the data frequently throughout the day either. I hardly noticed any significant costs. Monthly plans will provide at least 50% discount.
Creating Azure Data Lake Analytics2

An analytical unit is about 2 CPU cores and 6 GB of RAM. For more details on an analytical unit read Understanding the ADL Analytics Unit

For my simple U-SQL jobs, I usually run jobs with 5 Aus that run about 1 min to give a ballpark sizing.

After deployment, here are some of the key settings and functions
Creating Azure Data Lake Analytics3

I like to discuss just a few of these:

Add User Wizard
Allows for role-based access security of data analytics and files

  • Select Azure AD user
  • Role: Owner, Contributor, Reader, Data Lake Analytics Developer
  • Catalog permissions for running and managing jobs
  • Files and folder permissions

New Job
Running U-SQL scripts in the browser. Nice and convenient for simple scripts. The other option is using Visual Studio with Data Lake tools

Creating Azure Data Lake Analytics4

When you submit the job, you can get really nifty execution run-time monitoring
adla-0

Job Management

View the history and status of job submissions
Creating Azure Data Lake Analytics6

Compute Hours

View the compete hours so that you can check against your costs. Here shows is very little indication of computing usage.
Creating Azure Data Lake Analytics7

Data Explorer

View the folder and files of the data lake store and the databases and tables in Data Lake Analytics
Creating Azure Data Lake Analytics8

For Microsoft developers who want explore developing big data solutions and don’t aspire to be an expert, I would recommend trying Azure Data Lake Analytics. It is easy to pick up with existing knowledge of SQL and C#, easy to setup in Azure and, most importantly, able to deliver an end to end analytics solution without too much grind. This is in contrast to the Hadoop platform. For those who aspire to be an expert and want to use all the bells and whistles, then Hadoop is the way to go.

Using Azure Data Lake Store .NET SDK to Upload Files

The .NET SDK is a versatile option where you can build applications that have a graphical user interface, console application or integrate an existing application to transfer files to and from Azure Data Lake Store. For initial guidance, read Get started with Azure Data Lake Store using .NET SDK

I want to show how I implemented the SDK and highlight some key points.

I wrote a console application that reads data from a data source and uploads each file to a designated folder in Azure Data Lake Store. This ran on a reoccurring schedule.

  1. Create new .NET console application
  2. Add NuGet Packages
    a. Azure.Management.DataLake.Store
    b. Azure.Management.DataLake.StoreUploader
  3. Authentication
    I decided on service-to-service authentication with client secret approach.
    a. Go to Azure AD
    b. Click on App registrations
    Using Azure Data Lake Store .NET SDK to Upload Files 1

    c. Create an Azure AD App
    Using Azure Data Lake Store .NET SDK to Upload Files 2
    The sign-on URL is arbitrary at this point, so I just create any dummy URL.

    d. It is now listed
    Using Azure Data Lake Store .NET SDK to Upload Files 3
    e. Click into to the Azure AD App to display its settings
    Using Azure Data Lake Store .NET SDK to Upload Files 4
    f. Add a key name and expiration policy and click save. This will generate a key value.
    Using Azure Data Lake Store .NET SDK to Upload Files 5g.Remember to copy the key value and store somewhere. This will be referenced in yours .NET application.
    h. Obtain the Application ID that will be referenced in yours .NET application
    Using Azure Data Lake Store .NET SDK to Upload Files 6

    i.
     So what did we just do? We essentially create what I like to call an App Identity in Azure Active Directory. This is like a windows server service account one can create and grant permissions to certain resources to an application.
  4. Grant permissions in Azure Data Lake Store to the rkADSAADApp Azure AD App
    a. Go to Data ExplorerUsing Azure Data Lake Store .NET SDK to Upload Files 7

    b. 
    Click on a folder to grant permissions
    Click on Access
    Using Azure Data Lake Store .NET SDK to Upload Files 7a

    c. Click Add > Select User or Group > InviteUsing Azure Data Lake Store .NET SDK to Upload Files 8d. Find the rkADLSAADApp and SelectUsing Azure Data Lake Store .NET SDK to Upload Files 9

    e.
     Select Permissions
    Using Azure Data Lake Store .NET SDK to Upload Files 10
    f. Confirm
    Using Azure Data Lake Store .NET SDK to Upload Files 11
  5. I created a .NET project called AzureDataLakeStorageDataAccess that encapsulate the file operations to Azure Data Lake Store. I adopted much of the code from samples in https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-get-started-net-sdkI just want to explain how I applied and designed for my own purposes.
    Using Azure Data Lake Store .NET SDK to Upload Files 12
  6. Access and Authentication
    In the constructor method, I essentially create a client context by bassing in the client credentials of an Azure AD App by the application ID and client secret.
    Recommend storing these values in the app.config file and encrypt where necessary.

    static AzureDataLakeStorageDataAccess()
            {
                _adlsAccountName = "rkADLS";
                _resourceGroupName = "rkbigdata";
                _location = "Central US";
                _subId = "<subscription ID>";
    
                // Service principal / appplication authentication with client secret / key
                // Use the client ID and certificate of an existing AAD "Web App" application.
                System.Threading.SynchronizationContext.SetSynchronizationContext(new SynchronizationContext());
                var domain = "<mydomain>.onmicrosoft.com";
                var webApp_clientId = "e7e85dca-f056-4e08-8d79-91b95f18d203";
                var clientSecret = "<client secret>";
                var clientCredential = new ClientCredential(webApp_clientId, clientSecret);
    
                var creds = ApplicationTokenProvider.LoginSilentAsync(domain, clientCredential).Result;
    
                // Create client objects and set the subscription ID
                _adlsClient = new DataLakeStoreAccountManagementClient(creds);
                _adlsFileSystemClient = new DataLakeStoreFileSystemManagementClient(creds);
                _adlsClient.SubscriptionId = _subId;
            }
  7. Upload a file from the local file path.
    public static void UploadFile(string srcFilePath, string destFilePath, bool force = true)
            {
                var parameters = new UploadParameters(srcFilePath, destFilePath, _adlsAccountName, isOverwrite: force);
                var frontend = new DataLakeStoreFrontEndAdapter(_adlsAccountName, _adlsFileSystemClient);
                var uploader = new DataLakeStoreUploader(parameters, frontend);
                uploader.Execute();
       }
    
  8. Upload a file based on Stream object. This is where files are based on file stream, memory stream or other.
    public static void CreateFile(string destFilePath, Stream content)  // TODO: support overwrite existing file paramater
            {
                _adlsFileSystemClient.FileSystem.Create(_adlsAccountName, destFilePath, content);
    }
    
  9. The main console application would make reference to the AzureDataLakeStorageDataAccess project. And make calls as follows. Here I have some JSON formatted data in memory and created a file based on the stream content.
    jsonData = GetData();
    var serializer = new JsonSerializer();
    
    byte[] byteArray = Encoding.UTF8.GetBytes(JsonConvert.SerializeObject(jsonData, Formatting.Indented));
                        using (MemoryStream stream = new MemoryStream(byteArray))
                        {
                            // Create file each time by location and date/time
                            AzureDataLakeStorageDataAccess.CreateFile(newFileName, stream);
                            Console.WriteLine("\t \t Created file " + newFileName);
    }
    

In summary, I have shown how I implemented a .NET console application that uses the ADLS .NET SDK, which I wrapped in its own application layer, to store files into Azure Data Lake Store. The authentication mechanism was service-to-service with registering an Azure AD App and having ADLS grant it permissions. A use case would be a re-occurring batch service to periodically get data from one source to your ADLS.