Text Analytics of Movie Reviews using Azure Data Lake, Cognitive Services and Power BI (part 1 of 2)

Applicable Business Scenario

Marketing or data analysts who need to review sentiments and key phrases of a very large data set of consumer-based movie reviews.

Applied Technologies

  1. Azure Data Lake Store
  2. Azure Data Lake Analytics
    1. Cognitive Services
  3. Visual Studio with Azure Data Lake Analytics Tools
  4. Power BI Desktop & Power BI Service
  5. SharePoint Online site and preview of Power BI Web Part

Azure Data Lake Store

Upload .csv file of 2000 movie reviews to a folder in Azure Data Lake Store

textanalytics-1

Azure Data Lake Analytics

Execute the following U-SQL script in either the Azure Portal > Azure Data Lake Analytics > Jobs > New Jobs or Visual Studio with Azure Data Lake Analytics Tools.

This script makes reference to the Cognitive Services assemblies. They come out of the box in the Azure Data Lake master database.

TextAnalytics-2.png

U-SQL Script

 The following script reads the moviereviews.csv file in Azure Data Lake Store and then analyzes for sentiment and key phrase extraction. Two .tsv files are produced, one with the sentiment and key phrases for each movie review and another for a list of each individual key phrase with a foreign key ID to the parent movie review.

 REFERENCE ASSEMBLY [TextCommon];
 REFERENCE ASSEMBLY [TextSentiment];
 REFERENCE ASSEMBLY [TextKeyPhrase];

@comments =
 EXTRACT
 Text string
 FROM @"/TextAnalysis/moviereviews.csv"
 USING Extractors.Csv();

@sentiment =
 PROCESS @comments
 PRODUCE
 Text,
 Sentiment string,
 Conf double
 READONLY
 Text
 USING new Cognition.Text.SentimentAnalyzer(true);

@keyPhrases =
 PROCESS @sentiment
 PRODUCE
 Text,
 Sentiment,
 Conf,
 KeyPhrase string
 READONLY
 Text,
 Sentiment,
 Conf
 USING new Cognition.Text.KeyPhraseExtractor();

@keyPhrases = SELECT *, ROW_NUMBER() OVER () AS RowNumber
 FROM @keyPhrases;
 OUTPUT @keyPhrases
 TO "/TextAnalysis/out/MovieReviews-keyPhrases.tsv"
 USING Outputters.Tsv();

// Split the key phrases.
 @kpsplits =
 SELECT RowNumber,
 Sentiment,
 Conf,
 T.KeyPhrase
 FROM @keyPhrases
 CROSS APPLY
 new Cognition.Text.Splitter("KeyPhrase") AS T(KeyPhrase);

OUTPUT @kpsplits
 TO "/TextAnalysis/out/MovieReviews-kpsplits.tsv"
 USING Outputters.Tsv();

Azure Portal > Azure Data Lake Analytics  U-SQL execution

Create a new job to execute a U-SQL script.
TextAnalytics-3.png

TextAnalytics-4.png

Visual Studio Option

You need the Azure Data Lake Tools for Visual Studio. Create a U-SQL project and paste the script. Submit the U-SQL script to the Azure Data Lake Analytics for execution. The following shows the successful job summary after the U-SQL script has been submitted.

TextAnalytics-5.png

Click here to Part 2 of 2 of this blog series

HiveQL Group By and Views with Visual Studio and HDInsight

This article is for beginners looking to understand the developer experience in Visual Studio and working with hive tables in HDInsight.

I developed the following HiveQL statements. My cluster is HDInsight Spark 2.0 cluster.

USE USData;

CREATE TABLE IF NOT EXISTS CrimesGroupByType
ROW FORMAT DELIMITED
FIELDS TERMINATED by ',' stored as textfile
AS
SELECT count(id) as CrimeCount, PrimaryType AS PrimaryCrime, year YEAR FROM crimes
WHERE year <= 2017
GROUP BY PrimaryTYpe, Year;

CREATE VIEW IF NOT EXISTS CrimesGroupByType_View
AS
SELECT count(id) as CrimeCount, PrimaryType AS PrimaryCrime, year YEAR FROM crimes
WHERE year <= 2017
GROUP BY PrimaryTYpe, Year;

Before executing these statements, I have the database and tables:
Hive Group By and Views with Visual Studio and HDInsight-1

The crimes table data looks like:
Hive Group By and Views with Visual Studio and HDInsight-2

Let’s query the table with a count aggregation with Group By clause and save the data into a new hive table.

CREATE TABLE IF NOT EXISTS CrimesGroupByType
ROW FORMAT DELIMITED
FIELDS TERMINATED by ',' stored as textfile
AS
SELECT count(id) as CrimeCount, PrimaryType AS PrimaryCrime, year YEAR FROM crimes
WHERE year <= 2017
GROUP BY PrimaryTYpe, Year;

As a result a new hive table is created with the resulting data
Hive Group By and Views with Visual Studio and HDInsight-3Hive Group By and Views with Visual Studio and HDInsight-4

For scenarios where a view to encapsulate the same query is more appropriate:

CREATE VIEW IF NOT EXISTS CrimesGroupByType_View
AS
SELECT count(id) as CrimeCount, PrimaryType AS PrimaryCrime, year YEAR FROM crimes
WHERE year <= 2017
GROUP BY PrimaryTYpe, Year;

As a result:
Hive Group By and Views with Visual Studio and HDInsight-5

If we look into the file system in Azure Data Lake Store under \hive\warehouse:
Hive Group By and Views with Visual Studio and HDInsight-6

We see crimesgroupbytype folder for its hive table. And there is no folder for the hive view.
Clicking into the folder we see the tables own data files
Hive Group By and Views with Visual Studio and HDInsight-7

Clicking into the first file to see its data.
Hive Group By and Views with Visual Studio and HDInsight-8

Creating a view is useful where you want to query for an aggregation where its underlying hive table is changing and dynamic. Compared to creating a new hive table populated with the same query, it will only get the resulting data for that point in time. Both will serve different purposes. Also, note that Power BI will have Hive views selectable.

These are just simple examples to build a data warehouse in Hive with HDInsight. As such, data analysts, data scientists and report builders can work off for conducting the analysis and building solutions.


Azure Data Factory: Copy Data from Azure Data Lake Store to Azure SQL

Objective: From Azure Data Lake Store, copy data in a .tsv file to a database table in Azure SQL DB.

My take on Azure Data Factory is that of ETL but less of the transformation and move data from on-premises and cloud sources to Azure data-oriented services.

Azure Data Factory- Copy Data from Azure Data Lake Store to Azure SQL-1

Read further about Azure Data Factory https://docs.microsoft.com/en-us/azure/data-factory/data-factory-introduction

In my Azure Data Lake Store, I have the 227mb tsv file which will be the source data.
Azure Data Factory- Copy Data from Azure Data Lake Store to Azure SQL-2
In my Azure SQL DB, I have the following table which will be the target.
Azure Data Factory- Copy Data from Azure Data Lake Store to Azure SQL-3

The approach I will take is to use the Copy data Wizard. This is much more easier and convenient than the Author and deploy approach which is based on editing JSON. Hopefully in the near future, there will be more UI based tools.
Azure Data Factory- Copy Data from Azure Data Lake Store to Azure SQL-4

Click Copy data
Azure Data Factory- Copy Data from Azure Data Lake Store to Azure SQL-5

Click Next.
Select Azure Data Lake Store as the source

Enter connection information. For Authentication Type, I selected Service Principal over OAuth. The advantage is that the authentication does not expire as it does for OAuth. So for recurring schedules, you are best off with creating a Service Principal. For details on creating a service principal https://docs.microsoft.com/en-us/azure/azure-resource-manager/resource-group-create-service-principal-portal#get-application-id-and-authentication-key

Azure Data Factory- Copy Data from Azure Data Lake Store to Azure SQL-6
Choose the file or folder as the source data
Azure Data Factory- Copy Data from Azure Data Lake Store to Azure SQL-7

Configure File Format settings. Most have been auto-detected. I chose to edit the schema to apply some proper field names
Azure Data Factory- Copy Data from Azure Data Lake Store to Azure SQL-8

Select Destination data store as Azure SQL DB
Azure Data Factory- Copy Data from Azure Data Lake Store to Azure SQL-9

Enter connection and authentication information
Azure Data Factory- Copy Data from Azure Data Lake Store to Azure SQL-10

Select table
Azure Data Factory- Copy Data from Azure Data Lake Store to Azure SQL-11

Configure column mapping. Note: I haven’t setup for repeatability such as defining a date field to help manage this scenario.
Azure Data Factory- Copy Data from Azure Data Lake Store to Azure SQL-12

Performance settings. I choose to leave as default.
Azure Data Factory- Copy Data from Azure Data Lake Store to Azure SQL-13

Confirm Summary
Azure Data Factory- Copy Data from Azure Data Lake Store to Azure SQL-14

Validating and Deployment. This will create the linked services, data sets, and pipelines the same way in the Author and deploy editor
Azure Data Factory- Copy Data from Azure Data Lake Store to Azure SQL-15
Monitor pipeline

Confirm data copy by querying the Azure SQL JobPostings table.
Azure Data Factory- Copy Data from Azure Data Lake Store to Azure SQL-16

In Azure SQL, a quick look of the Resource Utilization during this ADF pipeline activity we can see it has been a substantial load. Although the sizing is at the lowest at 5 DTUs.
Azure Data Factory- Copy Data from Azure Data Lake Store to Azure SQL-17

I find Azure Data Factory I neat tool with basic capabilities. I don’t think this should be compared with SSIS with more sophisticated capabilities. For now, Copy data wizard is good for general bulk data movement to get your data into the Azure cloud.


Query Hive Tables with Ambari Hive Views in HDInsight

This is an introductory walk through of querying hive tables and visualizing the data in the Ambari Hive View. This is another option to build and debug HiveQL other than in Visual Studio with the Azure Data Lake Tools plugin. In my blog article Populating data into hive tables, I demonstrated populating internal and external hive tables and this article will build on that.

Hive is a data warehouse infrastructure built on top of Hadoop. It provides tools to enable data ETL, a mechanism to put structures on the data, and the capability to query and analyze large data sets that are stored in Hadoop. The Hive View is designed to help you author, execute, understand, and debug Hive queries.”

https://docs.hortonworks.com/HDPDocuments/Ambari-2.2.0.0/bk_ambari_views_guide/content/ch_using_hive_view.html

  1. Go to https://<yourcluster>.azurehdinsight.net
  2. Click on the top right waffle > Hive View
    Query Hive Tables with Ambari Hive Views in HDInsight-1
  3. In my Hive, I click on my database usdata and its two tables crimes and crimes_ext. You can see its schema.
    Query Hive Tables with Ambari Hive Views in HDInsight-2
  4. I query the number of rows in my crimes table to show it being quite large.
    Query Hive Tables with Ambari Hive Views in HDInsight-3
  5. A query to show the number of each type of crime for years before or at 2017.
    Query Hive Tables with Ambari Hive Views in HDInsight-4
    The execution took about 2-3 mins with 2 worker nodes in the cluster.
  6. Clicking on Visualization tab on the right, I am able to visualize the data in my last query. The red squares indicate settings and clicks to arrive to the stacked bar chart.Query Hive Tables with Ambari Hive Views in HDInsight-5
  7. Click Data Explorer to explore to see various charts defined automatically based on the columns in your query.
    Query Hive Tables with Ambari Hive Views in HDInsight-6

I have shown a brief introduction of Ambari Hive views with queries. This is one of the tools for data analysts and developers to quickly and easily build hive queries.


 

Azure Search: Pushing Content to an Index with the .NET SDK.

Blog Series

  1. Azure Search Overview
  2. Pushing Content To An Index with the .NET SDK

I hold the opinion that for a robust indexing strategy, you would likely end up writing a custom batch application between your desired data sources and your defined Azure Search index. The pull method currently only supports data sources that reside in specific Azure data stores (as of Feb 2017):

  • Azure SQL Database
  • SQL Server relational data on an Azure VM
  • Azure DocumentDB
  • Azure Blob storage, Table storage

I would assume many at this time would have desired content in websites databases and LOB applications outside of these Azure data stores.

Azure Search .NET SDK

This article Upload data to Azure Search using the .NET SDK gives great guidance and is what I used, but here’s my specific implementation approach.

To get started, first create a .NET project.Azure Search Pushing Content to an Index with the .NET SDK-1

Install from NuGet
Azure Search Pushing Content to an Index with the .NET SDK-2

My project with the Microsoft.Azure.Search library
Azure Search Pushing Content to an Index with the .NET SDK-3

 

To start coding, define your search index by creating a model class. I created a generic index schema. I will use this to define and create a new search index in the Azure Search Service. And to hold a list of records of movies as my searchable content.

[SerializePropertyNamesAsCamelCase ]
    public partial class IndexModel
    {
        [Key]
        [IsRetrievable(true)]
        public string Id { get; set; }

        [IsRetrievable(true), IsSearchable, IsFilterable, IsSortable]
        public string Title { get; set; }

        [IsRetrievable(true), IsSearchable]
        [Analyzer(AnalyzerName.AsString.EnLucene)]
        public string Content { get; set; }

        [IsFilterable, IsFacetable, IsSortable]
        public string ContentType { get; set; }

        [IsRetrievable(true)IsFilterable, IsSortable, IsSearchable]
        public string Url { get; set; }

        [IsRetrievable(true)IsFilterable, IsSortable]
        public DateTimeOffset? LastModifiedDate { get; set; }

        [IsRetrievable(true)IsFilterable, IsSortable]
        public string Author { get; set; }

    }

Next, I do 3 major steps in the Main method of the console app

  1. Create mock data, as if this data was retrieved from a data source.
  2. Create and index, if one not already exists, based on the index model class
  3. Update the index with new or updated content.
public static void Main(string[] args)
        {

            // Mock Data
            List&amp;lt;IndexModel&amp;gt; movies = new List&amp;lt;IndexModel&amp;gt;
            {
                new IndexModel()
                {
                    Id = "1000",
                    Title = "Star Wars",
                    Content = "Star Wars is an American epic space opera franchise, centered on a film series created by George Lucas. It depicts the adventures of various characters a long time ago in a galaxy far, far away",
                    LastModifiedDate = new DateTimeOffset(new DateTime(1977, 01, 01)),
                    Url = @"http://starwars.com"
                },
                new IndexModel()
                {
                    Id = "1001",
                    Title = "Indiana Jones",
                    Content = @"The Indiana Jones franchise is an American media franchise based on the adventures of Dr. Henry 'Indiana' Jones, a fictional archaeologist. It began in 1981 with the film Raiders of the Lost Ark",
                    LastModifiedDate = new DateTimeOffset(new DateTime(1981, 01, 01)),
                    Url = @"http://indianajones.com"
                },
                new IndexModel()
                {
                    Id = "1002",
                    Title = "Rocky",
                    Content = "Rocky Balboa (Sylvester Stallone), a small-time boxer from working-class Philadelphia, is arbitrarily chosen to take on the reigning world heavyweight champion, Apollo Creed (Carl Weathers), when the undefeated fighter's scheduled opponent is injured.",
                    LastModifiedDate = new DateTimeOffset(new DateTime(1976, 01, 01)),
                    Url = @"http://rocky.com"
                }
            };

            AzureSearch.CreateIndexIfNotExists&amp;lt;IndexModel&amp;gt;("movies");

            AzureSearch.UpdateIndex("movies", movies);

            Console.WriteLine("Enter any key to exist");
            Console.ReadKey();
        }

In the Azure Portal, you will see the outcomes

  • The ‘movies’ index has been created along with 3 documents as expected.
    Azure Search Pushing Content to an Index with the .NET SDK-4
  • I find that the document count value takes several minutes’ or more to be updated, but the indexing is immediate.
  • The fields has been defined along with its type and attributes based on the index model class
    Azure Search Pushing Content to an Index with the .NET SDK-5
  • To test the index, use the Search Explorer
    Azure Search Pushing Content to an Index with the .NET SDK-6

For further code snippet details of the following method calls. I made this method dynamic such that you pass in the Type of the index model as T. Then the  FieldBuilder.BuildForType() will build out the index schema.

AzureSearch.CreateIndexIfNotExists&amp;lt;IndexModel&amp;gt;("movies");

public static Boolean CreateIndexIfNotExists&amp;lt;T&amp;gt;(string indexName)
        {
            bool isIndexCreated = false;

                List&amp;lt;string&amp;gt; suggesterFieldnames = new List&amp;lt;string&amp;gt;() { "title" };

                var definition = new Index()
                {
                    Name = indexName,
                    Fields = FieldBuilder.BuildForType&amp;lt;T&amp;gt;(),
                    Suggesters = new List&amp;lt;Suggester&amp;gt;() {
                        new Suggester() {
                            Name = "Suggester",
                            SearchMode = SuggesterSearchMode.AnalyzingInfixMatching,
                            SourceFields = suggesterFieldnames
                        }
                   }
                };

                SearchServiceClient serviceClient = CreateSearchServiceClient();

                if (!serviceClient.Indexes.Exists(indexName))
                {
                    serviceClient.Indexes.Create(definition);
                    isIndexCreated = true;
                }
                else
                    isIndexCreated = false;
}

AzureSearch.UpdateIndex("movies", movies);
inner method call:
private static void UploadDocuments(ISearchIndexClient indexClient, List&amp;lt;IndexModel&amp;gt; contentItems)
        {
                var batch = IndexBatch.MergeOrUpload(contentItems);
                indexClient.Documents.Index(batch);
}

 

In conclusion, I generally recommend the push approach using the Azure Search .NET SDK as there are more control and flexibility. As I created a CreateIndex method, you should create a delete index method. This helps during development process as you iterate upon defining your index schema. Even in production scenarios, it can be appropriate to delete your index, re-create index with an updated schema and then re-index your content.


Azure Search Overview

Blog Series

  1. Azure Search Overview
  2. Pushing Content To An Index with the .NET SDK

Azure Search is a platform-as-a-service offering. This requires code and configuration to set up and use.

Applicable corporate scenarios

  • Enterprise search on many repositories of data or files that are intended to be available for a wide audience. A lightweight one-stop shop for finding any information chosen to be indexed
  • Add search functionality to an existing application that does not have its own search functionality. Such as public internet company website.
  • A custom search against multiple sources of log files.

Search Service

In the Azure Portal, create a new Azure Search service. The free tier should be sufficient for proof of concept purposes.

AzureSearchOverview-1

The Index

The index is the heart of any search engine where content is stored in a way that is searchable for fast and accurate retrieval. The index needs to be configured with a schema that defines custom fields by its name, type, and attribute.

Overview blade:
AzureSearchOverview-2

Index Fields:
azuresearchoverview-3

Field Types

Type & Description

Edm.String – Text that can optionally be tokenized for full-text search (word breaking, stemming, etc).

Collection(Edm.String) – A list of strings that can optionally be tokenized for full-text search. There is no theoretical upper limit on the number of items in a collection, but the 16 MB upper limit on payload size applies to collections.

Edm.Boolean – Contains true/false values.

Edm.Int32 – 32-bit integer values.

Edm.Int64– 64-bit integer values.

Edm.Double – Double-precision numeric data.

Edm.DateTimeOffset – Date time values represented in the OData V4 format (e.g. yyyy-MM-ddTHH:mm:ss.fffZ or yyyy-MM-ddTHH:mm:ss.fff[+/-]HH:mm).

Edm.GeographyPoint -A point representing a geographic location on the globe.


Field attributes

Attribute & Description

Key – A string that provides the unique ID of each document, used for document look up. Every index must have one key. Only one field can be the key, and its type must be set to Edm.String.

Retrievable – Specifies whether a field can be returned in a search result.

Filterable – Allows the field to be used in filter queries.

Sortable – Allows a query to sort search results using this field.

Facetable – Allows a field to be used in a faceted navigation structure for user self-directed filtering. Typically fields containing repetitive values that you can use to group multiple documents together (for example, multiple documents that fall under a single brand or service category) work best as facets.

Searchable – Marks the field as full-text searchable.

You can’t change an existing field. You can only add to the schema. If you have change existing fields, you have to delete the index, re-create with the new specifications and re-index your content. I suggest you automate this process by creating your own management app with the.NET SDK or REST API.

Refer to further guidance and sample code at Create an Azure Search index using the .NET SDK


Indexing

To populate the index from data sources, you need indexers. There are two approaches of indexing:

  1. Pushing to the index programmatically using the REST API or.NET SDK
  2. Pulling into the index with the Search Services’ indexer. No need for custom code.
Indexing Approach Pros Cons
Push to index More flexible
Support any kind of data source
You manage change tracking.
Need to write custom code using REST API or.NET SDK.
Pull into index Change tracking is mostly handled for you.
No need to write custom code.
Limited number of data sources that reside in Azure

To setup a pull into the index, you can configure through the Azure Portal through the Import Data. The alternative and more comprehensive way to configure is through the REST API or.NET SDK.

Import Data

azuresearchoverview-4

azuresearchoverview-5

Must have already uploaded supported file types into your blob containers such as PDF and MS Office Documents.

A custom index schema will be generated for you based on the supported metadata of the specific data source. In this case, it is blob storage.azuresearchoverview-6

Indexer Configurationazuresearchoverview-7

You provide the name of the indexer and the schedule of how often the indexer runs. There is automatic change tracking for new and updated documents, but deletion is handled differently. The indexer will not remove from the index unless you define a metadata field in your blob that can be marked with any value identifying it as ‘soft deleted’. In this example, we take the metadata_title field and when the document has this value as ‘softdelete’, the indexer will delete from the index. After this is known, then you can delete the document in the blob store. In my opinion, I find this process a bit complex to handle. This may need some custom application to scan through blob store and the index to see if there is any difference and delete in the blob store.

Search Explorer

Test the index with the Search Explorer by inputting your query string of search keys, filters, and other operatorsazuresearchoverview-8

Since I had uploaded some PDF and Word documents on Azure, I’ll search on “Azure”azuresearchoverview-9

Also, scrolling further down, you can see the results of the metadata fieldsazuresearchoverview-10

To setup a push approach to the index, you have to either use the REST API or .NET SDK.

You build a custom application that interfaces with your data source and use the above options to push the content to the index.

To see further details, click to my blog post – Azure Search: Pushing to an Index with the .NET SDK


Working with Hive Tables in Zeppelin Notebook and HDInsight Spark Cluster

Zeppelin notebooks are a web based editor for data developers, analysts and scientists to develop their code (scala, python, sql, ..) in an interactive fashion and also visualize the data.

I will demonstrate simply notebook functionality, query data in hive tables, aggregate the data and save to a new hive table. For more details, read https://hortonworks.com/apache/zeppelin/.

My HDInsight Spark 2.0 cluster is configured for Azure Data Lake Store as the primary data storage. The Zeppelin notebook has to be installed by going to the Add Service Wizard. http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_zeppelin-component-guide/content/ch_installation.html

To start off, create a new note.

One empty paragraph where you can write a code and has its results section to display error message and outputs. A note can have many paragraphs and share the same context. For more details read https://zeppelin.apache.org/docs/0.6.1/quickstart/explorezeppelinui.html#note-layout
Working with Hive Tables in Zeppelin Notebook and HDInsight Spark Cluster-1

Show tables in the usdata database. Hit shift+enter to run the paragraph or hit the play button on the top right.
Working with Hive Tables in Zeppelin Notebook and HDInsight Spark Cluster-2

I create another paragraph with code to query a hive table, display the first 2 rows and the number of records.
Working with Hive Tables in Zeppelin Notebook and HDInsight Spark Cluster-3

%livy.spark declares the interpreter. The programming language is Scala.

crimesDF is a data frame object which contains the data in named columns according to the schema defined in the hive table it is being queried from. For details read http://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes

The number of records counted is 6,312,976. The execution took about 21 seconds of which I consider quite fast.

In this paragraph,

  • a query upon the crimesDF data frame created in the previous paragraph. crimebytypeDF will contain an aggregation of the number of crimes by its primary type and year.
  • This result will be saved into a new hive table using saveAsTable function.
  • Show the first two rows for testing
  • Convert data frame to RDD for future scenarios to apply sophisticated transformations. RDD loses the named column support in data frames.
  • Print the first 5 rows in the RDD for testing.

Working with Hive Tables in Zeppelin Notebook and HDInsight Spark Cluster-4

This execution took 1 minute and 52 seconds which is again very fast.
The full code:

%livy.spark
spark.sql("Use usdata")
spark.sql("Show tables").show()	

val crimesDF = spark.sql("SELECT * FROM crimes")
crimesDF.show(2, false)
crimesDF.count()

val crimebytypeDF = crimesDF.select($"id", $"year", $"primarytype").where($"year" &lt; 2017).groupBy($"primarytype", $"year").
    agg(countDistinct($"id") as "Count")

// save data frame of results into an existing or non-existing hive table.
crimebytypeDF.write.mode("overwrite").saveAsTable("crimebytype_zep")

// display first 2 rows
crimebytype.show(2, false) 

// convert data frame to RDD and print 5 first rows. RDD allows for more control on the data for transformations.
val crimebytypeRDD = crimebytypeDF.rdd
crimebytypeRDD.take(5).foreach(println)

I have shown a brief and basic example of using the Zeppelin notebook and to manage data in hive tables for ad hoc data analysis scenarios. I like the fact that queries and operations are much faster than in Hive View in Ambari and in Visual Studio with Azure Data Lake Tools plugin. I like the notebook ‘paragraph’ concept as I can run small chunks of code rather than through separate files or by commenting and uncommenting lines of code. I would start off writing the core logic of my big data applications in notebooks and bring them into IntelliJ or visual studio where it would be more suitable for larger scale application development.


 

The Effects of Dropping Internal and External Hive Tables in HDInsight and ADLS

In my blog post Populating Data into Hive Tables in HDInsight, I have demonstrated populating an internal and an external hive table in HDInsight. The primary storage is configured with Azure Data Lake Store.

To see the differences, I will demonstrate dropping both types of tables and observe the effects. This for the beginner audience.

To recap the linked post, the internal hive table has its own copy of the csv file of crimes data. The external hive table crimes_ext has a pointer to the folder containing its data.The hive tables to drop as seen through Visual Studio

The Effects of Dropping Internal and External Hive Tables in HDInsight and ADLS-1
In Azure Data Lake Store, the tables are under \clusters\rkhdinsight\hive\warehouse\usdata.dbThe Effects of Dropping Internal and External Hive Tables in HDInsight and ADLS-2 copy
Clicking into the crimes folder, we see the data file.
The Effects of Dropping Internal and External Hive Tables in HDInsight and ADLS-3
Clicking into the crimes_ext folder, there is no data file.The Effects of Dropping Internal and External Hive Tables in HDInsight and ADLS-4
Since crimes_ext is an external table, its data is configured to be under \datsets\Crimes
The Effects of Dropping Internal and External Hive Tables in HDInsight and ADLS-5
Out of curiosity, I look in the external hive metastore in an Azure SQL Database. There is a table TBLS with a list of tables and its type. The hive metastore contains the metdata of the hive table, but no data.The Effects of Dropping Internal and External Hive Tables in HDInsight and ADLS-6Dropping the tables

Use usdata;
DROPTABLE crimes_ext;
DROPTABLE crimes;

In Server Explorer, after refreshing the usdata database, they are gone.The Effects of Dropping Internal and External Hive Tables in HDInsight and ADLS-7
In ADLS, the crimes folder is removed which was the internal table. However, the crimes_ext folder still exists for reasons I am not aware.The Effects of Dropping Internal and External Hive Tables in HDInsight and ADLS-8
To check the data of the external table, we see its data file not removed, which is expected by design.
The Effects of Dropping Internal and External Hive Tables in HDInsight and ADLS-9
Going back to the hive metastore, we see the two tables removed.The Effects of Dropping Internal and External Hive Tables in HDInsight and ADLS-10

Side note: The hive view crimesgroupbytype_view queries the dropped internal hive table and if you run the view, you get an error stating the dependency of the missing hive table.

The concepts of dropping internal and external tables are easy to understand, but I just want to show the underlying effects in Azure Data Lake Store and the hive metastore DB for the beginner.


Populating Data into Hive Tables in HDInsight


Objective:
Populate a csv file to an internal and external Hive table in HDInsight.

See my blog post on create hive tables Creating Internal and External Hive Tables in HDInsight

I have obtained a 1.4GB csv file on US city crimes data from https://catalog.data.gov/dataset/crimes-2001-to-present-398a4

My HDInsight cluster is configured to use Azure Data Lake store as its primary data storage.

  1. Go to Azure Data Lake Store > Data Explorer
  2. Upload csv file twice in 2 separate locations.
    Upload to adl://rkdatalake.azuredatalakestore.net/datasets for internal hive table and adl://rkdatalake.azuredatalakestore.net/datasets/crimes for external hive table. I will explain why further.
    Populating Data into Hive Tables in HDInsight-1
    Populating Data into Hive Tables in HDInsight-2
  3. In Visual Studio with Azure Data Lake Tools plugin, open an existing or new Hive project.
  4. Execute script:
    LOAD DATA INPATH 'adl://rkdatalake.azuredatalakestore.net/datasets/Crimes_-_2001_to_present.csv' OVERWRITE INTO TABLE crimes;
    
  5. Query crimes table to test data load
    Populating Data into Hive Tables in HDInsight-3
  6. Go back to adl://rkdatalake.azuredatalakestore.net/datasets and you will notice that the csv file is no longer present. It has been moved to the hive/warehouse.
    Populating Data into Hive Tables in HDInsight-4
  7. Execute Script to set the external data location.
    ALTER TABLE Crimes_ext SET LOCATION 'adl://rkdatalake.azuredatalakestore.net/datasets/Crimes/'
    

    I found that the location must be a folder containing the csv file(s) and not a specific csv file path.

  8. Execute Script to query external table:
    Populating Data into Hive Tables in HDInsight-5
  9. In contrast to the internal crimes table, the crimes_ext under hive\warehous does not contain any data files. This is to confirm this is by design of an external table
    Populating Data into Hive Tables in HDInsight-6

I have demonstrated the differences in populating data for an internal and an external hive table. For a Q&A discussion between the differences read http://stackoverflow.com/questions/17038414/difference-between-hive-internal-tables-and-external-tables.

In my following article The effects of Dropping Hive Tables, I will show the effects of dropping each type of hive table.


Creating Internal and External Hive Tables in HDInsight

Objective: Create an internal and an external hive tables in HDInsight. Based on the schema of a CSV file on US city crime. https://catalog.data.gov/dataset/crimes-2001-to-present-398a4

Building Hive tables establishes a schema on the flat files that I have stored in Azure Data Lake Store. This will allow me to do SQL like queries with HiveQL on that data. In other words, I have a data warehouse in Hive. and can be the basis of building big data analytical applications.
For an overview of Hive, read https://hortonworks.com/apache/hive/

  1. Open Visual Studio with Data Lake Tools plugin.
  2. Create New Project
    Creating Internal and External Hive Tables in HDInsight-1
  3. In Server Explorer, ensure you are connected to an HDInsight cluster
    CreateTables0
  4. In Solution Explorer, create a new HiveQL script to create tables
    Creating Internal and External Hive Tables in HDInsight-3
  5. Create Database
    CREATE DATABASE IF NOT EXISTS USData;
    
    use USData;
    
  6. Create internal table
    CREATE TABLE IF NOT EXISTS Crimes
    (
            ID INT ,
            CaseNumber STRING,
            CrimeDate DATETIME,
            Block STRING,
            IUCR INT,
            PrimaryType STRING,
            Description STRING,
            LocationDescription STRING,
            Arrest BOOLEAN,
            Domestic BOOLEAN,
            Beat INT,
            District INT,
            Ward INT,
            CommunityArea INT,
            FBICode INT,
            XCoord INT,
            YCoord INT,
            Year INT,
            UpdatedOn DATETIME,
            Latitude FLOAT,
            Longitude FLOAT,
            CrimeLocation STRING
    )
    FIELDS TERMINATED by ',' stored as textfile
    tblproperties ("skip.header.line.count"="1");
    
  7. Create external table
    CREATE EXTERNAL TABLE IF NOT EXISTS Crimes_EXT
    (
            ID INT,
            CaseNumber STRING,
            CrimeDate DATE,
            Block STRING,
            IUCR INT,
            PrimaryType STRING,
            Description STRING,
            LocationDescription STRING,
            Arrest BOOLEAN,
            Domestic BOOLEAN,
            Beat INT,
            District INT,
            Ward INT,
            CommunityArea INT,
            FBICode INT,
            XCoord INT,
            YCoord INT,
            Year INT,
            UpdatedOn DATE,
            Latitude FLOAT,
            Longitude FLOAT,
            CrimeLocation STRING
    )
    ROW FORMAT DELIMITED
    FIELDS TERMINATED by ',' stored as textfile
    tblproperties ("skip.header.line.count"="1");
    
  8. Submitting one of the scripts
    Creating Internal and External Hive Tables in HDInsight-4
  9. As a result, crimes and crimes_ext tables are created.
    Creating Internal and External Hive Tables in HDInsight-5
  10. Expanding the tables, shows the fields Internal/managed vs External Hive Tables
    Creating Internal and External Hive Tables in HDInsight-6


For further details to create tables, read https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ManagedandExternalTables

To populate these hive tables with data, read further to my article Populating Data into Hive Tables in HDInsight.