Building a Spark Application for HDInsight using IntelliJ Part 1 of 2

For developers with a Microsoft .NET background who want to get familiar with building Spark applications with Scala programming language, this blog post series is a walk through from installing the development tools and building a simple Spark application, then submit against an HDInsight Spark cluster.

My HDInsight configuration is Spark 2.0 (HDI 3.5) with Azure Data Lake Store as the primary storage.

The articles I used to understand the installation and setup:
https://docs.microsoft.com/en-us/azure/azure-toolkit-for-intellij-installation

A point of confusion for me was that HDInsight tools for IntelliJ is deprecated. It has been rolled into Azure Toolkit for IntelliJ. So, any online articles referring to HDInsight tools are valid, but see it with Azure Toolkit in mind.
Download

I’m working on a Windows Server 2012 R2 VM.
JDK 8
Building a Spark Application for HDInsight using IntelliJ Part 1 of 2-1

IntelliJ Installation
Building a Spark Application for HDInsight using IntelliJ Part 1 of 2-2
Start IntelliJ
Create New Project
Building a Spark Application for HDInsight using IntelliJ Part 1 of 2-3

Before creating a project, we need to install Azure Toolkit so that we can use Spark on HDInsight project template.
Building a Spark Application for HDInsight using IntelliJ Part 1 of 2-4Building a Spark Application for HDInsight using IntelliJ Part 1 of 2-5
Select Spark On HDInsight (Scala) project template. As a side note, an alternate approach is with Maven, but I found this approach generally easier.

Building a Spark Application for HDInsight using IntelliJ Part 1 of 2-6

Enter Project Name
For Project SDK, Select Java 1.8 by finding it at C:\Program Files\Java\jdk1.8.0_131
For Scala SDK, Download and select latest version. In this screen shot I had previously downloaded it.
For Spark SDK, Select and find where you downloaded spark-assembly-2.0.0-hadoop2.7.0-SNAPSHOT.jar

Selected SDKsBuilding a Spark Application for HDInsight using IntelliJ Part 1 of 2-8
Note: For Spark 2.0 cluster, you need Java 1.8 and Scala 2.11 or above.
Click Finish
Go to File >Project Structure, setProject Language level to 7Building a Spark Application for HDInsight using IntelliJ Part 1 of 2-9

To write code and submit to HDInsight, see my next blog post Building a Spark Application for HDInsight using IntelliJ Part 2 of 2


Build Power BI Reports with HDInsight Spark Connector

With Power BI Desktop you can use the Azure HDInsight Spark BI Connector to get data from the Spark cluster to build reports. I have an HDInsight Spark 2.0 cluster with Azure Data Lake Store as the primary storage.

Open Power BI Desktop

Click Get Data
Build Power BI Report with HDInsight-1

Enter url of your HDInsight Spark cluster.
Build Power BI Report with HDInsight-2

Enter the cluster admin credentials. This is the same credentials for Ambari.
Build Power BI Report with HDInsight-3

We are able to see a list of tables and views along with a preview of data. In my case, usdata is the database. Crimes and crimesgroupbytype are internal hive tables, crimes_ext is an external table and crimesgroupbytype_view is a view.

I selected crimes, crimesgroupbytype and crimesgroupbytype_view.
Click Load to generate queries for each
Build Power BI Report with HDInsight-4

The Query Editor
Build Power BI Report with HDInsight-5

When I click Apply & Close, the crimes query results in an error. My suspicion is that crimes has over a million rows where the other queries are only dealing with several hundred rows. Since this connector is in beta, perhaps I have to wait for final release.
Build Power BI Report with HDInsight-6

To continue, I delete usdata crimes query.
Build Power BI Report with HDInsight-7

Click Close & Apply

In Report Page you can see your tables in Fields pane from the two queries
Build Power BI Report with HDInsight-8

I build my report of number of crimes by each crime. Also filters to the right on year and crime type.
Build Power BI Report with HDInsight-9

HDInsight and Power BI is a powerful combination to work with big data and the ability to transform, analyze and visualize data with Power BI desktop.


 

Text Analytics of Movie Reviews using Azure Data Lake, Cognitive Services and Power BI (part 2 of 2)

Applicable Business Scenario

Marketing or data analysts who need to review sentiments and key phrases of a very large data set of consumer-based movie reviews.

Applied Technologies

  1. Azure Data Lake Store
  2. Azure Data Lake Analytics
    1. Cognitive Services
  3. Visual Studio with Azure Data Lake Analytics Tools
  4. Power BI Desktop & Power BI Service
  5. SharePoint Online site and preview of Power BI Web Part

Power BI Desktop

Use Power BI Desktop as the report authoring tool.

Data Source

Get Data from Azure Data Lake Store. Retrieve the output of the U-SQL script executed in Azure Data Lake Analytics in part 1 of this blog series.

TextAnalytics-6.png

Data Source to Azure Data Lake Store

textanalytics-7Point to the folder containing the .tsv (tab delimited) files which was the output of the U-SQL script execution.

Provide credentials to an account that has permissions to the Azure Data Lake Store. In this case, it was an Azure AD account.

Queries

Create a query for each .TSV file

textanalytics-8

textanalytics-9

Relationships

Define 1 to many relationship based on the ID of each movie review.

textanalytics-10

Reports/Visualization

Sentiment confidence value for each of the 2000 movie reviews

textanalytics-11

 

textanalytics-12

Publish

Click on ‘Publish’ to upload your report to Power BI Service in the MS cloud. You can view in http://app.powerbi.com with your Office 365 or Microsoft account.

textanalytics-13

SharePoint Online

If you want to publish and share this report to a wide audience via a SharePoint online site, you can leverage the new Power BI Web Part (currently preview as of Feb 2017). I have displayed this report in the latest SPO modern page experience as a publishing page. For each user that views the report must have a Power BI Pro license which is not free.

textanalytics-14

To configure, you need to create a modern publishing page displaying the power BI report via Power BI Web Part (preview).

textanalytics-15

Web Part in Edit Modetextanalytics-16

Enter the report link which you get from Power BI Service at http://app.powerbi.com

textanalytics-17

Options to further extend this solution

  • For the movie reviews .csv file, one can add date/time, movie being reviewed, genre, location and any other descriptive metadata. Thus, supporting more reporting and insights.
  • Overlay this data set against other data sent for correlation such as related news events, weather, popular movies trending, other movie reviews sources, etc. This is to find any cause and effect relationships for diagnostic insights – “Why is this happening?”.
  • To get data from internal network to Azure Data Lake or any Azure storage account, an option is to use the Data Management Gateway. This is installed within the internal network to allow to transfer files and data from other internal data sources with no to little corporate firewall changes.Move data between on-premises sources and the cloud with Data Management Gateway

Closing Remarks

Azure Cognitive Services built into Azure Data Lake Analytics is a suitable option for very high volume, unstructured and complex processing of data. This is such that the scalable computing power is needed. In addition, this priced in a pay-per use model in making it cost-effective in many scenarios. The agility of Azure services allows to experiment, iterate quickly and fail-fast in finding the right technical solution and applying the right techniques and approach. This article highlights how data can be ingested, analyzed/processed, modeled, visualized and then published to a business audience.

Text Analytics of Movie Reviews using Azure Data Lake, Cognitive Services and Power BI (part 1 of 2)

Applicable Business Scenario

Marketing or data analysts who need to review sentiments and key phrases of a very large data set of consumer-based movie reviews.

Applied Technologies

  1. Azure Data Lake Store
  2. Azure Data Lake Analytics
    1. Cognitive Services
  3. Visual Studio with Azure Data Lake Analytics Tools
  4. Power BI Desktop & Power BI Service
  5. SharePoint Online site and preview of Power BI Web Part

Azure Data Lake Store

Upload .csv file of 2000 movie reviews to a folder in Azure Data Lake Store

textanalytics-1

Azure Data Lake Analytics

Execute the following U-SQL script in either the Azure Portal > Azure Data Lake Analytics > Jobs > New Jobs or Visual Studio with Azure Data Lake Analytics Tools.

This script makes reference to the Cognitive Services assemblies. They come out of the box in the Azure Data Lake master database.

TextAnalytics-2.png

U-SQL Script

 The following script reads the moviereviews.csv file in Azure Data Lake Store and then analyzes for sentiment and key phrase extraction. Two .tsv files are produced, one with the sentiment and key phrases for each movie review and another for a list of each individual key phrase with a foreign key ID to the parent movie review.

 REFERENCE ASSEMBLY [TextCommon];
 REFERENCE ASSEMBLY [TextSentiment];
 REFERENCE ASSEMBLY [TextKeyPhrase];

@comments =
 EXTRACT
 Text string
 FROM @"/TextAnalysis/moviereviews.csv"
 USING Extractors.Csv();

@sentiment =
 PROCESS @comments
 PRODUCE
 Text,
 Sentiment string,
 Conf double
 READONLY
 Text
 USING new Cognition.Text.SentimentAnalyzer(true);

@keyPhrases =
 PROCESS @sentiment
 PRODUCE
 Text,
 Sentiment,
 Conf,
 KeyPhrase string
 READONLY
 Text,
 Sentiment,
 Conf
 USING new Cognition.Text.KeyPhraseExtractor();

@keyPhrases = SELECT *, ROW_NUMBER() OVER () AS RowNumber
 FROM @keyPhrases;
 OUTPUT @keyPhrases
 TO "/TextAnalysis/out/MovieReviews-keyPhrases.tsv"
 USING Outputters.Tsv();

// Split the key phrases.
 @kpsplits =
 SELECT RowNumber,
 Sentiment,
 Conf,
 T.KeyPhrase
 FROM @keyPhrases
 CROSS APPLY
 new Cognition.Text.Splitter("KeyPhrase") AS T(KeyPhrase);

OUTPUT @kpsplits
 TO "/TextAnalysis/out/MovieReviews-kpsplits.tsv"
 USING Outputters.Tsv();

Azure Portal > Azure Data Lake Analytics  U-SQL execution

Create a new job to execute a U-SQL script.
TextAnalytics-3.png

TextAnalytics-4.png

Visual Studio Option

You need the Azure Data Lake Tools for Visual Studio. Create a U-SQL project and paste the script. Submit the U-SQL script to the Azure Data Lake Analytics for execution. The following shows the successful job summary after the U-SQL script has been submitted.

TextAnalytics-5.png

Click here to Part 2 of 2 of this blog series

Azure Search: Pushing Content to an Index with the .NET SDK.

Blog Series

  1. Azure Search Overview
  2. Pushing Content To An Index with the .NET SDK

I hold the opinion that for a robust indexing strategy, you would likely end up writing a custom batch application between your desired data sources and your defined Azure Search index. The pull method currently only supports data sources that reside in specific Azure data stores (as of Feb 2017):

  • Azure SQL Database
  • SQL Server relational data on an Azure VM
  • Azure DocumentDB
  • Azure Blob storage, Table storage

I would assume many at this time would have desired content in websites databases and LOB applications outside of these Azure data stores.

Azure Search .NET SDK

This article Upload data to Azure Search using the .NET SDK gives great guidance and is what I used, but here’s my specific implementation approach.

To get started, first create a .NET project.Azure Search Pushing Content to an Index with the .NET SDK-1

Install from NuGet
Azure Search Pushing Content to an Index with the .NET SDK-2

My project with the Microsoft.Azure.Search library
Azure Search Pushing Content to an Index with the .NET SDK-3

 

To start coding, define your search index by creating a model class. I created a generic index schema. I will use this to define and create a new search index in the Azure Search Service. And to hold a list of records of movies as my searchable content.

[SerializePropertyNamesAsCamelCase ]
    public partial class IndexModel
    {
        [Key]
        [IsRetrievable(true)]
        public string Id { get; set; }

        [IsRetrievable(true), IsSearchable, IsFilterable, IsSortable]
        public string Title { get; set; }

        [IsRetrievable(true), IsSearchable]
        [Analyzer(AnalyzerName.AsString.EnLucene)]
        public string Content { get; set; }

        [IsFilterable, IsFacetable, IsSortable]
        public string ContentType { get; set; }

        [IsRetrievable(true)IsFilterable, IsSortable, IsSearchable]
        public string Url { get; set; }

        [IsRetrievable(true)IsFilterable, IsSortable]
        public DateTimeOffset? LastModifiedDate { get; set; }

        [IsRetrievable(true)IsFilterable, IsSortable]
        public string Author { get; set; }

    }

Next, I do 3 major steps in the Main method of the console app

  1. Create mock data, as if this data was retrieved from a data source.
  2. Create and index, if one not already exists, based on the index model class
  3. Update the index with new or updated content.
public static void Main(string[] args)
        {

            // Mock Data
            List<IndexModel> movies = new List<IndexModel>
            {
                new IndexModel()
                {
                    Id = "1000",
                    Title = "Star Wars",
                    Content = "Star Wars is an American epic space opera franchise, centered on a film series created by George Lucas. It depicts the adventures of various characters a long time ago in a galaxy far, far away",
                    LastModifiedDate = new DateTimeOffset(new DateTime(1977, 01, 01)),
                    Url = @"http://starwars.com"
                },
                new IndexModel()
                {
                    Id = "1001",
                    Title = "Indiana Jones",
                    Content = @"The Indiana Jones franchise is an American media franchise based on the adventures of Dr. Henry 'Indiana' Jones, a fictional archaeologist. It began in 1981 with the film Raiders of the Lost Ark",
                    LastModifiedDate = new DateTimeOffset(new DateTime(1981, 01, 01)),
                    Url = @"http://indianajones.com"
                },
                new IndexModel()
                {
                    Id = "1002",
                    Title = "Rocky",
                    Content = "Rocky Balboa (Sylvester Stallone), a small-time boxer from working-class Philadelphia, is arbitrarily chosen to take on the reigning world heavyweight champion, Apollo Creed (Carl Weathers), when the undefeated fighter's scheduled opponent is injured.",
                    LastModifiedDate = new DateTimeOffset(new DateTime(1976, 01, 01)),
                    Url = @"http://rocky.com"
                }
            };

            AzureSearch.CreateIndexIfNotExists<IndexModel>("movies");

            AzureSearch.UpdateIndex("movies", movies);

            Console.WriteLine("Enter any key to exist");
            Console.ReadKey();
        }

In the Azure Portal, you will see the outcomes

  • The ‘movies’ index has been created along with 3 documents as expected.
    Azure Search Pushing Content to an Index with the .NET SDK-4
  • I find that the document count value takes several minutes’ or more to be updated, but the indexing is immediate.
  • The fields has been defined along with its type and attributes based on the index model class
    Azure Search Pushing Content to an Index with the .NET SDK-5
  • To test the index, use the Search Explorer
    Azure Search Pushing Content to an Index with the .NET SDK-6

For further code snippet details of the following method calls. I made this method dynamic such that you pass in the Type of the index model as T. Then the  FieldBuilder.BuildForType() will build out the index schema.

AzureSearch.CreateIndexIfNotExists<IndexModel>("movies");

public static Boolean CreateIndexIfNotExists<T>(string indexName)
        {
            bool isIndexCreated = false;

                List<string> suggesterFieldnames = new List<string>() { "title" };

                var definition = new Index()
                {
                    Name = indexName,
                    Fields = FieldBuilder.BuildForType<T>(),
                    Suggesters = new List<Suggester>() {
                        new Suggester() {
                            Name = "Suggester",
                            SearchMode = SuggesterSearchMode.AnalyzingInfixMatching,
                            SourceFields = suggesterFieldnames
                        }
                   }
                };

                SearchServiceClient serviceClient = CreateSearchServiceClient();

                if (!serviceClient.Indexes.Exists(indexName))
                {
                    serviceClient.Indexes.Create(definition);
                    isIndexCreated = true;
                }
                else
                    isIndexCreated = false;
}

AzureSearch.UpdateIndex("movies", movies);
inner method call:
private static void UploadDocuments(ISearchIndexClient indexClient, List<IndexModel> contentItems)
        {
                var batch = IndexBatch.MergeOrUpload(contentItems);
                indexClient.Documents.Index(batch);
}

 

In conclusion, I generally recommend the push approach using the Azure Search .NET SDK as there are more control and flexibility. As I created a CreateIndex method, you should create a delete index method. This helps during development process as you iterate upon defining your index schema. Even in production scenarios, it can be appropriate to delete your index, re-create index with an updated schema and then re-index your content.


The Effects of Dropping Internal and External Hive Tables in HDInsight and ADLS

In my blog post Populating Data into Hive Tables in HDInsight, I have demonstrated populating an internal and an external hive table in HDInsight. The primary storage is configured with Azure Data Lake Store.

To see the differences, I will demonstrate dropping both types of tables and observe the effects. This for the beginner audience.

To recap the linked post, the internal hive table has its own copy of the csv file of crimes data. The external hive table crimes_ext has a pointer to the folder containing its data.The hive tables to drop as seen through Visual Studio

The Effects of Dropping Internal and External Hive Tables in HDInsight and ADLS-1
In Azure Data Lake Store, the tables are under \clusters\rkhdinsight\hive\warehouse\usdata.dbThe Effects of Dropping Internal and External Hive Tables in HDInsight and ADLS-2 copy
Clicking into the crimes folder, we see the data file.
The Effects of Dropping Internal and External Hive Tables in HDInsight and ADLS-3
Clicking into the crimes_ext folder, there is no data file.The Effects of Dropping Internal and External Hive Tables in HDInsight and ADLS-4
Since crimes_ext is an external table, its data is configured to be under \datsets\Crimes
The Effects of Dropping Internal and External Hive Tables in HDInsight and ADLS-5
Out of curiosity, I look in the external hive metastore in an Azure SQL Database. There is a table TBLS with a list of tables and its type. The hive metastore contains the metdata of the hive table, but no data.The Effects of Dropping Internal and External Hive Tables in HDInsight and ADLS-6Dropping the tables

Use usdata;
DROPTABLE crimes_ext;
DROPTABLE crimes;

In Server Explorer, after refreshing the usdata database, they are gone.The Effects of Dropping Internal and External Hive Tables in HDInsight and ADLS-7
In ADLS, the crimes folder is removed which was the internal table. However, the crimes_ext folder still exists for reasons I am not aware.The Effects of Dropping Internal and External Hive Tables in HDInsight and ADLS-8
To check the data of the external table, we see its data file not removed, which is expected by design.
The Effects of Dropping Internal and External Hive Tables in HDInsight and ADLS-9
Going back to the hive metastore, we see the two tables removed.The Effects of Dropping Internal and External Hive Tables in HDInsight and ADLS-10

Side note: The hive view crimesgroupbytype_view queries the dropped internal hive table and if you run the view, you get an error stating the dependency of the missing hive table.

The concepts of dropping internal and external tables are easy to understand, but I just want to show the underlying effects in Azure Data Lake Store and the hive metastore DB for the beginner.


Azure Data Lake Analytics: Job Execution Time and Cost

In running an intensive U-SQL job against a large number of files and records, I will show the performance diagnostics and its estimated cost.

I ran a U-SQL job against 4,272 JSON files with about 95,000 rows.
Azure Data Lake Analytics- Job Execution Time and Cost-1

1.2 hours to complete
Azure Data Lake Analytics- Job Execution Time and Cost-2

Parallelism was set to 5 AU.
Azure Data Lake Analytics- Job Execution Time and Cost-3

An AU is an analytic unit which is approximately a compute resource similar to 2 CPU cores and 6 GB of RAM as of Oct 2016. For details and background read https://blogs.msdn.microsoft.com/azuredatalake/2016/10/12/understanding-adl-analytics-unit/

To see diagnostics, first, click Load Job Profile.
Azure Data Lake Analytics- Job Execution Time and Cost-4

Click on Diagnostics tab
Azure Data Lake Analytics- Job Execution Time and Cost-5

Click on Resource usage

In the AU Usage tab, 5 processes have been fully utilized throughout the 1 hr 7 minutes of execution.
Azure Data Lake Analytics- Job Execution Time and Cost-6

The AU Usage Modeler provides a rough estimation of the number of AU for the best time. Here it estimates 4,272 AUs for 35.22 seconds of execution time.Azure Data Lake Analytics- Job Execution Time and Cost-7

Interestingly, 4,272 is the same number of JSON files to be analyzed. So, I am assuming to get best time for estimation, it would desire to allocate an AU for each file. There is probably more explanation to this but just commenting on the observation.

When going to job submission settings, I observed the max AUs that can be allocated is 112.
Azure Data Lake Analytics- Job Execution Time and Cost-8

Adjusting the Number to model to 112 we see an estimated 197 seconds of execution time.
Azure Data Lake Analytics- Job Execution Time and Cost-9

In submitting the job again at 112 AUs for parallelism, the outcome isAzure Data Lake Analytics- Job Execution Time and Cost-10

AU Usage diagnostics
Azure Data Lake Analytics- Job Execution Time and Cost-11

So the difference between the estimated time and actual for 112 AUs is that the actual tested is 6.2 minutes compared to about 2.5 minutes. So about 2.5 times longer than estimated.

The graph above you can see there is roughly 65-70% usage at 112 AUs compared to almost 100% usage at 5 AUs. But at 5 AUs, it took 10 times longer.

To do the price comparison, I couldn’t find the exact cost per each job execution, but I can guess based on the pricing stated when you open a U-SQL editor in the Azure Portal.

At 5 AUs, the cost is $0.17USD/minute. 72 minutes of execution costs $12.24
Azure Data Lake Analytics- Job Execution Time and Cost-12

At 112 AUs, cost is $3.73USD/minute. 6.2 minutes of execution costs $23.12
Azure Data Lake Analytics- Job Execution Time and Cost-13

AUs Execution time Cost / min USD Est. Total Cost
5 67 mins $0.17 11.39
112 3.3 mins $3.73 12.30

So based on my analysis it seems going for max AUs is a good deal where it cost about a dollar more yet save over an hour of time. I would assume the cost difference matters with other varying factors such as data size, # of records, data structure and the amount of number crunching.

If there is any feedback on my analysis, feel free to drop me a comment below. Hope my performance and cost analysis at least gives a ball park idea of what to expect.


Creating Azure Data Lake Store

The Azure Data Lake Store is a storage solution to manage files for big data analytical workloads. The definition of a data lake according to Wikipedia, “is a method of storing data within a system or repository, in its natural format, that facilitates the collocation of data in various schemata and structural forms, usually object blobs or files. The idea of data lake is to have a single store of all data in the enterprise ranging from raw data (which implies exact copy of source system data) to transformed data which is used for various tasks including reporting, visualization, analytics and machine learning.

To read more from the MS documentation visit Overview of Azure Data Lake Store

ingest-data

In summarizing the documentation’s overview, here are some of the key capabilities for starting out.

  • Hadoop compatible
  • Virtually unlimited storage
  • Performance for analytical processing
  • User and Role based security
  • Data encryption
  • Store any data format

In its simplest form, it is a hierarchical file system of folders and files. You run your analytical processing scripts pointing to a set of folders or files.

The following is how I created the Azure Data Lake Store. To see the MS documentation visit Get started with Azure Data Lake Store using the Azure Portal

  1. New Data Lake Store
    Creating Azure Data Lake 1
    A. Encryption Settings. I decided the more sophisticated option of creating a master encryption key in an existing Azure Key Vault for my own ownership. To see read details read Data protectionCreating Azure Data Lake 2
  2. Click Create button
  3. Confirm provisioning of Azure Data Lake
  4. Overview section has to provide details of the data lake service. However, it is prompting for further action. Grant the data lake store account RN_rkdatalake to have access to the key vault.Creating Azure Data Lake 3A. Click on the orange bar to setup.
    B. Click on Grant Permissions button to grant the RN-rkdatalake account permissions.
    Creating Azure Data Lake 4C. Notification
    Creating Azure Data Lake 5
  5. Let’s go back to the rkdatalake blade and take a tour of the some of the unique settings.
    Creating Azure Data Lake 6A. Encryption settings
    Creating Azure Data Lake 7The master encryption key is located and managed in my Key Vault named rkEntKeyVault. The data lake store account RN_rkdatalake only has access to the key vault to encrypt data stored.B. Firewall For security best practices, it is recommended to enable the firewall. The firewall is based on client IP Address or IP address range.Creating Azure Data Lake 8C. Pricing. For a developer scenario, pay-as-you-go should be quite fine. For myself, this option has not been expensive at all and usually work with a several GBs of data of data anyways. Currently, it is 0.039 USD per GB which is still pennies.
    creating-azure-data-lake-9.jpg
    For other monthly plans,
    Creating Azure Data Lake 10D. Data Explorer This is more of a tool to explore the file system in Data Lake Store. You can create folders, upload files and manage permissions. Creating Azure Data Lake 11File Preview of MvcWeb.log
    Creating Azure Data Lake 12Access
    You can assign permissions to a folder or file. Here I am managing permissions on the MyData folder I had created.Creating Azure Data Lake 13Click Add so I can add a user or group, that is in Azure Active Directory, to have access to this folder.
    Creating Azure Data Lake 14Creating the data lake store sets the foundation for analytical processing. You may begin to upload large amounts of data in their respective folders. Examples can be IoT sensor data, tweets, .csv export from relational databases, log files, images, videos or documents. It is up to the processing application such as Azure Data Analytics U-SQL or Hadoop applications to process the data which would use a set of libraries and apply your custom logic. To be specific about what open source applications can work with Azure Data Lake Store, read Open Source Big Data applications that work with Azure Data Lake Store. Essentially, only Azure’s HDInsight works with it and not any other cloud or on-premises Hadoop platform to my current understanding.Next, we will look at PowerShell and Options to Upload Data to Azure Data Lake


SharePoint 2016 Preview Large List Automatic Indexing with Deep Dive Analysis

The list view threshold (LVT) has been a pain point in some SharePoint sites that I have seen. The default setting in SharePoint 2016 Preview is still 5,000 as it is in 2013.

In cases where lists contain >5,000 items, users will eventually encounter the following message and the list is not displayed.

autoindex-1

According to Software boundaries and limits for SharePoint 2013 article the definition of the List view threshold (LVT) is:

“Specifies the maximum number of list or library items that a database operation, such as a query, can process at the same time outside the daily time window set by the administrator during which queries are unrestricted.”

To manage this constraint in SharePoint 2010 and 2013, read the article Manage lists and libraries with many items

In my opinion, many don’t quite understand what this really is and how to manage it properly. Many project stakeholders other than SharePoint SMEs understand this to be a limitation of how many items can be queried from the list. Rather, it is about the number of items or rows the SQL database has to ‘scan’ implicated by the list view’s query.

For example, let’s say we had a list of 30,000,000 items. Out of these items, we have 4,999 that have Country column value of Canada. List view threshold is set at 5,000.
There is a custom list view where a filter condition is Country = ‘Canada’.

Although it seems that this list view is doing a query for only 4,999 items, what is really happening at the SQL database table level is that all 30,000,000 items are being scanned.

A recommended solution is to index the column found in the list settings.

autoindex-2

Note that the indexing of columns is not a SQL based index such as a non-clustered index, but rather indexing through the NameValuePair_Latin1_General_CI_AS table in the respective content database.

The new Automatic Index Management setting

Now, in SharePoint 2016 Preview, there is a new list setting to automatically index found in List Settings > Advanced Settings. The default is set as ‘Yes’.

autoindex-3

The automatic indexing is supported by the ‘Large list automatic column index management job’.
Go to Central Administration > Monitoring > Review Job Definitions

sp16prev-autoIndexTimerJob

Large List Demo

Configuration

  • Central Administration > Select Web Application > General – Resource Throttling
    • List View Threshold for end users to be at 10,000
    • Auditors and administrators as 20,000
      Note: I doubled the default values just for general testing.

autoindex-5

  • Test User: Added ‘Roy Kim’ user account with only contribute permissions so that I can simulate the list view threshold without the special exceptions that a site collection administrator would have.
  • Custom list
    • Named ‘Large List’
    • Added site columns: Status, Gender, City, Province/State, Country,
    • Added 25,146 items with custom columns including Status. (via a PowerShell script)
    • Created View ‘By Not Started’ where Status equal to ‘Not Started’

autoindex-6

  • Large list automatic column index management job
    • Allow the timer job to run or manually run the job immediately.
  • Indexed Columns
    • Status column has become automatically indexed.
      autoindex-7

SharePoint 2016 Preview Install – First look

SharePoint 2016 Preview was released yesterday on Aug 24.

Download from here: https://www.microsoft.com/en-us/download/details.aspx?id=48712

Announcement: https://blogs.office.com/2015/08/24/announcing-availability-of-sharepoint-server-2016-it-preview-and-cloud-hybrid-search/

After installing, here are my comments as I walk through for noticeable changes:

  1. Similar to Office 365, there is a similar ‘App Launcher’ at the top left.

Newsfeed, OneDrive and Sites sit under your personal My Site.
http://<hostname>/my/personal/<username>/&#8230;

sp16preview-app launcher

2. Under List Settings, there is a new setting ‘Automatic Index Management’:
This may help with the list view threshold constraint where default  5,000 items for a list.

sp16preview-autoindexing

There is a new ‘Large list automatic column index management job’ timer job that may support this setting on the configured lists.

sp16prev-autoIndexTimerJob

After running this timer job, I went to the List Settings > Indexed columns and haven’t noticed any indexed columns. Perhaps the “indexing” can be seen elsewhere. I will have to continue to investigate.

3. Central Administration > Office 365 > Configure hybrid OneDrive and Site Features

SharePoint Hybrid Solutions Center: http://go.microsoft.com/fwlink/?LinkID=613711

sp16preview-O365hybridconfig

4. There are many more new timer jobs at a total of 228. In this screen shot, you may notice some that are new such as DeleteUnusedAbs, Document Changed Anti-virus Processing, DrainInlineStreams and Dump site information. Not sure what these do, but get ready to understand them and leverage accordingly.

sp16prev-newTimerJobs

To compare to a list of SharePoint 2013 timer jobs, read https://technet.microsoft.com/en-us/library/cc678870.aspx

5. “layouts/15” in URL and SharePoint Root Folders

For example, http://<hostname>/_layouts/15/start.aspx#/Shared%20Documents/Forms/AllItems.aspx

The URL contains “_layouts/15” rather than “_layouts/16” taking after the production major version number. Maybe that is why we are seeing the 2013 UI and perhaps in future there could be a change in the UI look.

Also in the the file system there is still the SharePoint 14 root folder (i.e. SharePoint 2010). Maybe this will go away in final release.

sp16prev-rootFolder

6. New Video thumbnails and playback.

I uploaded a 475mb video I grabbed from channel9.msdn.com into the Documents library. I happened to be running a PowerShell script adding 10,000 list items to a custom list and when playing the video in this thumbnail, the playback was a little choppy and slow. So even though this is a single server farm setup, one has to consider scalability and performance of playing videos. This video is surely stored in the content SQL database.

sp16prev-videoThumbNail