Azure Data Lake Analytics: U-SQL C# Programmability

One of the neat features of U-SQL for .NET developers are that you can write C# code for further extensibility and be applying custom logic. This is useful for scenarios for extensive ETL processes where you must take raw data, clean, scrub, transform to meet business requirements for further analytical processing and BI reporting. The alternative languages for Big Data analytics are languages such as Python, R, Scala and Pig. They are more popular in the Big Data community and serve their intended use cases very well. And is favoured by various type of data developers. For the .NET developer who wants to generally start out, then U-SQL is a good choice. After some time, one can advance to those other ‘big data’ languages.

U-SQL programmability guide

My U-SQL with C# code involves counting the frequency of words that appear in a text-based field. The purpose of this code is to test a hypothesis whether getting the most frequent words in a text field can help determine what the text field is about. For example, if the job description field has top words such as ‘developer’, ‘javascript’, ‘json’, then perhaps it gives some indication that the job is a technical job. Let’s continue to explore.

@jobpostings variable represents the data from a file extraction. Let’s process it further with C# code-behind to count the occurrence of each word in the jobDescription field.

@jobpostings_Parsed =
     PROCESS @jobpostings
     PRODUCE jobtitle string,
        company string,
        city string,
        state string,
        country string,
        date string,
        url string,
        latitude float,
        longitude float,
        jobDescription string,
        salaryRate string,
        jobType string,
        hourlyRate double?,
        salary double?,
        compensationType string,
        jobDescriptionTopWords string,
        jobTitleTopWords string
     USING new JobPostings.USql.JobPostingsParser();

For the C# code-behind, I am using the User-defined processor which allows custom logic to be applied on each row. For details read https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-sql-programmability-guide#use-user-defined-processors

Azure Data Lake Analytics- U-SQL C# Programmability - 1

using Microsoft.Analytics.Interfaces;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;

namespace JobPostings.USql
{
    public class JobPostingsParser : IProcessor
    {
	 // Dictionary to tally each word and its occurence
        Dictionary<string, int> jobDescriptionFreqWords = new Dictionary<string, int>();

	 // Process each row
        public override IRow Process(IRow input, IUpdatableRow output)
        {
            string jobDescription = input.Get<string>("jobDescription");

                if (!String.IsNullOrEmpty(jobDescription))
                {
                    Dictionary<string, int> <strong>jobDescriptionFreqWords</strong> = GetMostFrequentWords(jobDescription);
                    List<KeyValuePair<string, int>> myList = jobDescriptionFreqWords.ToList();

			// Sort by descending
                    myList.Sort(
                        delegate (KeyValuePair<string, int> pair1,
                        KeyValuePair<string, int> pair2)
                        {
                            return pair2.Value.CompareTo(pair1.Value);
                        }
                    );

			// To output the three most frequent words and their word count as a text string into a new field called jobDescriptionTopWords
                    if (myList.Count >= 3)
                        output.Set<string>("jobDescriptionTopWords", myList[0] + ", " + myList[1] + ", " + myList[2]);
                }

// Need to explicitly output each field even though there is no processing on them.
            output.Set<string>(0, input.Get<string>("jobtitle"));
            output.Set<string>(1, input.Get<string>("company"));
            output.Set<string>(2, input.Get<string>("city"));
            output.Set<string>(3, input.Get<string>("state"));
            output.Set<string>(4, input.Get<string>("country"));
            output.Set<string>(6, input.Get<string>("date"));
            output.Set<string>(8, input.Get<string>("url"));
            output.Set<float>(9, input.Get<float>("latitude"));
            output.Set<float>(10, input.Get<float>("longitude"));
            output.Set<string>(16, input.Get<string>("jobDescription"));
            output.Set<string>(17, input.Get<string>("salaryRate"));
            output.Set<string>(18, input.Get<string>("jobType"));

            return output.AsReadOnly();
        }
	// Helper method to tally frequency of words for a given text field
private Dictionary<string, int> GetMostFrequentWords(string text)
        {
            Dictionary<string, int> wordCount = new Dictionary<string, int>();
            Dictionary<string, int> stopWords = new Dictionary<string, int>() {
                { "a", 0 }, { "able", 0 }, { "about", 0 }, …<omitted for brevity> …
            };

            char[] delimiters = new char[] { ' ', '\r', '\n', ',', '.', ':' };
            string[] words = text.Split(delimiters, StringSplitOptions.RemoveEmptyEntries);
            string w;

            foreach (string word in words)
            {
                w = word.ToLower();
                if (stopWords.ContainsKey(w))
                    continue;

                if (wordCount.ContainsKey(w))
                    wordCount[w]++;
                else
                    wordCount.Add(w, 1);
            }

            return wordCount;
        }

To summarize the logic:

  1. Read each row
    1. Get the jobDescription field value
    2. Call custom GetMostFrequentWords to return a dictionary of word and its frequency
      1. GetMostFrequentWords simply keeps a tally of each word encountered for that row but ignores a predefined static list of stop words. The list of stop words is from an arbitrary internet source.
      2. Sort by descending the tally of frequent words
      3. Construct a text string of 3 key/value pairs of the top most frequent words.
      4. Output this text string into a new field called jobDescriptionTopWords

After running the u-sql script and doing a file preview, the following shows the jobDescriptionTopWords field and the resulting text string to show the top 3 words.

Azure Data Lake Analytics- U-SQL C# Programmability - 2     Azure Data Lake Analytics- U-SQL C# Programmability - 3

So, going back to my hypothesis of whether displaying most frequent words gives an indication of the type of job, I think there is still some ambiguity. For example, the first row had most frequent words of visit, flexible, anytime. I wouldn’t be able to guess what type of job this would be. The associated job title is Merchandiser. Looking at the row with words heat, machine and sewing, I would have guessed that this job did indeed involve operating a sewing machine. My conclusion is that top word counting in a text field with many words can help, but not significantly. To be more productive, use other text analysis services to extract more meaning. In this article, I have shown the use of U-SQL C# programmability with using the user-defined processor to count the frequency of words in a row’s text field.


Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s