Azure Data Lake Analytics: How To Extract JSON Files

The ability to read files from Azure Data Lake is dependent on U-SQL Built-in Extractors:

  • Extractors.Text() : Provides extraction from delimited text files of different encodings.
  • Extractors.Csv() : Provides extraction from comma-separated value (CSV) files of different encodings.
  • Extractors.Tsv() : Provides extraction from tab-separated value (TSV) files of different encodings.

An example from my previous blog post using the CSV Extractor:

@csvData =
    EXTRACT policyID int,
        statecode string,
        county string,
        eq_site_limit float,
        hu_site_limit float,
        fl_site_limit float,
        fr_site_limit float,
        tiv_2011 float,
        tiv_2012 float,
        eq_site_deductible float,
        hu_site_deductible float,
        fl_site_deductible float,
        fr_site_deductible float,
        point_latitude float,
        point_longitude float,
        line string,
        construction string,
        point_granularity int?
    FROM "/MyData/FL_insurance_sample.csv"
    USING Extractors.Csv(skipFirstNRows:1);

I need to work with JSON files; however, there is no built-in support for JSON which I found surprising. Especially, when much of internet data is in the form of JSON from APIs these days. Also there should be XML extractor as well.

To solve this challenge, I found a solution in the following github repo It has extractors for both JSON and XML
The readme has instructions how to register and sample scripts.

This is how it looks when I registered the 2 assemblies
Azure Data Lake Analytics- How To Extract JSON Files-1

The U-SQL script:

REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats]; 

DECLARE @inputfile string="/jobpostings/postings-{*}.json";

@jobPostingsSchema =
EXTRACT jobtitle string,
        company string,
        city string,
        state string,
        country string,
        date string,
        snippet string,
        url string,
        latitude float,
        longitude float
FROM @inputfile
USING new Microsoft.Analytics.Samples.Formats.Json.JsonExtractor("Results[*]");

OUTPUT @jobPostingsSchema
TO "/jobpostings/JSONoutput.csv"
USING Outputters.Csv();

The REFERENCE ASSEMBLY expression makes use of the registered assemblies in the JobPostingsBigData. Ensure that the script is using this database.

The extractor is reading multiple files that match the expression /jobpostings/postings-{*}.json. I have file names that end with date.

The JSON file structure

  "TotalResults": 41,
  "PageNumber": 0,
  "Results": [
      "jobtitle": "Junior Web Developer",
      "company": "Company X",
      "city": "Calgary",
      "state": "AB",
      "country": "CA"

I am only interested in extracting the Results collection. Therefore, I have a parameter value of “Results[*]” new Microsoft.Analytics.Samples.Formats.Json.JsonExtractor(“Results[*]”);
This extracts each record in this collection.

I can then take the extraction in the @jobPostingsSchema variable write to a flat file or to a table. In this script, I chose to store in a CSV file.

Upon submitting the script, the following is the file preview of the output flat file.
Azure Data Lake Analytics- How To Extract JSON Files-2

The JSON extractor would be applicable when retrieving data from REST APIs as it is becoming the modern standard. It would be a good idea that this become built-in as part of increasing support for a rich set of capabilities.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s