Azure Data Lake Analytics: Finding Duplicates With U-SQL Windows Functions

I will demonstrate a U-SQL script I wrote to identify duplicate rows by a field using the windows functions. The scenario is where by ingesting data from sources, you may get duplicate rows where by you must resort to reactively de-duplicate. My technique will involve U-SQL windows functions.

Window functions are used to do computation within sets of rows called windows. Windows are defined in the OVER clause. Window functions solve some key scenarios in a highly efficient manner.

Read Using U-SQL window functions for Azure Data Lake Analytics jobs for further details and sample code.

The following script will identify a set of rows or a “window” that have the same jobkey unique identifier. This will indicate duplicates that occurred in the data ingestion process.

Also, I will compare and contrast with the Group By clause to group by the jobkey field to identify the number of duplicates.

// Read from jobpostings data file
@jobPostings =
    EXTRACT jobtitle string,
        company string,
        city string,
        jobkey string
  FROM @"/jobpostings/outputtsv/v3/JobsData.tsv"
    USING Extractors.Tsv();

// Query from jobpostings data
// Set ROW_NUMBER() of each row within the window partitioned by jobkey field
@jobPostingsDuplicates =
    SELECT ROW_NUMBER() OVER(PARTITION BY jobkey) AS RowNum,
           jobkey AS JobKey
    FROM @jobPostings;

// ORDER BY jobkey to see duplicate rows next to one another
@jobPostingsDuplicates2 =
    SELECT *
    FROM @jobPostingsDuplicates
    ORDER BY JobKey
    OFFSET 0 ROWS;

// Write to file
OUTPUT @jobPostingsDuplicates2
TO "/jobpostings/outputtsv/v3/JobsOver-Dups.tsv"
USING Outputters.Tsv();

// Group By and count # of duplicates per jobkey
@groupBy = SELECT jobkey, COUNT(jobkey) AS jobKeyCount
FROM @jobPostings
GROUP BY jobkey
ORDER BY jobkey
OFFSET 0 ROWS;

// Write to file
OUTPUT @groupBy
TO "/jobpostings/outputtsv/JobPostingsGroupBy-Dups.tsv"
USING Outputters.Tsv();

Upon clicking Submit, we see the U-SQL Job execution graph
Azure Data Lake Analytics- Finding Duplicates With U-SQL Windows Functions-1

For the query involving windows function, we can see a preview of the file output identifying a record with one duplicate as identified by row number 2. The set of rows or ‘window’ is set a partition of the jobkey.
Azure Data Lake Analytics- Finding Duplicates With U-SQL Windows Functions-2

For the query with the Group By clause, we can compare the same duplicate in the previous file with job key ‘00052e6ad78510f3’. The difference with the Group By is that it is an aggregation.
Azure Data Lake Analytics- Finding Duplicates With U-SQL Windows Functions-3

If to identify and remove the duplicate rows, then I would use the windows functions and set a field by its ROW_NUMBER within the partitioned window. I could then remove duplicates row with a filter as such:

@jobPostingsDuplicates3 =
    SELECT *
    FROM @jobPostingsDuplicates2
    WHERE RowNum == 1;

OUTPUT @jobPostingsDuplicates3
TO "/jobpostings/outputtsv/v3/JobsOver-deduped.tsv"
USING Outputters.Tsv();

If to only count the occurrence of each jobkey, then I would use the Group By.

In conclusion, this is a way to manage duplicates using windows functions.


Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s