I will demonstrate a U-SQL script I wrote to identify duplicate rows by a field using the windows functions. The scenario is where by ingesting data from sources, you may get duplicate rows where by you must resort to reactively de-duplicate. My technique will involve U-SQL windows functions.
Window functions are used to do computation within sets of rows called windows. Windows are defined in the OVER clause. Window functions solve some key scenarios in a highly efficient manner.
Read Using U-SQL window functions for Azure Data Lake Analytics jobs for further details and sample code.
The following script will identify a set of rows or a “window” that have the same jobkey unique identifier. This will indicate duplicates that occurred in the data ingestion process.
Also, I will compare and contrast with the Group By clause to group by the jobkey field to identify the number of duplicates.
// Read from jobpostings data file @jobPostings = EXTRACT jobtitle string, company string, city string, jobkey string FROM @"/jobpostings/outputtsv/v3/JobsData.tsv" USING Extractors.Tsv(); // Query from jobpostings data // Set ROW_NUMBER() of each row within the window partitioned by jobkey field @jobPostingsDuplicates = SELECT ROW_NUMBER() OVER(PARTITION BY jobkey) AS RowNum, jobkey AS JobKey FROM @jobPostings; // ORDER BY jobkey to see duplicate rows next to one another @jobPostingsDuplicates2 = SELECT * FROM @jobPostingsDuplicates ORDER BY JobKey OFFSET 0 ROWS; // Write to file OUTPUT @jobPostingsDuplicates2 TO "/jobpostings/outputtsv/v3/JobsOver-Dups.tsv" USING Outputters.Tsv(); // Group By and count # of duplicates per jobkey @groupBy = SELECT jobkey, COUNT(jobkey) AS jobKeyCount FROM @jobPostings GROUP BY jobkey ORDER BY jobkey OFFSET 0 ROWS; // Write to file OUTPUT @groupBy TO "/jobpostings/outputtsv/JobPostingsGroupBy-Dups.tsv" USING Outputters.Tsv();
Upon clicking Submit, we see the U-SQL Job execution graph
For the query involving windows function, we can see a preview of the file output identifying a record with one duplicate as identified by row number 2. The set of rows or ‘window’ is set a partition of the jobkey.
For the query with the Group By clause, we can compare the same duplicate in the previous file with job key ‘00052e6ad78510f3’. The difference with the Group By is that it is an aggregation.
If to identify and remove the duplicate rows, then I would use the windows functions and set a field by its ROW_NUMBER within the partitioned window. I could then remove duplicates row with a filter as such:
@jobPostingsDuplicates3 = SELECT * FROM @jobPostingsDuplicates2 WHERE RowNum == 1; OUTPUT @jobPostingsDuplicates3 TO "/jobpostings/outputtsv/v3/JobsOver-deduped.tsv" USING Outputters.Tsv();
If to only count the occurrence of each jobkey, then I would use the Group By.
In conclusion, this is a way to manage duplicates using windows functions.