Objective: Populate a csv file to an internal and external Hive table in HDInsight.
See my blog post on create hive tables Creating Internal and External Hive Tables in HDInsight
I have obtained a 1.4GB csv file on US city crimes data from https://catalog.data.gov/dataset/crimes-2001-to-present-398a4
My HDInsight cluster is configured to use Azure Data Lake store as its primary data storage.
- Go to Azure Data Lake Store > Data Explorer
- Upload csv file twice in 2 separate locations.
Upload to adl://rkdatalake.azuredatalakestore.net/datasets for internal hive table and adl://rkdatalake.azuredatalakestore.net/datasets/crimes for external hive table. I will explain why further.
- In Visual Studio with Azure Data Lake Tools plugin, open an existing or new Hive project.
- Execute script:
LOAD DATA INPATH 'adl://rkdatalake.azuredatalakestore.net/datasets/Crimes_-_2001_to_present.csv' OVERWRITE INTO TABLE crimes;
- Query crimes table to test data load
- Go back to adl://rkdatalake.azuredatalakestore.net/datasets and you will notice that the csv file is no longer present. It has been moved to the hive/warehouse.
- Execute Script to set the external data location.
ALTER TABLE Crimes_ext SET LOCATION 'adl://rkdatalake.azuredatalakestore.net/datasets/Crimes/'
I found that the location must be a folder containing the csv file(s) and not a specific csv file path.
- Execute Script to query external table:
- In contrast to the internal crimes table, the crimes_ext under hive\warehous does not contain any data files. This is to confirm this is by design of an external table
I have demonstrated the differences in populating data for an internal and an external hive table. For a Q&A discussion between the differences read http://stackoverflow.com/questions/17038414/difference-between-hive-internal-tables-and-external-tables.
In my following article The effects of Dropping Hive Tables, I will show the effects of dropping each type of hive table.