Creating Internal and External Hive Tables in HDInsight

Objective: Create an internal and an external hive tables in HDInsight. Based on the schema of a CSV file on US city crime. https://catalog.data.gov/dataset/crimes-2001-to-present-398a4

Building Hive tables establishes a schema on the flat files that I have stored in Azure Data Lake Store. This will allow me to do SQL like queries with HiveQL on that data. In other words, I have a data warehouse in Hive. and can be the basis of building big data analytical applications.
For an overview of Hive, read https://hortonworks.com/apache/hive/

  1. Open Visual Studio with Data Lake Tools plugin.
  2. Create New Project
    Creating Internal and External Hive Tables in HDInsight-1
  3. In Server Explorer, ensure you are connected to an HDInsight cluster
    CreateTables0
  4. In Solution Explorer, create a new HiveQL script to create tables
    Creating Internal and External Hive Tables in HDInsight-3
  5. Create Database
    CREATE DATABASE IF NOT EXISTS USData;
    
    use USData;
    
  6. Create internal table
    CREATE TABLE IF NOT EXISTS Crimes
    (
            ID INT ,
            CaseNumber STRING,
            CrimeDate DATETIME,
            Block STRING,
            IUCR INT,
            PrimaryType STRING,
            Description STRING,
            LocationDescription STRING,
            Arrest BOOLEAN,
            Domestic BOOLEAN,
            Beat INT,
            District INT,
            Ward INT,
            CommunityArea INT,
            FBICode INT,
            XCoord INT,
            YCoord INT,
            Year INT,
            UpdatedOn DATETIME,
            Latitude FLOAT,
            Longitude FLOAT,
            CrimeLocation STRING
    )
    FIELDS TERMINATED by ',' stored as textfile
    tblproperties ("skip.header.line.count"="1");
    
  7. Create external table
    CREATE EXTERNAL TABLE IF NOT EXISTS Crimes_EXT
    (
            ID INT,
            CaseNumber STRING,
            CrimeDate DATE,
            Block STRING,
            IUCR INT,
            PrimaryType STRING,
            Description STRING,
            LocationDescription STRING,
            Arrest BOOLEAN,
            Domestic BOOLEAN,
            Beat INT,
            District INT,
            Ward INT,
            CommunityArea INT,
            FBICode INT,
            XCoord INT,
            YCoord INT,
            Year INT,
            UpdatedOn DATE,
            Latitude FLOAT,
            Longitude FLOAT,
            CrimeLocation STRING
    )
    ROW FORMAT DELIMITED
    FIELDS TERMINATED by ',' stored as textfile
    tblproperties ("skip.header.line.count"="1");
    
  8. Submitting one of the scripts
    Creating Internal and External Hive Tables in HDInsight-4
  9. As a result, crimes and crimes_ext tables are created.
    Creating Internal and External Hive Tables in HDInsight-5
  10. Expanding the tables, shows the fields Internal/managed vs External Hive Tables
    Creating Internal and External Hive Tables in HDInsight-6


For further details to create tables, read https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ManagedandExternalTables

To populate these hive tables with data, read further to my article Populating Data into Hive Tables in HDInsight.

Advertisements

One thought on “Creating Internal and External Hive Tables in HDInsight

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s