Importing HIVE Tables

This guide will show you how to import Hive Tables stored on S3 into Spark SQL.

Step 1:

Issue a “SHOW CREATE TABLE ..your_table..” command on your Hive command line for data stored in S3 and mounted via Databricks File System - DBFS.

hive> SHOW CREATE TABLE wikicc;
OK
CREATE  TABLE `wikicc`(
  `country` string,
  `count` int)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  '/user/hive/warehouse/wikicc'
TBLPROPERTIES (
  'totalSize'='2335',
  'numRows'='240',
  'rawDataSize'='2095',
  'COLUMN_STATS_ACCURATE'='true',
  'numFiles'='1',
  'transient_lastDdlTime'='1418173653')
Time taken: 0.03 seconds, Fetched: 18 row(s)

Step 2:

If your Hive command is a simple CREATE TABLE command, copy the command and add EXTERNAL.

  • You can omit the TBLPROPERTIES field.
  • EXTERNAL will ensure that Spark SQL will not delete your data if you drop the table.
DROP TABLE wikicc
-- Notice the addition of the EXTERNAL keyword.
CREATE EXTERNAL TABLE `wikicc`(
  `country` string,
  `count` int)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  '/user/hive/warehouse/wikicc'

Step 3: Now issue SQL commands on your data as usual.

SELECT * FROM wikicc