HIVE

Analysing Tez Jobs

When you run an Hive query using Tez engine, few ways to see the DAG is as below.

 

Advertisements
HDFS, HIVE

HIVE Data-Copy – Import/Export

HIVE Data Copy:

 

With hive one better option to copy the Hive data and metadata from cluster A to cluster B is to use HIVE Import/Export

 

NOTE:  In HIVE CLI exporting the data directly to target cluster will work fine as the Path can be resolved, But Beeline won’t allow copying the data directly to the target cluster directly. To do via beeline, Export table to the local HDFS and copy the data to the target cluster and connect to target hiveServer 2 and call the import to import the data.

 

  • EXPORT TABLE tablename [PARTITION (part_column=”value”[, …])]

TO ‘export_target_path’ [ FOR replication(‘eventid’) ]

 

  • IMPORT [[EXTERNAL] TABLE new_or_original_tablename [PARTITION (part_column=”value”[, …])]]

FROM ‘source_path’

[LOCATION ‘import_target_path’]

 

 

Reference:

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ImportExport

HIVE

Query JDBC datasource Directly from HIVE


We see many many options to retrieve data from RDBMS and we validate the data every time if we receive it as dump. Say for a use case where the source can give you access to fetch the full data. one can use the JDBC Hive storage Handler to directly query the JDBC RDBMS data source.

Below is the URL for further Mind tweaking.

 

Official:

https://issues.apache.org/jira/browse/HIVE-1555

https://community.hortonworks.com/articles/4671/sparksql-jdbc-federation.html

 

Forks:

 

https://github.com/qubole/Hive-JDBC-Storage-Handler

 

https://github.com/myui/HiveJdbcStorageHandler

 

https://github.com/QubitProducts/hive-jdbc-storage-handler

 

HIVE

Basic HQL CheckList

We can start with simple one – these are general stuffs to check while writing HQL.

 

General HIVE Query:

  • Avoid providing inline queries in the where clause.
  • Make all the joins explicitly in the “From” clause instead of “where” clause.
  • FULL OUTER JOINS are very expensive in HIVE.
  • If possible generate statistics for each table before running the query.
  • Make the order of join from smaller to larger tables.
  • Any partitioned tables must have the partition range filtered before making any joins.
  • While joining smaller tables, of up to 1 GB data size, make a MAPJOIN hint for the smaller table.
  • While doing any join if we want to have certain filter as part of the join then it has to be mention as part of the from clause join statement instead of having the same on where clause.
  • While Joining Make use of the Key columns to make the Joins.
  • If a query is having multiple layers of inner queries and used in many other queries to arrive at a result, please create a physical table which can be refreshed and queried as and when needed.