Getting Started With Hive

Published in

Towards Data Science

6 min readJun 12, 2018

[Image [0] (Image courtesy: https://wall.alphacoders.com/big.php?i=380565)]

The aim of this blog post is to help you get started with Hive using Cloudera Manager. Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data summarization, query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.

If you are interested in R programming, you can check out my other articles - Exploratory Analysis of FIFA 18 dataset using R and GgPlot ’Em All | Pokemon on R.

We’ll get started with Shakespeare first and understand the work-flow by implementing a simple word-count program. Then we’ll move on to the MovieLens data and execute queries using HQL.

Fire up your Cloudera VM. Go to your ip:7180.

We will use Hdfs, Hive, and Yarn. So, make sure they are running.

Now, we have to create a directory to store the input file on which we will run the wordcount program.

# to view the contents of the root directory on HDFS
hdfs dfs -ls /# create a new directory named 'input' in the HDFS
hdfs dfs -mkdir /input

You will see that permission is denied as the hdfs user holds the super user(root) privileges. We’ll see how to sort this later in this article.

Now, download the the text file on which to run the word count. We’ll use Shakespeare’s work for it. Download and unpack the file. There are five files inside it. We’ll use the comedies file today.

We have the comedies file in our local file system. We need to transfer it into the hdfs inside the /input folder. Before that, we need to give our root user permission to add a file to the /input folder.

# Give the su permission to root-user
hdfs dfs -chown root:root /input     
 
hdfs dfs -ls /

The root-user now has super user privileges over the /input folder.

Go back to being the root user.

exit

Write an SQL file to perform the wordcount. Save it as wordcount.sql.

DROP DATABASE IF EXISTS documents CASCADE;CREATE DATABASE documents;USE documents;CREATE TABLE docs(words string);LOAD DATA LOCAL INPATH '/root/shakespeare' INTO TABLE docs;CREATE TABLE word_count AS
SELECT word, count(*) AS count FROM
(SELECT explode(split(words, '\\W+')) AS word FROM docs) w
GROUP BY word;SELECT * FROM word_count limit 100;

We need to read write permission to the /shakespeare directory to run the wordcount.sql program. Become an hdfs user and enter the following commands.

su - hdfs     # Change user to hdfs from roothdfs dfs -chmod -R 777 /shakespeareexit

Now, as a root user, run the following:

hive -f wordcount.sql

You should see an output like the following.

Next up, we’ll try to run Hive queries on the MovieLens dataset.

Download the file. When you see the first few lines, you’ll notice that the file is delimited by :: . Change the delimiters to , (csv).

sed -i 's/::/,/g' ml-1m/movies.dat
sed -i 's/::/,/g' ml-1m/users.dat
sed -i 's/::/,/g' ml-1m/ratings.dat# Rename the files from .dat to .csv. Although, it doesn't really # matter. You can leave the files as .dat. I have OCD.mv ml-1m/movies.dat /ml-1m/movies.csv
mv ml-1m/ratings.dat /ml-1m/ratings.csv
mv ml-1m/users.dat /ml-1m/users.csv

Create a folder called movielens in the hdfs and move the files into it.

As an hdfs user, create the following files movies.sql , ratings.sql , and users.sql .

su - hdfs
nano movies.sql
nano ratings.sql
nano users.sql

To view the data in the movies.csv file, copy the following code in the movies.sql file and run the command hive -f movies.sql .

DROP DATABASE IF EXISTS movielens CASCADE;CREATE DATABASE movielens;USE movielens;CREATE EXTERNAL TABLE movies (MovieID INT, 
Title varchar(60), 
Genres varchar(60))ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
LINES TERMINATED BY "\n"
STORED AS TEXTFILE
LOCATION '/movielens/ml-1m/mvs.txt';LOAD DATA INPATH '/movielens/ml-1m/movies.csv' INTO TABLE movies;SELECT * FROM movies LIMIT 10;

First 10 entries in the movies.csv [Image [9]]

To view the data in the ratings.csv file, copy the following code in the ratings.sql file and run the command hive -f ratings.sql .

DROP DATABASE IF EXISTS movielens CASCADE;CREATE DATABASE movielens;USE movielens;CREATE EXTERNAL TABLE ratings (UserID INT, 
MovieID INT, 
Rating INT, 
Timestamp STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
LINES TERMINATED BY "\n"
STORED AS TEXTFILE
LOCATION '/movielens/ml-1m/rts.txt';LOAD DATA INPATH '/movielens/ml-1m/ratings.csv' INTO TABLE ratings;SELECT * FROM ratings LIMIT 10;

First 10 entries in ratings.csv [Image [10]]

To view data in users.csv file, copy the following code in the users.sql file and run the command hive -f users.sql .

DROP DATABASE IF EXISTS movielens CASCADE;CREATE DATABASE movielens;USE movielens;CREATE EXTERNAL TABLE users (UserID INT, 
Gender STRING, 
Age INT,
Occupation INT, 
ZIP INT)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
LINES TERMINATED BY "\n"
STORED AS TEXTFILE
LOCATION '/movielens/ml-1m/usr.txt';LOAD DATA INPATH '/movielens/ml-1m/users.csv' INTO TABLE users;SELECT * FROM users LIMIT 10;

First 10 entries in users.csv [Image [11]]

Top 10 viewed movies.

DROP DATABASE IF EXISTS movielens CASCADE;CREATE DATABASE movielens;USE movielens;CREATE EXTERNAL TABLE movies (MovieID INT, 
                              Title varchar(60), 
                              Genres varchar(60))ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
LINES TERMINATED BY "\n";CREATE EXTERNAL TABLE ratings (UserID INT, 
                               MovieID INT, 
                               Rating INT, 
                               Timestamp STRING)
 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
LINES TERMINATED BY "\n";LOAD DATA INPATH '/movielens/ml-1m/movies.csv' 
OVERWRITE INTO TABLE movies;LOAD DATA INPATH '/movielens/ml-1m/ratings.csv' 
OVERWRITE INTO TABLE ratings;SELECT movies.MovieID,movies.Title,COUNT(DISTINCT ratings.UserID) as views
FROM movies JOIN ratings ON (movies.MovieID = ratings.MovieID)
GROUP BY movies.MovieID, movies.Title
ORDER BY views DESC
LIMIT 10;

Top 20 rated movies having at least 40 views.

DROP DATABASE IF EXISTS movielens CASCADE;CREATE DATABASE movielens;USE movielens;CREATE EXTERNAL TABLE movies (MovieID INT, 
                              Title varchar(60), 
                              Genres varchar(60))ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
LINES TERMINATED BY "\n";CREATE EXTERNAL TABLE ratings (UserID INT, 
                               MovieID INT, 
                               Rating INT, 
                               Timestamp STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY "\n";LOAD DATA INPATH '/movielens/ml-1m/movies.csv' OVERWRITE INTO TABLE movies;LOAD DATA INPATH '/movielens/ml-1m/ratings.csv' OVERWRITE INTO TABLE ratings;SELECT movies.MovieID,movies.Title,AVG(ratings.Rating) as rtg,COUNT(DISTINCT ratings.UserID) as viewsFROM movies JOIN ratings ON (movies.MovieID = ratings.MovieID)
GROUP BY movies.MovieID,movies.Title
HAVING views >= 40
ORDER BY rtg DESC
LIMIT 20;

Thank you for reading. Hope you found it useful in getting started with Hive. Suggestions and constructive criticism are welcome. :) You can find me on LinkedIn and Twitter.

Check out my other blogs here.

Getting Started With Hive

Written by Akshaj Verma