Getting Started With Hive
The aim of this blog post is to help you get started with Hive using Cloudera Manager. Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data summarization, query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.
If you are interested in R programming, you can check out my other articles - Exploratory Analysis of FIFA 18 dataset using R and GgPlot ’Em All | Pokemon on R.
We’ll get started with Shakespeare first and understand the work-flow by implementing a simple word-count program. Then we’ll move on to the MovieLens data and execute queries using HQL.
Fire up your Cloudera VM. Go to your ip:7180.
We will use Hdfs, Hive, and Yarn. So, make sure they are running.
Now, we have to create a directory to store the input file on which we will run the wordcount program.
# to view the contents of the root directory on HDFS
hdfs dfs -ls /# create a new directory named 'input' in the HDFS
hdfs dfs -mkdir /input
You will see that permission is denied as the hdfs user holds the super user(root) privileges. We’ll see how to sort this later in this article.
Now, download the the text file on which to run the word count. We’ll use Shakespeare’s work for it. Download and unpack the file. There are five files inside it. We’ll use the comedies file today.
We have the comedies file in our local file system. We need to transfer it into the hdfs inside the /input
folder. Before that, we need to give our root user permission to add a file to the /input
folder.
# Give the su permission to root-user
hdfs dfs -chown root:root /input
hdfs dfs -ls /
The root-user now has super user privileges over the /input
folder.
Go back to being the root user.
exit
Write an SQL file to perform the wordcount. Save it as wordcount.sql
.
DROP DATABASE IF EXISTS documents CASCADE;CREATE DATABASE documents;USE documents;CREATE TABLE docs(words string);LOAD DATA LOCAL INPATH '/root/shakespeare' INTO TABLE docs;CREATE TABLE word_count AS
SELECT word, count(*) AS count FROM
(SELECT explode(split(words, '\\W+')) AS word FROM docs) w
GROUP BY word;SELECT * FROM word_count limit 100;
We need to read write permission to the /shakespeare
directory to run the wordcount.sql
program. Become an hdfs user and enter the following commands.
su - hdfs # Change user to hdfs from roothdfs dfs -chmod -R 777 /shakespeareexit
Now, as a root user, run the following:
hive -f wordcount.sql
You should see an output like the following.
Next up, we’ll try to run Hive queries on the MovieLens dataset.
Download the file. When you see the first few lines, you’ll notice that the file is delimited by ::
. Change the delimiters to ,
(csv).
sed -i 's/::/,/g' ml-1m/movies.dat
sed -i 's/::/,/g' ml-1m/users.dat
sed -i 's/::/,/g' ml-1m/ratings.dat# Rename the files from .dat to .csv. Although, it doesn't really # matter. You can leave the files as .dat. I have OCD.mv ml-1m/movies.dat /ml-1m/movies.csv
mv ml-1m/ratings.dat /ml-1m/ratings.csv
mv ml-1m/users.dat /ml-1m/users.csv
Create a folder called movielens
in the hdfs and move the files into it.
As an hdfs user, create the following files movies.sql
, ratings.sql
, and users.sql
.
su - hdfs
nano movies.sql
nano ratings.sql
nano users.sql
To view the data in the movies.csv
file, copy the following code in the movies.sql
file and run the command hive -f movies.sql
.
DROP DATABASE IF EXISTS movielens CASCADE;CREATE DATABASE movielens;USE movielens;CREATE EXTERNAL TABLE movies (MovieID INT,
Title varchar(60),
Genres varchar(60))ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY "\n"
STORED AS TEXTFILE
LOCATION '/movielens/ml-1m/mvs.txt';LOAD DATA INPATH '/movielens/ml-1m/movies.csv' INTO TABLE movies;SELECT * FROM movies LIMIT 10;
To view the data in the ratings.csv
file, copy the following code in the ratings.sql
file and run the command hive -f ratings.sql
.
DROP DATABASE IF EXISTS movielens CASCADE;CREATE DATABASE movielens;USE movielens;CREATE EXTERNAL TABLE ratings (UserID INT,
MovieID INT,
Rating INT,
Timestamp STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY "\n"
STORED AS TEXTFILE
LOCATION '/movielens/ml-1m/rts.txt';LOAD DATA INPATH '/movielens/ml-1m/ratings.csv' INTO TABLE ratings;SELECT * FROM ratings LIMIT 10;
To view data in users.csv
file, copy the following code in the users.sql
file and run the command hive -f users.sql
.
DROP DATABASE IF EXISTS movielens CASCADE;CREATE DATABASE movielens;USE movielens;CREATE EXTERNAL TABLE users (UserID INT,
Gender STRING,
Age INT,
Occupation INT,
ZIP INT)ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY "\n"
STORED AS TEXTFILE
LOCATION '/movielens/ml-1m/usr.txt';LOAD DATA INPATH '/movielens/ml-1m/users.csv' INTO TABLE users;SELECT * FROM users LIMIT 10;
Top 10 viewed movies.
DROP DATABASE IF EXISTS movielens CASCADE;CREATE DATABASE movielens;USE movielens;CREATE EXTERNAL TABLE movies (MovieID INT,
Title varchar(60),
Genres varchar(60))ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY "\n";CREATE EXTERNAL TABLE ratings (UserID INT,
MovieID INT,
Rating INT,
Timestamp STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY "\n";LOAD DATA INPATH '/movielens/ml-1m/movies.csv'
OVERWRITE INTO TABLE movies;LOAD DATA INPATH '/movielens/ml-1m/ratings.csv'
OVERWRITE INTO TABLE ratings;SELECT movies.MovieID,movies.Title,COUNT(DISTINCT ratings.UserID) as views
FROM movies JOIN ratings ON (movies.MovieID = ratings.MovieID)
GROUP BY movies.MovieID, movies.Title
ORDER BY views DESC
LIMIT 10;
Top 20 rated movies having at least 40 views.
DROP DATABASE IF EXISTS movielens CASCADE;CREATE DATABASE movielens;USE movielens;CREATE EXTERNAL TABLE movies (MovieID INT,
Title varchar(60),
Genres varchar(60))ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY "\n";CREATE EXTERNAL TABLE ratings (UserID INT,
MovieID INT,
Rating INT,
Timestamp STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY "\n";LOAD DATA INPATH '/movielens/ml-1m/movies.csv' OVERWRITE INTO TABLE movies;LOAD DATA INPATH '/movielens/ml-1m/ratings.csv' OVERWRITE INTO TABLE ratings;SELECT movies.MovieID,movies.Title,AVG(ratings.Rating) as rtg,COUNT(DISTINCT ratings.UserID) as viewsFROM movies JOIN ratings ON (movies.MovieID = ratings.MovieID)
GROUP BY movies.MovieID,movies.Title
HAVING views >= 40
ORDER BY rtg DESC
LIMIT 20;