How to Create Interactive AWS Elastic Map Reduce (EMR) Clusters using the AWS CLI

Introduction

In this How To article I demonstrate how to use the AWS CLI to create an Amazon Elastic Map Reduce (EMR) Cluster along with some common supplementary resources for experimentation and development on an EMR cluster. Amazon EMR is a big data service provided by AWS which makes it easy to provision distributed computing clusters provisioned with user defined open source data processing and AI/ML tools like Apache Hive, Apache Spark, Apache Flink and many others.

If You Prefer Video Checkout the Youtube Series

Steps to Create Interactive AWS EMR Cluster with Hadoop and Hive

In this demonstration I provide the steps necessary to create a modest EMR cluster complete with Apache Hadoop and Apache Hive suitable for experimenting and interactive development. By this I mean that these steps will produce an EMR cluster of three nodes (one master and two core worker nodes) that can be accessed via SSH and will be given fairly open service roles to interact with other common AWS services like Kinesis, S3, Glue and others.

Before going any further I do want to warn that this cluster will cost around $0.75 per hour to run so if you intend to use this "recipe" of steps to create an EMR cluster you will be charged by AWS.

1) Create EC2 Key Pair for SSH Access to Master Node of Cluster

This command will create a EC2 Key Pair (.pem key) and save it to a local file which can then be used to SSH onto the ending EMR Cluster's Master Node.

aws ec2 create-key-pair --key-name emr-keypair \
    --query 'KeyMaterial' --output text > emr-keypair.pem

Change key file permissions to read by user only.

chmod 400 emr-keypair.pem

2) Create an S3 Bucket to Hold Resources as well as Accept Log Files

This step creates an S3 bucket that will likely be useful for saving data to be processed within the EMR cluster, serve as a storage container for results and logs produced by EMR, and receive log files from it's execution.

If you have an S3 bucket you'd like to use you can skip creating one.

The following creates an S3 bucket named tci-emr-demo in the us-east-1 region.

aws s3 mb s3://tci-emr-demo

Next I create folders in s3 for logs, inputs, outputs, scripts

aws s3api put-object --bucket tci-emr-demo --key logs/
aws s3api put-object --bucket tci-emr-demo --key inputs/
aws s3api put-object --bucket tci-emr-demo --key outputs/
aws s3api put-object --bucket tci-emr-demo --key scripts/

3) Create an EC2 Security Group to Allow SSH Connections from Your IP

First I find my IP using an HTTP Client such as HTTPie (you could also use your browser or curl) which I'll save to an environment variable named MY_IP

http https://checkip.amazonaws.com -b

Next I create an EC2 Security Group using a VPC that has public subnets. Your AWS account likely has a default VPC that has a public subnet but, you may need to create one if this is not the case.

I've previously grabbed my VPC ID and saved it to an environment variable named MY_VPC along with a Public Subnet ID and saved it to a SUBNET_ID variable.

aws ec2 create-security-group --group-name ssh-my-ip \
    --description "For SSHing from my IP" --vpc-id $MY_VPC

This should output the newly created Security Group ID which I've saved to an environment variable named MY_SG

Using the Security Group ID along with the result of my IP lookup I can now add an ingress rule to the security group to allow TCP connections to the standard SSH port 22.

aws ec2 authorize-security-group-ingress --group-id $MY_SG \
    --protocol tcp --port 22 --cidr $MY_IP/32

4) Create default IAM Roles for EMR

This is a simple way to create default EMR IAM roles which give an EMR cluster rather liberal access to other commonly used AWS services. Please be sure to consult your organization's AWS and IAM administrator before doing this.

aws emr create-default-roles

5) Create EMR Cluster with Hive

The following command creates a emr-5.33.0 version EMR cluster of 3 m5.xlarge EC2 instances with the Hadoop and Hive applications along with 12 GB of disk space. The output of this command will give you your EMR Cluster ID which is needed for subsequent operations. I've saved my EMR Cluster ID in a variable CLUSTER_ID

aws emr create-cluster --name tci-cluster --applications Name=Hadoop Name=Hive \
  --release-label emr-5.33.0 --use-default-roles \
  --instance-count 3 --instance-type m5.xlarge \
  --ebs-root-volume-size 12 \
  --log-uri s3://tci-emr-demo/logs \
  --ec2-attributes KeyName=emr-keypair,AdditionalMasterSecurityGroups=$MY_SG,SubnetId=$SUBNET_ID \
  --no-auto-terminate

Another common application used with EMR (likely more common than Hive) is Spark. If you are looking to add Spark to the list of installed applications ammend the --applications flag to the following.

--applications Name=Hadoop Name=Hive Name=Spark

You can use the describe-cluster command to determine your MasterPublicDnsName of the Cluster as well as determine when the cluster is fully provisioned and in the WAITING state.

aws emr describe-cluster --cluster-id $CLUSTER_ID

If you have the CLI program jq installed you can parse out the MasterPublicDnsName using the following

MASTER_URL=$(aws emr describe-cluster --cluster-id $CLUSTER_ID | jq -r ".Cluster.MasterPublicDnsName")

Next you'll want to be sure you can SSH onto the Master Node using the MasterPublicDnsName retrived and the SSH key pair created earlier.

ssh hadoop@$MASTER_URL -i emr-keypair.pem

Once onto the Master node I should be able to interact with HDFS to verify that Hadoop and Hive was installed by listing the contents of the /user directory like so.

hadoop fs -ls /user

Yielding the following output.

Found 4 items
drwxrwxrwx   - hadoop hdfsadmingroup          0 2021-06-23 03:29 /user/hadoop
drwxr-xr-x   - mapred mapred                  0 2021-06-23 03:14 /user/history
drwxrwxrwx   - hdfs   hdfsadmingroup          0 2021-06-23 03:14 /user/hive
drwxrwxrwx   - root   hdfsadmingroup          0 2021-06-23 03:14 /user/root

Destroying the Cluster

When you are done working with this cluster you will likely want to destroy it to minimize costs.

aws emr terminate-clusters --cluster-ids $CLUSTER_ID

Then use either the same describe command from earlier or list all clusters to sure that the cluster reaches a TERMINATED state.

aws emr list-clusters

Conclusion

In this How To article I provided a recipe of steps and resources required to create an interactive AWS EMR cluster with Apache Hadoop and Apache Hive applications installed on it and configured to be accessible via SSH from a narrowly scoped individual IP address. I've regularly used this same set of steps to experiment with various cluster based Big Data technologies useful for developing many data intensive workloads and hope readers find it useful. 

As always, thanks for reading and don't be shy about commenting or critiquing below.

Share with friends and colleagues

[[ likes ]] likes

Navigation

Community favorites for Data Engineering

theCodingInterface