Parsing HTML Table to CSV Through Shell Script

Table of contents
Reading Time: 2 minutes

A few days back, we got the requirement that we need to parse an HTML table through shell script and generate a report based on the contents of the table. I tried many things and after much brainstorming created a shell script that parses the HTML tables into a CSV file. I will try to explain each and every step of this script.

The Script:



This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters


#!/bin/bash
### Created this shell script to convert the spark dispatcher logs accessible as HTML to CSV so that we can use the data for different purposes.
create_csv()
{
awk -v RS='' '{gsub("\n", ", "); print}' textfile > rawfile.csv
rm -f dump.txt textfile
}
html_clean()
{
curl -s "URL_TO_BE_HIT" > table.html
sed -e 's/<[^>]*>//g' table.html | awk '$1=$1' | sed '1,11d' | sed '$d' | sed '$d' > dump.txt
rm -f table.html
}
clean_text()
{
COUNTER=1
echo "COMMA_SEPERATED_HEADERS_OF_TABLES" > textfile
while read -r line
do
if [ $COUNTER -le 6 ]
then
echo "$line" >> textfile
let COUNTER+=1
else
echo "$line"$'\n' >> textfile
let COUNTER=1
fi
done < dump.txt
}
check_status()
{
prev_date=$(date –date="5 days ago" +'%Y/%m/%d')
time=$(date +"%T")
H=`date +"%T" | cut -f 1 -d ':'`
M=`date +"%T" | cut -c 4`
grep "$prev_date $H:$M[0-9]:[0-9][0-9]" rawfile.csv > csvfile.csv
}
### Main script starts here
html_clean
clean_text
create_csv
check_status
view raw

sh

hosted with ❤ by GitHub

As we can see that this script is divided into four aptly functions. Let’s discuss the functions one by one.

First, we come across the html_clean function. This function removes all the HTML syntax from the HTML page and it creates a text file which contains the data which was originally in the table which needed to be parsed. We can change the values in the sed command to get various desired (also undesired) effects in the output text file.

Then, there is clean_text function. This function needs some knowledge about the tables that are being utilized here. As I knew that there are 6 rows in my table, so I created a loop which will automatically format the text file in such a way that each row is in a separate line. Reading and manipulating this text file is much easier than the output from the previous function.

Now, we come across create_csv() function. As the name suggests, this function is responsible to read the file from the previous function and convert it into a CSV file. This function has been optimized to utilize the output of the previous function.

The last function check_status is just to create a report which in this case is all of the records lying in a particular time range. We can put any logic in here and in the end, we get a CSV file which can be used for further manipulation. In my case, the CSV file was used in an automation pipeline to check the status of some finished jobs of some application.

Hope, this blog helps. Happy Reading !!!


Knoldus-blog-footer-image

Written by 

Sudeep James Tirkey is a software consultant having more than 0.5 year of experience. He likes to explore new technologies and trends in the IT world. His hobbies include playing football and badminton, reading and he also loves travelling a lot. Sudeep is familiar with programming languages such as Java, Scala, C, C++ and he is currently working on DevOps and reactive technologies like Jenkins, DC/OS, Ansible, Scala, Java 8, Lagom and Kafka.

Discover more from Knoldus Blogs

Subscribe now to keep reading and get access to the full archive.

Continue reading