Parsing HTML Table to CSV Through Shell Script

Table of contents

Reading Time: 2 minutes

A few days back, we got the requirement that we need to parse an HTML table through shell script and generate a report based on the contents of the table. I tried many things and after much brainstorming created a shell script that parses the HTML tables into a CSV file. I will try to explain each and every step of this script.

The Script:

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters

Show hidden characters

	#!/bin/bash
	### Created this shell script to convert the spark dispatcher logs accessible as HTML to CSV so that we can use the data for different purposes.
	create_csv()
	{
	awk -v RS='' '{gsub("\n", ", "); print}' textfile > rawfile.csv
	rm -f dump.txt textfile
	}

	html_clean()
	{
	curl -s "URL_TO_BE_HIT" > table.html
	sed -e 's/<[^>]*>//g' table.html \| awk '$1=$1' \| sed '1,11d' \| sed '$d' \| sed '$d' > dump.txt
	rm -f table.html
	}

	clean_text()
	{
	COUNTER=1
	echo "COMMA_SEPERATED_HEADERS_OF_TABLES" > textfile
	while read -r line
	do
	if [ $COUNTER -le 6 ]
	then
	echo "$line" >> textfile
	let COUNTER+=1
	else
	echo "$line"$'\n' >> textfile
	let COUNTER=1
	fi
	done < dump.txt
	}

	check_status()
	{
	prev_date=$(date –date="5 days ago" +'%Y/%m/%d')
	time=$(date +"%T")
	H=`date +"%T" \| cut -f 1 -d ':'`
	M=`date +"%T" \| cut -c 4`
	grep "$prev_date $H:$M[0-9]:[0-9][0-9]" rawfile.csv > csvfile.csv
	}

	### Main script starts here
	html_clean
	clean_text
	create_csv
	check_status

view raw

sh

hosted with ❤ by GitHub

As we can see that this script is divided into four aptly functions. Let’s discuss the functions one by one.

First, we come across the html_clean function. This function removes all the HTML syntax from the HTML page and it creates a text file which contains the data which was originally in the table which needed to be parsed. We can change the values in the sed command to get various desired (also undesired) effects in the output text file.

Then, there is clean_text function. This function needs some knowledge about the tables that are being utilized here. As I knew that there are 6 rows in my table, so I created a loop which will automatically format the text file in such a way that each row is in a separate line. Reading and manipulating this text file is much easier than the output from the previous function.

Now, we come across create_csv() function. As the name suggests, this function is responsible to read the file from the previous function and convert it into a CSV file. This function has been optimized to utilize the output of the previous function.

The last function check_status is just to create a report which in this case is all of the records lying in a particular time range. We can put any logic in here and in the end, we get a CSV file which can be used for further manipulation. In my case, the CSV file was used in an automation pipeline to check the status of some finished jobs of some application.

Hope, this blog helps. Happy Reading !!!