Advanced Histogram Using Python

A histogram to delight the business users and data scientists

Anandakumar Varatharajah
Towards Data Science

--

What is the need?

Python has excellent support for generating histograms. But in Data Science it is very useful to display bar/bin counts, bin ranges, colour the bars to separate percentiles and generate custom legends to provide more meaningful insights to business users. There is no built in direct method to do this using Python.

So the need as a Data Scientist to provide a useful histogram are:

  1. How to display the data point count for each bar in the histogram?
  2. How to display the bar/bin range in the X axis of the histogram?
  3. How to change the colour of the bar/bins in the histogram based on the percentile?
  4. How to generate custom legends?

The final output the we are trying to produce that would be very useful to end users will be as below:

Advanced Histogram

Why is this useful?

Knowing the data ranges and the percentiles along with the count and normalised percentages are very useful in determining how the data should be wrangled/cleansed. Also it will be easy for the business users to understand and derive value from the histogram.

For example, if the data is heavily skewed, either positively or negatively, and has extreme outliers the graph may reveal some valuable insights about the data.

Advance histogram for skewed data

The above histogram shows that about 99% of the data is within the range 1 to 6788. Though the data range is from 1 to 67875, it is clear that almost 99% of the data is within 1 to 6788 which helps to decide what to do with the outliers.

You cannot get this level of detail insight from a standard histogram which is shown below.

Standard histogram for skewed data

Python Code

You can download the code from my AnalyticsInsightsNinja GitHub site or from Azure Notebook.

The important line of the code is:

counts, bins, patches = ax.hist(data, facecolor=perc_50_colour, edgecolor=’gray’)

which returns the following:

counts = numpy.ndarray of count of data ponts for each bin/column in the histogram

bins = numpy.ndarray of bin edge/range values

patches = a list of Patch objects. Each Patch object contains a Rectnagle object. e.g. Rectangle(xy=(-2.51953, 0), width=0.501013, height=3, angle=0)

By manipulating these three collections, we can get very useful information about the histogram.

Credits: The code was inspired by an answer provided by Joe Kington at stackoverflow.com

Conclusion

Though standard histograms are useful, having an enhanced version that can display the data ranges and the percentiles along with the count and normalised percentages are very useful for a Data Scientist to determe how the data should be wrangled/cleansed and provide more value to business end users.

--

--

I help businesses to get value from data, insights, machine learning and analytics, delivering solutions for real world business problems.