Distribution Visualization 101 with Python

Pick your weapon: histogram, density plot, or box plot

Zainul Arifin
Towards Data Science

--

Data is just like music, presented the wrong way it becomes awful (Picture by Kaharlytskyi from Unsplash)

When dealing with data, the best way to quickly understand it is through visualization. Rather than analyzing them in tabular form, visualizing them allows for a quick and intuitive understanding. You might even find surprising results by visualizing your data.

“Most of us need to listen to the music to understand how beautiful it is. But often that’s how we present statistics: we just show the notes, we don’t play the music.” — Hans Rosling

There are many interesting insights we can get from our data and it can be obtained by knowing the distribution of the data. Visualizing the distribution of numeric data could give us several insights such as the skewness of the data, the mode (highest peak), the distribution shape, and many more.

An example of distribution visualization with a density plot (Image generated by the author)

Sometimes, we are also interested to see the distribution of multiple numeric observations. For example, we might be interested to compare the height of people from 5 different Asian countries. To statistically confirm the difference between groups, we can utilize a t-test (2 groups) or ANOVA (more than 2 groups). But, an intuitive way to see differences between groups is to visualize them.

An easy way to visualize the difference is by stacking distribution plots of different groups of the same observation. Below are some of the most popular ways we can visualize the difference in the distribution of different groups with Python.

Histogram

Histogram is a classic for distribution visualization. A histogram visualizes data frequencies. The higher the bar in a histogram, the more frequent it is in the observed data. Let's compare the distribution of three different iris species.

An example of a histogram for distribution comparison (Image generated by the author)

To check for differences between species, we can use ANOVA and check the resulting p-value. However, explaining the result with ANOVA alone might be insufficient especially when we are showing the results to people with a limited understanding of statistics. Using histograms allows us to overcome the knowledge barrier because of how intuitive a histogram is. Without any statistical analysis, it is quite clear that Iris Setosa has a smaller petal length compared to Iris Versicolor and Virginica from the picture above.

One of my favorite additional indicators to add to a histogram or any other distribution plot is an average line. The average line can be constructed from the mean or median of the data, though I personally prefer to utilize median as it is more resistant to outliers. The addition of an average line allows for quick identification of the population's expected value.

Density Plot

Similar to a histogram, we can clearly see from the picture below that Iris Setosa's petals length is different from Iris Versicolor's and Virginica’s even without doing any statistical analysis. But how does the smooth curve in the density plot come to be?

An example of a density plot for distribution comparison (Image generated by the author)

Similar to a histogram, a density plot measures data frequency but with a smoothen continuous line instead of rectangular bins. Similar to a histogram, the x-axis is the numeric values from observed data. The y-axis of a density plot is quite peculiar as it is not an absolute count of frequencies but rather, an estimate of a probability density function (PDF) of the given data which resulted in the density curve. A common way to turn absolute frequency count into a smooth curve is with Kernel Density Estimation (KDE).

Interpreting the y-axis can be a bit confusing. For example, how come the peak of Iris Setosa’s Petal length is at around 2.4? Where does this number come from? In a density plot, the total area under the curve is 1 when integrated. But what about a small section of the curve? Let's say that we would like to get 2.4 from our manual calculation. And we can do that with a simple formula:

Image generated by the author

The area is the integral of the curve’s function. Because our curve looks like a flipped U, the approximate function is:

The curve’s function visualized in Desmos

We pick base values that surround the peak. In this case, we picked 1.4 to 1.5. Then, the integral of this function is:

Integral calculator from https://www.integral-calculator.com/. 24% of observed data fall under this curve.

From our very rough calculation, we divide 0.24 with 0.1 (the difference between bases) which would give us 2.4 which is the peak of our density plot. If all the math is too complicated, just remember that density function ultimately measures frequencies. The higher the peak, the higher the frequency.

Box plot

What both histogram and density struggle with is visualizing differences with many groups (more than 3). Take a look at this density plot.

Hours worked per week distribution plot of 5 races in America (Image generated by the author)

The plot above explains the hour-time Americans spend on work per week. Looking at the plot we can get a conclusion that most Americans work roughly 40 hours per week regardless of race. However, the stacking of data makes the density plot unpleasant to see and analyze. An alternative to this is to subset the data OR to use a box plot instead. Take a look at the example below.

Hours worked per week box plot of 5 races in America (Image generated by the author)

If we look at the box plot result, while it is true that our finding still stands true that the median working hour is at 40 hours/week across all races, the result interpretation gets more interesting through outliers visualization.

Box plot is particularly useful when we are interested in outliers. Black circles outside of the box plot whiskers show the outliers of the group. One circle is equivalent to one observation. The box in a box plot contains 50% of the data. Outside that of the box are the whiskers, and outside of the whiskers are outliers. We can see that there are actually many Americans that do not follow the 40 hours/week working time average. There are those that work less than 10 hours/week and there are those that work more than 60 hours/week.

Conclusion

Now that you have understood several ways we can visualize distribution, you can use any of the above methods according to your needs. For a quick and intuitive understanding of between-group comparisons, try to use a histogram or density plot. However, if you are interested in visualizing outliers or if there are too many groups, a box plot might be the better choice.

That is all folks! The codes and data used for this post can be found here https://github.com/ZainulArifin1/distribution_viz. Let me know if you have other fun and informative ways to visualize distribution.

Picture by Peggy Marco from Pixabay

--

--