Saturday, July 20, 2013

Plotting histograms

In any sort of experimental work, I or my students end up plotting a lot of histograms (of alignment scores, cost functions, segment lengths,).What I usually want to see is a probability density function (so the scaling is independent of the number of data points sampled or the bin sizes used). Most of the students end up using some crude built-in histogram plotter (in R or Matplotlib), that ends up with difficult-to-interpret axes and bad bin sizes.



I spent a couple of days experimenting with different approaches to making a Python module that can convert a generic list of numbers into a list of points to plot with gnuplot or Matplotlib.I ended up with 3 different things to control:the number of bins to use, how wide to make each bin, and whether to plot the estimated density function as a step function or linearly interpolated.




If I have n samples, I set the default number of bins to , which seems to give a good tradeoff between having so few bins that resolution is lost and so many that the shape of the distribution is buried in the sampling noise.The "+1is just to ensure that truncating the estimate never results in 0 bins.



Most of my experimenting was in adjusting the bin widths.I came up with three main approaches, and one minor variant: fixed-width This is the standard histogram technique, where the full range of values is split into numbins equal intervals, and the instances of numbers counted in the corresponding bins.This approach is very simple to program and very fast, as there is no need to sort the data (if the range is known), and bin selection is a trivial subscript computation.Since I'm projecting the range out a little from the largest and smallest numbers (based on the second largest and second smallest), I ended up sorting anyway.The fixed-width approach is very good for discrete distributions, as it can have lots of bins with 0 counts. fixed-count The fixed-count approach tries to make each bin have the same number of samples in it. The intent here is to have finer resolution where there are a lot of samples, can coarser resolution where there are few samples. I implemented this approach by sorting the numbers and setting thresholds in a single sweep across the data.The fixed-count approach gives good resolution at the peaks of the peaks of the density, but gets very coarse where the density is low. It does not leave any empty bins, so is not as good for discrete distributions as the fixed-bin approach. tapered The tapered approach is like the fixed-count approach, except that the desired counts taper off in the first and last few bins of the range. This was a rather clumsy attempt to get better resolution in the tails of a distribution. fixed-area This approach is a compromise between the other two trying to keep the product of the number of counts and the bin width roughly constant. I again implemented this by doing a sweep across the sorted numbers.The fixed-area approach provides a useful compromise giving reasonable resolution both at the peaks and on the tails of continuous distributions, but (like the fixed-count method) does not handle discrete distributions very well, since the density estimate can't go to zero inside the range of the data.



I made up some test data and tried the different approaches on a number of test cases. Here are a couple of illustrative plots using 3000 points, 1000 each from 2 Gaussian distributions and 1000 from a Weibull (extreme-value) distribution):



Plot of the real density function and the reconstructed ones using each of the approaches to setting bin boundaries. I used a log scale on the y axis so that the tails of the distribution could be analyzed (something I often do when looking at a distribution to estimate p-values). Note that the fixed-width bins get very noisy once the expected number of counts per bin gets below 1, and the fixed-count method has an extremely coarse estimate for the right-hand tail, but the fixed-area estimate is pretty good.



Only the fixed-width estimate drops off as fast as the narrow Gaussian peaks, but it goes all the way to 0. If we used pseudocounts to prevent estimates of 0 probability, then the fixed-width method would have a minimum value, determined by the pseudocount (a pseudocount of 1 would give a minimum density of about 1.6E-3 for this data, about where the single-count bins on the right-hand tail are).



Here is a detailed look at the narrow central Gaussian peak. Because I'm interested in the peak here, rather than the tail, I used a linear scale on the y axis. Not that the fixed-width bins are too wide here, broadening the peak and making it difficult to see as a Gaussian. The fixed-count methods have a very fine resolution--too fine really, as they are picking up a lot of sampling noise. The fixed-area method seems to do the cleanest job of picking up the shape of the peak without excessive noise.



I would release the code on my web site at work, except that we had yet another power failure on campus last night (Friday night seems to be the popular time for power failures and server failures), and the file server where I plan to keep the file will not be rebooted until Monday.
Full Post

No comments:

Post a Comment