Thursday, December 20, 2012

More on Theory of Breakage

kw: analysis, lognormal, power law, statistical distributions

I wrote about the statistical distribution of stellar masses nearly four years ago. The Theory of Breakage, as it is classically derived, indicates that random processes which divide an object or area or substance into many parts will tend to produce "pieces" that have a lognormal distribution of the "size" parameter, whether length or volume or mass. Some scholars, following Benoit Mandelbrot, instead posit that random breakage will produce a distribution of "size" that obeys a power law. Here I apply it to a large collection of computer files.

As I close out a 44-year career in the computer sciences, I need to determine which computer files to pass on to my colleagues, and which can be allowed to vanish. I have had some kind of personal computer on my desk for 32 years, and I have certain files from my mainframe days going back another 12 years. As a long-time member of the Elephant Club (Motto: Do not trust a computer you can see over), I share the computer geek's vice of never throwing anything away. I long ago learned that magnetic disk technology was producing file space much faster than anyone could fill it. So I am a kind of hoarder. My house is not cluttered, but my disk drive certainly is!

I used a DOS command (how few folks even know how to do so!) to gather a complete file and folder listing for the Work disk. The entire corpus came to 35,309 files with an aggregate size of 18 Gbytes. Of this, the Project data consists of 8,297 files that total 8 Gbytes. This Project data are those files directly related to paid work. The rest is support materials and other files kept for historical reasons, such as a great many FORTRAN, Pascal and Perl program source code files that I call my "algorithm collection", presentation files in PowerPoint and older formats such as Framework (by Lotus), spreadsheets (Excel, 1-2-3 and Quattro), flow charts and other drawings in Visio and older tools, plus images, videos and sound files.

With a little bit of fiddling around, I produced two lists of file sizes, and analyzed them two ways. This first chart is a Power Law analysis:


Keep in mind that the "Proj. Files" are included in "All Files". The strongly curved shape of these distributions is diagnostic that they do not follow a Power Law, but are more likely to follow a Lognormal distribution. If you simply project the slope of the upper quarter or third, to the "1" line, you see that it would require trillions of files to extend the line (I estimate 20 trillion for the blue line). By contrast, here is a Lognormal analysis:


In this presentation, it is evident that the distribution is very nearly lognormal. I have no explanation for the few departures from linearity, and they don't really need "explaining", anyway. I simply find it fascinating that these thousands of items, some written, others downloaded or generated by many methods and processes over four decades, should result in such a distribution.

No comments: