Minutes of the Meeting [Week 33/2025] | Reading material for Week 34/2025

46 views
Skip to first unread message

Dammalapati Sai Krishna

unread,
Aug 18, 2025, 3:25:12 AMAug 18
to modelt...@googlegroups.com, Dammalapati Sai Krishna
August 17th: 3PM - 4PM
Attendees: Rumi and @Dammalapati Sai Krishna 

We discussed the 5th chapter from Model Thinker—Normal distributions and LogNormal distributions. 

This book approached these distributions more as a modeler than as a statistician. An example to clarify this:

In a lot of statistics books, we consider that data on heights of people is normally distributed. It is more a matter of fact (normal/natural). And then a few books say that data on incomes would not be normally distributed (long-tailed distributions). These books reach these conclusions based on empirical analysis. The statisticians probably saw a lot of datasets of heights that were normally distributed and datasets of incomes that were not so distributed. 

But in Model Thinker, the author used the Central Limit Theorem (CLT) to explain why these datasets were so distributed. He defined CLT in a slightly different manner than what I found in other stats books.
image.png

"The sum of independent random variables will be normally distributed."

Then he goes on to explain a person's height as a sum of 180 random variables, where each random variable is the height contributed by a gene responsible for height (one gene corresponds to the height of the neck, another to the leg, etc.). Assuming that these 180 genetic contributions are independent, CLT would make us reach the conclusion that heights will be normally distributed.

If we cannot define a dataset as a sum of independent random variables, we cannot assume normal distribution there. There are good examples with the incomes and farm sizes dataset in the book. 

Rumi questioned the N>20 condition. We had no answers to that yet. 

Another discussion is about comparing items of different sample sizes/population sizes (schools, cities, districts, departments, etc). There is a chance that we over-interpret the variation between these items. Small sample sizes naturally have more variation. This is the reason why standard deviation formula is called the most dangerous equation in the world (Howard Wainer)

Funnel charts are useful in avoiding over-interpretation of this variation and in finding the real outliers. BBC over-interpreted variation of bowel cancer mortality (population data) once and got corrected by an analyst: https://www.theguardian.com/commentisfree/2011/oct/28/bad-science-diy-data-analysis . This example is also discussed in "The Art of Statistics" by David Spiegelhalter.

One key takeaway: We can do stats even when we have population data by treating it like sample data. In my recent work, I took the AQI data of Delhi, treated it like sample data and constructed confidence intervals to show that the recent reduction in July's average AQI is not statistically significant.
Delhi-clean-july-random-variation.png

For the next week, we planned to read till the 6th chapter (Power Law Distribution) of Model Thinker.
The call will be on August 24th, Sunday, at 10AM tentatively. Please RSVP to the GMeet invite when you receive it. 

Please add anything I missed or misrepresented.

Best,

Dammalapati Sai Krishna

unread,
Aug 18, 2025, 11:46:58 PMAug 18
to ModelThinker
Found a similar explanation online on why heights data is considered normally distributed:  Why heights are normally distributed
Reply all
Reply to author
Forward
0 new messages