data on distribution of birthdays by month US '96

ber...@my-dejanews.com

unread,

Feb 23, 1999, 3:00:00 AM2/23/99

to

reference: http://www.wiskit.com/marilyn/birthdays.html

If we omit the assumption that all 365 days are equally likely
[...]

MONTH | A | B | A/B |
-----------------------------
Jan | 8.08 | 8.47 | 0.954 |
Feb | 7.75 | 7.92 | 0.979 |
Mar | 8.29 | 8.47 | 0.979 |
Apr | 8.03 | 8.20 | 0.979 |
May | 8.37 | 8.47 | 0.988 |
Jun | 8.19 | 8.20 | 0.999 |
Jul | 8.87 | 8.47 | 1.047 |
Aug | 8.90 | 8.47 | 1.051 |
Sep | 8.64 | 8.20 | 1.054 |
Oct | 8.64 | 8.47 | 1.020 |
Nov | 7.95 | 8.20 | 0.970 |
Dec | 8.29 | 8.47 | 0.979 |
-----------------------------

In column A, we have the percentage of 1996 US live births
occurring in said month. In column B, we have the percentage
of days of 1996 occurring in said month. The last column is
the ratio of the percentages in columns A and B.

Right now I'm thinking about how to derive an approximate
distribution curve for each of the 366 days of 1996. What
comes to mind is using a polynomial of degree <= 2 for each of
the 12 months p_1, p_2, ... p_12 (p_1 for Jan, etc) where
\int_{0}^{1} {p_1(x)dx} would give an approximation to births
occurring from 00:00 CST(?) on 1/1/1996 to 00:00 CST(?) on
1/2/1996 (etc) subject to:

(1) The integral of p_1 from 0 to 31 should give the number
of births (modulo least squares?) in January 1996 (etc)

(2) Some kind of "smoothness" condition such as
(p_1)'(31) = (p_2)'(0), and so on all through the year
[for a total of twelve equations].

(3) A condition similar to:
\SUM_{n=1,12}{\int_{0}^{Days_n} { [(p_n)'(x)]^2 dx } }
is minimal where Days_n is the number of days in month
number n.

RFC: I'd welcome comments and suggestions as to how to
derive the "expected" number of births for each day of 1996.

data from: http://www.cdc.gov/nchswww/releases/98news/98news/natal96.htm
(see Table 15)

www.mapblast.com www.nytimes.com www.blackvault.com (;-)
www.terraserver.microsoft.com www.gsoc.dlr.de/satvis
dictionaries.travlang.com www.bldrdoc.gov/timefreq/javaclck.htm

-----------== Posted via Deja News, The Discussion Network ==----------
http://www.dejanews.com/ Search, Read, Discuss, or Start Your Own

David A Karr

unread,

Feb 25, 1999, 3:00:00 AM2/25/99

to

In article <7au59h$4jo$1...@nnrp1.dejanews.com>, <ber...@my-dejanews.com> wrote:
> In column A, we have the percentage of 1996 US live births
> occurring in said month. In column B, we have the percentage
> of days of 1996 occurring in said month. The last column is
> the ratio of the percentages in columns A and B.
>
> Right now I'm thinking about how to derive an approximate
> distribution curve for each of the 366 days of 1996. What
> comes to mind is using a polynomial of degree <= 2 for each of
> the 12 months p_1, p_2, ... p_12 (p_1 for Jan, etc) where
> \int_{0}^{1} {p_1(x)dx} would give an approximation to births
> occurring from 00:00 CST(?) on 1/1/1996 to 00:00 CST(?) on
> 1/2/1996 (etc) subject to:

It seems that really you should assume that December-January has
the same kind of "smoothness" conditions as the transition between
any other pair of months, and that in fact what you are looking for
is a periodic function approximated not by a polynomial but by
a partial Fourier series.

In fact the whole thing strikes me as a signal-processing problem.
(You assume the probability density of births at any instant during
the year is a function that varies continuously with time, you sample
it by integrating it over various unequal periods, and now you want
to reconstruct the original signal from the samples.) Then things
like Fourier transforms come to mind, but it's been a long time since
I've looked at any of that.

--
David A. Karr "Groups of guitars are on the way out, Mr. Epstein."
ka...@shore.net --Decca executive Dick Rowe, 1962

bu...@pac2.berkeley.edu

unread,

Mar 1, 1999, 3:00:00 AM3/1/99

to

ber...@my-dejanews.com posted some data about the distribution
of births by month in 1996,

> MONTH | A | B | A/B |
> -----------------------------
> Jan | 8.08 | 8.47 | 0.954 |
> Feb | 7.75 | 7.92 | 0.979 |
> Mar | 8.29 | 8.47 | 0.979 |
> Apr | 8.03 | 8.20 | 0.979 |
> May | 8.37 | 8.47 | 0.988 |
> Jun | 8.19 | 8.20 | 0.999 |
> Jul | 8.87 | 8.47 | 1.047 |
> Aug | 8.90 | 8.47 | 1.051 |
> Sep | 8.64 | 8.20 | 1.054 |
> Oct | 8.64 | 8.47 | 1.020 |
> Nov | 7.95 | 8.20 | 0.970 |
> Dec | 8.29 | 8.47 | 0.979 |
> -----------------------------
>

> In column A, we have the percentage of 1996 US live births
> occurring in said month. In column B, we have the percentage
> of days of 1996 occurring in said month. The last column is
> the ratio of the percentages in columns A and B.

and asked about how best to approximate the birth rate over
the course of that year as some sort of smooth function.

David Karr pointed out, correctly, that since the data should
be assumed to be cyclic (assuming that 1997 births will
be more or less like 1996 births -- except in February),
Fourier methods are appropriate.

Let p(j) (j=1,...,366) be the probability of a child being born on
the jth day of the year. Model p as a Fourier series
with, say, a total of five terms:

p(j) = a_2 cos(4 pi j/366) + a_1 cos(2 pi j/366) + a_0 +
b_1 sin(2 pi j/366) + b_2 sin(4 pi j/366),

and do a least-squares fit to the twelve pieces of data to find
the five unknown coefficients. You could choose some number
of terms other than five, of course. With 12, you'll get a perfect
fit, but that's probably ridiculously overfitting the data.

I tried this -- it's a piece of cake with Matlab -- and
found that fitting to a five-term series as above worked pretty
well. Going up to seven terms didn't seem to improve things
too much.

The coefficients are

(a_2,a_1,a_0,b_1,b_2) = (0.00460,-0.00785,0.27323,-0.00847,-0.00217)

(probabilities given in percents). This gives errors of between
1% and 2% in January, November, and December, and under 1% everywhere
else.

That should be good enough for bernier's purpose, which (as I
understand it) is to determine how much of a difference seasonal
variations in birth rate make to the classic
probability-of-matching-birthdays puzzle. I certainly don't feel like
taking on that part of the job -- I don't see how to do it except via
Monte Carlo methods -- so I'll let someone else take over from here.

-Ted

bu...@pac2.berkeley.edu

unread,

Mar 1, 1999, 3:00:00 AM3/1/99

to

I said

>p(j) = a_2 cos(4 pi j/366) + a_1 cos(2 pi j/366) + a_0 +
> b_1 sin(2 pi j/366) + b_2 sin(4 pi j/366),

[...]

>The coefficients are
>
>(a_2,a_1,a_0,b_1,b_2) = (0.00460,-0.00785,0.27323,-0.00847,-0.00217)

But I think I got my notation mixed up. The above numbers
actually have the sine coefficients first, not the cosines.
The order is

(b_2,b_1,a_0,a_1,a_2).

-Ted