Our DNA is written in Swift
Jump

Some Statistics for Starters

As a hobby, I am working on a SwiftUI app on the side. It allows me to keep track of height and weight of my daughters and plot them on charts that allow me to see how “normal” my offspring are developing.

I’ve shied away from statistics at university, so it took me so time to research a few things to solve an issue I was having. Let me share how I worked towards a solution to this statistical problem. May you find it as instructive as I did.

Note: If you find any error of thought or fact in this article, please let me know on Twitter, so that I can understand what caused it.

Ad

Let me first give you some background as to what I have accomplished before today, so that you understand my statistical question.

Setup

The World Health Organization publishes tables that give the percentiles for length/height from birth to two years, to five years and to 19 years. Until two years of age the measurement is to be performed with the infant on its back, and called “length”. Beyond two years we measure standing up and then it is called “height”. That’s why there is a slight break in the published values at two years.

I also compiled my girls heights in a Numbers sheet which I fed from paediatrician visits initially and later by occasionally marking their height on a poster at the back of their bedroom door.

To get started I hard-coded the heights such:

import Foundation

struct ChildData
{
   let days: Int
   let height: Double
}

let elise = [ChildData(days: 0, height: 50),
	     ChildData(days: 6, height: 50),
	     ChildData(days: 49, height: 60),
	     ChildData(days: 97, height: 64),
	     ChildData(days: 244, height: 73.5),
	     ChildData(days: 370, height: 78.5),
	     ChildData(days: 779, height: 87.7),
	     ChildData(days: 851, height: 90),
	     ChildData(days: 997, height: 95),
	     ChildData(days: 1178, height: 97.5),
	     ChildData(days: 1339, height: 100),
	     ChildData(days: 1367, height: 101),
	     ChildData(days: 1464, height: 103.0),
	     ChildData(days: 1472, height: 103.4),
	     ChildData(days: 1544, height: 105),
	     ChildData(days: 1562, height: 105.2)
	    ]

let erika = [ChildData(days: 0, height: 47),
	     ChildData(days: 7, height: 48),
	     ChildData(days: 44, height: 54),
	     ChildData(days: 119, height: 60.5),
	     ChildData(days: 256, height: 68.5),
	     ChildData(days: 368, height: 72.5),
	     ChildData(days: 529, height: 80),
	     ChildData(days: 662, height: 82),
	     ChildData(days: 704, height: 84),
	     ChildData(days: 734, height: 85),
	     ChildData(days: 752, height: 86),
	    ]

The WHO defined one month as 30.4375 days and so I was able to have those values be plotted on a SwiftUI chart. The vertical lines you see on the chart are months with bolder lines representing full years. You can also notice the small step at the second year end.

It’s still missing some sort of labelling, but you can already see that my older daughter Elise (blue) was on the taller side during her first two years, whereas the second-born Erika (purple) was quite close to the “middle of the road”.

This chart gives you an eye-eye overview of where on the road my daughters are, but I wanted to be able to put your finger down on every place and have a pop up tell you the exact percentile value.

The Data Dilemma

A percentile value is basically giving the information how many percent of children are shorter than your child. So if your kid is on the 75th percentile, then 75th of children are shorter than it. The shades of green on the chart represent the steps in the raw data provided by the WHO.

Thery give you P01, P1, P3, P5, P10, P15, P25, P50, P75, P85, P90, P95, P97, P99, P999. P01 is the 0.1th percentile, P999 is the 99.9th percentile. At the extremes the percentiles are very close together, but in the middle there is a huge jump from 25 to 50 to 75.

I wanted to show percentile values at those arbitrary times that are at least full integers. i.e. say 47th percentile instead of “between 25 and 50” and probably show this position with a colored line on the distribution curve those percentile values represent.

It turns out, those height values are “normally distributed”, on a curve that looks a bit like a bell, thus the term “bell curve”. To me as a programmer, I would say that I understand that a a form a data compression where you only need to to know the mean value and the standard deviation and from that you can draw the curve, as opposed to interpolating between the individual percentile values.

The second – smaller – issue is that WHO provides data for full months only. To determine the normal distribution curve for arbitrary times in between the months we need to interpolate between the month data before and after the arbitrary value.

With these questions I turned to Stack Overflow and Math Stack Exchange hoping that somebody could help out me statistics noob. Here’s what I posted…

The Problem

Given the length percentiles data the WHO has published for girls. That’s length in cm at for certain months. e.g. at birth the 50% percentile is 49.1 cm.

Month    L   M   S   SD  P01 P1  P3  P5  P10 P15 P25 P50 P75 P85 P90 P95 P97 P99 P999
0    1   49.1477 0.0379  1.8627  43.4    44.8    45.6    46.1    46.8    47.2    47.9    49.1    50.4    51.1    51.5    52.2    52.7    53.5    54.9
1    1   53.6872 0.0364  1.9542  47.6    49.1    50  50.5    51.2    51.7    52.4    53.7    55  55.7    56.2    56.9    57.4    58.2    59.7
2    1   57.0673 0.03568 2.0362  50.8    52.3    53.2    53.7    54.5    55  55.7    57.1    58.4    59.2    59.7    60.4    60.9    61.8    63.4
3    1   59.8029 0.0352  2.1051  53.3    54.9    55.8    56.3    57.1    57.6    58.4    59.8    61.2    62  62.5    63.3    63.8    64.7    66.3

P01 is the 0.1% percentile, P1 the 1% percentile and P50 is the 50% percentile.

Say, I have a certain (potentially fractional) month, say 2.3 months. (a height measurement would be done at a certain number of days after birth and you can divide that by 30.4375 to get a fractional month)

How would I go about approximating the percentile for a specific height at a fraction month? i.e. instead of just seeing it “next to P50”, to say, well that’s about “P62”

One approach I thought of would be to do a linear interpolation, first between month 2 and month 3 between all fixed percentile values. And then do a linear interpolation between P50 and P75 (or those two percentiles for which there is data) values of those time-interpolated values.

What I fear is that because this is a bell curve the linear values near the middle might be too far off to be useful.

So I am thinking, is there some formula, e.g. a quad curve that you could use with the fixed percentile values and then get an exact value on this curve for a given measurement?

This bell curve is a normal distribution, and I suppose there is a formula by which you can get values on the curve. The temporal interpolation can probably still be done linear without causing much distortion. 

My Solution

I did get some responses ranging from useless to a level where they might be correct, but to me as a math outsider they didn’t help me achieve my goal. So I set out to research how to achieve the result myself.

I worked through the question based on two examples, namely my two daughters.

ELISE at 49 days
divide by 30.4375 = 1.61 months
60 cm

So that’s between month 1 and month 2:

Month  P01 P1  P3  P5  P10 P15 P25 P50 P75 P85 P90 P95 P97 P99 P999
1 47.6 49.1 50 50.5 51.2 51.7 52.4 53.7 55 55.7 56.2 56.9 57.4 58.2 59.7
2 50.8 52.3 53.2 53.7 54.5 55 55.7 57.1 58.4 59.2 59.7 60.4 60.9 61.8 63.4

Subtract the lower month: 1.61 – 1 = 0.61. So the value is 61% the way to month 2. I would get a percentile row for this by linear interpolation. For each percentile I can interpolate values from the month row before and after it.

// e.g. for P01
p1 = 47.6
p2 = 50.8

p1 * (1.0 - 0.61) + p2 * (0.61) = 18.564 + 30.988 = 49.552  

I did that in Numbers to get the values for all percentile columns.

Month P01 P1 P3 P5 P10 P15 P25 P50 P75 P85 P90 P95 P97 P99 P999
1.6 49.552 51.052 51.952 52.452 53.213 53.713 54.413 55.774 57.074 57.835 58.335 59.035 59.535 60.396 61.957

First, I tried the linear interpolation:

60 cm is between  59,535 (P97) and 60,396 (P99).
0.465 away from the lower, 0.396 away from the higher value. 
0.465 is 54% of the distance between them (0,861)

(1-0.54) * 97 + 0.54 * 99 = 44.62 + 53.46 = 98,08
// rounded P98

Turns out that this is a bad example.

At the extremes the percentiles are very closely spaced so that linear interpolation would give similar results. Linear interpolation in the middle would be too inaccurate.

Let’s do a better example. This time with my second daughter.

ERIKA 
at 119 days
divide by 30.4375 = 3.91 months
60.5 cm

We interpolate between month 3 and month 4:

Month P01 P1 P3 P5 P10 P15 P25 P50 P75 P85 P90 P95 P97 P99 P999
3 53.3 54.9 55.8 56.3 57.1 57.6 58.4 59.8 61.2 62.0 62.5 63.3 63.8 64.7 66.3
4 55.4 57.1 58.0 58.5 59.3 59.8 60.6 62.1 63.5 64.3 64.9 65.7 66.2 67.1 68.8
3.91 55.211 56.902 57.802 58.302 59.102 59.602 60.402 61.893 63.293 64.093 64.684 65.484 65.984 66.884 68.575

Again, let’s try with linear interpolation:

60.5 cm is between 60.402 (P25) and 61.893 (P50)
0.098 of the distance 1.491 = 6.6%

P = 25 * (1-0.066) + 50 * 0.066 = 23.35 + 3.3 = 26.65 
// rounds to P27

To compare that to approximating it on a bell curve, I used an online calculator/plotter. This needed a mean and a standard deviation, which I think I found on the percentile table left-most columns. But I also need to interpolate these for month 3.91:

Month L M S SD
3 1.0 59.8029 0.0352 2.1051
4 1.0 62.0899 0.03486 2.1645
3.91 1.0 61.88407 0.0348906 2.159154

I have no idea what L and S mean, but M probably means MEAN and SD probably means Standard Deviation.

Plugging those into the online plotter…

μ = 61.88407
σ = 2.159154
x = 60.5

The online plotter gives me a result of P(X < x) = 0.26075, rounded P26

This is far enough from the P27 I arrived at by linear interpolation, warranting a more accurate approach.

Z-Scores Tables

Searching around, I found that if you can convert a length value into a z-score you can then lookup the percentile in a table. I also found this great explanation of Z-Scores.

Z-Score is the number of standard deviation from the mean that a certain data point is. 

So I am trying to achieve the same result as above with the formula:

(x - M) / SD
(60.5 - 61.88407) / 2.159154 = -0.651

Then I was able to convert that into a percentile by consulting a z-score table.

Looking up -0.6 on the left side vertically and then 0.05 horizontally I get to 0.25785 – So that rounds to be also P26, although I get an uneasy feeling that it is ever so slightly less than the value spewed out from the calculator.

How to do that in Swift?

Granted that it would be simple enough to implement such a percentile look up table in Swift, but the feeling that I can get a more accurate result coupled with less work pushed me to go searching for a Swift package.

Indeed, Sigma Swift Statistics seems to provide the needed statistics function “normal distribution”, described as:

Returns the normal distribution for the given values of x, μ and σ. The returned value is the area under the normal curve to the left of the value x.

I couldn’t find anything mentioned percentile as result, but I added the Swift package and I tried it out for the second example, to see what result I would get for this value between P25 and P50:

let y = Sigma.normalDistribution(x: 60, μ: 55.749061, σ: 2.00422)
// result 0.2607534748851712

That seems very close enough to P26. It is different than the value from the z-tables, `0.25785` but it rounds to the same integer percentile value.

For the first example, between P97 and P99, we also get within rounding distance of P98.

let y = Sigma.normalDistribution(x: 60, μ: 55.749061, σ: 2.00422)
// result 0.9830388548349042

As a side note, I found it delightful to see the use of greek letters for the parameters, a feature possible due to Swifts Unicode support.

Conclusion

Math and statistics were the reason why I aborted my university degree in computer science. I couldn’t see how those would have benefitted me “in real life” as a programmer.

Now – many decades later – I occasionally find that a bit more knowledge in these matters would allow me to understand such rare scenarios more quickly. Thankfully, my internet searching skills can make up for what I lack in academic knowledge.

I seem to have the ingredients assembled to start working on this normal distribution chart giving interpolated percentile values for specific days between the month boundaries. I’ll give an update when I have built that, if you are interested.


Also published on Medium.


Categories: Administrative

Leave a Comment

%d bloggers like this: