Python is No Good for Mortality Rates

Here we are at the uptick in the Covid 19 pandemic. There are many sources of data which list infections and deaths as a result of the virus. It’s very tempting to want to put your Python skills to use and crunch some numbers on the infection. By and large, go for it, but one thing I’d ask you not to do is to try to calculate a “mortality rate”. This is not because Python can’t do division but, rather, working this number out is conceptually pretty tricky. It’s something that epidemiologists need to get a lot of training in to do correctly.  You can’t just take the deaths column and divided it by the infected column because the two numbers are not properly related. For example:

  • Testing is incomplete. There is a shortage of test kits where I am. So, if you present with symptoms you will not get tested unless you meet the testing protocol – that is: have you been overseas in the last 14 days; or have you been in contact with a known Covid19 case. This testing protocol means that (most) community transmission of the virus hereabouts is not included in the numbers.
  • There is evidence to believe that a large cohort of those infected are asymptomatic. That is, they have no symptoms or very mild symptoms. If that’s the case, then there is a cohort of infected people who don’t feel unwell, so they don’t get tested and are, also, omitted from the infection numbers.

These factors will mean that naive division of the reported numbers will inflate the mortality rate, making it seem worse, possibly much worse, than it really is (the Economist argues [paywall] that places with extensive testing have much lower death rates – by a factor of 5 or so – simply because they are identifying more of those infected).

Ideally, if you’re going to publish these numbers make it clear what their limitations are.

PS: The Diamond Princess is probably the only cohort to have reliable infection numbers, since everyone was tested before leaving the ship. However, their mortality rate shouldn’t be used as they’re not a representative sample (ie mostly older people who are fit enough and wealthy enough to go on a cruise).

PPS: Further virus related stuff (eg on lag) to be posted on my other not actually python blog.