COVID-19 Case Counts and Deaths

Adam Morse
9 min readJul 7, 2020

Many people claim that there is a paradox in increasing COVID-19 case counts throughout much of the United States as daily COVID-19 death tolls continue to drop. Is this indicative of a new, less dangerous variety of COVID-19 outcompeting the earlier more dangerous version? Is this about shifting demographics, where the newer cases are concentrated among younger people who are less likely to get severely ill or die? Is this a statistical artifact about pooling data from multiple states, explainable by Simpson’s paradox?

Unfortunately, I believe that the most likely explanation is simply that deaths lag increases in cases by about 4 weeks. We can’t simply compare the number of confirmed cases to the number of deaths, because increased testing volume will result in more confirmed cases even if the actual number of confirmed cases is decreasing. However, we can adjust for that, using the test positivity rate — what percentage of tests are positive — to control for changes in numbers of tests. Once we adjust for testing volume, we can estimate the true number of cases in the population. But deaths are a lagging indicator — a person can get sick on day 0, get a positive test on day 4, get hospitalized on day 10, die on day 17, and not show up in the count of deaths until day 24. If we look at individual states that have been at the leading edge of the second wave of infections, like Arizona, we find that their cases began climbing around May 25th, and their daily death tolls began to climb around June 23rd — 29 days later. Florida saw a surge in cases starting somewhere around June 1st, and their daily deaths also started climbing around June 21st — about 20 days later. Some variation state-to-state is to be expected; different states will have different testing criteria, different reporting speeds, and so forth. And there is also inevitably noise in real-world data. But this points to a lag of about 20–30 days between increases in cases and increases in deaths.

Applying that same expectation to the United States as a whole, we would predict cases to begin climbing somewhere around now. Actual new cases appear to have hit a low around June 17th, after about two weeks of a rough plateau with a slow rate of decline. That would predict deaths to start climbing somewhere between July 7th and July 17th. Today’s data shouldn’t be taken as significant evidence, for the same reason that low numbers over the holiday weekend shouldn’t be taken too seriously— there’s a weekly cycle to death reporting, and holiday weekends accentuate that. But if this is the explanation, deaths will begin climbing very soon nationwide, and they will continue to climb for at least the next 20 days after they begin their climb.

Dave Blake has a terrifying graphical representation of this on his Twitter account: He calculates an infection density (equivalent to a number of actual new cases per day) by applying a formula relating confirmed cases, test positivity, and actual new cases. His formula is Actual New Cases = Confirmed Cases * 19.6* Test Positivity ^ 1/2 (or equivalently, times the square root of test positivity). This is an effort to get an empiric set of values for the intuitive understanding that a given number of confirmed cases implies more actual cases if the test positivity rate is higher. If two states both have 1000 new confirmed cases, but one has a test positivity rate of 10% and one has a test positivity rate of 20%, we’d expect the state with the higher positivity rate to have more actual cases. Blake’s formula estimates that the first state has about 6200 actual new cases, and the second has about 8800 actual cases. He explains how he derived his formula here:

Now we have a clear prediction: if the reason deaths have been declining or flat while cases have been rising is just about lag, sometime in the next week or so, deaths will start climbing and they will keep on climbing for at least 20 days. By the end of that period, they will be at about 1500 deaths per day. If we don’t rapidly bring the rate of transmission nationwide below 1, they may continue to grow well beyond that level. I don’t want to pretend to false precision here — there are multiple effects going on, including detecting cases earlier, different states’ reporting systems, and so forth — and those can affect the lag times significantly. But if the daily death reports don’t start rising within the next week or so, it will suggest that something else is going on.

People have advanced a set of other explanations. I don’t find them convincing for reasons I’ll explain below.

Many people have argued that there has been a shift in the demographics of the people infected by COVID-19. They argue that earlier on, most of the people infected were elderly, whereas now a large number of adults in the 20–40 age range are testing positive. This seems intuitively plausible — perhaps the elderly are taking precautions against COVID-19, while young adults are more likely to be going to bars without masks and spreading the disease freely. Combined with test data showing a higher frequency of infection among younger people, this seems like a satisfying explanation. As a corollary, the case fatality ratio would then be expected to drop, because younger people are less likely to die from COVID-19. Could this explain rising cases but dropping deaths?

It could, but it could also be an artifact of more widely available testing. If at time 0, you only test people who are hospitalized, and then at time 1, you test anyone with symptoms, you’ll find that a much higher fraction of the population that tested positive at time 0 were in vulnerable parts of the population. That’s true even if there’s been no shift in the demographics of the people who were infected. If at time 0, 50% of the population who were hospitalized were 65 or older, then necessarily 50% of the population who tested positive would also be 65 or older. Shift to time 1, and a much lower percentage of the population that tests positive will be 65 or older — even if the infection rates are the same. In order to accurately measure changes in the demographics of the infected, you need to control for the testing standards. For example, if the ratio of people in ages 20–40 hospitalized with COVID-19 to people in ages 65+ hospitalized with COVID-19 changes, that would be evidence for changing demographics of infection. Even then, it could also reflect changing standards for hospitalization, but it would be stronger evidence that demographics of people who have tested positive. Likewise, changes in the demographics of people dying of COVID-19 would presumably reflect actual changes in the demographics of the people infected with COVID-19. Of course, we need to wait for the appropriate lag to measure that. I haven’t seen any claims of changing demographics based on data that would be robust to these sorts of effects. That’s not to say that there hasn’t been some demographic shift — it’s just to say that I’m not confident that there has been, or that it has been large enough to have a major effect.

Other people — most notably Miles Beckett in a widely shared Twitter thread — have argued that the difference between deaths and cases can be explained by Simpson’s paradox. Simpson’s paradox is a statistical phenomenon where pooling data in a way that obscures causality can lead to inaccurate conclusions. For example, imagine that we have a very expensive treatment for a disease. Because it’s very expensive, insurance companies will only pay for the treatment for ordinary people who are very sick (and thus likely to die). But because doctors assert that the treatment is valuable, rich people will pay for it for even mild cases. So take a group of 100 normal people with the disease. 10 of them are very sick, hospitalized, and receive the treatment; out of those 10, 9 die. Another 10 of them are very sick, hospitalized, and don’t receive the treatment; out of those 10, 8 die. Assuming that this pattern repeats often enough, we correctly conclude that the treatment is counterproductive — you’re more likely to die with the treatment than without. But now add in 100 rich people as well, all of whom are hospitalized and receive the treatment, and 18 of whom die. If we pool those sets of data together and look at the outcomes among the hospitalized patients, we see that 120 were hospitalized; out of the 10 who were not given the treatment, 80% died; out of the 110 who were given the treatment, only 27 died (24.5%). Suddenly it looks like the treatment is hugely beneficial, even though in fact, it increases the likelihood that you will die. The erroneous conclusion is driven entirely from the way that we pooled the data together.

Simpson’s paradox has important consequences for many real-world evaluations of medical outcomes. For example, if the effectiveness of a treatment is measured in terms of average survival times, then a study of administering treatment earlier in the progression of the disease will often show a major benefit, when the benefit is in fact driven by observational effects. And that has direct applications to some aspects of COVID-19: for example, if more people are tested earlier in their sickness, the mean time from a positive test result to death (for those people who die) will inevitably go up. In this sense, Simpson’s paradox can explain changing lag times between confirmed cases and deaths. Catch the cases earlier, and the lag time will be longer, even if the actual results are the same.

I don’t see, however, how Simpson’s paradox explains the overall apparent disparity between case counts and deaths. If you aggregate the falling case counts in states like New York, New Jersey, and Massachusetts with the rising case counts in states like Arizona, Florida, and Texas, you find that there was a period when nation-wide, cases were declining even though they were increasing in the Sunbelt. Lag that appropriately, and you see death totals continuing to decline. Eventually the case counts nationwide started climbing, even though in some individual states the counts were still declining. Lag that appropriately, and we predict increasing deaths in the days to come. With appropriate lagging and with adjustments for positivity rates, these are counts that largely can be aggregated by simply summing them all up.

Quality of care could explain some decline in the infection fatality rate. Part of the point of “flattening the curve” was to give time for the state of medical knowledge to advance, and it has improved. From my position as a lay person, the studies of dexamethasone’s use to significantly improve survival rates among severely ill COVID-19 patients looks the most significant, but there are many smaller improvements in knowledge about how to care for COVID-19 patients. It wouldn’t surprise me if, taken as a whole, these reduced the infection fatality rate for COVID-19 by 25% or so. That’s hugely important —an improvement on that magnitude would likely represents tens of thousands of lives saved in the US alone — but it’s not enough to explain deaths decreasing while cases increase. That has to be a different effect — either a measurement effect, or just the deaths lagging.

Likewise, death rates were almost certainly higher during the first peak in areas where the healthcare system was overwhelmed. When doctors need to make hard triage decisions, some people who could have survived if they were given a slot in the ICU will instead die because of a shortage of beds. This also inevitably causes other, non-COVID-19 fatalities as an overwhelming number of COVID-19 cases compromises the ability to provide emergency medical services across the board. We should expect this to cause higher peaks and sharper falls in death numbers than the corresponding case numbers. But while this might explain some of the decline, and could cause an increased rate of climb in deaths as the Arizona hospitals are likely overwhelmed, it would not explain an overall disconnect, with cases rising and deaths falling.

Right now, the only explanation I see for why deaths have not started climbing yet is that we haven’t made it all the way through the lag time. I hope I’m wrong, because if I’m right mid to late July will be awful and August may be as bad or worse. But that’s the most parsimonious explanation for the data I’ve seen. We’ll know for sure within the next two weeks.