In this notebook we look at data on the total number of confirmed cases of covid-19, number of deaths, and/or number of recoveries. The data is broken down by country or by state/province/county (depending on the country). You can view the raw data at https://github.com/CSSEGISandData/COVID-19

The analysis in this notebook is based on ideas in a 3 Blue 1 Brown video on exponential growth in general and this data in particular. See https://www.youtube.com/watch?v=Kas0tIxDvrg

Notebook author - Jeff Kinne, jkinne@cs.indstate.edu.

Load Data

First we load the data from github. Note that the data is updated once per day.

In [1]:
# read the data

# if reading from downloaded files -
#confirmed <- read.csv("time_series_19-covid-Confirmed.txt", stringsAsFactors = FALSE)
#deaths <- read.csv("time_series_19-covid-Deaths.txt", stringsAsFactors = FALSE)
#recovered <- read.csv("time_series_19-covid-Recovered.txt", stringsAsFactors = FALSE)

# if reading directly from github to get the latest files - 
library(RCurl)
confirmed <- read.csv(text=getURL("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv"), stringsAsFactor = FALSE) 
deaths <- read.csv(text=getURL("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Deaths.csv"), stringsAsFactor = FALSE)
recovered <- read.csv(text=getURL("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Recovered.csv"), stringsAsFactor = FALSE)

first_numeric_col <- 5  
# check the data to make sure this is true
#print(confirmed[1, 1:10])

print(paste("Most recent day in data", colnames(confirmed)[ncol(confirmed)]))
Loading required package: bitops

[1] "Most recent day in data X3.20.20"

Convert Data

Convert the data so that we have the number of new cases per day. This is useful in trying to figure out whether the outbreak in a region is slowing down or accelerating (is the number of new cases per day going up or down).

In [2]:
# Convert each of those so we also have the # of new cases per day
new_cases <- function(total_cases, starting_col) {
    new_ <- total_cases[, 1:(ncol(total_cases)-1)]
    new_[,starting_col:ncol(new_)] <- total_cases[,(starting_col+1):ncol(total_cases)] - total_cases[,starting_col:ncol(new_)]
}
confirmed_new <- new_cases(confirmed, first_numeric_col)
deaths_new <- new_cases(deaths, first_numeric_col)
recovered_new <- new_cases(recovered, first_numeric_col)

Which Dataset

We pick which dataset we are looking at - confirmed, deaths, or recovered. Confirmed is more up to date in one way (currently active cases), while deaths may be more accurate in another (if confirmed cases are under-reported due to limited testing or other reasons).

In [3]:
text <- "confirmed"
if (text == "confirmed") {
    data <- confirmed
    data_new <- confirmed_new
} else if (text == "deaths") {
    data <- deaths
    data_new <- deaths_new
} else {
    data <- recovered
    data_new <- recovered_new
    text <- "recovered"
}

Choose a Region

For what we do below we want to look at particular country or region. Here are some of the options.

In [4]:
# names for first 5 columns
#print(colnames(data)[1:5])

# change topN to how many you want to display
topN = 10

# countries
print(paste("Number of new",text,"in each Country.Region in the most recent day"))
new_cases_by_region <- aggregate(data_new[,ncol(data_new)], by = list(Country.Region=data[,2]), FUN = sum)
print(new_cases_by_region[order(new_cases_by_region[,2], decreasing = TRUE)[1:min(topN,nrow(data_new))],])

# provice/state
print(paste("Number of new",text,"in each Province.State in the most recent day"))
new_cases_by_state <- aggregate(data_new[,ncol(data_new)], by = list(Province.State=data[,1]), FUN = sum)
print(new_cases_by_state[order(new_cases_by_state[,2], decreasing = TRUE)[1:min(topN,nrow(data_new))],])
[1] "Number of new confirmed in each Country.Region in the most recent day"
    Country.Region    x
77           Italy 5986
153             US 5423
60         Germany 4528
140          Spain 2447
56          France 1779
156 United Kingdom 1298
73            Iran 1237
145    Switzerland 1219
108    Netherlands  536
17         Belgium  462
[1] "Number of new confirmed in each Province.State in the most recent day"
    Province.State     x
1                  19963
198       New York  2945
81          France  1741
296 United Kingdom  1294
189    Netherlands   534
30      California   225
170       Michigan   218
122       Illinois   163
195     New Jersey   148
305     Washington   148

Select Country.Region

In the next box, you choose whether you want to look at a particular Country.Region or a particular Province.State and which one you want to look at.

In [5]:
col_to_use <- which(colnames(data) == "Country.Region")
#col_to_use <- which(colnames(data) == "Province.State")

region <- "US"
#region <- "New York"

So, a plot of new cases per day. If this goes up more days than not, this is exponential growth.

In [6]:
rows <- which(data[,col_to_use] == region)
data_new_region <- data_new[rows,]

data_region <- colSums(data_new_region)
#print(data_region)
options(repr.plot.width=16, repr.plot.height=8)
plot(1:length(data_region), data_region, 
     main=paste("Number of new", text, "each day in", region), 
     ylab=paste("New", text), xlab="days since Jan 22")

Exponential Growth?

As long as the number of new cases each day is normally more than the previous day, the outbreak is experiencing exponential growth. To see this, we can plot the ratio of the current day divided by the previous day. If this ratio is normally larger than 1, this is exponential growth.

In [7]:
scale_factor <- data_region[-1] / data_region[1:(length(data_region)-1)]
plot(1:length(scale_factor), scale_factor, 
     main=paste("Ratio of new", text, "in a day divided by the previous day in", region),
    ylab="ratio", xlab="days since Jan 22")
points(c(0,length(scale_factor)), c(1,1), type="l")

Another way to see exponential growth is to plot the logarithm of the data. If the data is exponential, then the log of it will be linear.

In [8]:
plot(1:length(data_region), log(data_region), 
    main=paste("Log of number of new", text,"each day in", region),
    ylab=paste("log of number of new", text), xlab="days since Jan 22")

Predictions

If the data seems to be exponential, then we can linear-fit the logarithm of the data and make predictions based off of that. A not unreasonable way to go is to use recent days for the model, and predict some number of days out.

In [9]:
data_region_cleaned <- data_region[(length(data_region)-7):length(data_region)]
data_region_cleaned <- data_region_cleaned[data_region_cleaned >= 1]
data_log_prev <- log(data_region_cleaned)
data_to_fit <- data.frame(logVals = data_log_prev, day = 1:length(data_log_prev))
model <- lm(logVals ~ day, data_to_fit)
predict_days <- data.frame(day=1:(nrow(data_to_fit)+7))
predicted_logs <- predict(model, predict_days)
plot(predict_days$day, exp(predicted_logs), type="l", 
     main=paste("Predicted number of new",text,"each day for next 7 days based on most recent days"),
    ylab=paste("Number of new",text,"each day"),
    xlab="points are data from recent days, line beyond points is the prediction")
points(1:length(data_log_prev), exp(data_log_prev))