In this notebook we look at data on the total number of confirmed cases of covid-19, number of deaths, and/or number of recoveries. The data is broken down by country or by state/province/county (depending on the country). You can view the raw data at https://github.com/CSSEGISandData/COVID-19
The analysis in this notebook is based on ideas in a 3 Blue 1 Brown video on exponential growth in general and this data in particular. See https://www.youtube.com/watch?v=Kas0tIxDvrg
Notebook author - Jeff Kinne, jkinne@cs.indstate.edu.
First we load the data from github. Note that the data is updated once per day.
# read the data
# if reading from downloaded files -
#confirmed <- read.csv("time_series_19-covid-Confirmed.txt", stringsAsFactors = FALSE)
#deaths <- read.csv("time_series_19-covid-Deaths.txt", stringsAsFactors = FALSE)
#recovered <- read.csv("time_series_19-covid-Recovered.txt", stringsAsFactors = FALSE)
# if reading directly from github to get the latest files -
library(RCurl)
confirmed <- read.csv(text=getURL("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv"), stringsAsFactor = FALSE)
deaths <- read.csv(text=getURL("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Deaths.csv"), stringsAsFactor = FALSE)
recovered <- read.csv(text=getURL("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Recovered.csv"), stringsAsFactor = FALSE)
first_numeric_col <- 5
# check the data to make sure this is true
#print(confirmed[1, 1:10])
print(paste("Most recent day in data", colnames(confirmed)[ncol(confirmed)]))
Convert the data so that we have the number of new cases per day. This is useful in trying to figure out whether the outbreak in a region is slowing down or accelerating (is the number of new cases per day going up or down).
# Convert each of those so we also have the # of new cases per day
new_cases <- function(total_cases, starting_col) {
new_ <- total_cases[, 1:(ncol(total_cases)-1)]
new_[,starting_col:ncol(new_)] <- total_cases[,(starting_col+1):ncol(total_cases)] - total_cases[,starting_col:ncol(new_)]
}
confirmed_new <- new_cases(confirmed, first_numeric_col)
deaths_new <- new_cases(deaths, first_numeric_col)
recovered_new <- new_cases(recovered, first_numeric_col)
We pick which dataset we are looking at - confirmed, deaths, or recovered. Confirmed is more up to date in one way (currently active cases), while deaths may be more accurate in another (if confirmed cases are under-reported due to limited testing or other reasons).
text <- "confirmed"
if (text == "confirmed") {
data <- confirmed
data_new <- confirmed_new
} else if (text == "deaths") {
data <- deaths
data_new <- deaths_new
} else {
data <- recovered
data_new <- recovered_new
text <- "recovered"
}
For what we do below we want to look at particular country or region. Here are some of the options.
# names for first 5 columns
#print(colnames(data)[1:5])
# change topN to how many you want to display
topN = 10
# countries
print(paste("Number of new",text,"in each Country.Region in the most recent day"))
new_cases_by_region <- aggregate(data_new[,ncol(data_new)], by = list(Country.Region=data[,2]), FUN = sum)
print(new_cases_by_region[order(new_cases_by_region[,2], decreasing = TRUE)[1:min(topN,nrow(data_new))],])
# provice/state
print(paste("Number of new",text,"in each Province.State in the most recent day"))
new_cases_by_state <- aggregate(data_new[,ncol(data_new)], by = list(Province.State=data[,1]), FUN = sum)
print(new_cases_by_state[order(new_cases_by_state[,2], decreasing = TRUE)[1:min(topN,nrow(data_new))],])
In the next box, you choose whether you want to look at a particular Country.Region or a particular Province.State and which one you want to look at.
col_to_use <- which(colnames(data) == "Country.Region")
#col_to_use <- which(colnames(data) == "Province.State")
region <- "US"
#region <- "New York"
So, a plot of new cases per day. If this goes up more days than not, this is exponential growth.
rows <- which(data[,col_to_use] == region)
data_new_region <- data_new[rows,]
data_region <- colSums(data_new_region)
#print(data_region)
options(repr.plot.width=16, repr.plot.height=8)
plot(1:length(data_region), data_region,
main=paste("Number of new", text, "each day in", region),
ylab=paste("New", text), xlab="days since Jan 22")
As long as the number of new cases each day is normally more than the previous day, the outbreak is experiencing exponential growth. To see this, we can plot the ratio of the current day divided by the previous day. If this ratio is normally larger than 1, this is exponential growth.
scale_factor <- data_region[-1] / data_region[1:(length(data_region)-1)]
plot(1:length(scale_factor), scale_factor,
main=paste("Ratio of new", text, "in a day divided by the previous day in", region),
ylab="ratio", xlab="days since Jan 22")
points(c(0,length(scale_factor)), c(1,1), type="l")
Another way to see exponential growth is to plot the logarithm of the data. If the data is exponential, then the log of it will be linear.
plot(1:length(data_region), log(data_region),
main=paste("Log of number of new", text,"each day in", region),
ylab=paste("log of number of new", text), xlab="days since Jan 22")
If the data seems to be exponential, then we can linear-fit the logarithm of the data and make predictions based off of that. A not unreasonable way to go is to use recent days for the model, and predict some number of days out.
data_region_cleaned <- data_region[(length(data_region)-7):length(data_region)]
data_region_cleaned <- data_region_cleaned[data_region_cleaned >= 1]
data_log_prev <- log(data_region_cleaned)
data_to_fit <- data.frame(logVals = data_log_prev, day = 1:length(data_log_prev))
model <- lm(logVals ~ day, data_to_fit)
predict_days <- data.frame(day=1:(nrow(data_to_fit)+7))
predicted_logs <- predict(model, predict_days)
plot(predict_days$day, exp(predicted_logs), type="l",
main=paste("Predicted number of new",text,"each day for next 7 days based on most recent days"),
ylab=paste("Number of new",text,"each day"),
xlab="points are data from recent days, line beyond points is the prediction")
points(1:length(data_log_prev), exp(data_log_prev))