CNBC claimed in a video on YT that Nebraska had a 16.5% increase in the Covid-19 (Delta variant B.1.617.2).
I found that a curious claim because according to WorldInfo Nebraska's 7 day moving average of daily new cases was at 200 on May 2nd and is now at 70, a 65% drop in new cases. Playing with all the numbers prior to 70 on July 8th I could find no number which could be used as a base for extrapolation to 70 using a 16.5% increase. The video also claimed that states which didn't have mask mandates and lockdowns were now experiencing higher cases of covid than states that did. That, too, reeked of bias.
I decided to do my own analysis. Going to the belly of the beast I downloaded the NYT covid data by counties in the US from Jan 1, 2020 to July 7, 2021. It is at
https://raw.githubusercontent.com/ny...s-counties.csv
and is 59MB of data in 1,500,000+ rows. Later I plan to get CDC's counties data to compare it to the NYT data.
I attempted to load the 59MB file into LibreOffice Calc and immediately ran into their row limit. I usually use Calc to clean up csv data before I import it into my Jupyter Notebooks. So, I decided to use pandas and create a dataframe directly from its "pd.read_csv" function. It ran out of memory well before it could load the entire file. My RAM is 16GB.
Casting about for a solution I discovered "Dask", which is a Pandas "work-a-like" but can handle gigantic files and using CPU's with multiple cores can do parallel processing. Instead of memory based like Calc or Pandas it is disk based and is, apparently, limited only by the size of the disk. If the storage is a cluster the cluster is the limit. Reports of Dask working with 2TB files are on the web.
Here is the code I worked out to read the NYT csv file:
"assume_missing=True" prevents the read command from aborting on ragged lines in the csv, i.e., missing data
scd = shorthand for statecountydate
The first row headers are "date, county, state, fips, cases, deaths". The file is presorted on the date column, which is in the form YYYY-MM-DD. As you'd suspect the rows are all in date order but are a jumble of states and counties.
Dask has the ability to sort ONLY on one column, unlike Pandas which can sort on multiple columns. The code:
gets around that limitation by creating a new column that I labeled "scd". It is created by concatenating the state, county and date columns in the order I want the sort to be in. Then the df.assign(ind=scd) command creates the new column in the df dataframe. The "ind=scd" parameter makes the new column the index column.. The last line in the code sets the index of the dataframe to be "ind" (which is scd by assignment). Telling "df.set_index" that ind is not sorted triggers it to sort on ind, which is scd.
The last line:
df.compute().to_csv("sorted_us-counties_2021-07-07.csv")
shows the use of the Dask "compute" function to merge the work of my 8 cores into one data stream which is then written to the disk as one sorted file. It is interesting to note that the sorted file ended up being 107MB in side, almost twice as big. Executing the entire code, from reading in, re-indexing, and writing out the sorted file, took about 5 or 6 seconds. The documentation said that Dask was fast. I believe them!
I found that a curious claim because according to WorldInfo Nebraska's 7 day moving average of daily new cases was at 200 on May 2nd and is now at 70, a 65% drop in new cases. Playing with all the numbers prior to 70 on July 8th I could find no number which could be used as a base for extrapolation to 70 using a 16.5% increase. The video also claimed that states which didn't have mask mandates and lockdowns were now experiencing higher cases of covid than states that did. That, too, reeked of bias.
I decided to do my own analysis. Going to the belly of the beast I downloaded the NYT covid data by counties in the US from Jan 1, 2020 to July 7, 2021. It is at
https://raw.githubusercontent.com/ny...s-counties.csv
and is 59MB of data in 1,500,000+ rows. Later I plan to get CDC's counties data to compare it to the NYT data.
I attempted to load the 59MB file into LibreOffice Calc and immediately ran into their row limit. I usually use Calc to clean up csv data before I import it into my Jupyter Notebooks. So, I decided to use pandas and create a dataframe directly from its "pd.read_csv" function. It ran out of memory well before it could load the entire file. My RAM is 16GB.
Casting about for a solution I discovered "Dask", which is a Pandas "work-a-like" but can handle gigantic files and using CPU's with multiple cores can do parallel processing. Instead of memory based like Calc or Pandas it is disk based and is, apparently, limited only by the size of the disk. If the storage is a cluster the cluster is the limit. Reports of Dask working with 2TB files are on the web.
Here is the code I worked out to read the NYT csv file:
Code:
import dask as dd df = dd.read_csv('us-counties_2021-07-07.csv',assume_missing=True ) scd = df.state +"|"+df.county +"|"+ df.date df = df.assign(ind=scd) df = df.set_index('ind', sorted=False) #sorted=True doesn't trigger sorting #df.head(475) df.compute().to_csv("sorted_us-counties_2021-07-07.csv")
scd = shorthand for statecountydate
The first row headers are "date, county, state, fips, cases, deaths". The file is presorted on the date column, which is in the form YYYY-MM-DD. As you'd suspect the rows are all in date order but are a jumble of states and counties.
Dask has the ability to sort ONLY on one column, unlike Pandas which can sort on multiple columns. The code:
Code:
scd = df.state +"|"+df.county +"|"+ df.date df = df.assign(ind=scd) df = df.set_index('ind', sorted=False)
The last line:
df.compute().to_csv("sorted_us-counties_2021-07-07.csv")
shows the use of the Dask "compute" function to merge the work of my 8 cores into one data stream which is then written to the disk as one sorted file. It is interesting to note that the sorted file ended up being 107MB in side, almost twice as big. Executing the entire code, from reading in, re-indexing, and writing out the sorted file, took about 5 or 6 seconds. The documentation said that Dask was fast. I believe them!
Comment