grep pattern to find special combinations of fields in a large csv file

58 views
Skip to first unread message

Lewis Downey

unread,
Jul 29, 2020, 9:56:03 AM7/29/20
to BBEdit Talk
Hello!

I do not know where to start to put together a grep pattern that will parse a 375,000+ line csv file in a specific way for a personal project. The file is publicly available from the NY Times and contains data related to tracking Covid-19 cases and deaths. Each line consists of six fields:

* a date
* a county name
* a state name
* FIPS number for the county
* cumulative number of Covid-19 cases reported in that county as of that date
* cumulative number of Covid-19 deaths reported in that county as of that date

I am trying to extract the date of the first reported Covid-19 death for each county in the US. This would be the first time that the last field is greater than 0 for unique county+state combinations. About 1/3 of all counties still have no reported deaths. Those counties should not be found be found by the pattern. In round numbers there are about 3100 counties in the report. Roughly 2000 have reported at least one Covid-19 death. I am trying to find those specific 2000ish counties and including the date of the first reported death.

The pattern needs to find the first instance of unique combinations of county & state where the last comma-delimited field is greater than zero. The pattern will be used by BBedit in an Applescript.  The Applescript execution needs to be as brief as possible so as not to commandeer my laptop for a meaningful chunk of time. The script will be run at least once each week. By iterating through a list of counties I made Applescript, BBedit, & Numbers do this but as-built the execution took too long, is prone to running out of memory, and otherwise makes my laptop close to unusable while it runs. An efficient grep pattern would be incredibly helpful.

Here is a handcrafted sample of the file:
date,county,state,fips,cases,deaths
2020-01-21,Snohomish,Washington,53061,1,0
2020-01-22,Snohomish,Washington,53061,1,0
2020-01-23,Snohomish,Washington,53061,1,0
2020-01-24,Cook,Illinois,17031,1,0
2020-01-24,Snohomish,Washington,53061,1,0
2020-01-25,Orange,California,06059,1,0
2020-01-25,Cook,Illinois,17031,1,0
2020-01-25,Snohomish,Washington,53061,1,0
2020-01-26,Maricopa,Arizona,04013,1,0
2020-01-26,Los Angeles,California,06037,1,0
2020-01-26,Orange,California,06059,1,0
2020-01-26,Cook,Illinois,17031,1,0
2020-03-03,Wake,North Carolina,37183,1,0
2020-03-04,Wake,North Carolina,37183,1,0
2020-03-09,Union,New Jersey,34039,1,0
2020-03-21,Essex,Massachusetts,25009,41,0
2020-03-22,Essex,Massachusetts,25009,60,0
2020-03-23,Essex,Massachusetts,25009,73,1
2020-03-24,Essex,Massachusetts,25009,118,1
2020-03-24,Union,New Jersey,34039,246,2
2020-03-25,Essex,Massachusetts,25009,177,1
2020-04-09,Union,New Jersey,34039,5203,145
2020-04-12,Essex,Massachusetts,25009,3170,102
2020-04-13,Essex,Massachusetts,25009,3413,114
2020-04-15,Wake,North Carolina,37183,510,1
2020-05-06,Union,New Jersey,34039,13604,800
2020-05-07,Union,New Jersey,34039,13781,829
2020-05-08,Union,New Jersey,34039,13917,844
2020-06-30,Wake,North Carolina,37183,5178,47
2020-07-01,Wake,North Carolina,37183,5379,48
2020-07-02,Wake,North Carolina,37183,5590,49

From that sample the grep pattern needs to find exactly these three lines:
2020-03-23,Essex,Massachusetts,25009,73,1    <- 1st occurrence of Essex, Massachusetts w/ last field great than zero 0
2020-03-24,Union,New Jersey,34039,246,2        <- 1st occurrence of Union, New Jersey w/ last field great than zero 0
2020-04-15,Wake,North Carolina,37183,510,1    <- 1st occurrence of Wake, North Carolina w/ last field great than zero 0

And report them as
2020-03-23,Essex,Massachusetts
2020-03-24,Union,New Jersey
2020-04-15,Wake,North Carolina

There are other counties in the sample but none of the others have a 6th comma-delimited field with a value greater than 0. Those counties should not be found.
That said, an alternative report is perfectly workable if it is easier to generate (as shown below but without the descriptive comments). It is hard for me to see how this alternative report
would be easier to generate, but grep has plenty of other qualities i would not imagine.

2020-01-25,Snohomish,Washington,0        <- most recent date reported for county where last comma-delimited field = 0
2020-01-26,Cook,Illinois,0                       <- most recent date reported for county where last comma-delimited field = 0
2020-01-26,Los Angeles,California,0        <- most recent date reported for county where last comma-delimited field = 0
2020-01-26,Maricopa,Arizona,0            <- most recent date reported for county where last comma-delimited field = 0
2020-01-26,Orange,California,0            <- most recent date reported for county where last comma-delimited field = 0
2020-03-23,Essex,Massachusetts,1       <- number of deaths first reported now included in report (almost always 1)
2020-03-24,Union,New Jersey,2         <- number of deaths first reported now included in report (almost always 1, 2 shown here to cover the case of more than 1)      
2020-04-15,Wake,North Carolina,1       <- number of deaths first reported now included in report (almost always 1)

I can build grep patterns for really easy stuff but do not know how to approach this. Is it even possible?

Any help you provide will be appreciated.  Thank you for taking the time to think about this question.

Lewis

Neil Faiman

unread,
Jul 29, 2020, 10:15:41 AM7/29/20
to BBEdit Talk Mailing List
AppleScript is wonderful as a glue language. As a data processing language, not so much.

BBEdit is happy running any sort of scripts, not just AppleScript.This looks like it should be trivial in Perl.

Regards,

Neil Faiman

--
This is the BBEdit Talk public discussion group. If you have a feature request or need technical support, please email "sup...@barebones.com" rather than posting here. Follow @bbedit on Twitter: <https://twitter.com/bbedit>
---
You received this message because you are subscribed to the Google Groups "BBEdit Talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bbedit+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bbedit/455abe22-f5d7-411c-9615-bdc7968f3256o%40googlegroups.com.

Robin Beech

unread,
Jul 29, 2020, 11:17:12 AM7/29/20
to BBEdit Talk
I often write scripts to process text that I put in the text filters folder so they are accessible from the Text menu, maybe this is what you plan for your grep script. If find awk is very good for processing large volumes of data organized by column.

Here is an awk text filter that seems to do what you are looking for

#!/usr/bin/awk -f
BEGIN {
# fields are separated by a comma
FS = ",";
}
{
# skip the current line if the last field is not a number
if (! ($6 ~ /[0-9]+/)) next;
# save lines based on County:State
key = $2 ":" $3;

# if the last field is greater than zero 
# and we didn't save this County:State before, save it
if (($6 > 0) && ( ! (key in firstoccurrences))) firstoccurrences[key] = $1 "," $2 "," $3 "," $6; 

}
END {
# when we are done processing all the lines print out the first occurrences
for(key in firstoccurrences) {
print firstoccurrences[key];
}
}

face

unread,
Jul 29, 2020, 12:58:41 PM7/29/20
to BBEdit Talk
These things are so much easier in a database...with a simple MySQL table of

  cv19(recDate, county, state, FIPS, cases, deaths)

bulk load the data and execute the query

'select  recDate, county, state , count(deaths), min(deaths) from cv19 where  deaths > 0 group by county, state order by recDate'

yields the date of the first non-zero count of deaths, the number of records where deaths are non-zero, and the number of deaths on the first reported date...

of course this doesn't answer the question of how to do this with grep or trivially with PERL.

Lewis Downey

unread,
Jul 29, 2020, 1:30:49 PM7/29/20
to BBEdit Talk
Robin,

Your text filter works like a charm! The results are in some kind of funky order but accurate. The filter barely took moments to runs - a gigantic improvement. I have not yet found a way to script the application of the filter, but if that never happens spending 60 seconds to run the filter and save the results manually is a piece of cake.

Thank you very much!

Lewis

Lewis Downey

unread,
Jul 29, 2020, 1:41:11 PM7/29/20
to BBEdit Talk
Neil, face,

No doubt you are right. By now i should be perfectly comfortable building and querying MySQL tables. However so far I have punted on learning SQL. Same for Perl. It could still happen, maybe...

Thanks for the replies.

(still curious if grep could find the correct results.)

David G Wagner

unread,
Jul 29, 2020, 2:12:08 PM7/29/20
to BBEdit Talk
You will have in history and you can use the script editor to record your actions, save that output and execute. Personally a Perl individual, but you should be able to follow the shooting... A thought...

Wags ;)
WagsWorld
Hebrews 4:15
Ph(primary) : 408-914-1341
Ph(secondary): 408-761-7391
--
This is the BBEdit Talk public discussion group. If you have a feature request or need technical support, please email "sup...@barebones.com" rather than posting here. Follow @bbedit on Twitter: <https://twitter.com/bbedit>
---
You received this message because you are subscribed to the Google Groups "BBEdit Talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bbedit+un...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages