Colorado eBird: Filters and filter limits

134 views
Skip to first unread message

colorad...@aol.com

unread,
Jan 21, 2014, 11:15:24 AM1/21/14
to cob...@googlegroups.com, cl...@cornell.edu, bl...@cornell.edu, mj...@cornell.edu
Cobirders:

Since the question has come up privately a couple of times recently, I thought that I would respond publicly in this venue, as the information may be appreciated by all of Colorado's eBirders.

In the beginning, eBird was a very simple and simplistic world.  From the start, though, the powers-that-were deemed it important to have filters for input data in order to flag entries that were atypical.  These first filters were usually state-based, one-filter-per-state things that provided gross estimates of numbers acceptable for that state in each of the 12 months of the calendar.  Chris Wood and I constructed that first Colorado filter.  At that time, there were no non-species entries.  That is, no spuhs, slashes, hybrids, subspecies.  There were just species.

As eBird has become more refined with much more capacity and capability, filters have become incredibly more complex.  First, was the separation of the statewide filter with regional filters, for Colorado there five:  Northeast, Southeast, Mountains, Northwest, Southwest.  That, obviously, required some fine-tuning of each of those five filters to more-closely match each subregion's avifauna, such as not include Northern Bobwhite in the three western filters, exclude Gunnison Sage-Grouse from the two eastern filters.  Second, was the addition of various non-species-level entries, the spuhs and the slashes (e.g., Semipalmated/Western Sandpiper, peep sp.).  That meant going through each of the five then-extant filters and adding those non-species entries relevant to each filter, which was done on a fairly conservative basis -- only the really common non-specific entries were added, such as Snow/Ross's Goose and Cackling/Canada Goose.  That wasn't too bad; tedious, but not too bad, and at that time, I was the only person working on Colorado's eBird filters.

With the addition of Marshall Iliff as the final member of what is familiarly called the eBird trinity (Chris Wood, Brian Sullivan, and Marshall) that runs the program, eBird's abilities expanded further, with a more-in-depth taxonomy that was to cover the entire planet.  Hybrids were added, many, many, many more non-species entries were added, even in the ABA-area, such that there are probably now more non-species-level entries available in the ABA area than species-level entries, some used exceedingly rarely, some widely used.

Then, eBird tackled the 'April problem.'  Those of us in the filter and record-review aspects of eBird (and I was, and am, doing both) had for years complained that the rigid monthly structure to the filters made for some major problems, with April being the poster child for such problems.  In much of the ABA area, particularly the Lower 48, filter makers/editors had to decide to filter all occurrences of a migrant species that arrived in the filter region in the last few days of April, or allow all occurrences of such species, even in early April when they were unknown.  In Colorado, MacGillivray's Warblers is an excellent case in point, with the vast bulk of migrants arriving in May, but with a very small number typically noted in the last week of April, but unknown in the state prior to the 22nd or so.

The solution was to throw out the monthly framework, replacing it with, essentially, a weekly framework, but not tied to any particular idea of 'week.'  While there are limits in all things -- and this new system's overarching limit was a maximum of 13 temporal filter periods per species per filter, the new system allowed chopping up, particularly, the short, intense spring migration of most migrant species into periods as small as five days, with each period allowed its own filter limit.  Each filter period has a number that is 'permitted,' while any larger number of birds of that species in that time period would require review.  As example, the in-construction Lincoln County filter has five filter periods covering the spring migration of Clay-colored Sparrow, allowing as many as 1 during 22-30 April, 9 during 1-7 May, 29 during 8-14 May, 15 during 15-21 May, and 9 during 22-31 May.  While we could simply allow any number, doing so would mean that there was no way to catch data-entry errors of numbers, such as 10 entered instead of 1 or 355 entered instead of 35 (and I have seen both of these mistakes, which are easy to make when using the number pad on computer or laptop) made.  In essence, a filter limit is the result of a decision about a tenuous balance between what might occur and data-entry errors, and such decisions need to be made for as many as 13 temporal periods in each of as many as 400 species and 175 non-species entries in each filter.

While the new filter system allows an excellent amount of flexibility in constructing species- and location-specific filters, it is also much more complex and much more time-consuming to construct. 
It takes me something like 12-20 hours of tedious effort to make a new filter from scratch and not much less than that to use existing eBird data to fine-tune existing active filters.  I use the temporal spread and abundance values from existing eBird data to create new filters or to modify existing filters.  Depending upon the region, the filter includes some 300-400 species-level entries with 125-175 other taxa (spuhs, slashes, hybrids, etc.)., and multiple temporal periods per taxon for nearly every taxon.

There are 28 active filters now covering eBird Colorado.  I also have 16 filters in some stage of construction to enable better fit to particular counties currently covered by more-general, multi-county filters.  The rationale for smaller-scale filters is generally self-evident, but as example, I constructed a filter for Phillips County a few years ago.  I did that because Phillips County eBird data were being filtered by the general northeast Colorado filter, which, at that point, included Weld, Morgan, Washington, Kit Carson, Yuma, Phillips, Sedgwick, and Logan counties.  Note that all of those counties but Phillips has at least part of a major water body in it.  Thus, Phillips County data were allowed to include large numbers of waterbird species that were actually fairly rare there.

However, just because a particular filter covers just one county does not mean that there aren't still difficult decisions about filtering to be made.  Common Raven provides an excellent example of this challenge.  The species is regular in small numbers in the far western part of Arapahoe County, but nowhere else in the county.  It is also regular at the Rocky Mountain Arsenal NWR in the southwestern corner of Adams County, but is virtually unknown in the vast majority of the rest of Adams.  Both Arapahoe and Adams are filtered by county-specific filters, so I have to decide whether to allow unfiltered Common Ravens from parts of those two counties where they do not occur or to have to review every entry of Common Raven, even those in the parts of the counties in which they are known to occur with regularity.

As more data are entered in eBird, the data set for a particular filter region becomes more robust and allows for more-precise filter limits and temporal periods, and I am constantly trying to incorporate fine-tuning of existing filters.  However, I also am endeavoring to construct new filters to get the review of data from counties like Lincoln out of the hands of more-general filters that are not particularly effective for the county (currently in the Southeast filter, which also includes Baca, Prowers, Kiowa, Bent, Otero, and Crowley counties).  Thus, you may encounter remnants of previous filter strategies when entering data into eBird, simply because I have not found the free time to completely revamp older filters.  Please bear with me on these minor problems while I'm still dealing with larger ones, as 100s and 100s of hours that I spend on eBird filter and review tasks is a volunteer effort.  However, feel free to drop me a line about any particularly egregious filter problems.

Current Colorado eBird filters:
Adams
Arapahoe
Archuleta, Dolores, La Plata, Montrose, San Miguel
Boulder
Broomfield, Denver
Chaffee
Clear Creek, Gilpin
Custer
Delta, Mesa
Douglas
El Paso
Elbert
Fremont
Huerfano
Jefferson
Larimer
Las Animas
Montezuma
Northern Mountains (Grand, Jackson, Lake, Park, Summit, Teller)
Northeast (Kit Carson, Logan, Morgan, Sedgwick, Washington, Yuma)
Northwest (Eagle, Garfield, Moffat, Pitkin, Rio Blanco, Routt)
Phillips
Pueblo
San Juan
San Luis Valley (Alamosa, Conejos, Costilla, Rio Grande, Saguache)
Southeast (Baca, Bent, Cheyenne, Crowley, Kiowa, Lincoln, Otero, Prowers)
Southwest montane (Gunnison, Hinsdale, Mineral, Ouray)
Weld

Filters in construction
Archuleta
Baca
Bent
Cheyenne
Crowley, Otero
Dolores, Montrose, San Miguel
Grand, Jackson
Kit Carson, Yuma
La Plata
Lake
Lincoln
Ouray
Park
Prowers
Summit
Teller


Tony



Robert Parsons

unread,
Jan 21, 2014, 1:02:01 PM1/21/14
to colorad...@aol.com, cob...@googlegroups.com, cl...@cornell.edu, bl...@cornell.edu, mj...@cornell.edu

From my perspective, a big thanks to all those that do all the detailed behind the scenes work on eBird so that the rest of us can enjoy this wonderful tool.

 

Robert Parsons

Washington DC

--
You received this message because you are subscribed to the Google Groups "Colorado Birds" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cobirds+u...@googlegroups.com.
To post to this group, send email to cob...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cobirds/8D0E4DB1F30A9F9-11D4-2AC2%40webmail-d207.sysops.aol.com.
For more options, visit https://groups.google.com/groups/opt_out.

colorad...@aol.com

unread,
Jan 21, 2014, 4:17:04 PM1/21/14
to nc...@cdc.gov, cob...@googlegroups.com, cl...@cornell.edu, bl...@cornell.edu, mj...@cornell.edu
Hi Nick:

The primary problem inherent in this thesis is not directly the establishment of new highs, thus filter limits pushed ever higher, but the inability to catch important data-entry errors -- the ones measurable in orders of magnitude.

Example:  Red-naped Sapsucker occurs in Adams Co. mostly as single individuals and, for argument's sake, we'll say that there are 107 such entries (I, personally, never saw more than one/day in the county in my 14 years there).  Then, someone scores two of 'em, which the filter flags.  The observer describes well and/or photographs both birds (perhaps they were both banded at the banding station) and the report is validated.  With an automated system for filter limits, the filter would then climb to 2.  The next fall, someone mistakenly hits the '2' key when s/he intended to hit the '1' key and the new automatic filter limit allows it to enter the data set without review, despite the fact that it would be only the second time in at least 109 occurrences in the county recorded by eBird in which more than one was reported.  Any statistician will tell you that that is significantly different than normal, yet such an entry would receive no oversight and an error would be included in the data set.  However, I would have maintained the filter at 1 and would have caught that error, requesting details from the observer who, hopefully, would realize the error and fix it.

Granted, it's probably unlikely that the filter limit would climb anymore in that situation, but let's extend the argument anyway with a different species.

eBird has 39 checklists from Larimer Co. during the seven-day period of 15-21 October that include Double-crested Cormorant (DCCO).  The current filter limit for the county in October is 600, the max from the seven-day period is 2500, and the average abundance is 397, which is well below the current filter limit.  The single highest count accounts for >16% of all of the DCCOs for that time period in Larimer and any statistician will tell you that that is an outlier.  Since it is not possible to determine what the second-highest tally is without downloading all of the Larimer data or hunting for it among the huge number of occurrences on the eBird map for October (one cannot generate maps for time periods shorter than one month and reviewers cannot get an output that will generate this sort of information, at least, not at all easily), I cannot say what that next-highest value is, but by removing the 2500, the average count/checklist drops to 341.  So, for argument's sake, let's use the arbitrary value of 700 for the second-highest tally.  This, then, might have been the filter limit under an automatic-filter system.

With the validation of the 2500 count, the filter limit gets bumped.  Now, the range of possible data-entry errors that are automatically accepted increases dramatically.  Now, not only do all of the potential mistakes beginning with the digit '1' get accepted (e.g., 1000 for 100, 1100 for 100, 1200 for 120), but now many possible errors beginning with the digit '2' get accepted (e.g., 2000 for 200, 2200 for 200 or even 20, etc.).  However, with the lower filter limit, not only will observers be more likely to catch their errors ("why is that entry of 20 flagged? oh, I accidentally hit the '2' twice as well as the '0' twice) before checklist submission, the local reviewer will have a chance to confirm that one actually intended 2200 rather than some other entry.

The take-home message is that outliers are outliers and should not create the bounds between which data are considered "non-outliers."


Tony

Tony Leukering
Largo, FL



-----Original Message-----
From: Komar, Nick (CDC/OID/NCEZID) (CDC/OID/NCEZID) <nc...@cdc.gov>
To: coloradodipper <colorad...@aol.com>; cobirds <cob...@googlegroups.com>
Cc: clw37 <cl...@cornell.edu>; bls42 <bl...@cornell.edu>; mji26 <mj...@cornell.edu>
Sent: Tue, Jan 21, 2014 3:13 pm
Subject: RE: [cobirds] Colorado eBird: Filters and filter limits

Tony:
 
Wow. I had no idea so much time and effort goes into the filtering process for Colorado ebird entries. Volunteers who put in as much time and effort (and above all, quality time and quality effort) as you do deserve a big award, or even better, a big reward!
 
Ebird team:
 
Regarding filters in place on the number of individuals reported for a species, I’ve noticed some filters recently that need adjusting. Examples for Larimer County (Colorado) would include Double-crested Cormorant and California Gull. I think the filters are currently 1000 and 400, respectively. However, it is not unusual at certain hotspots to surpass these filters. The problem is that these large congregations are site specific, and it would be too labor intensive to have filters established by human beings (even superhumans like Tony) at scales below county level. So, here is a thought (in case you all had not already thought of it). For hotspots or broader geographic areas (e.g. counties) with a certain threshold number of checklists, have ebird automatically generate filters. This is already in place for birds not on the default list for the location, because adding a species requires the user to confirm the addition. But for the number of individuals for the species already on the default list, an automatic variable filter could be programmed for all species that would be equal to each species’ previous high count for the location (and period). In this way, ebird would ask for confirmation for any reported datum only when a new high count is established for that species at that location and period. In this way, these site-specific filters would automatically increase over time as new high counts are established at a fine geographic scale. For most (common) species that don’t really merit the effort to continuously manage filters even at broad geographic scales, this system could mitigate input errors that would erroneously establish new high counts reported to ebird for that location and period. For rare species that merit human review, a lower fixed threshold still makes sense. If this system were put in place, and gulls start piling up this winter at Horseshoe Lake in Loveland, CO, then every time I report more than 400 Cal Gulls, I would not be required to comment (a bird log feature); however, if at any point I report a new high count for Horseshoe Lake, I would be cross-checked by ebird to ensure the input number was not an error.
 
If this idea has already been considered, I apologize for taking up your valuable time, and keep up all the good work.
 
Nick Komar
Fort Collins CO
 
From: cob...@googlegroups.com [mailto:cob...@googlegroups.com] On Behalf Of colorad...@aol.com
Sent: Tuesday, January 21, 2014 9:15 AM
To: cob...@googlegroups.com
Cc: cl...@cornell.edu; bl...@cornell.edu; mj...@cornell.edu
Subject: [cobirds] Colorado eBird: Filters and filter limits
 
Reply all
Reply to author
Forward
0 new messages