Re-scaling options: Percentiles, min-max normalization, etc.

1,263 views
Skip to first unread message

Amanda Dwelley

unread,
Oct 12, 2021, 3:42:46 PM10/12/21
to justice40-open-source
Hi All,

I'd love to hear people's perspective/ideas on re-scaling individual indicators for use in a combined index or score.

It looks like the GitHub code uses "min-max normalization" where the "percentile" is:
(Observed value - minimum of all values) / (Maximum of all values - minimum of all values)

Min-max normalization preserves the shape of the data and "relative distance" between things.

I believe California and other tools use more traditional percentiles where the top 1% of values are 100, the next 1% get a 99, etc...such that the the data is "flattened" or smoothed out (you don't preserve relative distance).

I'm curious who else has considered pros/cons, or whether one approach may be better for *some* indicators, whereas another may apply to *other* indicators.

If anyone wants to go deep on this -- Here are some slides with examples and options. Plus ETL code excerpt. 

Would love to hear what you've considered or tested!

Thanks,
Amanda
min max norm.PNG
Re-Scaling Data for Combined Scoring.pdf

Rohit Musti

unread,
Oct 12, 2021, 5:44:15 PM10/12/21
to Amanda Dwelley, justice40-open-source
Hi!

When we were developing the tree equity score at American Forests, we favored min-max normalization within municipalities fo preserve local context. We tested other normalizations, we actually ended up using straight max normalization for tree canopy because it had we wanted a way to convey than zeros were different than the local minimum, but in general other normalization’s didn’t preserve the local context as well which was important to us. 

Cheers,
Rohit M

On Oct 12, 2021, at 3:42 PM, Amanda Dwelley <ama...@illumeadvising.com> wrote:

Hi All,
--
You received this message because you are subscribed to the Google Groups "justice40-open-source" group.
To unsubscribe from this group and stop receiving emails from it, send an email to justice40-open-s...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/justice40-open-source/c8941265-87b9-48aa-a40f-d1dde1ab5fc7n%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
<min max norm.PNG>
<Re-Scaling Data for Combined Scoring.pdf>

Switzer, Shelby C. EOP/OMB

unread,
Nov 5, 2021, 1:54:11 PM11/5/21
to Rohit Musti, Amanda Dwelley, justice40-open-source

Hi y’all –

 

Thanks for getting this convo started, and apologies for my own late response here. On the CEJST side, we’ve been evaluating both percentiles vs min-max and are not sure yet which will be the preferred method. My personal take is that min-max is better for preserving local context and understanding relative context (similar to Rohit’s explanation of the tree Equity approach), but I’ve heard from other folks that traditional percentiles has the benefit of being easier to explain to folks. With regards to relative distance, I do wonder about your point, Amanda, about smoothing out that data, and if it makes sense to take min-max for some data but percentiles for others. That, however, might get even more complicated to explain!

 

I’d love to center the next community chat (Nov 15) on open questions like this regarding data methodology in developing a definition of disadvantaged. Of course I will give the disclaimer that I won’t be able to talk about policy decisions regarding the CEJST and the definition of disadvantaged from CEQ, but I think we should open up the floor for good community discussion of this topic, as well as others such as the benefits of using point data and what factors should be considered for determining which datasets should be used in a screening tool, such as the CEJST or ones other teams may be building.

 

If y’all or anyone in the community have other specific data methodology questions you’d like to discuss, please respond here or message me separately if you prefer.

 

Shelby

 

 

Shelby Switzer

United States Digital Service

 

(202) 881-6055

Shelby.C...@omb.eop.gov 

 

They/them/theirs

signature_911356080signature_586513087

 

signature_2052341683

David Holstius

unread,
Nov 8, 2021, 3:40:44 PM11/8/21
to justice40-open-source
Hi all,

Just a couple of reflections from experience on (a) transforming and (b) summarising.

The Healthy Places Index (HPI) is a California-centered tool that took a different approach (z-score transform) than others (rank transform, aka percentiles). Some of the reasoning for preferring that approach—preserving variance—might have been captured in the HPI Steering Committee's work. I can recall some of that debate but not very clearly. 

If and when one does elect to transform to percentiles—an open question—there's still a choice of summary function. Most folks default to summing after transforming. For percentile-transformed data, this is the other half of a "rank sum" method and, in effect, somewhat like a "both-and" would be in a 2x2 context. Units (tracts or whatever) that score moderately high on one indicator will outscore units that score as high as #1 on some indicator but low on others. But there are alternatives. For examples, if you have reason to value "either-or", is "rank product" which is a continuous and multi-valued version of that. (It also has a nice statistical interpretation that can be illustrated using shuffled decks of cards.) Rank-sum is conventional, but it's still a choice. In a "world's greatest Olympian" analogy, it's kind of like preferring decathletes to those who excelled at a single event. You might have reason to want those single-issue units at the table, which would be a reason to value rank product. Or you might not, which would be a vote for rank sum. And of course there are other summary functions too. The most interpretable or conventional one won't necessarily yield the right results, so "ground truthing" a sample or running a simulation might be warranted.

I've not been part of EJSCREEN development behind the scenes, and don't know its relation to the Justice40 group, but my understanding was that they have very deliberately opted not to summarize. Sometimes, not doing something is the best course of action!

Last remark: map of single percentile-transformed indicators can elicit different perceptions of where "hotspots" are. If your untransformed data are left-skewed and spatially autocorrelated, as many popular rate-based indicators (e.g. asthma ED visits per 100k person-years) are, then you may have a raw map with some fairly obvious hotspots, but the map of "percentiles" will wash those out. The rates that are (say) half as large will fill up the 80th percentile, which is hard to discern from the 90th percentile on a map. On a grayscale map, it certainly won't be half as dark, which it should be, if you really want to compare things proportionally. And maps are pretty popular for that.

Just my 2¢ and not representative of my employer's opinion, as I'm on my lunch hour. :-) 

David Holstius
Reply all
Reply to author
Forward
0 new messages