I was thinking about how large language models interact with Indigenous ways of knowing and being (as one does, naturally), and stumbled on the emerging framework of Indigenous Data Sovereignty (IDSov). As a field that looks at the overall relationship between Indigenous peoples and information systems, IDSov draws on ideas that have been developed throughout history, including both intra-tribal norms and emergent challenges post-colonial contact. Though many of these ideas pre-date the rise of big data (including LLMs and other derived content) the last few decades have sharpened focus on the relevant harms and accelerated the need for Indigenous peoples to formalize these standards in a way that preserves their right to self-determination.
This might all seem very abstract, so let's look at some examples. The United States government and its agents have a long history of manipulating, misrepresenting, and outright falsifying information with respect to tribal nations. As Rebecca Tsosie notes in her comprehensive article on IDSov, a number of treaties were ratified under dubious pretenses, with tribes "alleging that federal officials undercounted the total population and sometimes miscounted by including persons who were not even tribal members." Even today, programs like the census disproportionately undercount Native Americans, buttressing cultural myths (consciously or unconsciously) like the "vanishing Indian".
Beyond quantitative representations, qualitative information can be a site of harm for tribes and their members. Post-contact, the United States waged a persistent war of misinformation regarding tribal nations, including the erasure of tribal written language and agricultural practices in service of a narrative of uncivilized "savages". In addition to misrepresentation, the undesired disclosure of information can harm tribes, such as traditional ecological knowledge that is disclosed in Freedom of Information Act requests during federal agency review processes. In a landmark case for the Havasupai Tribe, they successfully sued over the use of Indigenous blood samples for reasons beyond the original collection, on the grounds that it violated their cultural property and creation stories.
Drawing on the ideas above, IDSov encompasses a set of frameworks (defined in different ways by different groups, from the Māori to Native American coalitions) for placing control over data in the hands of tribes and indigenous people. A catalyst in the development of this field has been the UN Declaration of Rights of Indigenous Peoples, which includes the right to "maintain, control, protect and develop their cultural heritage, traditional knowledge and traditional cultural expressions," alongside other cultural protections. IDSov asks that we look at how data can construct group identities and hierarchies of power, pushing us to recognize that data collection (whether it's a census, medical testing, or something else) is an inherently political activity, rather than a neutral one. Maggie Walter's "data of disregard" phrasing captures this perfectly, as she looks at the ways that selective data collection and representation can be used to pathologize communities (e.g. impoverished, low life expectancy, prone to substance abuse), echoing our earlier discussion of deficit framing.
What's particularly encouraging here is the work that tribes are doing to assert their sovereignty in this space. A recent survey indicated that nearly 50 tribes have incorporated data sovereignty principles into their tribal codes, many within the last decade. These questions are being raised in other communities, too, which seek to take control of data gathering and the corresponding narratives. Organizations like the Tubman Center are building out data projects around community health, and the Black Brilliance Project compiled a 1000 page report in support of participatory budgeting and other community well-being initiatives.
If we return to large language models, we can see the same debates around the political nature of data and representation, but with an even more nebulous operator ("the algorithm"). As someone who works on tools for data processing, this is of particular interest and concern for me. Ensuring that automated systems don't replicate the "data of disregard" is a tall order, and it's not clear how best to establish guardrails that can respect pluralistic notions of data sovereignty in monolithic foundation models (the underlying components that power LLMs like GPT and Bard). Simple solutions like ignoring all potential sources of Indigenous data create a different kind of inequity, perpetuating a kind of digital erasure that mirrors the "vanishing Indian" in a different form. What's clear is that we don't have the answers today, and tribes must be part of finding the way forward.
Here are this week's invitations:
Personal: When you read demographic statistics like the census, what stories are you telling yourself? Where might you find other perspectives?
Communal: How can we create resources for more community-led data projects, and support these in a way that preserve agency (by avoiding coercive grant restrictions, etc.)?
Solidarity: Support the Seattle Indian Health Board and their work to "decoloniz[e] data by identifying the resiliencies and gaps in our communities and using techniques rooted in Indigenous knowledge to address them."
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
FAQ
Can I share this newsletter with non-Googlers? Yes! Feel free to forward this note externally; it does not contain confidential information.
Is this an official Google newsletter? Nope. The views expressed in this newsletter are not the official position of Google, and we are not affiliated with any particular ERG.