Panopticon Images

0 views

Skip to first unread message

Laura N Gerard

unread,

Aug 4, 2024, 9:57:47 PM8/4/24

to laitabarve

Wearinghis social reformer hat, Bentham proposed the Panopticon as an improvement over the prevailing prison and carceral system of his day, which he believed to have harmful knock-on effects and societal costs, particularly for Great Britain's poorest.

The idea of the panopticon, a prison built around constant surveillance of the incarcerated, has proved to be one of the more controversial ideas in the history of crime, punishment, and incarceration. However the idea has enjoyed considerable attention in broader cultue for almost two centuries.

Camera (in iOS and iPadOS) relies on a wide range of scene-understanding technologies to develop images. In particular, pixel-level understanding of image content, also known as image segmentation, is behind many of the app's front-and-center features. Person segmentation and depth estimation powers Portrait Mode, which simulates effects like the shallow depth of field and Stage Light. Person and skin segmentation power semantic rendering in group shots of up to four people, optimizing contrast, lighting, and even skin tones for each subject individually. Person, skin, and sky segmentation power Photographic Styles, which creates a personal look for your photos by selectively applying adjustments to the right areas guided by segmentation masks, while preserving skin tones. Sky segmentation and skin segmentation power denoising and sharpening algorithms for better image quality in low-texture regions. Several other features consume image segmentation as an essential input.

Our approach to panoptic segmentation makes it easy to scale the number of elements we predict, for a fully parsed scene, to hundreds of categories. This year we've reached an initial milestone of predicting both subject-level and scene-level elements with an on-device panoptic segmentation model that predicts the following categories: sky, person, hair, skin, teeth, and glasses.

In this post, we walk through the technical details of how we designed a neural architecture for panoptic segmentation, based on Transformers, that is accurate enough to use in the camera pipeline but compact and efficient enough to execute on-device with negligible impact on battery life.

Besides the pure numerical improvements, employing a single ANE segment allows us to participate in a sophisticated camera pipeline, in which many latency-sensitive workloads run in parallel to maximize the utilization of all available coprocessors.

Second, DETR is highly efficient when evaluating regions of interest (RoIs). Two-stage approaches, such as Mask R-CNN, evaluate thousands of anchor-based RoIs before forwarding hundreds of top-ranked proposals to the second stage. We instead constrain the number of RoIs in the original DETR model by an order of magnitude (from its default configuration of 100), and yet obtain negligible degradation in detection performance for our target distribution of images (

In a forward pass of DETR, each RoI generates a unique segmentation mask. Input to the pass includes a unique set of feature maps from the Transformer module, along with a common set of feature maps from the Convolutional Encoder module.

When processing a large number of RoIs and output resolution is set to a relatively low value, the batched convolutional decoder module is only one of the performance bottlenecks. With higher output resolutions, it becomes the dominant bottleneck. We set our output resolution as high as 384x512 to obtain high-quality segmentation masks. To mitigate the performance bottleneck of DETR at high resolutions, when scaling to large numbers of object queries, we propose HyperDETR.

First, the convolutional decoder can be run without batching along the sequence axis, which decouples the complexity of high-resolution mask synthesis from the RoI sequence length (100 in a standard DETR configuration, 4 in ours).

Concurrent work explores generating dynamic weights through convolutional networks, instead of Transformers. This approach has advantages as well as disadvantages compared to HyperDETR, and an in-depth comparative analysis is subject of future work.

Before each image was fed into the network during training, we randomly resized, cropped, and resized again. Next, we randomly oriented, randomly rotated, and cropped the valid regions to simulate poorly-oriented captures. Finally, we introduced color jitter by randomly varying brightness, contrast, saturation, and hue.

Finally, we leveraged an ANE compiler optimization that splits the computation of layers with large spatial dimensions into small spatial tiles, and makes a trade-off between latency and memory usage. Together, these techniques yielded an extreme reduction in the memory footprint of our model and consequently minimized its impact on battery life and workloads that run in parallel.

In this post, we introduced HyperDETR, a panoptic segmentation architecture that scales efficiently to large output resolutions and a large number of region proposals. Panoptic segmentation, powered by HyperDETR, provides pixel-level understanding for the Camera and enables a wide range of features in the Camera app, such as Portrait mode and Photographic Styles. We designed the model to ensure it is accurate enough to use in the Camera pipeline, but compact and efficient enough to execute on-device without impacting battery life.

Aditya Kusupati, Vivek Ramanujan, Raghav Somani, Mitchell Wortsman, Prateek Jain, Sham Kakade, Ali Farhadi. Soft Threshold Weight Reparameterization for Learnable Sparsity. arXiv:2002.03231, February, 2020, [link].

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, Piotr Dollr. Microsoft COCO: Common Objects in Context. arXiv:1405.0312, May, 2014. [link].

Photos (on iOS, iPad OS, and Mac OS) is an integral way for people to browse, search, and relive life's moments with their friends and family. Photos uses a number of machine learning algorithms, running privately on-device, to help curate and organize images, Live Photos, and videos. An algorithm foundational to this goal recognizes people from their visual appearance.

Jeremy Bentham, an English philosopher and social theorist in the mid-1700s, invented a social control mechanism that would become a comprehensive symbol for modern authority and discipline in the western world: a prison system called the Panopticon.

The basic principle for the design, which Bentham first completed in 1785, was to monitor the maximum number of prisoners with the fewest possible guards and other security costs. The layout (which is depicted below) consists of a central tower for the guards, surrounded by a ring-shaped building of prison cells.

The building with the prisoners is only one cell thick, and every cell has one open side facing the central tower. This open side has bars over it, but is otherwise entirely exposed to the tower. The guards can thus see the entirety of any cell at any time, and the prisoners are always vulnerable and visible. Conversly, the tower is far enough from the cells and has sufficiently small windows that the prisoners cannot see the guards inside of it.

The sociological effect is that the prisoners are aware of the presence of authority at all times, even though they never know exactly when they are being observed. The authority changes from being a limited physical entity to being an internalized omniscience- the prisoners discipline themselves simply because someone might be watching, eliminating the need for more physical power to accomplish the same task. Just a few guards are able to maintain a very large number of prisoners this way. Arguably, there wouldn't even need to be any guards in the tower at all.

In 1813, parliament granted Bentham 23,000 pounds to build the first ever panopticon prison. This panopticon in New Dehli was completed in 1817 and is still functioning as a prison to this day (Wikipedia: Panopticon).

Michel Foucault, a French intellectual and critic, expanded the idea of the panopticon into a symbol of social control that extends into everyday life for all citizens, not just those in the prison system (Foucault 1970). He argues that social citizens always internalize authority, which is one source of power for prevailing norms and institutions. A driver, for example, might stop at a red light even when there are no other cars or police present. Even though there are not necessarily any repercussions, the police are an internalized authority- people tend to obey laws because those rules become self-imposed.

This is a profound and complicated idea, namely because the process entails a high degree of social intuition; the subject must be able to situate him or her self amidst a network of collective expectations. The crucial point is that the subject's specific role within the network is incorporated as a part of the body and mind, which then manifests as self-discipline.

In the course of my project, I will argue that the mirror enables people to very realistically project their body images into a visible and objective space. This results in more comprehensive control over body images, and this control results in self-discipline according to a number of body norms and myths. People use the mirror to help affect their bodies in relation to the social definitions of beauty, hygiene, productivity, diet/consumption, disgust- and inevitably situate their bodies within a multitude of different identities that interact with these constructions.

A mirror is a powerful tool that socialized people use to monitor their bodies in relation to a network of body images and signs (citizens are to the social network as prisoners are to the panopticon). Rather than a legal law like stopping at red lights, the rules implicated in self-monitoring with mirrors involve social etiquette and aesthetic norms.

This project is divided into two main sections. The first section will explore the history of mirror production, and its implications for self-monitoring in the context of a social-psychological framework. The second part is dedicated to several specific forms of social rhetoric that inform our body images and they ways that we construct them.