Transformer Ka Rate

0 views

Skip to first unread message

Domenec Reynolds

unread,

Aug 3, 2024, 10:38:15 AM8/3/24

to waychipamen

CRATE is a transformer-like architecture which is constructed through first principles, enjoys a rich theoretical framework, and achieves competitive performance across diverse training setups.

At the top of the page, we have linked a long-form manuscript explaining the CRATE architecture in full detail. Below, we summarize the numerous sub-projects that have developed the CRATE architecture. These consist of:

where \(R\) and \(R^c\) are measures of compression of the final token representations \(\boldsymbolZ = f(\boldsymbolX)\) w.r.t. different codebooks, and the \(\ell^0\) norm promotes the sparsity of \(\boldsymbolZ\). Overall, the sparse rate reduction objective promotes compact and sparse representations.

where \(f^\mathrmpre\) is the pre-processing mapping, and \(f^\ell\) is the \(\ell^\mathrmth\)-layer forward mapping that transforms the token distribution to optimize the above sparse rate reduction objective incrementally.

After encoding input data \(\boldsymbolX\) as a sequence of tokens \(\boldsymbolZ^1\), CRATE constructs a deep network that transforms the data to a canonical configuration of low-dimensional subspaces by successive compression against a local model for the distribution, generating \(\boldsymbolZ^\ell+1/2\), and sparsification against a global dictionary, generating \(\boldsymbolZ^\ell+1\). Repeatedly stacking these blocks and training the model parameters via backpropagation yields a powerful and interpretable representation of the data.

We use soft-max cross entropy loss to train on the supervised image classification task. We obtain competitive performance with the usual vision transformer (ViT) trained on classification, with similar scaling behavior, including above 80% top-1 accuracy on ImageNet-1K with 25% of the parameters of ViT.

An interesting phenomenon of CRATE is that even when trained on supervised classification, it learns to segment the input images, with such segmentations being easily recoverable via attention maps, as in the following pipeline (similar to DINO).

Such segmentations were only previously seen in transformer-like architectures using a complex self-supervised training mechanism as in DINO, yet in CRATE, segmentation emerges as a byproduct of supervised classification training. In particular, the model does not obtain any a priori segmentation information at any time. Below, we show some example segmentations.

Another remarkable property is that attention heads in CRATE automatically carry semantic meaning, which implies that CRATE may have post-hoc interpretability for any classification it makes. Below, we visualize the output of some attention heads across several images and several animals, showing that some attention heads correspond to different parts of the animal, and this correspondence is consistent across different animals and different classes of animals.

To derive the decoder architecture, we propose a novel framework of structured denoising-diffusion, which is analogous to the common (ordinary) denoising-diffusion framework popularly used for generative modelling of imagery data. Our framework relies on a quantitative connection between the compression operator and the score function (as used in denoising-diffusion models), as shown below:

The encoder and decoder are derived through discretizations of the structured denoising and diffusion processes, respectively. Importantly, the encoder derived from structured denoising is exactly the previously described CRATE architecture. The full encoder and decoder layers are given below.

In this paper, we contend that the objective of representation learning is to compress and transform the distribution of the data, say sets of tokens, towards a mixture of low-dimensional Gaussian distributions supported on incoherent subspaces. The quality of the final representation can be measured by a unified objective function called sparse rate reduction. From this perspective, popular deep networks such as transformers can be naturally viewed as realizing iterative schemes to optimize this objective incrementally. Particularly, we show that the standard transformer block can be derived from alternating optimization on complementary parts of this objective: the multi-head self-attention operator can be viewed as a gradient descent step to compress the token sets by minimizing their lossy coding rate, and the subsequent multi-layer perceptron can be viewed as attempting to sparsify the representation of the tokens. This leads to a family of white-box transformer-like deep network architectures which are mathematically fully interpretable. Despite their simplicity, experiments show that these networks indeed learn to optimize the designed objective: they compress and sparsify representations of large-scale real-world vision datasets such as ImageNet, and achieve performance very close to thoroughly engineered transformers such as ViT. Code is at -Lab-Berkeley/CRATE.

No, that won't work. The Sampler requires the sampling rate to be a single value for all features. If you use attributes there is the potential that they have different values, which can lead to all kinds of issues.

There is the option to use a User Parameter for the sampling rate, but if you want to use that you'd have to cut the process in to two parts, one to perform steps 1-3 from your list and then call a 2nd workspace, using a WorkspaceRunner, to do part 4 using a User Parameter as input. The downside is that you're reading all of your data twice.

In this brochure a uniform way of collecting, compiling and presenting transformer failure data is proposed. The results and analysis of an international transformer failure survey are presented in terms of the investigated population, calculated failure rates and failures classification into location, cause, mode and effects of the failures.

As we can see, the learning rate is lower as the number of embedding vector dimensions dim_embed is larger. As expected, the learning rate peaks when the step_num is at warmup_steps, and the larger warmup_steps is, the lower the learning rate at the peak.

After a transformer is installed and makes it though the burn-in period, some researchers put the expected life of power transformers in the range of 25 to 40 years. Electrical Technology describes the normal life expectancy of transformers as about 20 to 30 years and acknowledges that a transformer might last for more than 50 years. Menzies puts the life expectancy of industrial transformers at 20 to 25 years, but notes that under ideal conditions, transformers can be expected to operate for 30 to 40 years.

As with all components, however, there is some probability that a power transformer will fail earlier than its expected life, just as there is some probability that a power transformer will last well beyond its expected life. It is in understanding these failure rates that we can get at an understanding of the probability of one transformer failing, and in the case of two transformers, of simultaneous failure.

A failure rate of 0.05 failures per year per transformer is not a guarantee that a transformer will fail during the year it is at that failure rate. In fact, there is less than a 5% probability that it will fail that year. Better odds than Russian roulette. But who wants to play Russian roulette with their transformers?

In the same work, Tenbohlen and co-authors report data giving the overall failure rate for power transformers at 0.0047 failures/year per transformer, but they include data for failures that occur well beyond the constant rate period, the normal life of a transformer.

While the literature is full of examples of transformer failures caused by collisions with utility poles and squirrels chewing through cables, a common cause of transformer failure is failure of the dielectric fluid. Mineral oils are the most common dielectric fluid, providing both insulation and cooling within the transformers. However, they can absorb water with time, and that moisture leads to degradation of the cellulose paper within the transformer, leading to failure.

An analysis of this data shows that the portion of failures prevented by oil servicing is about 60%. There are other causes, of course, and it is the totality of all these causes that lead to the overall failure rate shown in Figure 1.

When a facility uses two separate feeds, each with their own transformers, they do it because the consequences of a power outage are severe. To avoid common cause failures, a tractor-trailer crash, for instance, they place the feeds at separate locations, ideally fed from separate substations. Even when all of these things are in place, there is still some probability of both transformers being in a failed state at the same time. A 1oo2 architecture is more reliable than a 1oo1 architecture, but it is not perfect.

As transformers age past their normal 25-year life, however, their failure rate increases, along with the PFDavg. The PFDavg of two transformers in a 1oo2 architecture also increases. At 50 years, when the failure rate is about 0.025 failures/year per transformer, the PFDavg for a 1oo2 architecture for that one year is 25 times higher than it was for the entire normal life of the transformers.

Transformers are so reliable that we sometimes forget that they can fail. They often go their entire useful life without any service. When we install a redundant power transformer, it is more often out of an abundance of caution than on the basis of risk calculations. But power transformers have a finite life expectancy, and many power transformers are reaching or are at the end of their normal life.