I've started self studying tensor calculus, my sources are the video lecture series on the YouTube channel; "MathTheBeautiful" and the freeware textbook/notes; "Introduction to Tensor Calculus" by Kees Dullemond & Kasper Peeters. Other textbooks go much more in depth in advanced math topics. I have been through the first 3 chapters and watched the first 5 videos, but I don't seem to understand the content. I don't know what I should take from these lectures and notes and what part of the work to focus on in order to start practicing as soon as possible.
I want to learn tensor calculus in order to study more advanced mathematics and physics such as; General Relativity, Differential Geometry, Continuum Mechanics etc. I've also seen many other textbooks on continuum mechanics and tensor analysis for mathematicians/physicists. All of these sources seem quite different and seem like I require much more advanced topics in mathematics in order to understand. How should I approach tensor calculus? through a physics or through a mathematics perspective?
From what I've seen, tensor calculus seems very abstract and more towards the proving side of the spectrum (like a pure mathematics subject), it doesn't look "practicable" as appose to other calculus courses where I could go to any chapter in the textbook and find many problems to practice and become familiar with the concept.
If you haven't taken an advanced linear algebra class, dealing not just with matrices and row reduction, but with vectors, bases, and linear maps, do that. The key to understanding tensor calculus at a deep level begins with understanding linear and multilinear functions between vector spaces. Once linear maps, multilinear maps, tensor products of spaces, etc., are clear to you, come back to this answer.
Have you studied linear algebra now? Good. The intuition behind tensor calculus is that we can construct tensor fields smoothly varying from point to point. At every point of a manifold (or Euclidean space, if you prefer) we can conceptualize the vector space of velocities through that point. Once we have a vector space, we have its dual, and from the space and its dual, we construct all sorts of tensor spaces. A tensor field is just one such tensor at every point that varies in a differentiable fashion across the manifold.
Let's make a concrete example. You're an EE student, hopefully you'll forgive me if I use a concept from mechanical engineering. Consider a voluminous body with internal stresses. Fix a point. Given a vector $v$ at that point, the stress tensor $\sigma$ produces the stress vector acting on the plane perpendicular to $v$ through that point. It is a tensor because it does so in a linear fashion, at each point mapping a vector to another vector.
If you're interested in general relativity and differential geometry, consider also picking up some differential geometry textbooks. I recommend Semi-Riemannian Geometry, with Applications to Relativity by Barrett O'Neill. (As a plus, if by then your linear algebra is rusty, the first chapter is devoted to the basics of multilinear algebra and tensor mechanics.) You might start by working through his undergraduate curves & surfaces book, Elementary Differential Geometry.
First all, study multivariable differential calculus from Rudin's PMA. Then learn Smooth manifolds through Sinha's book, and Lee's book. Only then O'Neill's Semi-Reimannian geometry could be intelligible. This book will teach you the true SR and GR.
To learn more about the theoretical speedups achievable by using matrix cores compared to SIMD Vector Units, please refer to the tables below. The tables list the performance of the Vector (i.e. Fused Multiply-Add or FMA) and Matrix core units of the previous generation (MI100) and current generation (MI250X) of CDNA Accelerators.
AMD Matrix Cores can be leveraged in several ways. At a high level, one can use libraries such as rocBLAS or rocWMMA to do matrix operations on the GPU. For instance, rocBLAS may choose to use MFMA instructions if this is advantageous for the computations at hand. For a closer-to-the-metal approach, one can choose to
sprinkle HIP kernels with inline assembly (not recommended, since the compiler does not look at the semantics of the inlined instructions, and may not take care of data hazards such as the mandatory number of cycles before using the results of the MFMA instructions)
The coding examples in this post use some of the available compiler intrinsics for the MFMA instructions and show how to map entries of the input and output matrices to lanes of vector registers of the wavefront. All the examples use a single wavefront to compute a small matrix multiplication. The examples are not intended to show how to achieve high performance out of the MFMA operations.
To perform the MFMA operation on AMD GPUs, LLVM has builtin intrinsic functions. Recall that these intrinsic functions are executed on a wavefront-wide basis and pieces of the input and output matrices are loaded into registers of each lane in the wavefront. The syntax of a MFMA compiler intrinsic is shown below.
cbsz, the Control Broadcast Size modifier, is used to change which input values are fed to the matrix cores and is supported only by instructions that have multiple input blocks for the A matrix. Setting cbsz informs the instruction to broadcast values of one chosen input block to 2^cbsz other neighboring blocks in A. The input block picked for broadcasting is determined by the abid parameter. The default value of 0 results in no broadcast of values. As an example, for a 16-block A matrix, setting cbsz=1 would result in blocks 0 and 1 receiving the same input values, blocks 2 and 3 receiving the same input values, blocks 4 and 5 receiving the same input values, etc.
abid, the A-matrix Broadcast Identifier, is supported by instructions that have multiple input blocks for the A matrix. It is used with cbsz and indicates which input block is selected for broadcast to other neighboring blocks in the A matrix. As an example, for a 16-block A matrix, setting cbsz=2 and abid=1 will result in the values from block 1 to be broadcast to blocks 0-3, values from block 5 to be broadcast to blocks 4-7, values from block 9 to be broadcast to blocks 8-11, etc.
blgp, the B-matrix Lane Group Pattern modifier, allows a constrained set of swizzling operations on B matrix data between lanes. For the instructions that support this modifier, the following values are supported:
Consider the matrix multiplication operation D = A B where M=N=16 and K=4 and the elements are of type FP32. Assume that the input C matrix contains zeroes for simplicity sake. We will demonstrate the use of the intrinsic function __builtin_amdgcn_mfma_f32_16x16x4f32 that computes the sum of four outer products in one invocation. This function operates on a single block of matrices.
The input matrices, A and B, have dimensions 16\times4 and 4\times16 respectively and the matrices C and D have 16\times16 elements. It is convenient to map a 16\times4 thread block to the elements of both input matrices. Here, the thread block has one wavefront with 16 threads in the x dimension and 4 threads in the y dimension. We represent the matrices in row major format: A[i][j] = j + i*N where i is the row and j is the column. Using this representation, a thread at position x, y in the block would load entry A[x][y] and B[y][x]. The output matrix has 16\times16 elements, so each thread would have 4 elements to store as illustrated in the figure and code snippet below.
Consider the case of multiplying matrices of dimensions M=N=16 and K=1 using the compiler intrinsic __builtin_amdgcn_mfma_f32_16x16x1f32. In this case, the input values could be held just by 16 lanes of the wavefront. In fact, this instruction could simultaneously multiply 4 such matrices thereby having each lane hold values from one of those 4 matrices.
The two figures below show 1) the size and shapes of the four components of the input arguments A and B and 2) how those components map to lanes in the register owned by the wavefront. The arguments of this instruction include A, B, C and returns D, so we understand that each argument and the output holds 4 matrices.
Gina SitaramanGina Sitaraman is a Senior Member of Technical Staff (SMTS) Software System Design Engineer in the Data Center GPU Software Solutions group.She obtained her PhD in Computer Science from the University of Texas at Dallas. She has over a decade of experience in the seismic data processing field developing and optimizing pre-processing, migration and post processing applications using hybrid MPI + OpenMP on CPU clusters and using CUDA or OpenCL on GPUs. She spends her time at AMD solving optimization challenges in scientific applications running on large-scale HPC clusters.
Noel Chalmers is a Senior Member of Technical Staff (SMTS) in the Data Center GPU Software Solutions Group at AMD. Noel is the lead developer of the rocHPL benchmark, AMD's optimized implementation of the well-known LINPACK benchmark which is responsible for achieving over 1 Exaflop of performance on the Frontier supercomputer at ORNL. Noel obtained a PhD in Applied Mathematics from the University of Waterloo, where he studied convergence and stability properties of the discontinuous Galerkin finite element method on hyperbolic systems. Noel's research interests include GPU acceleration of high-order continuous and discontinuous finite element methods, and large-scale geometric and algebraic multigrid methods.
Nicholas Malaya is an AMD Fellow with an emphasis in software development, algorithms, and high performance computing. He is AMD's technical lead for exascale application performance, and is focused on ensuring workloads run efficiently on the world's largest supercomputers. Nick's research interests include HPC, computational fluid dynamics, Bayesian inference, and ML/AI. He received his PhD from the University of Texas. Before that, he double majored in Physics and Mathematics at Georgetown University where he received the Treado medal. In his copious spare time he enjoys motorcycles, long distance running, wine, and spending time with his wife and children.