Optimize For Performance

0 views

Skip to first unread message

Heather Mitchell

unread,

Aug 4, 2024, 4:34:06 PM8/4/24

to reningnamea

Im experiencing performance issues with the "Get items" action to retrieve items from my SharePoint list. Currently, the execution of this action can take more than 30 seconds, which is too long for my use case.

And what specific previous issues? If you mean applying multiple filters then it is because SP is limited to what filters it can use on different columns if it is trying to apply said filter to more than 5000 rows.

Thanks for your reply! It works correctly and quickly (less than 5s), I've adapted your flow to my use case and simplified the process. However, it crashes when the filter query is too long, which doesn't happen with Get items. Do you have any idea how to fix this?

I have done some work optimizing a flow using the SharePoint HTTP action to return items to a Power App and I have gotten it to return up to 20,000 items to the app in under 10 seconds. And it will work for any combination of filtering and/or searching the list.

This method of only doing batches of up to 5000 at a time in a concurrent Apply to each should also get around issues you previously mentioned where you couldn't apply multiple filters at once. Because it does allow more filters to be applied when your initial piece of the query, like querying on the ID column, restricts the initial outputs to less than the 5000 list view threshold.

Hello, I would like to re-launch this topic. Do you have any ideas for improving the performance of "Get items" in my context, or for solving my bug when using "Send an HTTP request to SharePoint"? Many thanks in advance !

I am currently working on a space scene build with react-three-fibre and react-three-drei. I use the stars object multiple times to add more dynamic and different star sizes into the night sky. Furthermore I have some basic geometries and I am loading three different models into the scene. The models are not super high poly objects, however I got strong performance issues and would like to ask you some questions:

About the Stars component:

I already deleted the additional Stars to get a better performacnce. However I miss the dynamic and depth inside the stars component. The idea was to use multiple Stars components in order to show a small amount of Stars with a bigger size and a large amout of stars with a smaller size. When I just use one Stars component then all stars will have the same size and the same saturation.

I just recognized that the file size from one of my 3D models is around 150MB and therefore I got an error when I tried to upload my project on github. Can anyone recommend a tool to decrease the scene.bin file size?

Performance Optimization is the process of refining the efficiency of systems and applications to improve their output, speed, or scalability. It plays a key role in ensuring that the available resources are put to the best use. For data science professionals, it notably aids in streamlining complex data workflows, improving data processing, and accelerating data analytics.

Performance Optimization utilizes various techniques, such as code optimization, system tuning, and load balancing, among others. It primarily focuses on improving the computational efficiency of a system by reducing resource usage and minimizing latencies. For data-intensive applications, a well-optimized system leads to quicker data processing and more accurate analytics.

Despite its advantages, Performance Optimization can also present several challenges. These include the complexity of optimizing code, potential trade-offs between performance and functionality, and the time and expertise required to optimally tune a system. It's essential to conduct careful analysis and planning to mitigate these challenges.

A Data Lakehouse combines the best elements of data warehouses and data lakes into a unified, easy-to-manage platform. Performance Optimization plays a crucial role in such an environment, helping to ensure efficient, rapid data processing and analytics, regardless of the volume or variety of data. By reducing latency and improving compute efficiency, it can greatly enhance a Data Lakehouse setup.

While not directly related to security, Performance Optimization can indirectly support security measures by reducing system vulnerabilities, improving resilience, and ensuring efficient auditing and logging processes. It's vital to note that optimization efforts should be aligned with security best practices.

Performance Optimization significantly impacts system performance. By refining system efficiency, it allows for faster data processing, improves user experience, and ensures the scalability of systems as data volumes grow.

Why is Performance Optimization important in a Data Lakehouse environment? Performance Optimization is essential in a Data Lakehouse setup to ensure efficient, high-speed data processing and analytics, regardless of data volume or variety.

What are some challenges of Performance Optimization? Challenges include optimizing code complexity, potential trade-offs between performance and functionality, and the requirement for time and specialist know-how to tune a system optimally.

Internally, React uses several clever techniques to minimize the number of costly DOM operations required to update the UI. For many applications, using React will lead to a fast user interface without doing much work to specifically optimize for performance. Nevertheless, there are several ways you can speed up your React application.

By default, React includes many helpful warnings. These warnings are very useful in development. However, they make React larger and slower so you should make sure to use the production version when you deploy the app.

In most cases, instead of writing shouldComponentUpdate() by hand, you can inherit from React.PureComponent. It is equivalent to implementing shouldComponentUpdate() with a shallow comparison of current and previous props and state.

The problem is that PureComponent will do a simple comparison between the old and new values of this.props.words. Since this code mutates the words array in the handleClick method of WordAdder, the old and new values of this.props.words will compare as equal, even though the actual words in the array have changed. The ListOfWords will thus not update even though it has new words that should be rendered.

When you deal with deeply nested objects, updating them in an immutable way can feel convoluted. If you run into this problem, check out Immer or immutability-helper. These libraries let you write highly readable code without losing the benefits of immutability.

In the CUDA programming model, computation is ordered in a three-level hierarchy.Each invocation of a CUDA kernel creates a new grid, which consists of multiple blocks.Each block consists of up to 1024 individual threads.These constants can be looked-up in the CUDA Programming guide.Threads that are in the same block have access to the same shared memory region (SMEM).

The number of threads in a block can be configured using a variable normally called blockDim, which is a vector consisting of three ints.The entries of that vector specify the sizes of blockDim.x, blockDim.y and blockDim.z, as visualized below:

Here are the relevant hardware stats for my GPU, obtained from the cudaGetDeviceProperties API (Multiprocessors are the SMs we talked about earlier):The amount of shared memory is configurable by using a feature called SharedMemoryCarveout. The so-called unified data cache is partitioned into L1 cache and shared memory, so we can trade-off less shared-memory for more L1 cache.

So this kernel is limited by the number of threads per block, and the number of registers per thread.We cannot load more than one block per SM, giving us a final occupancy of 32 active warps / 48 max active warps = 66%.

Warp was stalled waiting for the MIO (memory input/output) instruction queue to be not full. This stall reason is high in cases of extreme utilization of the MIO pipelines, which include special math instructions, dynamic branches, as well as shared memory instructions

The compiler unrolls both loopsThe compiler can unroll them since the loop count is known at compile time. and then eliminates the repeated SMEM loads of the Bs entries, so we end up with the same amount of SMEM accesses as our optimized CUDA code.

Now that the SMEM cache is populated, we have each thread multiply its relevant SMEM entries and accumulate the result into local registers.Below I illustrated the (unchanged) outer loop along the input matrices, and the three inner loops for the dot product and the TN and TM dimension:

In the inner loop, we can reduce the number of SMEM accesses by making dotIdx the outer loop, and explicitly loading the values we need for the two inner loops into registers.Below is a drawing of the dotIdx loop across time, to visualize which SMEM entries get loaded into thread-local registers at each step:I had to reduce some dimensions to make it easier to draw. In the kernel: BK=TM=TN=8.

The first optimization that I already hinted at earlier is to transpose As.This will allow us to load from As using vectorized SMEM loads (LDS.128 in SASS).Below the same visualization of the three inner loops as for kernel 5, but now with As transposed in memory:

I tried my best to visualize all three levels of tiling below, although the structure is getting quite complex.The CUTLASS docs about efficient GEMMs go even more in-depth into warptiling, and their visualizations are illuminating.Each warp will compute a chunk of size (WSUBN * WNITER) x (WSUBM * WMITER).Each thread computes WNITER * WMITER many chunks of size TM*TN.

Writing this post was a similar experience to my previous post on optimizing SGEMM on CPU: Optimizing SGEMM iteratively is one of the best ways to deeply understand the performance characteristics of the hardware.For writing the CUDA programs I was surprised by how easy it was to implement the code once I had made a good visualization of how I wanted the kernel to work.