Mojo Part 3

0 views

Skip to first unread message

Quincey Homer

unread,

Aug 4, 2024, 7:35:23 PM8/4/24

to itanopmek

Westarted this blog post series to describe how to write Mojo? for the Mandelbrot set to achieve over 35,000x speedup over Python. To recap the optimizations so far, in part 1 we ported the code into Mojo to get around a 90x speedup, and then, in part 2 we vectorized and parallelized the code to get a 26,000x. This blog post continues this journey to show another performance technique that takes well beyond our promised 35,000x speedup goal.

Recall that our parallelism strategy in the previous blog post is that each CPU core gets an equal number of rows to process. Pictorially, the idea is that we can evaluate each set of rows independently and achieve speedup proportional to the number of cores.

To demonstrate the imbalance, we plot the total number of iterations each row performs within the Mandelbrot set. As shown in the figure below, some rows require less than 1000 iterations before they escape whereas others require over 800,000 iterations.

If you partition in a way that each thread gets a sequential number of rows, what happens is that all threads will wait until the middle set of rows finishes (assigned to some core) before completion. There are multiple ways to alleviate this, but the easiest way is to over-partition. So, instead of each thread getting a set of rows divided equally amongst the cores, we have a work-pool and create a work item for each row. The threads pick items from the thread pool in round-robin fashion.

To summarize the optimization journey in the first 3 blog posts. In the first blog we first started by just taking our Python code and running it in Mojo, added some type annotations to the Mojo code, and performed algebraic simplifications. In the second blog post, we vectorizing the code, evolved that to utilize multiple ports, and then made the sequential implementation parallel to exercise all the cores on the system. Finally, in this blog post, we used oversubscription to resolve the load imbalance due to parallelism. During this journey we achieved 46x to 68,000x speedups over Python. As shown below.

Throughout these blog posts, we showed that Mojo affords you great performance while being portable and easy to reason about. Performance does not come for free, however, and thus one has to be cognizant of the underlying hardware and algorithms to be able to achieve it.

In the previous blog post, we motivated the Mandelbrot set problem and described basic optimization techniques to get around 90x speedup over Python. In this blog post (part 2 of our 3 part blog post series), we continue our optimization journey and describe how to go from 90x to 26,000x speedup over Python. We will share insights into the techniques we use and discuss why Mojo is well positioned to address them. In the upcoming and final part 3, we'll wrap this series up by showing you how you can get well over 35,000 speedup over Python.

How do I run this example: Last week we announced that Mojo SDK will become available for local download to everyone in early September. Sign up to be notified if you haven't already! This example will become available in the Mojo examples repository on GitHub along with Mojo SDK availability. And don't forget to join us live on our first ever Modular Community Livestream where we'll share recent developments in Mojo programming language and answer your questions about Mojo.

In this blog post, we are going to continue on this journey and describe other optimizations that can get us closer to the 35,000x speedup we reported during launch. We will still use the 88-Core Intel Xeon Platinum 8481C CPU as we did in the first blog post in this series.

In Mojo, the SIMD type is a first-class type. To take advantage of SIMD to operate on multiple pixels at a time, we have to write our code slightly differently from the scalar code shared earlier. This is especially true because each pixel may escape at a different iteration count. The resulting code is still straightforward, though:

Recall that in the first blog post we only achieved 90x speedup over Python. The theoretical expectation from the vectorization is around 8x speedup (the system has a 512-bit vector register, therefore we have 512/64=8 double-precision elements per SIMD vector). We can plot the speedup against Python and the best results from the first blog post (marked as blog1):

In terms of performance, we can evaluate the speedup as we vary the num_ports and select the best one (i.e. autotune). If we plot the results, we can see that the value num_ports=4 is the ideal sweet spot (with a 1.6x speedup) and that as we increase the number of ports we degrade the performance since we increase contention.

So far we have only looked at a single-threaded implementation (with vectorization), but modern hardware has multiple cores. A natural way to parallelize is to subdivide the complex plane into a sequence of rows, where each thread gets a sequence of rows to operate on.

However, there is a problem with the above approach. With 88 cores we should expect around an 88x speedup, but we were only getting a 30x speedup. In the next blog post we will describe what this problem is and how we can address it to achieve well over 35,000x speedup.

As you can see above, our implementation achieves over 26,000x speedup over Python. In the next blog post, we are going to discuss the next set of optimizations to get over the 35,000x speedup. Stay tuned!

Don't forget to join us live on our first ever Modular Community Livestream where we'll share recent developments in Mojo programming language and answer your questions about Mojo. We will also be releasing Mojo SDK generally to everyone in early September. Sign up to download.

LinkedIn and 3rd parties use essential and non-essential cookies to provide, secure, analyze and improve our Services, and to show you relevant ads (including professional and job ads) on and off LinkedIn. Learn more in our Cookie Policy.

There are a lot of choices out there when it comes to finding software to help you with your writing. The most obvious of these is your word processor, but there are a couple of other types of software that can influence the efficiency of your academic writing. In this post, I review five types of software and outline the ways in which they can be assembled to make an academic workflow. These five types are: 1) word processors and writing tools 2) reference managers 3) PDF annotators and 4) databases and organising your notes 5) drawing and image processing. Fair warning: I am an Apple user, so some of these could be different for Windows or Linux. I also have a social science perspective, which might be different for some disciplines.

There are lots of debates on various websites about which is the "best" word processor. My reflection on it is, it depends on what you are writing, and has a lot to do with your process and objective. I will also say that the decision about your writing application has to be made together with the reference manager as they do not all play well together equally.

Lots of people use Word and are perfectly happy with it. I find my self frustrated using it as I don't feel like I have full control over the way things are formatted, that formatting changes without my consent, and that it is unstable, especially for for larger documents, say over 50,000 words or with lots of tracked changes and cross references (at least on Mac, but also on the last version of Windows 7, which was the last version that I had to use in the office. I don't know about newer versions). Word is effective for collaboration though as other users can easily add to and comment on your draft, especially when linked with OneDrive, Dropbox or other shared folder service. There are some great collaboration features in Word 2016 for Mac and the web version of Word, which I have yet to fully test drive, but I like the new commenting features. Note that Word 2016 for Mac and Windows are currently compatible with Mendeley, but other reference managers are not yet working at the time of writing. Now that Word 2016 for Windows is out, maybe that development will speed up for reference managers. Also, I cannot see and have not yet tested collaboration between Word 2016 for Mac and Windows, but it seems that it will work.

Google docs has a lot fewer features than Word, but is very stable and quite capable for limited formatting. Its real advantage is in collaboration. If co-authoring a paper, multiple authors can work on the same document at the same time without getting conflicted copies that desktop applications have using a shared drive like Dropbox, Box, or whatever. You can even use google docs offline. What I don't like in Google Docs is that basically I need to format the document once it is finished in another application and that automatic numbering for tables, figures etc is not possible, nor is creating cross-references like you can do in a full word processor. So, that makes it nice to a co-authored journal article, but not ideal for a thesis or book. Google Docs works with two reference managers with which I am familiar: ProQuest Flow and PaperPile. Anything else, you would have to use 'curly references' (coded placeholders in curly brackets) and export to another word processor for finishing and bibliography creation.

Scrivener has been getting a lot of attention in my circles. It is decidedly not a word processor in that its function is not to finish formatting, although it has the basic features to do some of this. Scrivener is great for organising writing. Basically, you can write in chunks (a sentence, paragraph or section) and not worry too much about the flow of your work. It is really easy to move these chunks of text around, or bin the bits you don't know what to do with but don't want to lose. I have used it on several occasions to deconstruct my writing and put it back together in a way that makes better sense. Scrivener is not that great at auto-numbering and cross-referencing, like Google Docs, but it is possible using the formula in your text (basically would be your first heading and the fist subheading would be .. Similarly, you could refer to . When you export to a word processor, they will all number nicely, but they are hard to keep track of and you have to make them all manually). I know some people who use Scrivener to store and organise all their notes as well, but I have never used it this way. Also, if you work with complex tables, Scrivener might not be the best for you. If you are the kind of person who likes a finished format as you go, as I am, you will need to break away from Scrivener at some point. You can use any reference manager that works with a word processor, put in 'curly references' and when you export to the word processor, you can scan the document and your references will format and bibliography will be produced.