Parsing using a Graphics Processing Unit (GPU)?

Roger L Costello

unread,

Sep 1, 2020, 12:44:53 AM9/1/20

to

Hi Folks,

I am reading a book [1] on machine learning and the book says some pretty
interesting things:

"In the search for more speed, machine learning researchers started taking
advantage of special hardware found in some computers, originally designed to
improve graphics performance. You may have heard these called graphics cards.
... Those graphics cards contain a GPU, or graphics processing unit. Unlike a
general purpose CPU, a GPU is designed to perform specific tasks, and do them
well. One of those tasks is to carry out arithmetic, including matrix
multiplication, in a highly parallel way. ... GPUs have many more [than CPUs]
arithmetic cores, thousands are fairly common today. This means a huge
workload can be split amongst all those cores and the job can be done
quickly."

Neat!

Has the parsing community found a way to take advantage of GPUs?

From the above excerpt, it appears that GPUs are especially good at
arithmetic. When I think of parsing, I don't think of lots of arithmetic.
Perhaps someone has devised a way to recast the parsing problem into an
arithmetic problem?

Any thoughts you might have on:

(a) parsing-using-GPUs, and
(b) recasting-the-parsing-problem-into-an-arithmetic-problem

would be appreciated. /Roger

[1] "Make Your First GAN with Pytorch" by Tariq Rashid

[Parsing is not usually an important factor in compiler performance.
The slow parts are the lexer, because it has to look at every
character of the input, and some optimizations that have to analyze
the entire intermediate form of the program. The first step in lexing
is to identify what class each character is, e.g., identifier, white
space, or operator. Perhaps a GPU could do vector lookups to speed
that up. For optimizations, I can sort of imagine how some analyses
like reachability might be expressible as matrices. -John]

Christian Gollwitzer

unread,

Sep 1, 2020, 12:03:27 PM9/1/20

to

Am 31.08.20 um 12:35 schrieb Roger L Costello:

> Has the parsing community found a way to take advantage of GPUs?

I don't think that you can speed up parsing a lot using GPUs. GPUs are
generally good at parallel processing. A single thread is usually slower
than a CPU thread, and there is overhead, because they are not
self-contained - i.e. you can usually speed up only some part of a
program, and it needs to be uploaded to the GPU and downloaded back.
GPUs also have faster memory, *if* you access it either in big blocks or
as a serial stream, in which case the compiler can transform it to block
access. For random accesses, the memory is slower.

I have done some basic GPU programming, and I think that parsing is not
a parallel task in that sense. The parser reads the input as a stream of
tokens; you can't split the C file at some arbitrary point in half and
parse both parts independently.

Christian

A. K.

unread,

Sep 1, 2020, 12:04:14 PM9/1/20

to

Am Dienstag, 1. September 2020 06:44:53 UTC+2 schrieb Roger L Costello:
> Has the parsing community found a way to take advantage of GPUs?

The old IT classic still holds: Early optimization is the root of all evil.

IOW: first identify your bottlenecks by profiling your parsing routines.
Repeat it.

Most probably bottlenecks will have to do with I/O speed and selected
algorithms. Memory access speed perhaps, CPU/GPU performance very behind.

Hans-Peter Diettrich

unread,

Sep 1, 2020, 6:25:37 PM9/1/20

to

Am 01.09.2020 um 09:22 schrieb Christian Gollwitzer:
> Am 31.08.20 um 12:35 schrieb Roger L Costello:
>> Has the parsing community found a way to take advantage of GPUs?

> I have done some basic GPU programming, and I think that parsing is not
> a parallel task in that sense. The parser reads the input as a stream of
> tokens;

and acts *differently* depending on that input, while a GPU performs
essentially the *same* operation to all elements of a stream (merge,
scale...).

> you can't split the C file at some arbitrary point in half and
> parse both parts independently.

Multiple files can be parsed in parallel just with a multi-core CPU.
More parallel modules require more symbol tables etc., what limits the
amount of concurrent threads by available (cached!) memory.

DoDi

Christopher F Clark

unread,

Sep 1, 2020, 6:27:19 PM9/1/20

to

Our esteemed moderator wrote:
[Parsing is not usually an important factor in compiler performance.
The slow parts are the lexer, because it has to look at every
character of the input, and some optimizations that have to analyze
the entire intermediate form of the program. The first step in lexing
is to identify what class each character is, e.g., identifier, white
space, or operator. Perhaps a GPU could do vector lookups to speed
that up. For optimizations, I can sort of imagine how some analyses
like reachability might be expressible as matrices. -John]

As to parallelizing lexers, we did some experiments on speeding up
regular expressions (mostly for Snort) while I was at Intel. Charlie
Lasswell did experiments on starting parallel scans on different parts
of a stream or file that showed some promise. The hyperscan team (aka
Sensory Networks, Geoff Langdale &co) had some interesting results on
when you could skip ahead and guess the correct state when using
NFAs--I believe they called it "and then a miracle happens" (following
the famous cartoon about a mathematical proof with some missing steps
on a whiteboard). Michela Becchi worked on when you could predict
states involved in xFA loops would terminate. Geoff had me do a mode
they called Maclellan (the HS folks had some interesting naming
conventions, one of them was the use of US Civil War generals for
major variations/modes) which was turning NFAs into parallelized DFAs
for Hyperscan. The regular expression engine we put into the Cave
Creek chip, did a different form of parallel DFAs, spawning them when
certain conditions were met, and for things like the Aho-Corasick
algorithm you can show that there tends to be a "bushy" part near the
head of the automata, with "skinny" parts being forked off after a
small number of bytes--that was the key insight that led to our chip.

So, there are definite possibilities for doing SIMD regular expression
matching that would be suitable for GPU implementation. Geoff and I
discussed how to approach it on numerous occasions. It just never
quite got past the to-consider list.

Some of this "could" also be applied to parsing, particularly in a GLR
style parser, where you have forests. But our moderator is correct in
saying the big boost would be in speeding up lexing.

--
******************************************************************************
Chris Clark email: christoph...@compiler-resources.com
Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris
------------------------------------------------------------------------------

Elijah Stone

unread,

Sep 2, 2020, 12:00:04 AM9/2/20

to

On Mon, 31 Aug 2020, Roger L Costello wrote:

> Any thoughts you might have on:
>
> (a) parsing-using-GPUs, and
> (b) recasting-the-parsing-problem-into-an-arithmetic-problem

Look up Aaron Hsu's Ph.D thesis, _A data parallel compiler hosted on the
GPU_ (https://scholarworks.iu.edu/dspace/handle/2022/24749). As John (and
others) mention, I don't think the GPU is an especially interesting target
to speed up parsing specifically, but it may be a fruitful line of
inquiry. If so then that thesis is, as far as I can tell, the only
research that has been done so far; maybe you can make your own
experiments based on it.

--
time flies like an arrow;
fruit flies like a banana

Aharon Robbins

unread,

Sep 2, 2020, 10:04:43 AM9/2/20

to

In article <20-0...@comp.compilers>, A. K. <minf...@arcor.de> wrote:
>Am Dienstag, 1. September 2020 06:44:53 UTC+2 schrieb Roger L Costello:
>The old IT classic still holds: Early optimization is the root of all evil.

Jus to give credit where credit is due, it was Donald Knuth who said
"Premature optimization is the root of all evil."

IIRC it was in his famous "Structured Programming with GOTO Statements"
paper.

Arnold

Jan Ziak

unread,

Sep 2, 2020, 10:05:10 AM9/2/20

to

On Tuesday, September 1, 2020 at 6:03:27 PM UTC+2, Christian Gollwitzer wrote:
> The parser reads the input as a stream of
> tokens; you can't split the C file at some arbitrary point in half and
> parse both parts independently.

Of course you can split asm/C/C++/Go/Python/Rust/etc file at arbitrary
points: when the state of the lexical analyzer collapses to a single
state starting from a random file position with an arbitrary starting
state.

-atom

Roger L Costello

unread,

Sep 2, 2020, 10:08:19 AM9/2/20

to

> Look up Aaron Hsu's Ph.D thesis,

> A data parallel compiler hosted on the GPU

> (https://scholarworks.iu.edu/dspace/handle/2022/24749)

Abstract:
This work describes a general, scalable method for building data-parallel by
construction tree transformations that exhibit simplicity, directness of
expression, and high-performance on both CPU and GPU architectures when
executed on either interpreted or compiled platforms across a wide range of
data sizes, as exemplified and expounded by the exposition of a complete
compiler for a lexically scoped, functionally oriented programming commercial
language. The entire source code to the compiler written in this method
requires only 17 lines of simple code compared to roughly 1000 lines of
equivalent code in the domain-specific compiler construction framework,
Nanopass, and requires no domain specific techniques, libraries, or
infrastructure support. It requires no sophisticated abstraction barriers to
retain its concision and simplicity of form. The execution performance of the
compiler scales along multiple dimensions: it consistently outperforms the
equivalent traditional compiler by orders of magnitude in memory usage and run
time at all data sizes and achieves this performance on both interpreted and
compiled platforms across CPU and GPU hardware using a single source code for
both architectures and no hardware-specific annotations or code. It does not
use any novel domain-specific inventions of technique or process, nor does it
use any sophisticated language or platform support. Indeed, the source does
not utilize branching, conditionals, if statements, pattern matching, ADTs,
recursions, explicit looping, or other non-trivial control or dispatch, nor
any specialized data models.
---------------------------------------------------
Wow!

Thanks for the reference Elijah!

/Roger

Christopher F Clark

unread,

Sep 2, 2020, 11:49:49 AM9/2/20

to

To follow up on the part of the question about using GPUs to do
arithmetic to implement DFA transitions.

It can "easily" be done. The base ideas can be found in the Wu Mamber
string search algorithm. You treat states like (usually one-hot) bits
in a fixed length (usually one word) bit vector. Your transition
function then simply maps bit vectors to bit vectors. You can do that
as a SIMD instruction, thus it is suitable for a GPU to execute.

Hans-Peter Diettrich

unread,

Sep 2, 2020, 4:43:11 PM9/2/20

to

Am 02.09.2020 um 11:13 schrieb Jan Ziak:
> On Tuesday, September 1, 2020 at 6:03:27 PM UTC+2, Christian Gollwitzer wrote:
>> The parser reads the input as a stream of
>> tokens; you can't split the C file at some arbitrary point in half and
>> parse both parts independently.
>
> Of course you can split asm/C/C++/Go/Python/Rust/etc file at arbitrary
> points:

Certainly not for C/C++. The preprocessor and included files modify the
original source. That's one of the reasons for the slow C/C++ compilation.

> when the state of the lexical analyzer collapses to a single
> state starting from a random file position with an arbitrary starting
> state.

If you mean the lexer by "lexical analysis", that's only a front end for
the parser. A compiler needs more state and context information like
symbol tables, which can not be built from arbitrary parts of a source file.

DoDi
[One can certainly pipeline the C preprocessor and later phases but I'm
guessing that's not what you're talking about here. -John]

Christopher F Clark

unread,

Sep 4, 2020, 9:43:34 PM9/4/20

to

I hate getting overly involved in this, but not only is GPU lexing
possible, it probably isn't that complicated.

To make this practical, let's assume we are lexing a language like C
and we have split our file into arbitrary chunks (say 1000 bytes) and
are trying to process them in parallel.

First, note that while the preprocessor can do some pretty complicated
things to the output token stream, it doesn't have that impact at the
lexical level, the level of turning things into "raw" tokens. You
cannot write a macro that introduces a quote character or a comment
delimiter. And those are your main issues in terms of lexing. Yes,
you have to do symbol table lookup (and macro expansion) as a
post-step, but you aren't doing that on a character-by-character
basis.

In fact, you basically have only a few different "states" to worry about.

1) you are lexing normally, not in one of the special cases that follows.
2) you are in a string, i.e. you have seen a quote character and until
you see the matching one, you are not tokenizing at all. There are
two quote characters so call them states 2a and 2b.
3) you have seen a C comment start /* and are looking for */. Again,
you are not tokenizing.
4) you have seen a C++ comment start // and are looking for a newline.
Not tokenizing.
5) you have seen a # (preceded only with whitespace) and are looking
for a newline. You are tokenizing, but the line is a preprocessor
directive.
6) you are in code that is in a #if and will be skipped for that reason.
7) you are at the start of your buffer and the previous character might be a \

So, given those states, assume that your chunk starts in state 1 and
lex like that, but saving the contents as a string in case of state 2.
On a quote, keep going but reverse the text contexts. Similar idea
for comment marks and newlines and hash characters. You only have to
deal with state 7 for the first character and if you consider the
first character part of the previous chunk, just that you get to
inspect, you can probably finesse that issue.

For each chunk, you have roughly two interesting sequences of possible
tokens, whether you were in a string or not. And you can easily patch
those together because your left context always tells you which state
you enter a chunk in.

Now, while I did this hand-waving analysis for C. Most languages are
not significantly more complex at the lexical level. Basically, you
are tokenizing or you are collecting a string or the text will be
entirely thrown away, and you mostly care about the first two cases
(tokenizing or building a string and even building a string is easy
(although can use memory) so you really only care about the tokenizing
case.

So build a SIMD lexer using techniques based upon Wu Manber (and
Glushkov NFAs) and break your text into chunks and let the GPU have a
field day.

gah4

unread,

Sep 10, 2020, 1:54:37 PM9/10/20

to

On Monday, August 31, 2020 at 9:44:53 PM UT

C-7, Roger L Costello wrote:
> Hi Folks,
>
> I am reading a book [1] on machine learning and the book says some pretty
> interesting things:
>
> "In the search for more speed, machine learning researchers started taking
> advantage of special hardware found in some computers, originally designed to
> improve graphics performance. You may have heard these called graphics cards.

The usual way to use a GPU is for single precision floating point. They might also
be able to do fixed point of a reasonable size.

As above, parsing is usually fast enough.

GPUs are often enough used for dynamic programming, which is sometimes
used for optimization and code generation in compilers. It might be that those
could use a speed boost. This might be more true for unusual architectures where
optimal code generation is more important.

Stefan Monnier

unread,

Sep 22, 2020, 8:24:44 PM9/22/20

to

> I hate getting overly involved in this, but not only is GPU lexing
> possible, it probably isn't that complicated.

Indeed. Extra bonus points if you do it generically by having the `lex`
tool do that parallelization for you. I think in the general case it
could be a potentially costly parallelization, but in practice it should
be pretty efficient for most programming languages.

Stefan