R Bytecode

0 views

Skip to first unread message

Rosy Demorest

unread,

Aug 4, 2024, 5:58:20 PM8/4/24

to foodssudeemat

Bytecodealso called portable code or p-code[citation needed]) is a form of instruction set designed for efficient execution by a software interpreter. Unlike human-readable[1] source code, bytecodes are compact numeric codes, constants, and references (normally numeric addresses) that encode the result of compiler parsing and performing semantic analysis of things like type, scope, and nesting depths of program objects.

The name bytecode stems from instruction sets that have one-byte opcodes followed by optional parameters. Intermediate representations such as bytecode may be output by programming language implementations to ease interpretation, or it may be used to reduce hardware and operating system dependence by allowing the same code to run cross-platform, on different devices. Bytecode may often be either directly executed on a virtual machine (a p-code machine, i.e., interpreter), or it may be further compiled into machine code for better performance.

Since bytecode instructions are processed by software, they may be arbitrarily complex, but are nonetheless often akin to traditional hardware instructions: virtual stack machines are the most common, but virtual register machines have been built also.[2][3] Different parts may often be stored in separate files, similar to object modules, but dynamically loaded during execution.

A bytecode program may be executed by parsing and directly executing the instructions, one at a time. This kind of bytecode interpreter is very portable. Some systems, called dynamic translators, or just-in-time (JIT) compilers, translate bytecode into machine code as necessary at runtime. This makes the virtual machine hardware-specific but does not lose the portability of the bytecode. For example, Java and Smalltalk code is typically stored in bytecode format, which is typically then JIT compiled to translate the bytecode to machine code before execution. This introduces a delay before a program is run, when the bytecode is compiled to native machine code, but improves execution speed considerably compared to interpreting source code directly, normally by around an order of magnitude (10x).[4]

Because of its performance advantage, today many language implementations execute a program in two phases, first compiling the source code into bytecode, and then passing the bytecode to the virtual machine. There are bytecode based virtual machines of this sort for Java, Raku, Python, PHP,[a] Tcl, mawk and Forth (however, Forth is seldom compiled via bytecodes in this way, and its virtual machine is more generic instead). The implementation of Perl and Ruby 1.8 instead work by walking an abstract syntax tree representation derived from the source code.

More recently, the authors of V8[1] and Dart[7] have challenged the notion that intermediate bytecode is needed for fast and efficient VM implementation. Both of these language implementations currently do direct JIT compiling from source code to machine code with no bytecode intermediary.[8]

I used to parse the erlang binaries a lot in the past, recently started something with it in Elixir as a typed experiment (ran out of time, bleh), which you can check out here if you want to see how to read type information and such (and show the general API): _elixir

You can also specify what type the input should be, a little anyway, so for example the option :from_core means that the input, whether file of forms, is core erlang. This is what I use in the LFE compiler where I generated core erlang forms (there AST anyway) which I then compile with :compile.forms(forms, [:from_coreoptions]). I found this easier than generating erlang AST.

EDIT: I see from the video that there should be Kernel Erlang before the BEAM Bytecode?

EDIT2: I seem to have found the slides from the other talk you mentioned: Slides about implementing Erlang languages

There are actually 2 passes between core and bytecode: kernel and life. The kernel pass converts it to kernel erlang where the code has been flattened, lambda lifted and the pattern matching has been compiled. The life pass does life time analysis of variables.

Bytecode is computer object code that an interpreter converts into binary machine code so it can be read by a computer's hardware processor. The interpreter is typically implemented as a virtual machine (VM) that translates the bytecode for the target platform. The machine code consists of a set of instructions that the processor understands.

Many computer languages, such as C and C++, require a separate compiler for a specific computer platform. That is, a separate compiler is needed for each combination of operating system (OS) and hardware architecture. For example, Microsoft Windows and Intel's microprocessors represent one platform, and macOS and the Apple M-series chips represent another.

One of the most common examples of bytecode in action is the Java programming language. When an application is written in Java, the Java compiler converts the source code to bytecode, outputting the bytecode to a CLASS file.

The CLASS file is then read and processed by a Java virtual machine (JVM) running on a target system. The JVM, which is part of the Java Runtime Environment, interprets the bytecode and converts it to machine language specific to the intended platform.

The JVM interpreter usually processes the bytecode instructions one instruction at a time, but a JVM can also support a just-in-time compiler. These compilers can process the bytecode more efficiently, which helps improve application performance.

The Lisp programming language, once commonly used for artificial intelligence applications, is an earlier language that uses bytecode as an intermediary step. Other languages that use bytecode or a similar approach include the following:

For example, I have a large directory of modules. Rather than having it littered with .pyc files, I would like to keep my source code in the directory and have a subdirectory like "bytecode" where all of the .pyc are stored.

To clarify: Before 3.2, you can compile bytecode and put it elsewhere as per Brian R. Bondy's suggestion, but unless you actually run it from there (and not from the folder you want to keep pristine) Python will still output bytecode where the .py files are.

EDIT: sorry, I misread the question - what I meant is that it's not possible (I think) to configure the Python executable to put its bytecode files in a different directory by default. You could always write a little Python compiler script (or find one, I'm sure they're out there) that would put the bytecode files in a location of your choosing.

I remember a professor once saying that interpreted code was about 10 times slower than compiled. What's the speed difference between interpreted and bytecode? (assuming that the bytecode isn't JIT compiled)

When you compile things down to bytecode, you have the opportunity to first perform a bunch of expensive high-level optimizations. You design the byte-code to be very easily compiled to machine code and run all the optimizations and flow analysis ahead of time.

The speed-increase is thus potentially quite substantial - not only do you skip the whole lexing/parsing stages at runtime, but you also have more opportunity to apply optimizations and generate better machine code.

You could see a pretty good boost. However, there are a lot of factors. You can't just say that compiled code is always about 10 times faster than interpreted code, or that bytecode is n times faster than interpreted code.

Factors include the complexity and verbosity of the language for example. If a keyword in the language is several characters, and the bytecode is one, it should be quite a bit faster to load the bytecode, and jump to the routine that handles that bytecode, than it is to read the keyword string, then figure out where to go. But, if you're interpreting one of the exotic languages that has a one-byte keyword, the difference might be less noticeable.

I've seen this performance boost in practice, so it might worth it for you. Besides, it's fun to write such a thing, gives you a feel for how language interpreters and compilers work, and that will make you a better coder.

For instance, when you use use a Perl program directly from its source code, the first thing it does is compile the source into a syntax tree, which it then optimizes and uses to execute the program. In normal situations the time spent compiling is tiny compared to the time actually running the program.

Sticking to this example, obviously Perl cannot be faster than well-optimized C code, as it is written in C. In practice, for most things you would normally do with Perl (like text processing), it will be as fast as you could reasonably code it in C, and orders of magnitude easier to write. On the other hand, I certainly wouldn't try to write a high performance math routine directly in Perl.

For example, consider executing a Python script. When you do that, you have all the costs associated with converting the program text in to the internal interpreter data structures, which are then executed.

But if you consider, say, a classic BASIC interpreter, these typically never store the raw text, rather they store a tokenized form and recreate the program text when you do "LIST". Here the byte code is much cruder (you don't really have a virtual machine here), but your execution gets to skip some of the text processing. That's all done when you enter the line and hit ENTER.

Don't think that if you convert your interpreted code into ByteCode it will run as fast a Java(near C speeds), there has been years of performance boosting going on, but you should see significant speed boost.

I've never noticed a Vim script that was slow enough to notice. Assuming a script primarily calls built-in, native-code, operations (regexes, block operations, etc) that are implemented in the editor's core, even a 10x speed-up of the 'glue logic' in scripting would be insignificant.