JIT: So you want to be faster than an interpreter on modern CPUs

67 points by pinaraf a day ago

stmw 2 hours ago

Good read. But a word of caution - the "JIT vs interpreter" comparisons often favor the interpreter when the JIT is inplemented as more-or-less simple inlining of the interpreter code. (Here called "copy-and-patch" but a decades-only approach). I've had fairly senior engineers try to convince me that this is true even for Java VMs. It's not in general, at least not with the right kind of JIT compiler design.

hoten an hour ago

I just recently upgraded[1] a JIT that essentially compiled each bytecode separately to one that shares registers within the same basic block. Easy 40 percent improvement to runtime, as expected.
But something I hadn't expected was it also improved compilation time by 40 percent too (fewer virtual registers made for much faster register allocation).
[1] https://github.com/ZQuestClassic/ZQuestClassic/commit/68087d...
_cogg an hour ago

Yeah, I expect the real advantage of a JIT is that you can perform proper register allocation and avoid a lot of stack and/or virtual register manipulation.
I wrote a toy copy-patch JIT before and I don't remember being impressed with the performance, even compared to a naive dispatch loop, even on my ~11 year old processor.

klipklop 2 hours ago

A shame operating systems like iOS/iPadOS do not allow JIT. iPad Pro's have such fast CPU's that you cant even use fully because of decisions like this.

Pulcinella 27 minutes ago

Those operating systems allow it, but Apple does not. Agree that it is a total waste.

gr4vityWall 3 hours ago

That was a pretty interesting read.

My take is that you can get pretty far these days with a simple bytecode interpreter. Food for thought if your side project could benefit from a DSL!

imtringued an hour ago

I'm not really interested in building an interpreter, but the part about scalar out of order execution got me thinking. The opcode sequencing logic of an interpreter is inherently serial and an obvious bottleneck (step++; goto step->label; requires an add, then a fetch and then a jump, pretty ugly).

Why not do the same thing the CPU does and fetch N jump addresses at once?

Now the overhead is gone and you just need to figure out how to let the CPU fetch the chain of instructions that implement the opcodes.

You simply copy the interpreter N times, store N opcode jump addresses in N registers and each interpreter copy is hardcoded to access its own register during the computed goto.