Fishy Magic

luaJIT

· by Jameson · Read in about 4 min · (835 Words)
Tags: code, lua

I’d been wanting to try euclid on a Raspberry Pi recently, as I’d heard that while its CPU is 700MHz, its performance is somewhat like a Pentium 2 at 2-300Mhz. This means that the performance should be only 3x that of a machine which could run Duke Nukem 3D (its manual says it requires a 486 DX2/66 processor - 66Mhz, although with a much less ‘smart’ design overall). Now, I’ve no doubt that Ken Silverman is a much better programmer than I am, and of course, lua slows things down, but I thought that this sounded like plenty of overhead for the engine to run.

How wrong I was

Porting the game to work on linux again turned out to be relatively straight forward, although I had to rely on the linux packages for SDL2 and SDL2_image. That’s probably for the best though. After a few other tweaks I got it running, but not responding to input. However… the framerate was not so good. About 1-2 frames per second. Worse still, my lua profiler told me that the Render function was taking 0.05 seconds, meaning at best it would go up to 20 frames per second, or so I thought.

At the urging of several acquaintences, I figured I’d give luaJIT a go. It advertises 3-15x performance improvements, so it couldn’t hurt. luaJIT is based on lua 5.1. The actual language isn’t hugely different from 5.3, and most of the API changes are quite easy to deal with. Once integrated, luaJIT improved the framerate about two-fold. Not bad, but not good enough.

At this point I was ready to let the idea of running on Raspberry Pi go, but I decided to profile it anyway to see if there were any useful patterns. Just about everything was running almost exactly twice as fast. So much so that an important detail almost didn’t notice - the Render() function was also twice as fast.

Wait, what?

Nearly everything in the Render() function is native code. So it shouldn’t be affected by the change to luaJIT. At first I thought it might be to do with my C++/lua integration being less efficient than I thought, but I decided to check on the ‘nearly’ part of that statement.

It turns out inside the camera code there was a line of lua like this:

local sector, pos = entity.map:LineTrace(
	entity.sector,
	entity.position,
	entity.position + self:getViewOffset()
)

Where getViewOffset() is to give the camera viewbob. Why LineTrace ? because moving horizontally can put the position in a new sector, and working out which is non-trivial, and very important for the renderer. I removed this LineTrace, and the time for the Render call goes from 0.028 seconds (the luaJIT time) to 0.006 seconds. So that one line trace was taking around 22 milliseconds to run. This needs to be called every time any entity moves. So if there’s ten entities moving, plus the player, that’s 11 * 22ms or .242 seconds, just for Line Traces, per frame.

Of course (ok not of course), LineTrace was written in lua, when it should have been in native C++, so there’s a clear improvement to be had here.

Lessons

A lot of this lua code was written because, with the live editing of lua, the iteration times were fast. Hence fixing bugs in lua code is fast compared to C++. However, code which is clearly expensive, and requires high performance needs to be moved to C++, even if, on my development PC it seems ‘fast enough’

The luaJIT documentation also indicates that it has some extensions which could be used to speed the code up even further, specifically tying in C-style structs and functions in a way that works nicely with the JIT compiler. Using this system might also allow for further optimisations.

Further Possibilities

If I had a way to hot-reload C++ code, it might have meant this code never entered lua, as I would have been able to iterate just as quickly in C++. Getting such a system working isn’t simple, but is definitely possible using dll files or shared libraries (on linux).

A change like this could be made superficially, partitioning the code into that which can hot-reload, and everything else. This wouldn’t be too difficult, but it also would only benefit those parts of the code that fell the right side of the line. To make substantial use of this feature, it would need to be worked into the engine in a more fundamental way.

Raspberry Pi

It’s possible that with all of these changes the engine might work on the Raspberry Pi, with some strong restrictions on the number of entities and the complexity of maps. While it should be possible to do a lot more on the platform than this, I think that might have to be restricted to some later engine. By entirely targetting an engine development effort at the platform, and in all likely hood restricting it to mostly C or C++, more impressive work should be possible.

Comments