OpenGL ES 2 (v1.11) for Warp3D Nova (AmigaOS 4)I picked up the pencil I dropped after successfully finishing up the initial version 1.0 for the 11th time now, so it's a new week with a new update once again
This time we got one fix and some real *massive* optimizations for a common-case scenario.
The fix: glViewport didn't always work correctly. A dumb typo made it falsely depend on the provided target window- / bitmap-height
Thanks to Frank Menzel for reporting that one. Really wondering how this one could remain unnoticed for so long.
But now to the optimizations
It's all about non-VBO drawing commands. So what's that anyway you may ask?
OGLES2 allows you to draw your geometry either from GPU memory (VBO) or directly from your application's RAM (non-VBO). The latter is often used in older progs when VBOs weren't available, in stuff ported from OGL(ES)1, for simplicity, for vertex-data that constantly changes, etc.
In contrast, Nova only allows you to draw your stuff through VBOs. Therefore the OGLES2 wrapper has to create / update at least one VBO internally if you want to draw sth. from your application's RAM. So for the lib-user it looks as if he'd draw from his RAM directly, but in reality the wrapper turns everything into a VBO behind the scenes. And that VBO modification means that the data has to be uploaded to the GPU. Furthermore it means that the lib has to wait until that VBO is not used by the GPU anymore, which could be the case if you issued another non-VBO draw-command before.
OGLES2 has to do all that for every single non-VBO glDraw-command you issue, because it has to asume that your vertex data changed. There is no way to tell OpenGL "hey, don't worry, that data will remain unchanged for the next 1000 draw-calls".
As you may guess all this is a huge bottleneck. So I spent some time to improve that situation.
The basic idea is that it's actually faster to check the whole data for changes and to not upload anything if there's no change than to always upload. And instead of comparing the data I hash it and compare that hash only. The hashing function has been extremely optimized in 1.10 already (it's used internally for other things already). Anyway, that's the core idea, there's a bit more though.
Note: throughout the following lines I'll present the fps of the boing-ball-test if compiled *not* to use VBOs! So it's 1024 identical balls (about 800 triangles each) rendered using client memory vertex- / index-data.
Before it was very slow in that mode (more at Warp3D than W3D Nova niveau), especially with that big load (although the previous update already contained a nice speedup by around factor 3, as you probably remember). However this test situation can be considered a best-case scenario.
I also prepared a second version of that test, one that renders two different types of balls which use completely different vertex-/ index-data sets in an alternating way (ball type A, then type B, then A, B, ...). Triangle- and ball-count is the same as in test 1. This second test is used to somewhat simulate a worst-case scenario.
So, let's begin. With 1.10 we had the following results for those tests (on a sam460ex, low-end R7 250, default window size, all ca. values).
Variant 1: 3.3 fps
Variant 2: 3.3 fps
Optimization 1: client-RAM index-arrays are uploaded only if a data change has been detected.
Variant 1: 3.7 fps
Variant 2: 3.7 fps
Optimization 2: client-RAM vertex-arrays are uploaded only if a data change has been detected.
Variant 1: 16.0 fps
Variant 2: 3.7 fps
Wow
Variant 1 becomes insane!
Interesting to note is that variant 2 isn't getting any slower (at least not measurable), despite the fact that it now computes a hash over 25kb of vertex-data (about 800 vertices, 32 bytes each) - and it does so 1024 times per frame for nothing... Yes, the hashing has been optimized indeed
Optimization 3: instead of just one VBO for index- and vertex-arrays the library now manages a hole lot of such VBOs internally.
Variant 1: 15.9 fps
Variant 2: 12.5 fps
Not bad, hm?
However there's one situation that doesn't benefit much from all this, namely when you render procedurally generated vertex data that changes all the time.
If you use glDrawElements then it's likely that at least the index-array-upload can be optimized away because in the common procedural use-case this will remain constant, but the always-changing vertex-data will not just not benefit from all this but might actually become slower than before because there's the additional hashing overhead now...
However, as we have seen at "Optimization 2" computing the hash is virtually for free - at least in those tests here, which, after all, throw around about 50 times (!) more vertices while issueing about 40 times (!) more draw-calls than the current "Wings Remastered" beta does in the most heavy loaded strafing-scenes... (and "Wings Remastered" will be the most heavy Warp3D game in existence). Just to give you an idea about what amounts of data this boing-ball tests is actually about...
So, if you stay within those limits the hashing overhead is definitely neglectable.
So compared to v1.10 the new v1.11 delivers a performance gain around factor 3.8 to 4.8 for common non-VBO situations!
And because it sounds even cooler:
Compared to v1.9 we have an improvement of an incredible factor 11.4 to 14.4!
Note: the actual performance gain highly depends on the size of the data you are about to draw with one draw-call. So don't expect that your numbers are identical to those above. But the overall order of magnitude will be around that.
Cheers,
Daniel