Contact me | Login | Search | Sitemap | Site Notice

3D-Engine Comparison

Dissecting the Algorithm

This section will give a deeper insight in the inner structure of the 3D-engine of the 3D-Demo. The subroutine which does the calculus of the drawing positions on the screen of the points of the 3D-object starts in line #529 as INITPNTS in the assembler listing.

What the routine does is basically the following:

  • Get the point positions from the point data table
  • Calculate the combined rotation matrix for the rotations around the x- and the y-axis according to the current rotation angles
  • Perform scaling, translation and projection of the 3D-coordinates on the computer screen
  • Store the new drawing positions in the drawing data table 

The following graph shows these operations in more detail:

Efficiently Calculating SINE & COSINE

A very basic requirement for calculating the rotation matrix is the evaluation of the sine- and cosine-function. This can be done by calculating e.g. a Taylor series expansion which is rather time consuming and inefficient when it comes to fast animations. Hence I chose the option to read the data out of a lookup table.

The data is stored in a memory page of 256 bytes in two's complement enabling a single byte counter to cycle round and round through the table. Following quirks apply:

  • The table is 256 + 64 bytes long, hence resulting in a total of 320 bytes. This is due to the fact that I need to calculate sine and cosine for a given value of the rotation angle at the same time. I could of course add an offset of 90° to the current sine-angle in order to pick the correct cosine-value. However, this would mean more algorithmic overhead and so I chose to spend 64 more bytes for the cosine. So it is possible to lookup both values SINE and COSINE with the same counter value:
* 
* EVAL SIN/COS-TAB
*
LDX LOOPCNTX ; INITIAL POS
LDA SINTAB,X ; GET SINUS FOR X
STA SAX
LDA COSTAB,X ; GET COSINUS FOR X
STA CAX
.
.
.
  • The sine and cosine values are stored as single bytes in two's complement format meaning that the range from -127 to 127 resembles the sine and cosine-functional range from -1 to 1 giving a resolution of approximately 0.0078125 per unit which is sufficient for the calculus.
  • Mapping 360° on 256 values yields an angular resolution of 1.40625° per step which is also sufficient for the calculus
  • Calculating the rotation matrix requires some combined sine- and cosine-terms as can be seen from the graph above for calculating RX and RZ. Since the sine- and cosine base values are 8 bit the resulting combined values are 16 bit after multiplication. For further calculations the 16 bit values are stripped down to 8 bit again. This is done by just using the high-byte of the combined result.
  • The data in the sine and cosine table was calculated using a simple AppleSoft-routine and stored in a small binary file which is loaded in memory by the calling program before the 3D-engine is started.

Calculate the Combined Rotation Matrix

In order to perform the rotations around the x- and the y-axis a combined rotation matrix was calculated as concatenation from the single rotation matrices. This concatenation yields equations for the X-, Y-, and Z-point position in object space. Some thoughts about the numerics involved:

  • Basically the memory size of any stored value is 8 bit (point data, trigonometric data).
  • Any multiplication of two 8 bit values results in a 16 bit value. This value is always stripped back down to 8 bit before the next calculation is done except the division! In principle the high byte of any 16 bit number is used in the next calculation step
  • For the calculation of RX there are no combined sine- and cosine-terms necessary. However, in order to stay in the chosen numerical setting the values of the simple sine- and cosine-values have to be adjusted to fit to the value range of the combined sine- and cosine-terms. This simply is done by doing a LSR (logical shift right) on the sine- or cosine-value which pretty good approximates the high-byte of the 16 bit representation of the formula SIN16(theta) = SIN8(theta) * #$7F. Hint: #$7F equals 1 and adjusts a single 8 bit sine-value to the corresponding 16 bit representation. Instead of doing a multiplication and using the high-byte of the result the following code is much shorter (however, I must admit that you have to deal with negative numbers in two's complement accordingly to perform this calculus correctly. This yields additional overhead but the code is still pretty quick here!):
    LDA CAX     ; COSXONE = COSX * $7F (= 1) 
BPL CX1
. ; NEGATIVE NUMBERS IN TWO'S COMPLEMENT
.
.
LSR ; SHORT METHOD IMPLEMENTED HERE
CX1 STA COSXONE ; DOES NOT NEED MULT-ROUTINE!
  • At the end the values of RX, RY and RZ are stored with 8 bit resolution.

Scaling, Translation and Projection

The final steps encompass scaling, translation and projection. In order to stay in the 8 bit / 16 bit scheme the calculus is performed in the following order:

  • The 8 bit values of RX and RY are scaled with an 8 bit scaling factor. This factor is set to a fixed value in a way that the 3D-objects fit nicely on the screen. The determination of this value which is currently set to #$90 was done by trial and error. I just tried different values in order to determine which value resulted in a good "zoomed" picture on the screen. The result is 16 bit wide
  • The 16 bit result is then divided by RZ + ZTRANS. The z-coordinate gets an offset prior to the division in order to move the object "away" from the "camera". This simple method yields a good projection of the 3D-object on the 2D-screen. This division is maybe the most crucial part of the algorithm. With the variation of ZTRANS an impression of zooming is created. However, if ZTRANS is chosen too small the object will get distorted so the lowest allowed value has to be limited in such a way that the resulting 2D image is not too distorted. If the object is too small an adjustment of the scaling factor will help. If ZTRANS is increasing the object moves farer away form the camera. However, if ZTRANS is too big the division will yield a zero value which also will lead to funny drawing results. This must also be prevented by limiting the maximum ZTRANS-value!
  • The division will yield 8 bit signed return values which are then combined with the XTRANS and YTRANS values in order to yield the true screen positions. These two values are just the middle of the HIRES screen which is (140,96).

With these final steps the true screen positions are evaluated and then stored to the point data table. After finishing with all points, the point data table will be evaluated by the drawing subroutine which draws the 3D-object on the hidden HIRES screen while the other one is still displayed.

The Need for Speed!

The first implementation of the 3D-engine did not incorporate any numerical optimisations or fast mathematical routines besides the table lookup for sine and cosine. Multiplication and division was performed by subroutines incorporating loops for repeated additions/subtractions.

The need for more speed for this 3D-engine is evident. So I tried to incorporate some improvements in order to make the algorithm work faster. Before I will describe the single measures I want to give an overview of the numerical load of the original algorithm. In order to calculate the new position of a single point the algorithm performs:

  • 16 signed multiplications (8 bit * 8 bit)
  • 2 unsigned multiplications (8 bit * 8 bit)
  • 2 signed divisions (16 bit / 8 bit)
  • numerous uncounted additions & subtractions

From the list it is obvious that most of the algorithm time is spend for multiplications. Hence I investigated in methods in order to speed up multiplications. I finally implemented a method which was presented by Neil Parker in terms of a 2kB lookup table for multiplications to speed up this major part of the calculus.

Furthermore I wanted to speed up the division routine. There are only two divisions per point position needed so this resembles not the major part of the calculus but nevertheless I wanted to try to speed that up as well. I could not find an easy implementation of a lookup table for division and also the reciprocal multiplication in terms of a/b = a * (1/b) was not an easy implementation issue. Structural changes to the division routine yielded only minor speed improvements. So I decided to strip down iterations of the division routine instead.

The original algorithm comprises a loop with 8 iterations in order to divide all 8 bits of the divisor. I decided to strip down these 8 iterations to 6 to save 25% of the iteration cycles. Since I still need an 8 bit result I conditioned the input to the division subroutine from 16 bit / 8 bit down to 14 bit / 6 bit. This simplification of the division yielded still reasonable stable drawing results. When I tried to decrease the bit count further more to 13 bit / 5 bit or even 12 bit / 4bit the resulting point values get unstable yielding a jittering 2D representation of the 3D-object. So 14 bit / 6 bit seems to be the lower acceptable accuracy limit for this engine.

Finally I unrolled the division loop after trying to replace the division loop by a faster table-lookup routine which did not yield a acceptable quality of the drawing. The unrolled division yielded some extra percent of performance.

Further analysis yielded the possibility to simplify four multiplications in a way that I did not need to call the table lookup multiplication routine anymore but doing a simple LSR (logical shift right) yielded the same but faster results.

As another step I found out that it was possible to transfer some code from the main loop which cycles through all points saving some more cycles since I only need to calculate a bunch of data only once at every angular position.

Last but not least I was thinking about the undraw of the object on the hidden screen before the next position of the 3D-object was drawn. Undraw was performed by just drawing the 3D-object with HCOLOR = 0. This is ok for simple objects like the tetraeder but has high cost of cycles when the object get more complex. I tried to use the ROM HCLR-function (HIRES CLEAR) but this turned out to at first clear the whole screen making it necessary to redraw the outer box and the text output. Secondly the routine was slower compared to my own undraw. I decided to investigate if there are faster methods of undrawing.

Finally I came to the conclusion that I only will need to clear the portion of the screen that shows the rotating object and not the whole screen. So I decided to implement a rather simple but fast partial clear-screen routine by unrolling a loop which cycles through Y-base adresses of single HIRES lines and undraw not all the 40 bytes per line but only about the middle 18 bytes. This resulted in even more speed increase of the algorithm for the cost of spending a lot more bytes for the code! Have a look at some lines of that simple but efficient code:

*
********************************
* UNDRAW LINES                 *
********************************
*
UDRWLNS        LDA   #$00
*
               LDY   $E6
               CPY   #$20
               BEQ   UDRWSCR1
               JMP   UDRWSCR2
UDRWSCR1       LDX   #$0B
UDRWLP1        STA   $2200,X
               STA   $2600,X
               STA   $2A00,X
               STA   $2E00,X
               STA   $3200,X
               STA   $3600,X
               STA   $3A00,X
               STA   $3E00,X
*
               STA   $2280,X
               STA   $2680,X
               STA   $2A80,X
               STA   $2E80,X
               STA   $3280,X
               STA   $3680,X
               STA   $3A80,X
               STA   $3E80,X
               .
               .
               .
               STA   $21D0,X
               STA   $25D0,X
               STA   $29D0,X
               STA   $2DD0,X
               STA   $31D0,X
               STA   $35D0,X
               STA   $39D0,X
               STA   $3DD0,X
*
               INX
               CPX   #29
               BEQ   UDRW1RTS
               JMP   UDRWLP1
UDRW1RTS       RTS

Benchmark Results 

I performed benchmark tests with a stop watch using the simple benchmark function that I implemented in my code and got the following improved results. The table states each improvement step and the obtained increase of algorithm speed. I used the objects #1, #4 and #5 for doing the benchmarks. Given is the needed algorithm time at 1 MHz CPU speed for the benchmark which simply represents five complete rotations of the 3D-object:

Object #1Object #4Object #5Time #1Time #4Time #5
Original algorithm20,13 s49,43 s42,23 s---
MULT lookup table18,63 s45,29 s36,91 s-7,41%-8,38%-12,60%
DIV to 6 bit17,35 s42,79 s35,19 s-13,81%-13,43%-16,67%
DIV subroutine structure17,17 s42,20 s34,75 s-14,70%-14,63%-17,71%
MULT number reduction16,73 s41,10 s33,32 s-16,89%-16,85%-21,10%
Code transfer14,44 s39,56 s31,12 s-28,27%-19,97%-26,31%
UNDRAW optimised13,16 s26,59 s23,48 s-34,62%-46,21%-44,40%
UNROLLING the DIV-loop12,19 s25,65 s22,32 s-39,44%-48,11%-47,15%
PROJECTION lookup table11,46 s22,95 s19,93 s-43,07%-53,57%-52,81%
ZERO PAGE usage optimization10,58 s21,70 s18,29 s-47,44%-56,10%-56,69%
FAST LINE DRAWING8,26 s16,28 s15,73 s-58,97%-67,06%-62,75%

This table shows that different optimisation methods had different effects on the object drawing speed. This is related to the fact that the objects consist of different number of points and connecting lines. The most complex object in the 3D-Demo ist #4. Drawing speed could be increased by almost 47% with this one. Drawing overhead for this object is higher compared to #1 and #5, hence the speed increase is not as much. However, the optimisation of the undraw routine shows the biggest effect on this 3D-object.

Fast line drawing is based on the algorithms published by Andy McFadden

This example shows that not only the implementation of wicked math routines yield a noticeable speed increase. Instead a thoughtful analysis of the overall algorithm structure might uncover potential for massive improvements.