Rectangles Redux

There’s a bug in yesterday’s rectangle code. In certain situations it copies too many pixels on the right of the rectangle, so you end up with a pixel-wide strip of extra pixels tagged onto the side of the rectangle. Fixing it was more complicated than I’d expected. We can only work with even values (an even x co-ordinate and an even width), so we have to find the most efficient way to fix any odd values and remember that we’ve fixed them for later filling in with the standard pixel writer function.

Here’s the fixed code:

void drawFilledRect(s16 x, s16 y, u16 width, u16 height, u8 colour) {

    // Draw top line
    for (u16 i = x; i < x + width; i++) {
        PA_Put8bitPixel(0, i, y, colour);

    // Calculate DMA copy values - we need to deal with even values only
    u16 dmaX = x & 1 ? x + 1 : x;
    s32 dmaWidth = width;
    bool drawRight = false;

    // Did we convert to an even x value?
    if (dmaX != x) {
        // Reduce the width to compensate
        dmaWidth -= 1;

        // Remember that we need to draw an extra pixel
        drawRight = true;

    // Pre-emptive bitshift to save calculations
    dmaX = dmaX >> 1;
    dmaWidth = (dmaWidth & 1 ? dmaWidth - 1 : dmaWidth) >> 1;

    // Perform DMA copy
    if (dmaWidth > 0) {
        for (u16 i = y + 1; i < y + height; i++) {
            DMA_Copy((PA_DrawBg[0] + (y << 7) + dmaX), (PA_DrawBg[0] + (i << 7) + dmaX), dmaWidth, DMA_16NOW);

    // Fill in any gaps
    if (x & 1) {
        // Draw left-hand side
        for (u16 i = y; i < y + height; i++) {
            PA_Put8bitPixel(0, x, i, colour);

    if ((width & 1) || (drawRight)) {
        // Draw right-hand side
        for (u16 i = y; i < y + height; i++) {
            PA_Put8bitPixel(0, x + width - 1, i, colour);

It occurred to me that I could duplicate this method in order to draw lightning-fast horizontal lines, which is essential for a rectangle-based GUI system. The basic idea was to draw half of the line normally, then save a significant number of memory writes by using the DMA hardware to copy that half into the memory address of the second half. It would halve the amount of work needed for each line.

After a bit of programming, it became apparent that the sheer amount of code needed to calculate the exact region of memory to copy and the number of pixels that had to be filled in normally later meant that the actual speed increase wouldn’t be significant enough to justify the time it was taking to write. This method would work brilliantly with 16-bit displays, but the complexity of 8-bit data only being copyable in 16-bit chunks makes it far too troublesome for 8-bit displays.

Instead of this, I removed the PALib line routines and wrote simple pixel plotting routines for horizontal/vertical line drawing. Should be a bit faster. I also tidied up the window border drawing code, which was drawing longer lines than it really needed to (the overspill was hidden by the background fill, which gets drawn after the surrounding edges).

Related to line drawing, I’ve fixed the missing corners of the XOR rectangle. This had me puzzled for a while - the lines are all the right length, so why wasn’t the corner being drawn? Simple - the corners were being drawn by both XOR lines that intersect to make the corner, which meant the corner was getting XORed twice. It was getting set back to normal.

The three new features added today are multiple screen support (Amiga-style screens, that is, not the physical DS screens, which wouldn’t make much sense), a screen depth gadget (so that screens can be swapped) and initial work on a glyph set (so that the depth gadget has an image on it). Multiple screen support pretty much wrote itself, so that wasn’t a problem. The depth gadget inherits from the button class, so that more or less wrote itself too. Glyphs are handled simply by using the existing font system (think Wingdings), so no work required there either. Object-orientation is a wonderful thing.

The latest demo is here:

Woopsi V0.05 Demo


Jeff on 2007-09-21 at 04:52 said:

If you are going to do this sort of optimising, its probably best to look at the whole thing. For example, your computation:

u16 dmaX = x & 1 ? x + 1 : x;

with a conditional could just be

u16 dmaX = (x+1) & ~1;

which should optimise down a great deal more since it involves no conditionals any more.

Similiarly, your loops evaluate their conditions with arithmetic every time…

   for (u16 i = y + 1; i &lt; y + height; i++) {

vs lasty = y+height; for (u16 i = y+1; i<lasty; i++)

and you should be able to get the gaps into the same loop.

             for (u16 i = y; i &lt; y + height; i++) {
                    PA_Put8bitPixel(0, x + width - 1, i, colour);

relies on the optimiser realising that x+width-1 is invariant for the loop. Never a good idea when you can code it yourself for clarity

Similiarly in the loop that does:

DMA_Copy((PA_DrawBg[0] + (y << 7) + dmaX), (PA_DrawBg[0] + (i << 7) + dmaX), dmaWidth, DMA_16NOW);

PA_DrawBg[0]+(y<<7)+dMax is invariant, but PA_DrawBg[0] is volatile, isn’t it so the compiler won’t optimise that?

And rather than doing i<<7, why not restructure the loop something like:

line0 = PA_DrawBg[0]+dmaX+(y&lt;&lt;7);
linei  = from + (1&lt;&lt;7);
for (i=1; i&lt;height;i++) {
    linei += 1&lt;&lt;7;

typed directly into the web, not syntax checked or whatever but it looks to me a lot easier to read.

In a previous life, I had someone ask me to write a fourier transform in 6502 assembler because it was way to slow in C - exactly the same sort of loop unrolling took it to the point where it was literally too fast in C - other parts of his code starting being a bottleneck.

As to a comment you’ve given elsewhere about the arguments to DMA_Copy() needing to be in (parenthesis), that screams that its implemented as a #define which means its also a candidate for unrolling into your routine.

Good luck with it, I’ll be watching your progress being a beginner in all this NDS stuff myself.

ant on 2007-09-21 at 13:02 said:

Wonderful stuff. I’m glad I took the subscribers-only comment restriction off, as this comment on its own is worth all the comment spam I get.

I’ve implemented all of the changes you suggested, and it does indeed seem to go even faster than it was before. This line in particular is a stroke of genius:

u16 dmaX = (x+1) & ~1;

I’ll put the improved routine up in the next post.