2007-12-04

Filled Polygons

I became distracted by a question on the PALib forum this evening, so haven’t fixed any of the Woopsi bugs I’d intended to. Interesting problem - is it possible to draw a 100x100 rectangle rotating through 360 degrees, using nothing but PALib’s line routines and a floodfill algorithm, at 60fps?

Short answer: No. Long answer: Well, I couldn’t get it working fast enough.

This was an interesting query for a few reasons. I’m thinking of writing a vector-based game in a Woopsi window when that’s finished. Also, Woopsi may (or may not) get a floodfill routine at some point, so it was good to see how that should work.

I tried two floodfill algorithms, both from this page:

http://student.kuleuven.be/~m0216922/CG/floodfill.html

The first one I tried was the 4-way recursive routine which, as the site suggests, is a bit useless - it overflowed the stack. The second one I tried was the scanline-with-stack routine, which did the job. I optimised it by replacing its single pixel plotting method with a DMA accelerated line routine (which meant switching it from scanning through columns of pixels to rows) and switching to bitshifts where possible. There’s a “QuickFill” method on the internets somewhere that would probably do it a bit faster.

The (bog-standard C) code below includes useful functions such as stack management routines, polygon rotation and drawing routines, and the accelerated line drawing routine. This might turn out to be useful for Woopsi - I’ve been meaning to write this for ages, and it does produce a speed increase over the current routine. Instead of plotting each pixel in the line one by one, it plots the first 8th of the pixels required, then copies those to the location of the next 8th. It then copies all of the pixels drawn so far to the location of the next quarter, and so on (you can see the pattern emerging - each copy is double the size of the last). A 16-pixel wide line will thus take 5 draw operations (2 plots and 3 copies) instead of 16. Any pixels left to be drawn after the copies have completed are plotted using the standard method.

Rotating Rectangle DS Code

Comments

Jeff on 2007-12-05 at 01:36 said:

One thing I’ve wondered about the DMA copy routines in the NDS is that there seem to be FOUR DMA channels (http://nocash.emubase.de/gbatek.htm#dsdmatransfers), yet noone uses anything but the first.

Whilst its a little harder to coordinate them, and perhaps they might step on each other feet, you should be able to get at least two-times performance by alternating between them?

Off the cuff, of course, because I’ve never tried using them.

As to the appropriateness of a floodfill when you know you are rotating a square, it seems somewhat foolish to force the use of a general algorithm when you can compute a lot more about the rectangles coordinates (ie, you know there is only one start/stop point per scan line, you can use Bresenham to compute the start point, you can derive the end point from the start point if its rotating around its centre, etc)

ant on 2007-12-05 at 09:39 said:

I considered trying a more specific polygon filling routine, but decided against it in the end - I need a generic floodfill anyway, so it was a good opportunity to get that working. Also, I didn’t know the ultimate purpose of the program, so making it too specific didn’t really seem worthwhile.

There are a couple of other specific optimisations that would make a difference. First of all, there’s a significant portion of the rectangle in the centre that does not change, and so does not need redrawing. Since we’d no longer be wiping the whole screen on each VBL, we could write a clearing routine that just erased the portions of the screen that we need to refresh.

I didn’t know there was more than one DMA channel. I’ll have to look into that! From what I remember of PALib’s code, using a different channel should be just a matter of changing the memory address that the DMA_Copy routine pokes its DMA command into.

Jeff on 2007-12-05 at 09:44 said:

I knew it was there somewhere. From


#define DMA_Force(ulVal,dest, count, mode) {
   REG_DMA3SRC=(u32)&ulVal;
   REG_DMA3DST = (u32)dest;
   REG_DMA3CNT = (count) |(mode) | DMA_SRC_FIX;
}

Its like DMA_Copy but the source pointer does not move along, so it copies the same 16/32 bit value into the entire destination buffer. Obviously, you can’t use it with a constant, but its probably faster for setting your lines than subdividing.

ie,


   color=PA_RGB(...);
   DMA_Force(color, buffer, length, DMA_16NOW);

Jeff on 2007-12-05 at 09:46 said:

The only issue with the DMA channels is that one takes priority over the others, and I’m not sure whether starting the first one stops the CPU - it seems like it might, unless you use the special ‘vbl/sound’ modifiers which may well defeat the purpose of the exercise.

ant on 2007-12-05 at 10:00 said:

DMA_Force! Brilliant. That’ll come in amazingly handy.

I’ve been looking into polygon filling routines, and I think I might have to implement a few of those for Woopsi (filled irregular polygons, filled circles and ellipses, etc). Floodfill is handy for other things, but when you’ve got a polygon and know its edges already, you’re right - floodfill isn’t really appropriate.

Jeff on 2007-12-05 at 11:53 said:

I noticed your circle routine over on palib.info/forum and it seemed to me that it would be trivial to convert to a filled circle - just draw lines between the x values for each value of y, rather than plotting individual pixels for the extremeties.

ant on 2007-12-05 at 12:07 said:

It should be, yep. That seems to be the usual way to do things, from what I’m reading. Incidentally, it’s not really my circle routine - I borrowed it from elsewhere and added in the PALib-specific stuff.

This filled poly routine looks interesting - I might see if I can adapt it later:

http://www.ruthless.zathras.de/facts/apps/polygonesia/scanline-polygonfill.php

Jeff on 2007-12-05 at 22:12 said:

That looks good for convex polygons but concave ones are a different problem because you can have multiple leading edges on any Y value, which makes the reliance on Bresenham problematic.

Using their picture nomenclature, imagine what happens if there is an additional vertex ABOVE and to the right of B1 before then coming down to B2.

Filling concave polygons is hard - most naive solutions cop out and try to convert them to multiple convex which can cause seams unless you are careful that your Bresenham is consistent.

And, of course, splitting a convex polygon into concave ones is a major problem in itself; thats why people fall back on ‘flood-fill’ which just searches for edges.

ant on 2007-12-06 at 00:56 said:

Put some more thought into filled polygons today, and I think the floodfill will only work with the SuperBitmap. Problem with the Graphics port is that it’s all clipped, so if I draw an irregular polygon and try to fill it, and half of that polygon is obscured by another window, I’m screwed - I won’t know where to start filling from, or when to stop. Even worse, if I’ve got multiple portions of the polygon visible, but each portion is detached from the others by a higher gadget, I’ll have to try and fill each portion individually. The whole thing would get very complicated very quickly.

With the SuperBitmap, as there’s no clipping, it’s not a problem. One way around it would be to draw everything to a SuperBitmap and then copy the filled polygon back to the display, only copying pixels that match the fill colour (clipping is already written for this). That means creating a RAM buffer for every polygon, though, plus two complete sets of draws (once to the buffer, once to the screen) and a complete read (to identify which pixels should be drawn).

Simple shapes should be OK - rects are done, circles shouldn’t be difficult, and ellipses should be similar. Irregular polygons - hmmm, needs more thought!

cearn on 2007-12-20 at 20:37 said:

Just thought you’d like to know that there are a couple of ways to greatly speed up your floodfill algorithm:

1) Use ints for local variables and parameters. The CPU registers of ARM cores are 32 bits long (corresponding to ints or longs) and consequently the datasize they work best with. You can almost say that that’s the only size they can work with. For chars and shorts, one or two extra instructions have to be added to keep the numbers 8-bit and 16-bit. For example, something like:

s16 i, sum=0;
for(i=0; i < count; i++)
    sum += array[i];

will be compiled as

s16 i, sum=0;
for(i=0; i < count; )
{
    sum += array[i];
    sum <>= 16;
    i++;
    i <>= 16;
}

The 4 extra shifts are completely unnecessary and can be avoided by using `int’ as your datatype. This usually saves you 10 to 20% in processing time -- in some cases this goes up to 50%. Use the smaller types only when you really have to.

2) Use temporary variables for globals and pointer derefs. Global variables and data that pointers point to can be changed by every function. When calling a function, the compiler can’t know whether this function will alter anything, and will have to reload any globals and pointer-data just to be safe; even when it’s not strictly necessary. This extra work can be avoided by loading them into local variables beforehand. An added benefit of this is that the local names can generally be shorter and more consistent than globals, which can improve readability and maintainability.

3) order expressions in compound conditionals for early escaping When you have something like `a && b && c’, later expressions can be skipped if the result has already been determined by earlier ones. In the case of repeated ANDs, you can speed things up by putting the expression most likely to fail first. For the floodfill this means putting the rectangle tests after the color tests.

These three plus two smaller optimizations give a speed-up of roughly 70%.

Some source code: http://www.coranac.com/files/misc/floodtestDS.zip (based on your rotating square code)

Additional references: http://www.arm.com/pdfs/DAI0034A_efficient_c.pdf (PDF) http://www.azillionmonkeys.com/qed/optimize.html http://www.coranac.com/2007/12/11/floodfill/ http://www.coranac.com/tonc/text/first.htm#ssec-bad-example http://leto.net/docs/C-optimization.php

ant on 2007-12-20 at 23:53 said:

All useful stuff!

Regarding the first point - nuts, if I’d known this before I’d probably have used more u32s. Back when I wrote a Java raycaster I used all u32s, then decided I’d be clever and work out the data ranges of each variable and swap the u32 for the correct type. I later discovered (in exactly the same way as here) that Java converts all variables less than 32 bytes long into 32 byte values before processing them anyway, and I’d wasted all of that time and actually made the raycaster slower.

In Woopsi’s case, I think I’ll leave everything as it is for the time being. I can see memory being more of an issue than performance with DS coding in general.

Very handy tips, though - I’ll keep these in mind.

Jeff on 2007-12-21 at 00:23 said:

When optimising like this, its also worth ensuring that you actually let the compiler have a go. Many of these sorts of improvements, gcc will make if you pass it the -O flag.

The standard PALib makefile (which Woopsi’s is derived from, as I recall) does not enable the optimiser.

cearn on 2007-12-21 at 22:42 said:

Jeff wrote: "Many of these sorts of improvements, gcc will make if you pass it the -O flag." In principle, yes you should compile with -O (or -O2 or -O3) before you start really optimizing, but there are also many cases in which it won't help because the compiler can't cut corners the way humans can. When you ask for u16 variable, the compiler has to assume that you want an u16, even if it can cause slower code. That said, more recent versions of GCC do try to eliminate the extra shifts. If in my example the 'count' was a small constant, the i++ would be done without the two casting shifts. (I now notice that the comments ate the shift symbols; it should have said `i++; i <<=16; i >>=16;' (and now hope these aren't swallowed either :P) ).

Likewise, re-ordering expressions in conditions and preloading globals/pointers cannot always be done by the compiler because it doesn't know the project as well as you do. It has to play it safe.

If you want to see what kinds of optimizations (and mis-optimizations; there are a few of those as well) the compiler is capable of, add `-save-temps' to the compiler flags. This will put the generated assembly in the build directory. Of course, you'll need to be able to read ARM assembly to make sense of it :). Some of the optimizations that the compiler will try to do: replacing multiplication and division by shifts and doing loop-invariant stuff outside the loop. Wikipedia also has a full list if you're interested.

Jeff wrote: "The standard PALib makefile (which Woopsi’s is derived from, as I recall) does not enable the optimiser." Are you sure? I have a PALib installation here and it seems to use -O2.

Jeff on 2007-12-22 at 07:00 said:

I’ve definitely seen cases where GCC was smart enough to recognise that ‘you only want a u16’ and it arranged to use versions of op-codes that only operated on the bottom 16 bits of the register - ie, it really didn’t need to worry about what random bits were in the top 16 bits. However, thats pretty sophisticated and I’m not sure the arm compiler is that smart.

I’ll have to go back and check as to whether -O2 is being passed, but I do not recall seeing it. Having said that, its possible that its there. And I’m currently away from my sources.

cearn on 2007-12-22 at 10:55 said:

“it arranged to use versions of op-codes that only operated on the bottom 16 bits of the register - ie, it really didn’t need to worry about what random bits were in the top 16 bits”

Aside from the memory load/store instructions, op-codes that use only part of the registers don’t exist on the ARM processors present in the DS. All arithmetic uses the full 32bits. Well alright, there are the SMULxy instructions, but that’s it.

Jeff on 2007-12-22 at 22:09 said:

Yes, but if you look at the actual maths involved, a 32-bit addition is exactly the same as a 16-bit addition if you are not interested in the overflow bits, etc. ie,

[(foo)<<16 + 27] + [(bar)<<16 + 33] = [(bletch)<<16+60]

irrespective of the values of foo, bar and bletch and the compiler writer can exploit that fact. You can use 16-bit loads and stores on that register quite safely, provided you don’t need to check the status bits, which quite a lot of code doesn’t, and the compiler is free to use.

This sort of thing depends on you using the correct data types because unsigned arithmetic and signed arithmetic will affect what operations will work, and mixing and matching types tends to kill it.

Note, I’m not saying that the current arm gcc does this all the time. I’m saying that I’ve seen some variant of gcc doing it in the past, and I don’t know whether that intelligence has been propagated into the arm version.

Jeff on 2007-12-22 at 22:51 said:

I stand corrected on -O2.

It is being set in my Makefiles - I have no idea why I missed seeing this over and over…

ant.simianzombie.com » Blog Archive » Iterators and Bugfixes on 2008-12-02 at 12:29 said:

[...] meaning to do for a while is replace any non-int iterator variables with ints. This was mentioned nearly a year ago by a chap called cearn, whose website I’ve been browsing through lately. In addition to his [...]

Iterators and Bugfixes | simianzombie.com on 2011-11-18 at 19:08 said:

[…] meaning to do for a while is replace any non-int iterator variables with ints. This was mentioned nearly a year ago by a chap called cearn, whose website I’ve been browsing through lately. In addition to his […]