Occlusion culling seems to be the obvious feature every mature engine should have these days. Some use HW occlusion queries but they introduce another sync point with GPU, others prefer using software occlusion. That is one can rasterise depth of nearby occluders ( special geomery or maybe even the one been rendered ) and check bound boxes of objects to be drawn againt this depth buffer. Its not necessary to use full resolution buffer. Memory/performance could be saved by using smaller one. There’s basically two approaches to rasterise triangles - scanlines and halfspaces. In scanline approach we draw horizontal lines between pair of triangles edges. When using halfspaces we build bound box of projected triangle and iterate its pixels checking if current one is inside triangle or not. But since all we care about is determining whether given triangle is visible or not some tricks can be used.

On a local forum (yes, its time to learn russian), infamous comrade IronPeter suggested a smart approach to do the occlusion culling. The idea is very simple - we have a queue of triangles. Triangles can be of two types - occluder-triangles which write occlusion info and occludee-triangles which only test it. Occlusion info is written in tiled bit-buffer. This allows us to write 128 pixels at a time at most using some SIMD extension like SSE. Rasterisation procedure works like in convinient scanline rasterisers except for we only need to know start and end coordinate of each scanline clipped by current tile bounds. With this assumptions in mind all is left is to sort triangle queues front-to-back using minimum z value for occludee triangles and maximum one for occluder triangles and then rasterise those triangles in order.

- Transform triangles to viewport space
- Determine tile bounds
- Push triangles into tile triangle queues
- Sort front-to-back each triangle queue of each tile ( using zmin for test triangles and zmax for occluders )
- Draw sorted triangle queues ( for test triangles check if any pixel of scanline is visible for occluders just OR scanline bits with tile buffer )

The algorithm is pretty simple and straightforward to implement and besides it can be easily threaded. My initial implementation can draw a scene with actual render geometry with more than 300k triangles in 20msec ( 12msec transform + 8msec rasterisation) in 720p. Lower resolution can be used to futher speedup processing. There’s a number improvements to be made. In my expectations SPU and VMX versions should be faster. After porting to other platforms and adding multithreading support i’m planning to do some reseach on automatic occluder and test geometry generation which seems to be pretty interesting topic as well. And i’m gonna share source code when there’ll be enough functionality to be usable in other engines.

P.S. Many thanks to IronPeter for neat idea and some insights on implementation.