On a local forum (yes, its time to learn russian), infamous comrade IronPeter suggested a smart approach to do the occlusion culling. The idea is very simple - we have a queue of triangles. Triangles can be of two types - occluder-triangles which write occlusion info and occludee-triangles which only test it. Occlusion info is written in tiled bit-buffer. This allows us to write 128 pixels at a time at most using some SIMD extension like SSE. Rasterisation procedure works like in convinient scanline rasterisers except for we only need to know start and end coordinate of each scanline clipped by current tile bounds. With this assumptions in mind all is left is to sort triangle queues front-to-back using minimum z value for occludee triangles and maximum one for occluder triangles and then rasterise those triangles in order.

- Transform triangles to viewport space
- Determine tile bounds
- Push triangles into tile triangle queues
- Sort front-to-back each triangle queue of each tile ( using zmin for test triangles and zmax for occluders )
- Draw sorted triangle queues ( for test triangles check if any pixel of scanline is visible for occluders just OR scanline bits with tile buffer )

The algorithm is pretty simple and straightforward to implement and besides it can be easily threaded. My initial implementation can draw a scene with actual render geometry with more than 300k triangles in 20msec ( 12msec transform + 8msec rasterisation) in 720p. Lower resolution can be used to futher speedup processing. There’s a number improvements to be made. In my expectations SPU and VMX versions should be faster. After porting to other platforms and adding multithreading support i’m planning to do some reseach on automatic occluder and test geometry generation which seems to be pretty interesting topic as well. And i’m gonna share source code when there’ll be enough functionality to be usable in other engines.

P.S. Many thanks to IronPeter for neat idea and some insights on implementation.