Pathtracing Coherency

Anders Lindqvist (breakin)

This is not a blog post and it will be improved over time as I learn more about the subject. Feel free to ask questions or suggest additions! So far I has received a major addition on 2018-08-30.

The problem I am looking at here is how deal with the inconvenient problem of reduced performance when shooting incoherent rays. This problem present itself in different ways depending on what you are trying to do. My focus is GPUs but I will touch CPU rendering a bit as well.

Contents

1 Pathtracing - the shortest overview there ever was
2 Rendering - so many choices
3 So what is this coherency then?
  3.1 Pathtracing on the CPU
  3.2 Pathtracing on the GPU (without DXR)
  3.3 Pathtracing on the GPU - with DXR
4 Examples
  4.1 Progressive rendering with one bounce
  4.2 Offline rendering, single bounce
  4.3 Two bounces (and maybe three)
  4.4 Offline rendering
5 Credits

Pathtracing - the shortest overview there ever was

In pathtracing we do the following:

Fire rays from the camera through each pixel to find where it ends up in the scene
We do something with light sources here based on the material. Maybe shoot rays to one or all of them
- Yes there is a way to make it correct (if noisy) to only do one at the time
Based on material at the surface we fire ONE secondary ray
- ... which sometimes makes use hit another surface and repeat the previous points
Sometimes we miss surfaces and hit the sky (“environment”)
The rays that belong together (camera->surface->surface->...) is called a path which gives pathtracing its name
Finally we add together all the contributions from light sources and the environment and add that to the pixel

I am not trying to teach you pathtracing here today but I want to give you the idea that there are alot of rays involved, going in all sorts of directions. When shooting secondary rays or shooting rays towards area light source we use random numbers to get an unbiased result. This is part of the monte carlo framework used in pathtracing. I've written a tutorial on the 1D-case if you want to learn more.

When shooting rays against triangles we need an acceleration structures. Right now the main choice is boundary volume hierarchy (BVH) where all triangles are recursively subdivided into two (possibly-overlapping) axis-aligned bounding boxes in a pre-process. When triangle intersection is done the BVH-tree is visited in some way. Once leafs are found there are triangles that will be intersection tested. Once all possible leafs has been visited we know the closest intersection (if any). For shadow rays we might only care about an arbitrary intersection and BVH-traversal can quit early.

In this text we will use the term pass to refer to doing one sample (one path) for each pixel on the screen. This is sometimes called doing 1 SPP (one sample per pixel). Note that depending on how many bounces we do a path might have a different number of rays being traced. For paths starting outdoors has a much lower number of bounces compared to paths starting indoors.

Just as a side note; for now I think focusing on path-tracing for realtime is probably the wrong thing to do for most games. Fixing soft/textured shadows and having proper reflections/AO seems good enough. With denoising and more advances it can probably be done, especially once we have even faster GPUs and more knowledge, so feel free to research away on it!

Rendering - so many choices

There are many different ways to render. Each comes with a different “budget”. It is not all about how many samples we can afford per pixel and per frame, but also how many frames we can “average” over. If we are allowed to let the image become “good looking” we say that it is converged. dIt is important to remember when reading blog posts and papers that there are different speed/quality goals!

Here are some different alternatives:

Realtime rendering with no history
Realtime rendering with TAA
Interactive rendering (same as 2 but lower framerate tolerated)
Progressive rendering (restart when camera/light/mesh change)
Offline rendering (user only see final image, no in-between images)

For 1 there is little we can do. If we want to do path tracing it will be very noisy and we need to use de-noising and hope for the best. I would go for good shadows (soft?) and reflections and forget indirect illumination here. The environment integral can maybe be handled correctly which might replace AO, but no bounces.

In 2-5 we have more than one frame to converge over. In 3-4 we must get something nice quick and then improve if the user doesn't change anything. In 1-3 the user expect that animated meshes work.

I expect that often 1-2 will use hybrid rendering so that the first hit (camera to first surface) is done using rasterization. TAA will handle anti-aliasing instead of how path-tracing usually deals with it. An exception might be for foveated rendering and adaptive rendering that might be better handled using raytracing. Also depending on budget it might be tempting to let raytraced primary samples handle anti-aliasing.

For lightmap baking some of these modes change (an example is that camera navigation doesn't invalidates old results for progressive rendering).

So what is this coherency then?

In path tracing and raytracing we shoot a lot of rays. If we just shoot rays at random (not even from camera, just shoot them randomly) they will be extremely incoherent. If instead most rays have the same starting position and go in the same direction, they are extremely coherent. But why does it matter? It is mostly about memory but that is not the full story. Lets break it up.

Pathtracing on the CPU

When doing pathtracing on the CPU the frame is generally processed in tiles. The idea is that the rays shot will go roughly in the same direction and end up at the same place. They will traverse the same parts of the BVH, making similar traversal choices. The same triangles will be intersected. Once at the surface they will look up the same texture with the same mipmap settings. Often textures are so large so that they are out-of-core handled with tiles being read in when needed. Rendering in tiles leads to better re-usage of both CPU caches (handled by CPU) and texture cached (handled manually by renderer).

When the result has been written back to the framebuffer the tile might be written to disk since the full framebuffer might not fit in memory. This is partially because there might be multiple layers, not only the finished frame but different partial results needed to compositing.

A nice aspect of raytracing on the CPU is that the entire scene representation is read-only so that it is trivial to have multiple threads work on it at the same time.

Summary; In the CPU case coherency gives better cache/memory utilization.

Pathtracing on the GPU (without DXR)

The GPU story is similar to the CPU story (less pressure on memory systems, more cache re-usage) but there are some differences.

Lets first assume that each “thread” process one pixel. Now threads are grouped in wavefronts/warps composed of say 64 threads (depends on GPU). Because of the way the GPU is designed performance is much better if all 64 threads in a wavefront/warp makes similar choices. This means taking the same amount of steps in for-loops, making the same choices in if-statements etc.

If threads in wavefront/warp make different choices performance suffers a lot. The term used here is often thread divergence.

Lets give us an example of bad behavior. If each thread in a wavefront/warp start by shooting a ray from the camera into the scene. 50% of them a floor but 50% of them hit the environment. Now the 50% that hit the environment will sit idle until the other ones are done (unless we do things differently). Same things happens when paths have a different amount of rays in them (some hit the environment sooner than others).

With compute (and also with pixel shaders) there are many ways to structure things to avoid this (or where this description even makes sense) but this is an example why coherence could be good on the GPU.

Summary; Both cache/memory utilization and to avoid thread divergence.

Pathtracing on the GPU - with DXR

First. On DXR it seems that the standard way to do things is to use a BVH (boundary volume hierarchy) to accelerate finding triangles to intersect. That is what the hardware is for. I don't know much more than that. I have no idea how rays are submitted, what order they are executed in and if there is something like wavefront/warps on the RTCore. The only thing I can assume is that since some rays continue and some terminate, the hardware probably handles “compaction” of rays such that paths with many rays segments will not slow down paths with few path segments.

It will very likely be different on different graphics card vendor and from some of them we might not get much details to actually know how to best program it. Who knows.

What we do know is that paths are started by a ray generation shader. We can assume that pixels that are close to each other (not a very precise term!) will spawn intersection tests that are done somewhat at the same time. Thus spawning coherent rays might help performance since it would put less pressure on the caches when fetching BVH data.

Beyond this it is hard to say anything. With time we will know more about how it works!

Examples

Progressive rendering with one bounce

We will start with an example with ONE secondary ray since it is already hard enough.

Lets say we have a fully diffuse scene and we want to render it progressively.

First we fire rays from the camera into the scene. If we are looking at the sky some rays might by terminated. If we are looking at the ground all will survive. The hit points are mostly coherent. With this I mean that adjacent pixels will fire similar rays that will on average end up on surfaces that are close to each other with similar normals. Here it might be needed to compact the set of paths so that we don't end up having idle threads in wavefront/warps.

For progressive rendering we want that if you take 5 samples it should look good for 5 samples, but as you approach say 10 samples it should look good for 10 samples. That way it will converge quickly in the beginning and then keep improving as we wait (unless we restart). For this to work we need random numbers to spawn directions over the hemisphere that has the progressive property that the samples \([0,..,4]\) gives good directions, but so should \([0,..,N]\) for any \(N\). It is quite common that we tabulate these for say \(N=1024\) and then after that we just do random white noise instead since there is little benefit to using say blue-noise distribution or so.

Now we are at the surface that is visible from the camera and we want to choose a secondary direction (guided by the diffuse brdf). The issue then is that if adjacent pixels use the same random directions there will be severe banding artifacts. This is due to the fact that the error of the integrals are very correlated between the pixels. As the number of samples in the sequence goes up this correlation will mostly disappear but if we have an reasonable low amount of samples it will look really bad. It usually manifest itself as big constant areas.

Think of it like this. If the camera see a big plane all the diffuse sampling will be started from that plane. If all pixels use the same random sequence they will all send their first ray in the same direction (but from slightly different points). Now lets imagine that there is an area-light shining on the floor. Since all rays goes in the same direction the result from the first frame (1 SPP) will look as a rectangle on the ground. The next frame will add another such rectangle and the same for every frame. This will give areas with constant color which is not nice. If we take an infinite number of samples if will look OK in the end, but we don't have time for that today. Instead we want to trade banding for noise. Noise is something the user can accept (or it can be removed by a de-noiser). Banding is not so easy to dispense with.

An easy way to break up the banding is to make sure all pixels use a different random sequence. Lets say that we generate 16×16 unique random sequences. We then tile them over the frame such that after 16 pixels we reuse a random sequence. After 16 pixels the rays that go in the same direction will probably see different things so it can still look good. Or we try 32×32. A good value depends on content and resolution. The random angle should probably be based on blue-noise or something else so that pixels are different but not too different.

Problem solved? Not quite. Now adjacent pixels create rays going in wildly different directions. Incoherence.

A solution to this is introduced in the paper Interleaved Sampling. The general idea that I took from it is that while adjacent pixels must have different random sequences, pixels some way apart can share random sequences. If we reorder the processing of pixels in our frame we can make sure that we create a batch of rays that uses the same random sequence but are located some distance apart.

To make this concrete lets say that we are doing our own GPU raytracing. We use one “thread” in our warp/wave per ray. We then want to make sure that all threads in a warp corresponds to pixels that are sort of close and using same random sequence (hence going in roughly same direction). This also means that once all the rays in our wave/warp has reached the target and we want to fire even more rays, they are at least maybe close to each other. And if we are outside hopefully must of them hit the environment so we don't have to process them at all.

Note that this would probably also work well for “Realtime rendering with TAA”. The random sequence would be different but the general idea would work out.

Offline rendering, single bounce

Here we are not concerned with being able to produce a good image after each frame. This helps us find coherency. Lets say we want to do 256. If we let all pixels have the same 256 random directions there will be banding instead of noise. If we do total randomization there will be noise but no coherency.

What do we really want? For a given frame we want all random directions to be roughly the same. They don't have to be exactly the same, but we want rays that are fired close to each other and end up at a surface to shoot secondary rays in roughly the same direction. Here micro-jitter is a perfect solution. It takes one random directions and perturb it just enough to make the pixels not have correlated values. No banding but noise. The idea is that we don't rotate things wildly. Sample N goes roughly in the same direction, but each pixel move it just a little bit different.

After frame 0 we will have a very banded image so no good for progressive rendering. But once we've gotten to 256 samples each pixel will have used different random directions. The key here is that once we've taken all 256 samples we don't remember what order we took them in. And taken as a whole the two pixels have different random directions.

The higher the sample count, the smaller perturbations we need.

To see why this work consider a hemisphere with well-placed points. Now we form voronoi-regions around each point and then we move the points randomly within their voronoi region, differently for different pixels. Then we shuffle the order of the points, differently for different pixels. Since order is forgotten (since we don't care about what happens early on for low sample count) this works out.

Here is the paper: Cache-Friendly Micro-Jittered Sampling.

Two bounces (and maybe three)

Now lets say that we've shot our camera rays and our first level of coherent diffuse rays (using interleaved sampling or micro-jitter perhaps). Now we are at the second bounce positions. Hopefully many paths has been terminated (if we are outside) but some will live on. Maybe we should compact rays again. But can we create coherent rays yet again?

Not easily in my experience. Lets keep the idea of micro-jitter in mind. After the first bounce we send all rays roughly in the same direction. If we imagine 4×4 pixels all intersection a floor and all taking the same direction away from there, they might all end up at roughly the same surface. Now to give those 4×4 pixels uncorrelated values we need them to go in different directions. We need them to be incoherent. Failing to do so usually ends up with the diffuse lobe looking more like a reflection.

Pascal Lecocq, one of the authors of the micro-jitter paper, says that their approach tend to maintain coherent rays for about 3 bounces given that the scene has reasonable geometry (think typical architectural scenes with somewhat flat surfaces). It is being used in an experimental version of the Cycles render (called “scrambling distance”). It can be tried here. So I believe this it s a good viable approach. I have to convince myself it doesn't suffer from what I described above, but the problem might only be in my head so take my scepticism with a grain of salt.

Offline rendering

Here we can maybe afford to do multi-bounce GI depending on our budget. Since we don't need to present a good image after each frame we can let the result have banding until we converge. It is important that we compact the rays so that paths that die don't consume “dead” threads in wave/warps. This is probably handled implicitly behind the scenes in DXR. If you are doing you own GPU implementation then there doesn't have to be explicit compaction; It can be handled by path restarting or persistent threads and some sort of ray queue. Important part is to not leave threads idle during computation.

An important factor here is that the first pass from the camera to the scene is often very coherent so you don't want to mix in too much secondary rays in wave/warps that are doing such rays. This is an argument against too naive ray-restarting.

In this mode we will have a massive amount of rays that we can process before we need to show anything. Latency is OK. Thus we have a situation where we can bin/sort rays and defer them until we have full coherent “buckets”. A sobering read here is this paper but I am not sure if it applies or not in this new GPU world.

The least we can do is bin on direction (positive x, negative x, ...) and maybe starting position.

An interesting read from the CPU-world is Faster Incoherent Rays: Multi-BVH Ray Stream Tracing to show you the type or ray sorting that could be done. What works depends very much on rays and the machine you are running on. In the world of DXR BVH-traversal can probably not be manipulated but it could implement something like this on the inside.

Before I leave I want to point out this really nice paper. It highlights what can happen if you have really expensive shading. Megakernels Considered Harmful: Wavefront Path Tracing on GPUs. It will be interesting so see if the key takeaways here change now that we have callable shaders in DXR.

Credits

Most if the insights I got around path tracing on the GPU was when I was working on the stingray GPU light baker together with Tobias Persson. We had many discussions while trying to do performance optimizations. Also thanks to Alan Wolfe for proof-reading of early drafts; your comments and questions made the text better! Thanks to Pascal Lecocq for comments on micro-jittering.