Monday, January 18, 2016

GPU path tracing tutorial 3: GPU-friendly Acceleration Structures. Now you're cooking with GAS!

(In case you were wondering, my pun-loving girlfriend came up with the title for this post). This tutorial is the longest, but most crucial one so far and deals with the implementation ray tracing acceleration structure that can be traversed on the GPU. The code from the previous tutorial works okay for simple triangle meshes with less then 10,000 triangles, but since render times grow linearly or O(n)with the complexity of the scene (each ray needs to test every primitive in the scene for intersection), anything above that number becomes unfeasible. To address this issue, ray tracing researchers came up with several acceleration structures such as grids, octrees, binary space partitioning trees (BSP trees), kd-trees and BVHs (bounding volume hierarchy), allowing render times to scale logarithmically or O(log n) instead of linearly with scene complexity, a huge improvement in speed and efficiency. Acceleration structures are by far the most important ingredient to building a fast ray tracer and an enormous amount of research has gone into improving and refining the algorithms to build and traverse them, both on the CPU on the GPU (the latter since 2006, around the same time unified shader architecture was introduced on GPUs). 

Scratch-a-Pixel (again) has a great introduction to acceleration structures for ray tracing (grids and bounding volume hierarchies) that includes example code: http://www.scratchapixel.com/lessons/advanced-rendering/introduction-acceleration-structure. Peter Shirley's "Realistic Ray Tracing" book also contains a good description and implementation of a BVH with C++ code.

An overview of the latest state-of-the-art research in acceleration structures for GPUs can be found in this blogpost on Robbin Marcus' blog: http://robbinmarcus.blogspot.co.nz/2015/10/real-time-raytracing-part-2.html

This tutorial focuses on the implementation of a BVH acceleration structure on the GPU, and comes with complete annotated source code for BVH construction (on the CPU) and BVH traversal (on the GPU using CUDA). The reason for choosing a BVH over a grid or kd-tree is because BVHs map better to modern GPU architectures and have also been shown to be the acceleration structure which allows the fastest build and render times (see for example https://anteru.net/research/quantitative-analysis-of-voxel-raytracing-acceleration-structures/). Another reason for choosing BVHs is that they are conceptually simple and easy to implement. The Nvidia research paper "Understanding the efficiency of ray traversal on GPUs" by Aila and Laine comes with open source code that contains a highly optimised BVH for CUDA path tracers which was used in Cycles, Blender's GPU path tracing renderer (http://wiki.blender.org/index.php/Dev:Source/Render/Cycles/BVH).

The code in this tutorial is based on a real-time CUDA ray tracer developed by Thanassis Tsiodras, which can be found on http://users.softlab.ntua.gr/~ttsiod/cudarenderer-BVH.html and which I converted to support path tracing instead. The BVH from this renderer is already quite fast and relatively easy to understand.

For the purpose of clarity and to keep the code concise (as there's quite a lot of code required for BVH construction), I removed quite a few nice features from Thanassis' code which are not essential for this tutorial, such as multithreaded BVH building on the CPU (using SSE intrinsics), various render modes (like point rendering), backface culling, a scheme to divide the image in rendertiles in Morton order (along a space filling Z-curve) and some clever workarounds to deal with CUDA's limitations such as separate templated kernels for shadow rays and ambient occlusion. 

One of the more tricky parts of implementing a BVH for ray tracing on the GPU is how to store the BVH structure and BVH node data in a GPU friendly format. CPU ray tracers store a BVH as a hierarchical structure starting with the root node, which contains pointers to its child nodes (in case of an inner node) or pointers to triangles (in case of a leaf node). Since a BVH is built recursively, the child nodes in turn contain pointers to their own child nodes and this keeps on going until the leaf nodes are reached. This process involves lots of pointers which might point to scattered locations in memory, a scenario which is not ideal for the GPU. GPUs like coherent, memory aligned datastructures such as indexable arrays that avoid the use of too many pointers. In this tutorial, the BVH data (such as nodes, triangle data, triangle indices, precomputed intersection data) are therefore stored in flat one-dimensonal arrays (storing elements in depth first order by recursively traversing the BVH), which can be easily digested by CUDA and are stored on the GPU in either global memory or texture memory in the form of CUDA textures (hardware cached). The BVH in this tutorial is using CUDA texture memory, since global memory on older GPUs is not cached (as opposed to texture memory). Since the introduction of Fermi however, global memory is also cached and the performance difference when using one or the other is hardly noticeable.  

In order to avoid wasting time by rebuilding the BVH every time the program is run, the BVH is built only once and stored in a file. For this to work, the BVH data is converted to a cache-friendly format which takes up as little memory space as possible (but the compactness of the data makes it also harder to read). A clever scheme is used to store BVH leaf nodes and inner nodes using the same data structure: instead of using a separate struct for leaf nodes and inner nodes, both types of nodes occupy the same memory space (using a union), which stores either two child indices to the left and right child when dealing with an inner node or a start index into the list of triangles and a triangle count in case of a leaf node. To distinguish between a leaf node and an inner node, the highest bit of the triangle count variable is set to 1 for a leaf node. The renderer can then determine at runtime if it has intersected an inner node or a leaf node by checking the highest bit (with a bitwise AND operation).  

A lot of the triangle intersection data (such as triangle edges, barycentric coordinates, dot products between vertices and edge planes) is precomputed at the scene initialisation stage and stored. Since modern GPUs have much more raw compute power than memory bandwidth, it would be interesting to know whether fetching the precomputed data from memory is faster or slower compared to computing that data directly on the GPU. 

The following is a high level explanation of the algorithm for top-down BVH construction (on the CPU) and traversal (on the GPU). The BVH in this code is built according to the surface area heuristic and uses binning to find the best splitting plane. The details of the BVH algorithm can be found in the following papers:

"On fast construction of SAH based Bounding Volume Hierarchies" by Ingo Wald, 2007. This paper is a must read in order to understand what the code is doing.

- "Ray tracing deformable scenes using dynamic Bounding Volume Hierarchies" by Wald, Boulos and Shirley, 2007

- "On building fast kd-trees for ray tracing, and on doing that in O(N log N)" by Wald and Havran, 2006


Overview of algorithm for building the BVH on the CPU

- the main() function (in main.cpp) calls prepCUDAscene(), which in turn calls UpdateBoundingVolumeHierarchy()

- UpdateBoundingVolumeHierarchy() checks if there is already a BVH for the scene stored (cached) in a file and loads that one or builds a new BVH by calling CreateBVH()

- CreateBVH():
  1. computes a bbox (bounding box) for every triangle and calculate the bounds (top and bottom)
  2. initialises a "working list" bbox to contain all the triangle bboxes
  3. expands the bounds of the working list bbox so it encompasses all triangles in the scene by looping over all the triangle bboxes
  4. computes each triangle bbox centre and adds the triangle bbox to the working list
  5. passes the working list to Recurse(), which builds the BVH tree structure
  6. returns the BVH root node
Recurse() recursively builds the BVH tree from top (rootnode) to bottom using binning, finding optimal split planes for each depth. It divides the work bounding box into a number of equally sized "bins" along each axis, chooses the axis and splitting plane resulting in the least cost (determined by the surface area heuristic or SAH: the larger the surface area of a bounding box, the costlier it is to raytrace) and finding the bbox with the minimum surface area:
  1. Check if the working list contains less then 4 elements (triangle bboxes) in which case create a leaf node and push each triangle to a triangle list
  2. Create an inner node if the working list contains 4 or more elements
  3. Divide node further into smaller nodes
  4. Start by finding the working list bounds (top and bottom)
  5. Loop over all bboxes in current working list, expanding/growing the working list bbox
  6. find surface area of bounding box by multiplying the dimensions of the working list's bounding box
  7. The current bbox has a cost C of N (number of triangles) * SA (Surface Area) or C = N * SA
  8. Loop over all three axises (X, Y, Z) to find best splitting plane using "binning"
  9. Binning: try splitting the current axis at a uniform distance (equidistantly spaced planes) in "bins" of size "step" that gets smaller the deeper we go: size of "sampling grid": 1024 (depth 0), 512 (depth 1), etc
  10. For each bin (equally spaced bins of size "step"), initialise a left and right bounding box 
  11. For each test split (or bin), allocate all triangles in the current work list based on their bbox centers (this is a fast O(N) pass, no triangle sorting needed): if the center of the triangle bbox is smaller than the test split value, put the triangle in the left bbox, otherwise put the triangle in the right bbox. Count the number of triangles in the left and right bboxes.
  12. Now use the Surface Area Heuristic to see if this split has a better "cost": calculate the surface area of the left and right bbox and calculate the total cost by multiplying the surface area of the left and right bbox by the number of triangles in each. Keep track of cheapest split found so far.
  13. At the end of this loop (which runs for every "bin" or "sample location"), we should have the best splitting plane, best splitting axis and bboxes with minimal traversal cost
  14. If we found no split to improve the cost, create a BVH leaf, otherwise create a BVH inner node with L and R child nodes. Split with the optimal value we found above.
  15. After selection of the best split plane, distribute each of the triangles into the left or right child nodes based on their bbox center
  16. Recursively build the left and right child nodes (repeat steps 1 - 16)
  17. When all recursive function calls have finished, the end result of Recurse() is to return the root node of the BVH
Once the BVH has been created, we can copy its data into a memory saving, cache-friendly format (CacheFriendlyBVHNode occupies exactly 32 bytes, i.e. a cache-line) by calling CreateCFBVH(). which recursively counts the triangles and bounding boxes and stores them in depth first order in one-dimensional arrays by calling PopulateCacheFriendlyBVH()

The data of the cache friendly BVH is copied to the GPU in CUDA global memory by prepCUDAscene() (using the cudaMalloc() and cudaMemcpy() functions). Once the data is in global memory it's ready to be used by the renderer, but the code is taking it one step further and binds the BVH data to CUDA textures for performance reasons (texture memory is cached, although global memory is also cached since Fermi). The texture binding is done by cudarender() (in cuda_pathtracer.cu) which calls cudaBindTexture(). After this stage, all scene data is now ready to be rendered (rays traversing the BVH and intersecting triangles).


Overview of algorithm for traversing the BVH on the GPU

- after cudarenderer() has bound the data to CUDA textures with cudaBindTexture() the first time it's being called, it launches the CoreLoopPathTracingKernel() which runs in parallel over all pixels to render a frame.
- CoreLoopPathTracingKernel() computes a primary ray starting from the interactive camera view (which can differ each frame) and calls path_trace() to calculate the ray bounces 
- path_trace() first tests all spheres in the scene for intersection and then tests if the ray intersects any triangles by calling BVH_IntersectTriangles() which traverses the BVH.
- BVH_IntersectTriangles():
  1. initialise a stack to keep track of all the nodes the ray has traversed
  2. while the stack is not empty, pop a BVH node from the stack and decrement the stack index
  3. fetch the data associated with this node (indices to left and right child nodes for inner nodes or start index in triangle list + triangle count for leaf nodes)
  4. determine if the node is a leaf node or triangle node by examining the highest bit of the count variable
  5. if inner node, test ray for intersection with AABB (axis aligned bounding box) of node --> if intersection, push left and right child node indices on the stack, and go back to step 2 (pop next node from the stack)
  6. if leaf node, loop over all the triangles in the node (determined by the start index in the list of triangle indices and the triangle count), 
  7. for each triangle in the node, fetch the index, center, normal and precomputed intersection data and check for intersection with the ray
  8. if ray intersects triangle, keep track of the closest hit
  9. recursively traverse the left and right child nodes, if any (repeat steps 2 - 9)
  10. after all recursive calls have finished, the end result returned by the function is a bool based on the index of the closest hit triangle (true if index is not -1)
- after the ray has been tested for intersection with the scene, compute the colour of the ray by multiplying with the colour of the intersected object, calculate the direction of the next ray in the path according to the material BRDF and accumulate the colours of the subsequent path segments (see GPU path tracing tutorial 1).

In addition to the BVH, I added an interactive camera based on the interactive CUDA path tracer code from Yining Karl Li and Peter Kutz (https://github.com/peterkutz/GPUPathTracer). The camera's view direction and position can be changed interactively with mouse and keyboard (a new orthornormal basis for the camera is computed each frame). The camera produces an antialiased image by jittering the primary ray directions. By allowing primary rays to start randomly on a simulated disk shaped lens instead of from a point. a camera aperture (the opening in the diaphragm) with focal plane can be simulated, providing a cool, photographic depth-of-field effect. The focal distance can also be adjusted interactively.

The material system for this tutorial allows five basic materials: ideal diffuse, ideal specular, ideal refractive, Phong metal (based on code from Peter Shirley's "Realistic Ray Tracing" book) with a hardcoded exponent and a coat (acrylic) material (based on Karl Li and Peter Kutz' CUDA path tracer).


CUDA/C++ source code

The source code for this tutorial can be found at 

https://github.com/straaljager/GPU-path-tracing-tutorial-3/

As in the previous tutorials, I aimed to keep the code as simple and clear as possible and added plenty of comments throughout (in addition to the original comments). If some steps still aren't clear, let me know. Detailed compilation instructions for Windows and Visual Studio are in the readme file: https://github.com/straaljager/GPU-path-tracing-tutorial-3/blob/master/README.md


Download executable (Windows only)

https://github.com/straaljager/GPU-path-tracing-tutorial-3/releases

All scene elements in this executable are hardcoded. Changing the scene objects, materials and lights is only possible by directly editing and re-compiling the source code. 


Screenshots

Screenshots produced with the code from this tutorial (Stanford Dragon and Happy Buddha .ply models from the Stanford 3D scanning repository)

Glossy Stanford dragon model (871,000 triangles)



Happy Buddha model (1,088,000 triangles) with Phong metal material











The next tutorial will add even more speed: I'll dive deeper into the highly optimised BVH acceleration structure for GPU traversal from Aila and Laine, which uses spatial splitting to build higher quality (and faster) trees. It's also the framework that the GPU part of Blender Cycles is using.

Other features for upcoming tutorials are support for textures, sun and sky lighting, environment lighting, more general and accurate materials using Fresnel, area light support, direct light sampling and multiple importance sampling.

References

- CUDA based sphere path tracer by Peter Kutz and Yining Karl Li
- "Realistic Ray Tracing" by P. Shirley