Ray Tracing on Cell by Using KD-Tree Acceleration Structure
MSc Dissertation Umut Rıza ERTÜRK
12/19/2008
University of Abertay Dundee,
Department of Computing
and Creative Technologies, Dundee, Scotland, UK
Table of Contents A. Introduction to Ray Tracing 5 3. Shooting Rays and Ray-Object Intersection 24 5. Reflection and Refraction Rays 27 C. Acceleration of Ray Tracing 29 1. Acceleration Structures for Faster Ray-Surface Intersection Calculations 31 D. Distribution of Ray Tracing Steps 46 A. Outline of the Project Implementation 48 C. KD-Tree PC Implementation 54 D. KD-Tree Cell Implementation 57 A. Basic PC implementation without any acceleration structure 65 B. KD-Tree implementation on PC 71
I.AbstractThis thesis presents a rendering method called ray tracing that generates a two-dimensional image from a virtual three-dimensional world. The major advantages of ray tracing are creating a realistic image by taking the light behaviour such as reflections and refractions into consideration. This property of ray tracing has established it as an efficient method for image generation in different graphics areas such as animation and movie industry. However, the use of this method has not been widely implemented for computer games because of its high processing cost. But the recent developments in hardware technologies have demonstrated enormous processing potential which may make ray tracing possible for real time applications such as games in the next few decades. One of the dominant hardware developments today is the Cell architecture which is composed of eight synergetic processing units each having 3.2 GHz of calculation spend (SPUs) and one power processing unit (PPU). Due to this high processing speed presented by the Cell architecture makes it the perfect contender for ray tracing implementation. However, the processing power alone cannot answer the problem of real time implementation of ray tracing and it should be sufficed with acceleration data structures that share the processing load thus enhancing the overall performance. For the purpose of this research KD-Tree data structure is used. In brief the KD-Tree acceleration structure prevents time consuming calculations such as square rooting by ordering the objects in the scene in a BSP tree fashion. Finally this thesis discusses the results obtained by using the KD-Tree structure on the Cell as well as on PC and an evaluation of these results is conducted based on the efficiency gained in terms of speed and rendering time.
II.IntroductionHuman beings have always needed to control, and also simulate, nature in order to make life easier. In the Stone Age this desire forced him to invent tools for hunting, and in ancient times made him want to build water ways to irrigate farms and so on. The aspiration to simulate nature has shown itself in different forms and different ways. In 1021 AD Ibn al-Haytam published a book which included references to projecting light rays passing through a small hole, onto a plane, and from this the pinhole camera was invented and the idea of converting three dimensional space on to a two dimensional space was begun. Hundreds of years after this occurrence, with the help of the significantly developing transistor technologies as well as the accumulation of knowledge in the computer sciences and graphic technologies, it is now possible to simulate an imaginary three dimensional space and project it to two dimensional output devices such as monitors. This ability to be able to visually imitate an imaginary three dimensional world has affected most areas of the visual arts, including films and video games. The most important drivers for computer graphics technologies are obviously films and computer games which are always pushing towards increased realism. As the games are getting more developed in terms of visual realism they need more accurate visual attributes, such as soft and realistic volume shadows, multi-pass lightning and complex shaders. To achieve this level of realism they are not only getting more resource intensive but also getting more complicated in terms of the development process with the currently most common rendering method; rasterization. The main problem of this classic rendering approach is that all the objects in a 3D scene are made up of triangles, and all these triangles have to be passed through from the main processing unit to the graphics processing unit one by one. In the rasterization pipeline all these triangles need to be analyzed, coloured, lit, textured, culled and as a result become a picture (CDR-INF, 2007). To overcome the fact that the time taken increases linearly with respect to the number of triangles lots of other methods, such as level of details (LOD), back face culling, become involved in the rendering pipeline (Wald, Slussalek 2001). As a result, this approach either gives rise to a linear slow down with the increasing number of triangles, or increases the complexity of realistic effects implementation, since each effect needs to be implemented by hacking or by approximation methods. One of the most suitable examples to expose the weakness of the rasterization is shadowing; there are many different methods being used for shadowing in rasterization (such as stencil shadowing and shadow mapping) and all of them are actually finding a way to imitate the shadow in very complex ways. Despite the unrealistic results and complex implementation of the effects, currently rasterization is considered as the conventional rendering method for computer games because of the low computation cost (for low poly scenes) and the common use of rasterization dedicated devices (graphics cards). However microprocessor technology, as well as parallel-computing, has significantly improved over the last few decades which has provoked researchers to be focused on other alternative rendering techniques to achieve better levels of realism. Ray tracing is one of the oldest rendering techniques and completely different from rasterization. Ray tracing makes it possible to achieve nearly perfect visual realism as it precisely mathematically simulates real light ray behaviour. This level of realism and accuracy caused ray tracing to be used not only for visual effects but also for whole rendering process in animated films such as Ice Age, Ice Age 2 and Robots (Suffern, 2007, p. xiv). Despite the wide use of offline ray tracing in animated films, current PCs are not fast enough to afford expensive ray tracing calculations for games within current hardware resources. However, because each ray is independently computed, ray tracing is very suitable for parallelization. This enables the time consumed by process to be decreased linearly in proportion to the number of processing units being used. Therefore, with the help of the significant advances in multi-core processor technologies, as well as the increasing research in this area, ray tracing may eventually become the conventional rendering technique, overtaking the place of rasterization (Teller and Alex, 1998), (Wald, 2004).
A.Thesis OutlineResearch into ray tracing can simply be grouped into two parts; 1) focusing on improvements in the current methods or trying to find some new techniques, especially for acceleration structures, 2) implementing already known techniques or structures on different environments. This thesis should be classified in the second group as it mostly focuses on an investigation of current methods, instead of the implementation of new ones. For that reason, the implementation of ray tracing in this thesis is kept as simple as possible. On the other hand, investigation of currently used techniques and comparison of these techniques takes precedence in this thesis. This thesis is made up of three main parts; firstly, a detailed explanation of ray tracing, secondly, CELL architecture and implementation of ray tracing on CELL, finally, the project outcomes and a comparison of techniques implemented on CELL and CPU. Part one begins with a very basic explanation of ray tracing and continues with the two main ray tracing methods; forward and backward ray tracing, however, the rest of the thesis is focused on backward tracing method since it is considered as the standard ray tracing method because of the reasons explained in the backward ray tracing section of this thesis. Next, the key steps of ray tracing are discussed. The following chapter is integral to this thesis which is based around an in depth investigation of various techniques. In this chapter ray tracing acceleration methods are investigated, furthermore their pros and cons are discussed. Part two is the methodology where the KD-Tree acceleration structure benefits of ray tracing by using this acceleration structure on multi-core processors are discussed. After giving the general idea of why parallel computing for ray tracing is so important, the next chapter focuses on an individual multi-core architecture called CELL. Owing to the CELL architecture’s different hardware attributes the conventional ray tracing steps, as well as the data structures used, need to be changed. For that reason, in the following chapters, fitting the data structures and distributing the steps of ray tracing are discussed. The next part after the methodology is the implementation part where the implementation of the project is explained. This part is also divided into four parts; first part gives the outline of the implementation and explains the steps followed for the implementation of the ray tracer. These steps are explained in details by giving detailed explanation of the classes as well as the main ray tracing functions. The last step of this part is the desired ray tracer implementation by using KD-Tree on Cell architecture. Part four is the testing part of the results. Testing of the results are made between the Cell and the PC implementation of the project and also with and without using the acceleration structure. Part five is the conclusion. After giving a summary of the thesis, continues with a critical evaluation of the project and discusses what improvements could have been made. The conclusion will also emphasise the significance of ray tracing for games and films and the importance of its implementation on multi-core architectures. The last part is a brief survey of prospective future work for ray tracing on multi-processor architectures. III.Ray TracingA.Introduction to Ray TracingRay tracing is “a technique for image synthesis: creating a 2D picture of a 3D world” as stated by (Glassner, 1989, p.1 ) in his one of the most referenced books about ray tracing. In essence this explanation overlaps with Ibn al-Haytams’s pinhole camera invention as the main idea is projecting a 3D space on to a 2D space with a difference; in the context of ray tracing neither the spaces nor the light are real but virtual. For the reason that the conceptual idea for ray tracing is coming from the pinhole camera, before going deep into technical explanations involving computer graphics terminology, it is better to briefly put in plain words the fundamentals of the pinhole camera and behaviour of light for a better understanding of the theory. First of all it is very important to get general understanding of the behaviour of light in ray tracing. There are two different models for explaining the behaviour of light; wave and directional models. While wave model can explain diffraction and interference, directional model can explain reflection, refraction and shadows. In the context of ray tracing, only directional model of the light is taken into account which also explains the working mechanism of the pinhole camera (photon mapping). A pinhole camera is composed of three essential elements; objects which are being viewed, a plane with a whole on it and another plane which the result image is being projected (the diameter of the whole changes with respect to the distance between two planes). The light rays coming from light resources reflect, refract or get absorbed according to the attributes of the objects. For instance a blue coloured, non-opaque, reflective surface absorbs the colours except for blue tones of the light ray then reflects the blue channel of the light in a direction depending on the shape of the surface at the intersection point of the light ray and the object. After several reflections of the light rays from the light resources, some of the light rays can pass through the whole before being totally absorbed. The rays passing through the hole intersects with the screen plane forming the image of a scene. The final image take a shape on the image plane is topside down as shown in Figure 1.
Despite the common origin of ray tracing and the pinhole camera, there are also differences between the two in terms of terminology and application of the idea. While the image plane is in front of the focal point in a pinhole camera, it is behind the focal point in terms of ray tracing. From ten thousand feet high, there are three main elements in ray tracing just like in pinhole camera model; a three dimensional scene involving 3D objects (world), rays (light rays) and a 2D plane (a surface on which light from the objects is being projected). The main requirement for having a complete 2D image of the scene is; at least one ray from the scene should correspond per each atomic unit (pixel) of the 2D plane if it doesn’t correspond to an empty space in the scene. That can be done either by sending the rays from the light source or by sending the rays from the image plane however these methods leads to different kinds of problems such as efficiency or level of realism. These issues will be discussed in the following chapters of this thesis. Because of its physically correct implementation, ray tracing has no difficulty in creating realistic effects such as reflection or shadowing. However this method computationally expensive due to two major problems; as a pixel on the screen plane needs to correspond to a ray going through (or coming from) the scene, each ray needs to be created, shot and tested against the objects in the scene to find correct colour value of that pixel. To give an example, a screen image with 1024x768 pixels needs at least 786432 eye rays to be created and according to the organisation method, may require millions of object-ray intersection tests. Moreover these rays will interact with object surfaces and keep reflecting as well as refracting through the scene, and eventually, the colour values for all the pixels on the image plane need to be found and shown on the output device. In the following chapters of this section, the types, steps and important acceleration methods for making ray tracing reasonable fast will be discussed in detail. It should be noted that despite the significant improvements in processor technologies, without the acceleration methods and structures, it would never be possible to see ray tracing in real time. In addition to that reason, as this thesis is mostly focused on investigation of these acceleration methods and implementing some techniques on a multiprocessor architecture, the chapters to come are the most significant in the context of this thesis.
1.Ray Tracing TypesAs mentioned above, tracing the rays through the scene can be done in two ways; the first one being tracing the rays from light sources (as the light rays in the nature) through the scene, and secondly from the image plane through the scene. Although the main ideas behind these two methods are the same, they lead to different levels of realism, efficiency and implementation problems. In the next two sections algorithms for these two methods will be given, discussed and compared. At this point it is useful to distinguish and clarify two terminologies; final image and the image plane. Final image is the result picture of 3D world and consists of pixels. These pixels are discrete (Glassner, 1989, p.161) (there is nothing between 1st and the 2nd pixels of the image) and physically exist to be shown as the final image’s colour values. On the other hand an image plane is a virtual plane in the scene and only two vectors can define the image plane (for instance top-left and bottom-right points in the scene space). The scene is completely abstract to the final image but the image plane virtually exists in the scene. Despite the abstraction between these two terminologies, they can be used in same meaning since each pixel in the final image and corresponds a region on the image plane or vice versa.
a)Forward Ray TracingForward ray tracing is possibly one of the first ideas for imitating the light behaviour for creating a 2D view of a 3D scene as it works literally how the image of the real world gets into shape on our retina. In real world photons (or ‘light rays’ instead, as ray tracing is the implementation of directional behaviour of light) coming out of different kind of light resources collides with objects and according to the object’s surface reflect, refract from the object or get absorbed as mentioned in the previous chapter. However from countless number of those light rays only a few comes into an eye and creates a projected view of the world on the retina. Since forward ray tracing method is a straightforward implementation of the real world, the steps for creating a forward ray tracer are pretty straightforward as well and can be roughly defined as follows;
Table 1 - Steps for Forward Ray Tracing (Sherrod, 2008)
With a more understanding in the context of computer sciences these basic steps can be interpreted into the following pseudo code steps;
Table 2 - Pseudo Code for Forward Ray Tracing (Sherrod, 2008)
At first sight these steps can be seen quite simple to implement however the implementation of the steps is not as simple as its concept is due to some quality and efficiency problems.
The first and the most important problem is the wasted ray problem. According to the forward ray tracing algorithm there is no guarantee for a ray to hit a point in the viewing volume just as in real world (see the figure 3). In addition to that, it cannot be known if the ray will eventually fall into the viewing volume after some reflections or refractions, that make the problem even worse since the rays need to be traced until they get totally absorbed or until a threshold value of zero. Because of that reason only the a few lucky rays falling into the viewing volume can affect the final image. For having a complete picture of the scene the number of wasted rays might be (even usually) much more than the lucky rays. To tackle with this problem there are estimation methods being used for sending the rays in the directions, which has more potential to get into the result image instead of sending them totally random. However these estimations are highly dependent on the number of rays sent into the scene and also the complexity of the scene (Wald, 1999). Even if the rays were sent into high probabilistic areas, the distributions of these rays are still important. If the rays were mostly sent through some areas rather than the other parts of the scene that might result with an unbalanced result such as, the regions in shadow having some noise and other regions may have a clear result. The noise problem is a serious quality problem and the only method for solving this problem is sending more rays into the scene especially to the shadowed regions and far regions. The main reason of the noise problem is, as the shadowed areas are in low probabilistic regions, the numbers of rays corresponding to the pixel in the image view are less compared to the other lighted regions. As a result because of being insufficient number of rays filling into the pixel, the colour value of the pixel cannot be calculated correctly causing the noise effect.
On contrary, forward ray tracing has some natural advantages such as it being easily understandable, implementation also being quite straight forward as well as simpler and generates very high quality realistic pictures as it naturally handles all the illumination features (Wald, 1999) furthermore it correctly satisfies the rendering equation (given in APPENDIX). To conclude, it is possible to create very realistic high quality images with forward ray tracing however due to the following reasons it is not as commonly used as backward ray tracing unless very a realistic result is expected;
Table 3 - Problems of forward ray tracing (Sherrod, 2008) Despite of such complicated problems, backward ray tracing is mostly preferred with higher level of implementation complexity, which is going to be discussed in the rest of this thesis.
b)Backward Ray TracingThe computational cost of forward ray tracing is not acceptable for real time applications such as games. The computational cost is mainly caused by the wasted rays as mentioned in the previous chapter. There is a very simple logical solution to this problem; if it is not possible to accurately estimate the rays being sent from the light resources to the scene for creating the final image, and then why not send the rays from the image plane into the scene so there will be no wasted ray.
The number of rays sent into the scene depends on the resolution and the aimed quality. Sending one ray per pixel is enough to satisfy the minimum requirement of creating a 2D view of the 3D image, however in some cases one ray per pixel may lead to some quality problems such as aliasing (which will be discussed in the following chapters). After sending the ray, just like in forward ray tracing, the ray needs to be tested against the objects in the scene in order to find the closest object intersecting with the ray and according to the material of the object. Then after finding the object and the intersection point, reflected ray and/or refracted ray need to be found which also needs to be traced in the scene until reaching an empty space or a limitation such as maximum number of reflections.
The very basic method for backward ray tracing, mentioned above, was suggested by Turner Whitted in 1980 and it became known as recursive ray tracing or whitted style ray tracing. This method is also most commonly used method in the context of backward ray tracing however this method can only implement perfect reflections, refractions and hard (-edged) shadows due to the limitation of the reflected and refracted rays. Therefore the whitted style ray tracing method does not satisfy the rendering equation due to some physical facts such as behaviour of diffused materials. These issues will be discussed in great details in the Reflection, Refraction and Shadow Rays section of Ray Tracing Steps chapter. A solution suggested by Robert Cook (Cook, 1984) is distributed ray tracing which does not only satisfy the rendering equation but also allows some of the special effects listed in table 4 to be implemented easily. The idea behind distributed ray tracing is about creating and sending multiple rays instead of one ray in different stages of the rendering process such as reflection and refraction step. These steps and effects of applying the distributed ray tracing will be given in to more details in the next chapter. Nevertheless note that distributed ray tracing gives more realistic result therefore it is computationally more costly compared to the whitted-style ray tracing method and is used mostly for offline rendering.
Table 4 - Distribution of the rays in various phases of the rendering causes different effects (Glassner, 1989, p.171-172) To conclude, even though the results of backward ray tracing are not as realistic as forward ray tracing method and its implementation is more complicated, backward ray tracing is the most commonly used ray tracing method since the quality lost is not comparable to the gain in the computation speed. Backward ray tracing is eventually expected to be able to run on ordinary computers and to become the standard rendering method (Wald, 2004) in the next couple of decades however forward ray racing is unlikely to reach such a development. In addition to this efficiency advantage, the flexibility and maintainability nature of backward ray tracing are also some other reasons be known as the standard ray tracing method. In the next chapter, the steps of ray tracing will be given in more details and different methods for increasing the efficiency and quality will be discussed.
B.Ray Tracing StepsThe previous chapters of this dissertation tried to give a general idea about ray tracing without going into details for more understanding of the general concept. In this chapter the steps of ray tracing will be discussed separately, this will involve the associated definitions, suggested methodology increasing the level of realism and efficiency, in order to give a solid idea of its implementation. Before getting into more details in explaining each step, a brief study of basic algorithm for recursive ray tracing would be useful since the algorithm will be referred to in each step.
Table 5– Whitted style ray tracing steps. It should be noted that these steps are simplified for better understanding of the following chapters.
Table 6 - Flow chart for ray tracing 1.Preparing the sceneJust like any kind of the computer application ray tracing also needs some inputs in order to create the output. In the context of ray tracing these inputs are; the objects in the scene, an imaginary plane (view plane) and the lights. It should be spotted that the camera (the focal point) is missing in the list. The main reason is that the focal point is needed only for perspective view of the scene. As previously mentioned (Table 4) the distribution of the focal point for each ray enables different effects such as depth of field. Technically speaking the objects in the scene need to be stored in some kind of data structures such as; arrays, trees just to mention a few. Some of these data structures do not only stores pointers to the objects but also may partition the space and may also create some virtual objects for efficiency reasons. These structures will be discussed under Acceleration Structures for Reducing Ray-Surface Calculations section of this thesis.
2.Ray GenerationRay generation step defines the projection type of the scene since this step the primary sets the primary rays. To give an example, if all the rays are set as they are coming from one constant point (focal point) behind the view plane then the result will be perspective and the distance between that point and the view plane defines the view angle. In addition it is possible to get non-perspective result image by using different mapping techniques such as spherical mapping for getting up to 360 degree view angle (Suffern, 2008). Apart from perspective projection these techniques will not be discussed in this paper please check the referred book (Suffern, 2008, Chapter 5,6,7,8) for more information about these techniques. To satisfy the minimum requirements of ray tracing, at least one ray from each pixel should be sent into the scene and a ray has two fundamental elements; an origin and a direction. As shown in the figure 6, in terms of perspective image generation, the origin of the ray is the focal point (location of the camera in the space) and the direction is the normalised vector between the focal point and the related pixel’s corresponding view plane point.
Equation 1 – Ray direction and origin The pixel’s corresponding point has been referred to a number of times in this thesis and the relation between the image plane and the pixels has been previously mentioned and the basic mathematical relation between them is;
Figure 6 - Generating primary rays; ray origin and direction
Table 7 - Pseudo code for creating one ray per pixel to get a perspective view At first sight creating one ray per pixel may seem enough to get a realistic result however it is not due to the discrete structure of the pixels. Let’s assume that the number of pixels in the final image is N and the surface area being projected on the image plane is K unit square. In that case the surface area covered by one pixel will be N/K however the colour value for that area will be calculated from one ray which identifies only one primitive. If that area covers a region which involves the edge points of a number of primitives, final colour value of the pixel will be either of these primitive’s colour. This problem causes the spatial aliasing problem (Glassner, 1989). Programmatically the easiest way to overcome this problem is supersampling which means sending more than one ray per pixel and finding the final colour value of this pixel by averaging the result colour values of these rays. For instance let’s say the pixel covers an area corresponding to the edge point of a sphere’s black coloured shadow and a white coloured ground. In this case if the number of ray sent from the pixel is one, the final colour value of that pixel will be either black or white (Figure 7). However if the number of rays sent from the pixel is five and three of them intersects with the shadow and two of them intersects with the ground than the final colour value will be gray and the transition will be smooth (Figure 8).
Supersampling by shooting a constant number of rays per pixel reduces the aliasing problem however it doesn’t solve the problem perfectly besides most of the rays sent into the scene do not correspond to edge points in addition to that the render time increases linearly respect to the number of rays sent per pixel. One of the methods for solving this problem is firstly by dividing the pixel into five grid cells (corners and the centre) then shooting rays through the centre points of the grid cells and then looking for the difference between the final colours of them. If the differences are big (usually if bigger than a pre-defined value) then the grid cells are divided into smaller grids until reaching a satisfactory difference. This method is called adaptive supersampling. Although this method works fairly fast and can easily be implemented, it is not the best solution in terms of quality since the subdivision of the pixel is regular (Glassner, 1989).
There are some other methods for the regular subdivision problem of the pixel, such as stochastic ray tracing and statistical supersampling however these methods will not be given in this paper please check the referred book (Glassner, 1989, Chapter 1) for these methods. In summary ray generation step can be considered as the step where the lens type of the virtual camera is set and where the anti-aliasing problem is solved by sending a number of rays from each pixel. Implementation of this step has an important impact on the quality of the image as well as the rendering time.
3.Shooting Rays and Ray-Object IntersectionThe word shooting might be a confusing terminology as it has an abstract definition in terms of implementation and means searching through the objects if any of them intersects with the ray or the rays being processed. To give an example, in table-7 the “Ray” represents a ray preparing to be shot (to be tested against all the objects in the scene) through the scene. After setting the direction and the origin of the ray the shooting ray process starts and depending on the method being used either some of the objects or all of the objects are tested against the shot ray to find out which of the objects intersect with. This step is also known as traversing the scene. Finding the intersecting object with the ray is done by using different mathematical calculations depending on the objects geometry (Appendix – Primitive Ray Intersection). A 3D scene can be compound of very complex geometric shapes but these objects have to be defined by a set of simple primitives such as triangles, spheres, planes etc. Finding the intersecting object with the ray actually means testing the primitives forming the geometric object against the ray. For instance let’s assume there is a scene involving twenty cubes each having twelve triangles, in this case, simply, the ray should be tested against two hundred forty triangles unless an acceleration structure is not used. This step is the most time consuming step of ray tracing and from 75% to 90% of the whole rendering time is used for the ray-primitive intersection calculations (Whitted, 1980). This is the main reason why primitive-ray intersection calculations require complex mathematical calculations such as square root calculation. Besides these calculations need to be very accurate and rough approximations are not acceptable since the intersection point of the ray and the object will be the origin point for the reflective, refractive as well as the shadow rays and miscalculation of this point causes a deformation in the final images. The primitive–ray intersection tests are calculatingly expensive and the best way to accelerate the rendering is avoiding these calculations by using some special data structures which somehow sort the primitives. However by this step instead of testing all the primitives against each ray, testing only some of them may sufficiently accelerate the rendering. Although this step has no direct impact on the quality of the final image, since it is the most time consuming step, vast majority of the researches are being carried out on speeding up this step by constructing complex structures. For that reason this step has the biggest importance and complexity in terms of implementation. The speed up data structures related to this step will be explained in more details in the Acceleration Structures for Reducing Ray-Surface Calculations chapter of this thesis.
4.ShadowingThe next step after finding the intersection point is determining if the hit point is lightened by any of the light sources in the scene or not and this step is the only step where the light sources are into account. The easy way of finding if the hit point is shaded by an individual light source is simply creating a ray (shadow ray) between the hit point and the light resource being tested (Figure 11). If the ray intersects with any of the objects rather than another light resource then this point is considered as in shadow with respect to that individual light resource. Since finding the occluder object also requires intersection tests between the shadow ray and the objects, the performance of this step is also highly dependent to the acceleration structures used for sorting the objects in the scene.
Furthermore sending only one ray from the hit point to the light resource is a solution only for point light resources. If the light resources in the scene are areal lights (such as directional lights) then sending only one shadow ray is not sufficient at all and causes lack of realism in the final image as the hit point is considered as either in shadow or not. However in real life a point can be partly lightened by a light resource which causes penumbra areas to be formed. The solution for having realistic soft shadows is simple; sending a number of rays, depending on the light source’s size and deserved quality, from the hit point towards the light source. However the number of rays sent from the hit point to the light resources increases the render time linearly and in some cases the number of shadow rays might be even more than 50 (Fomella, 1998). To prevent a tremendous slow down in render time one of the solutions which were suggested by Eric Haines in 1986 (Haines, 1986) is; shadow buffers which fastens up the shadowing process for scenes with many light resources and provides a sufficient level or realism without any approximations. However on the contrary it may even end up with some more calculations for scenes involving small number of light resources (Wald, 2004). Another suggested approach by Lukaszewski and Fomella is simply based on finding the penumbra areas by shrinking down the light sources to point lights and enlarging the objects. Although this method also reduces the shadowing time, it needs some additional data structures besides it is highly dependent on the shape of the objects in the scene (Fomella, 1998).
5.Reflection and Refraction RaysAs mentioned earlier in this thesis reflection and refraction of the light rays is because of the directional behaviour of the light. When a light ray collides with an object it gets absorbed depending on its frequency (colour) and the colour of the object. The unabsorbed part continues travelling and either reflects or refracts from the objects surface depending on the normal of the object at the hit point.
Figure 13 – Reflection (a) and refraction (b) of a ray depends on the normal of the surface at the hit point
A very simple approach for solving perfect reflections and refractions is creating two rays; one reflection ray and one refraction ray and tracing these two rays through the scene recursively. However in reality is nothing completely neither reflective nor refractive. In the context of computer graphics a surface has a level of specular behaviour and has a level of diffuse behaviour. While specular surfaces reflects the light in regular basis, diffuse surfaces reflects the light arbitrarily (Figure 14). The problem with backward ray tracing is the ray hitting only a point; however the pixel corresponding to the ray represents an area. As a result of this problem diffusion of a ray should be handled in a distributive manner by creating more than one reflection ray for diffuse surfaces. In doing so, the rendering equation can be satisfied at certain levels. Nevertheless creating more than one ray for diffuse surfaces may decrease the rendering speed dramatically depending on the number of rays created and the data structure used. Apart from the efficiency concerns the other problem is the distribution method of the light rays which has a big impact on quality. Since the rendering equation is an integral function, the rays that can be sent from a point are a finite approximation in order to find their directions. One of the commonly used methods for solving this problem is the Monte Carlo Integration (Ward, 1988). However this method will not be discussed in this paper.
The distribution method of rays spreading from a point (or from a tiny area) is also a problem for rays passing through non-uniform surfaces such as glossy surfaces. These techniques will not be discussed in this paper however for explanations of these methods given into details in (Suffern, 2008, Chapter 5). To sum up, reflection and refraction of the rays from object surfaces has a big impact on the performance as well as on the realism. It should be noted that in the ray tracing process this step is the recursive which means this step actually decides the new direction of the ray(s) to be traced through the scene. Therefore the step is very important as it effects ray tracing time and the level of realism as the material properties of the objects are being implemented in this step.
C.Acceleration of Ray TracingSince ray tracing algorithm is computationally very expensive and unlike the rasterization there are no specialised hardware units in general purpose computers hence it has always been a been a big challenge to find ways for the acceleration of ray tracing. Subsequently the majority of the acceleration methods for ray tracing have been found in the last few decades (Wald, 2004).
Generally speaking acceleration methods for ray tracing are classified into three groups; fast object-ray intersection methods, fewer ray creation methods and the use generalised rays (Humphreys, 2003). The methods for reducing the number of rays, such as adaptive super sampling as well as shadow mapping, have already been discussed in previous chapters of this thesis (B.2, B.3, B.4, and B.5). To avoid duplications these techniques will not be discussed in this chapter anymore instead this chapter will focus on fast ray-object intersection algorithms. Apart from the acceleration algorithms, the hardware where the ray tracing code will run and the programming language used have an important impact on the rendering performance. Most of the commercial ray tracers such as Brazil, Mental Ray, finalRender and Maxwell Render have been written in C++ (Suffern, 2008) because of the efficient floating point calculations and also the object oriented programming logic which gives a high level of flexibility and reusability. Suffern’s writing efficient C++ codes highlights basic but very important points in ray tracing (Suffern, 2008, Chapter 1)
1.Acceleration Structures for Faster Ray-Surface Intersection CalculationsUnlike the reducing the number of rays techniques, faster ray-surface intersection methods has no impact on the final image but on rendering time, memory usage and implementation. As mentioned earlier in this paper 75% to 95% of the rendering time is being used by ray-surface intersection calculations (Whitted, 1980) due to computationally expensive functions such as cross-product or square root.
a)Brute-ForceActually brute force is not an acceleration structure but a most primitive way of storing the objects in a very simple linear data structure such as an array or a vector (a kind of dynamic array). According to the brute force method, all the primary rays and the secondary rays need to be tested against all the objects in the array one by one since the order of the objects in the array are not sorted therefore the closest primitive intersecting with the ray needs to be found. For shadow rays, situation is a bit simpler since when any of the objects are found between the origin of the ray and the light source then the intersection tests for the remaining primitives can be skipped. To give an example a scene involving ten sphere primitives with the screen resolution of 800x600, no reflections and two light resources, the number of intersection tests is;
Table 9 - Brute force is the most primitive method without any additional data structures. As can be seen from the figures given above, the number of intersection tests is highly dependent to the number of objects and the light resources in the scene. The only advantage of the brute force is that the complexity for the organisation of the randomly ordered data structure is zero even if the objects are dynamic. As a result without any arrangement of the objects in the scene it is impossible to reach a speed of real time ray tracing by the use such techniques. The next parts will explain the well known acceleration techniques with their pros and cons. b)Vista Buffering – ZF BufferingVista buffering is also known as ID-rendering. The idea behind these methods is pre-caching the objects in accordance to the corresponding pixels. At the first step the scene needs to be traced just by the primary rays to find out which pixels correspond to which primitives and after each render, the corresponding object is tested, in this manner the probability of finding the closest object increases. The only difference between the ZF buffering and Vista buffering, is that Vista buffering records the bounding volumes in comparison ZF buffering stores a pointer to the object (Raghuvanshi, 2000). However these methods does not provide a significant improvement on rendering time, nevertheless it consumes quite a lot of memory space (extra 4 bytes per pixel) and only improves the primary ray hit test timings (Wald, 2004, p. 31). For these reasons the two methods have not been implemented in this thesis.
c)Bounding VolumesThis idea was firstly introduced by Turner Whitted. He suggested that grouping the objects in some less complex bounding volumes would decrease the number of ray-object intersection tests and fasten ray tracing and he also suggested using spheres as the bounding volumes as computationally the fastest intersection tests can be done with spheres (Glassner, 1989). Basically the idea is, instead of testing all the objects in a scene, testing of the bounding volumes such be done, since their intersection tests are faster, and lastly testing the objects in the bounding volumes one by one. The idea of grouping the objects can really speed up the intersection test due to the distribution of the objects in the scene and the shape of the objects. Sparse scenes or scenes with long objects would not benefit from such a method. In addition, even if the objects fit to bounding volumes perfectly, this method can provide a linear acceleration respect to the number of objects in the scene. Nevertheless, as a result these drawbacks bounding volume hierarchy method is suggested. The method groups objects into bounding volumes and recursively groups the objects in the bounding volumes into other bounding volumes. This method also provides a logarithmic complexity to the solution, instead of using spheres, using bounding boxes also fairly solves the problem of fitting long objects into bounding volumes. However this method still cannot solve the sparse scene problem, besides it has some drawbacks such as colliding bounding volumes which cause extra intersection tests since the inner bounding volumes naturally overlaps with the outer volumes (Havran, 2000). In addition, the efficiency of the intersect tests highly dependent to the distribution of the objects in the scene (Wald, 2004).
Table 10 – a) Bounding Volumes, b) Bounding Volume Hierarchies (taken from www.flipcode.com)
d)Spatial SubdivisionSpatial subdivision is the general expression for dividing the scene into axis-aligned cells (voxel which means volume cell). Unlike bounding volumes these cells does not collapse but full-fills the whole space and each voxel holds the information of the colliding primitives. Thus a ray only traverses the voxels on its path therefore the intersection tests are only done to the objects colliding with the voxels. Moreover it is also possible to predetermine the empty voxels and avoid an unnecessary ray-voxel intersection test which is a very addition to the spatial subdivision algorithms. One of the methods for dividing the scene into voxels, called uniform spatial subdivision, was suggested by Fujimoto (Glassner, 1989). As the name states, the idea is about setting the size of the voxels to a constant value. Defining the voxel sizes constant enables traversing the scene in an incremental manner which saves time for calculations aimed at finding the next voxel to be tested. However this gives rise to the problem of finding the optimum voxel size. If the voxels are too small then the number of voxels increases proportionally with the memory, thus more memory is needed. Moreover a primitive’s ability of being contained by many voxels gives rise unnecessary intersection tests. On the other hand if the voxels are too big in that case obviously more than one object can penetrate into one voxel which causes unnecessary intersection tests. One of the solutions for avoiding duplicate intersection tests of the same objects with a ray is mailboxing. In this method every ray has a unique id and every object has a mailbox which stores the results with the ray ids. Before testing the object against the ray, the mailbox of the object is checked against the ray (Kirk and Avro, 1991), if the id of the ray matches then no intersection test is needed. Though this method reduces the number intersection tests it requires quite a lot of memory since each ray has an id and each object has a mailbox. However more time is needed for searching the mailbox as a result the process can be time consuming.
Figure 15 – Uniform grid (a), and Octree (b) Apart from the regular spatial subdivision there are also non-uniform spatial subdivision methods such as octrees and BSP Trees however these methods will not be discussed in this paper. For detailed explanation of these methods the following authors re recommender; (Glassner, 1989), (Havran, 2000) and (Wald, 2004).
IV.MethodologyAs mentioned in the background chapter acceleration structure of the ray tracer plays a vital role on the performance of ray tracing and should be chosen to suit the hardware specifications being used. In different cases different methods may have various advantages against other methods not only in terms of rendering time but also in terms of usage of other resources such as memory and bus bandwidth. Before giving the reasons why the acceleration structures given in the previous chapter are not included in this project, the chosen acceleration structure and the specification of the hardware will be given. There after the chapter will end with the explanation of the methodology of ray tracing in Cell architecture.
a)KD-TreeKD-tree is a special kind of binary tree data structure for organising points in k-dimensional space (since the graphics applications run on three dimensional spaces the kd-tree in this paper will be representing ‘three dimensional kd-tree’) which provides a multidimensional search. The construction cost for the KD-tree is logarithmic and the average cost of traversing a voxel can be estimated with respect to the number of primitives in the scene (Shevtsov Maxim, 2007). Another important point about KD-trees is its difference from octree which divide the 3d space into constant number of parts. In contrast KD-tree divides the space into unfixed parts with planes perpendicular to one of the coordinate system axes thus differentiating it further from the conventional BSP trees. Since there are different kinds of KD-trees according to the optimisations roughly, the data structure for a node contains two child pointers (just like in binary tree); a Boolean variable showing if it is a leaf node or not, the splitting exists and the splitting position on that axis. The construction algorithm of a KD-tree is;
Table 11 - Steps for constructing a kd-tree, taken from (Havran, 2000) The major problem of constructing a KD-Tree is determination of the division planes since KD-trees are not fixed divided. As shown in figure 16, dividing a scene into two parts can be done in several ways. The problem is finding a more suitable division to increase the efficiency of the KD-tree as well as the ray tracer. The optimisation is based on a simple rule; large areas with few objects or small areas with many objects both work faster (Martin E., 2006).
Figure 16 - Different division methods, taken from lecture slides of Martin Eisemann The cost optimisation formula according to the rule mentioned above is;
Equation 2 - cost equation to split the space with higher efficiency, taken from lecture slides of Martin Eismann Contrary to the computationally expensive construction method of the KD-Tree, traversing is very fast and is simple as the traversing algorithm shows below;
Table 12 - Steps for traversing a kd-tree, taken from (Havran, 2000)
B.Cell Architecture“Since the invention of the integrated circuit in 1958, the number of transistors that can be placed inexpensively on an integrated circuit has increased exponentially, doubling approximately every two years” Quoted from (Moore’s Law – Wikipedia). As shown in the figure below, speed improvement of microprocessors in accordance to the transistor size will end over a certain level as the power density on chips will reach very high levels. Because of this reason, instead of reducing the transistor size and fitting more transistors in the same chip size, the last few decades’ the improvement in the speed of the computer systems is being kept by using multi-thread architectures. Cell is just one the multi-core architecture kinds, which is for example being used in one of the game consoles called PS3 as the main processing architecture. As the multi-processor architectures will be widely used in the near future this project will focus on implementation of ray-tracing on multi-core stream processing architectures.
Table 13 - Energy density table (Taken from IBM Cell Processor Overview Presentation (2006)) As the multi-processor architectures will be widely used in the near future this project will focus on implementation of ray-tracing on multi-core stream processing architectures. Cell is a heterogeneous multiprocessor architecture (Benthin, Wald, 2006 ) compound of eight synergistic processing elements (SPEs also called SPU-synergistic processing unit) and one 64bit PowerPC processing element (PPE also called PPU-Power processing unit), all attached to a high-bandwidth bus called Element Interconnect Bus (EIB) (IBM Cell research, August 2005). PPE is the general purpose main processing unit similar to conventional CPUs having one internal L1 cache and one L2 cache, which is connected to the EIB. PPE also have direct access to the main memory. The power of the cell architecture comes from the SPEs which are specialised cacheless vector processing units with 256KB of local memory for the code as well as for the data, with the clock speed of 3.2 GHz. Instead of caches each SPE has 128bit, 128 registers which also means an SPU can process 128bits of data in one cycle. On the other hand each SPE has its own RICS (Reduced Instruction Set Computer) based instruction set besides there is no branch prediction for the instructions which causes extra fetch time for the code to be run (Benthin, Wald, 2006). In addition to that, the main memory can be used explicitly by DMA (direct memory access) with the speed of 25 GB/s moreover by using the EIB it is also possible to transfer data from one SPE to another with the speed of 300 GB/s.
Figure 17 - Cell Block Diagram (taken from IBM Cell research web page) The specifications given above have a big impact on programming. First of all the limited memory size, fast data transfer speed of the bus and the restricted memory access of the fast synergistic units obviously shows that the architecture is a SIMD (Single instruction Multi data) architecture which fundamentally changes the programming attitude. Secondly, the non-homogeneous structure of the Cell architecture forces the programming side to be divided into heterogeneous processes. On the other hand having the ability of spontaneously transferring data between SPEs gives a big flexibility for distributing the processes between the SPEs. In addition to these, explicit memory access drives the memory management to be done on the SPE side which also has a big impact on the data structures since the physical memory addresses of the data elements are changed after transferring data from the main memory to the SPE local storage. Another impact which hardware has on programming is the vector specific processing structure of the SPEs. Since an SPE can process 128bit of data at a time, theoretically it is possible to reach 25.6 GFlops at 3.2 GHz for single precision (Benthin, Wald, 2006). To achieve this point of efficiency the basic data types should suit the hardware. These kinds of variables called intrinsic variables; intrinsic variables can be processed by intrinsic functions which extends the programming language in a way. For instance instead of using scalar variables, using a set of variables in one vector variable and making calculations on these variables at a time. Therefore the use of these intrinsic functions may increase the processing speed tremendously. However to use these variables efficiently each line of the code needs to be coded according to these functions which results with high level of complexity in the implementation. As a result the Cell architecture provides a significant parallel calculation power. This is exactly what can fasten up ray tracing. On the other hand it changes the programming attitude, data structures and even the simple variable usage. This is because of its vector processing architecture with RISC based instruction set and the restricted memory size. C.Why KD-Tree on CellAs mentioned in the literature review of this thesis the scene data structure plays a crucial role in the three fundamental steps of the ray tracing algorithm as shown in figure-18 on rendering time and also on implementation complexity. However each acceleration structure discussed in the background chapter has different advantages and disadvantages depending on the hardware. The Cell architecture used for this project has two major constraints as mentioned above; 256 KB restricted memory size shared for code as well as for data and un-branch predicted execution order. On the other hand its execution speed, bus speed and fast vector calculation capability are very important attributes of this system.
Figure 18 - Acceleration structure is being traversed in two steps for each ray sent from the camera Because of the memory restriction, Vista buffering method is eliminated since it uses a list of four bytes (for ray pointers or ids) per each object thus consumes lots of space on the memory even though it is just a helper structure nonetheless its improvement on the ray tracing speed is only 10%. In addition to the vista
buffering method, regular grids
also uses quite a lot of memory space depending on the object and the
grid size. For instance, let us assume that the regular grid divides
the space into ten parts for each dimension meaning it divides the
scene into 1000 ( Bounding volume hierarchies method is a less memory extensive method compare to grids however its overlapping volumes does not only mean additional calculations but also means unnecessary branching while traversing the scene creating one of the weak points for execution on the SPUs. In addition to these problems efficiency on BHVs is very dependent to the distribution of the objects. Because of these weak points the BHVs is also eliminated. On the other hand KD-Tree is not a memory demanding acceleration structure as it divides the scene into axis-aligned boxes according to the object distributed in the scene therefore there are no unnecessary boxes is being created. Moreover the number of objects per box can be fixed thus preventing unnecessary object ray calculations per box. In addition to these advantages, as KD-Tree is a special kind of tree, traversing cost of the scene is logarithmic unlike the regular grids besides since none of the bounding boxes overlap with each other therefore the number of branching is less compared to BVHs. The trade off for these improvements is the construction cost of the structure which also effects the traversing efficiency, however since the main purpose of this project is adopting a KDTree implementation to Cell architecture, the problem of ray tracing on dynamic scenes will not be taken into consideration. D.Distribution of Ray Tracing StepsAfter starting the main acceleration structure to be used on the Cell hardware, the first methods were to be considered before the implementation of distributing the tasks between the processors without violating the limited memory (256 KB) areas of the SPUs. Some of tasks for the ray tracing algorithm that will be implemented in this project are;
The planned distributions of the jobs are shown in the figure below.
Figure 19 - Job Distribution As shown in the figure, construction of the KD-Tree structure can only be handled in the host processor. After construction of the tree, the host processor will kick the tree data structure to each stream processor’s local memory. All the threads running on a stream processor will use the same KD Tree data structure and the same camera orientation to create the eye rays (primary rays). Eye rays will be created in the SPU just before traversing the scene thus the computation power of the SPUs on vector operations will be taken into advantage besides no memory space will be used for the primary rays. Each thread will be responsible for a fixed area on the screen. Due to this reason, the range of the pixels which a thread will be responsible for, needs to be assigned for each individual thread. Since the colour values to be displayed on the screen are actually stored in a sequential array, instead of dividing the screen into rectangular regions and assigning the rays corresponding to these regions to the threads, the pixels in the same row are assigned to the threads as shown in the figure 20-1. The main advantage of this method is; since all the result colour values of the pixels are needed to be sent from the SPU memory to the PPU memory, by assigning the pixels in a line to one thread these pixel data can be easily sent as a block of data without any other additional bus usage or memory location calculations (figure 20-2) making this method so advantageous.
Figure 20 - Screen area assignment between threads is done in a very simple manner; each thread traces the rays corresponding to a row of pixels To sum up this chapter, the reasons of choosing the acceleration structure to be implemented on the Cell architecture and the distribution method of the jobs are discussed. In the next chapter implementation of the methods will be given in details. V.ImplementationA.Outline of the Project ImplementationThe main purpose of the project is implementing a ray tracer on one of the recent multi-core architectures called Cell by using KD-Tree acceleration structure. It should be noted that, this project is not aiming to implement a completely new KD-Tree structure, but to adopt a previously implemented KD-Tree structure to Cell, investigate and explore its usage and effects on ray tracing time within the multi-core architecture then comparing the results with the single core PC implementation. Therefore the specifications of the ray tracer are kept as simple as possible. Ideally a complete ray tracer should handle all the rendering steps such as texturing and rendering of the fundamental primitives such as triangles, polygons, triangle meshes even spline surfaces. However due to the time constraints, implementing a whole ray tracer was not possible. Because of that reason instead of creating a complete one ray tracer five different very primitive ray tracers have been implemented handling only spheres and planes without texturing on Cell and also on PC. Besides, the KD-Tree construction and traversing methods in this project are based on Jacco Bikker’s (flipcode, 2004), (bik5, 2008) work with his permission. The implementation of this project is divided into four main steps. In the first step windows version of a very simple ray tracer have been implemented. In this version of the ray tracer none of the acceleration structures are implemented but the objects in the scene are stored in an array without any ordering. There are two main reasons for creating such a premature ray tracer; programmatically understanding of ray tracing and creating a simple framework on which to easily build up the more ray tracers. This early version of the project is kept for testing purposes. The second step is basically adopting the primitive ray tracer to Cell. Implementing this step was one of the most complex steps because of the time needed to understand the programming on Cell architecture. Besides this step can be considered as the main step, provided that design issues cencerning implementation on Cell (such as mapping the main memory to SPU memory issue) have been solved. Third step is relatively the shortest step since the KD-Tree construction methods are based on Jacco Bikker’s implementation as mentioned earlier. However adopting this newly developed implementation to the windows version of the project’s framework was still quite a lot of work. The final and the most important step of this project was the implementation of the KD-Tree structure on Cell. This step didn’t take as long time as expected since the main framework was already ready. However adopting all the data structures to SPUs turned out to be time consuming. These implementation steps and the data structures will be discussed in the rest of this chapter into four parts.
Figure 21 - a) Project Implementation steps, b) Time used for implementation of the steps (approximated)
B.Basic PC ImplementationThis step of the implementation process is mostly based on the creation of the framework and the understanding of the simple ray tracing algorithm for the next steps of the project. Since the ray tracing process has the objects in the scene as the main input data and the pixels as the output data the project code is organised in a similar object oriented structure as shown in the figure 22.
Figure 22 - Organisation of the basic framework (basic data structures such as Colour, Vector and Pixel are excluded) In the structure shown above all the objects are stored in an array inside the Scene class. For that reason the functionality of finding the nearest colliding object in the scene with a ray is also handled by the Scene class. Since there is no acceleration structure for fast ray-object intersections, the implementation of the scene class is very simple and straight forward. Apart from storing the objects in an array and finding the nearest intersection point Scene class does nothing. Planes and spheres are the only primitives implemented in this version of the ray tracer both deriving from the Renderable base class. Renderable is an abstract class having the common attributes of the primitives such as material. The material structure has four attributes these are; colour vector (red, green, blue colour values), diffuse level, specular level and reflection level. All these values must be in the range of zero and one where one is the maximum value. The material class is the main class where the surface properties of the primitives are represented. A Renderable also needs to be identified if it is a sphere or a plane according to the type information for the programming purposes (type cast issue) for that reason Renderable class has the type attribute. The reason why the Renderable class is abstract is because the functions (getDistance(), getNormal(), isIntersect(), getReflectionRay() ) which need to be overridden since the results of these functions varies from one primitive type to another. For instance the plane class getNormal function returns only the normal attribute however the sphere class returns a value after doing some calculations. Since the reflection works according to the normal of a primitive at a certain point, getReflectionRay() function is also needs to be overridden. The most important class of this step is the Raytracer class as the main purpose of this step is simply to have a simple visual result, not accelerating. RayTracer class has two important function these are; generateEyeRays and rayTrace. generateEyeRays function creates all the primary rays before starting the rendering process. This approach consumes quite a lot of space on the memory since for each pixel at least one eye ray is being created and one ray is a compound of two vectors which is a compound of three floating point numbers (coordinates). With the assumption of 4 bytes per floating point number, each primary ray consumes 24 bytes. In that case with the resolution of 600x800, the total memory used only for primary rays is 600x800x24 bytes ≈ 11 MB. On the other hand since the position of camera (also the view plane) does not change every frame, this approach saves from one vector subtraction and one vector normalisation per primary ray per frame. The trade off between the number of calculations and the memory usage is again an issue and in this case the pre-computed primary rays are used since 11 MB of memory space is not a big problem for conventional PCs besides a normalisation function is very time consuming for ordinary CPUs. After giving the purposes of the classes shown in the figure 22, for a better understanding of the working mechanism of the ray tracing algorithm, implementation is going to be explained in execution order. First of all the colour buffer is being created by the VisualOutputProvider (OutputBuffer) according to the resolution. Since the RayTracer class is derived from VisualOutputProvider, the colour buffer automatically exists in RayTracer as a one dimensional array of unsigned integers (colour value for each pixel is 32 bits which corresponds to an unsigned integer size). After creating the empty colour buffer (all the pixels colour values are set to zero which represents the black colour) to be filled by the ray tracer, the next step is creating the scene and filling the scene with the desired objects. In the constructor of the scene class the camera is being created which is composed of the location vector and the view plane. The view plane can be defined by only two vectors; top-left and bottom-right positions in scene space. Up to that point, by creating the colour buffer, setting the camera and the objects in the scene, all the inputs are being prepared for the ray tracing. Consequently the next step is starting the rendering process by calling the render function of the RayTracer. Since each pixel needs to be computed independently, render function calls the raytrace function of the RayTracer for each pixel in the screen with the parameter of the previously prepared primary ray corresponding to the pixel being rendered. This step is simply done in for loop. The first thing that the raytrace function does is finding the nearest intersecting object with the ray and the distance between the starting point of the ray and the object. As explained above the responsible class of finding the ray-object intersection is the scene class, because of this reason raytracer calls the getNearestRenderable function of the Scene class. This function simply tests each object if it intersects with the ray to find out the nearest intersection point. Having the information of the nearest intersection point and the nearest object, raytracer function creates the shadow rays from the intersection point to each light source to find out if any of the objects in the scene occludes the light ray. If there is no object between the light source and the point it means that this point is illuminated by the light source therefore the colour value needs to be calculated according to the material properties of the object and the colour of the light source. If there is an object between the light source and the intersection point, it means that this point is in shadow so there is no need for calculating the colour value. Lastly, in order to trace the ray recursively, the reflection ray gets created according to the normal of the intersection point, which is provided by the object itself (getReflectionRay function), and the ray tracer function calls itself by giving the reflection ray as the parameter if the number of maximum reflection limit is not reached or if the material of the objects is reflective. After reaching the reflection limit, the final colour value gets written to the corresponding pixel point. As a result of that step a very simple ray tracer structure is constructed within about ten days. The main problems encountered during the implementation of this step were mostly very basic but fundamental problems. For instance, at the beginning, instead of using floating point colour values for each channel (red, green, blue) in the ray tracing process, one unsigned integer was used to avoid floating point operations. However this resulted in imprecise result colour values. Finding this problem and fixing it took almost a day. Another problem was about starting point of the light and the reflection rays. Since the ray object intersection algorithm works precisely, the intersection point is always found on the object surface colliding with the object itself however this point should be just a little bit outside of the object otherwise the nearest occluding object of that point is always the object itself. The next steps of this chapter will be explained in an incremental manner to avoid duplications. C.KD-Tree PC ImplementationThis step is an extension of the previous step with an additional KD-Tree structure. Since the Scene class is responsible for the object-ray intersections and the purpose of having a KD-Tree structure is because of the fundamental change made on the Scene class. The main additional classes are KDTree, KDTreeNode and the ObjectList. KDTree class is simply responsible for the construction of the tree structure and storing the root node of the tree. KDTreeNode structure is the main data structure where the axis aligned bounding box is defined. The objects in that bounding box are stored and the two children nodes of the node are stored. In order to store the objects in the bounding box the data structure used is a linked list named ObjectList.
Figure 23 - Class Diagram of KDTree PC Implementation
Moreover, in order to provide the structured input data to the ray tracer. Firstly the data structure needs to be constructed before the rendering process starts. However there is no point of constructing the tree simultaneously as the objects added to the scene. Therefore the construction of the tree is made once, after adding all the objects to the scene. The construction method works as follows; the build function of the KDTree class is called by the scene. In the build function, firstly all the objects are added to the root node of the tree. Thereafter, the subdivision process of the root node starts in order to divide the scene into bounding boxes according to the location and the size of the objects. In subdivide function, firstly the axis is determined by checking the size of the dimensions of the bounding box. In the next step, the objects in the bounding box are split temporarily in different combinations and the split positions are stored. The main purpose of doing that is finding the best splitting position as KDTree cost function is predictable. After finding the best split position the next step is splitting the tree node into two parts and determining which objects go into which child tree node. By splitting the tree node into two parts the previous tree node becomes a non-leaf-node which means the objects list in this node becomes redundant and needs to be cleared, for this reason the clearObjectList is called to clean the memory. After dividing the node into two parts and cleaning the memory the subdivide function calls itself in order to recursively continue the division until a reaching a predefined number of objects per each box. Defining these value small results in deeper trees which means more memory consumption and more tree-traverse time. On the other hand this also results with less ray-object intersection calculations. Traversing the KDTree is done in the getNearestRenderable function of the Scene class. The traversing function starts with clipping the ray by the node’s splitting axis and splitting position. If there are any other child nodes intersecting with the ray segment (as it is clipped) this process is repeated until reaching the leaf node which intersects with the ray segment since only leaf nodes stores the list of objects. However after reaching the leaf node, the objects in the leaf node’s object list are tested against the ray. As a result, instead of testing all the objects in the space against the ray, by finding the corresponding leaf node only the objects in the object list gets tested saving the calculation time. In the KD-Tree PC implementation a Memory class is used for managing the memory and using less memory. However this structure will be discussed in the next section as the purpose of using this Memory class was actually making the adaptation of the PC version to the Cell easier.
D.KD-Tree Cell ImplementationThis step is the step where the PC implementation of the ray tracer is converted into an SPU implementation. As explained in the methodology, the ray tracing code should to be divided into two parts; the first part which runs on the PPU (main processor unit) and prepares the input data structure to be sent to the SPU memory. Then the second part which runs on the SPUs where the output (pixel colour) data is produced. For the data transfer between the memories the DMA library of PS3 is used and for timing purposes as well as for displaying purposes the framework library of the PS3 SDK is used. However since the framework is confidential these parts are extracted from the code and therefore not submitted within this project. Explaining the execution order will provide an easy understanding of the implementation; just like the in the PC implementation, firstly all the objects needs to be added to the scene and the KD-Tree structure needs to be constructed. At this stage a problem is encountered; each tree-node stores its child nodes with a pointer address which is a 64bit number representing the main memory address of the child node. Similarly the object list structure stores the next object list by using a pointer as well. At first sight it didn’t seem to be a problem and was unpredicted before the implementation phase that is why this problem has not been mentioned in none of the previous chapters. The problem of having pointers in the data structures is simply because pointers show points to the main memory addresses. However SPUs can explicitly access the main memory which means that each access SPU has to use DMA calls for transferring the data from the main memory to its local memory and this solution is not really practical since even if the DMA calls were not time consuming calls, there is a possibility that the memory addresses of the objects might change on execution time depending on the compiler’s decision on runtime. However valotile keyword (which is used for guaranteeing that the memory address of a data can never be changed) can be used for all the data structures to avoid this problem but still the slow DMA calls are a problem and valotile keywords may slow down the execution.
Figure 24 - The tree and object list pseudo representation
Figure 25 – An Example memory organisation of the data structures on the PC implementation; black-filled boxes represents object lists, gray filled boxes represents KDTree nodes, and the rounded boxes represents the object data
The implemented solution on this project is instead of having pointer addresses for storing objects’ memory addresses, parsing a chunk of memory area for each data structure and using index values relative to the starting point of the chunk. In this case even if the chunk’s memory address changes after transferring it from the main memory to the SPU memory there will be no problem since the relative addresses will never change. Another pro of using index values instead of pointer addresses is that the size of the index values can be chosen smaller than 64 bits which saves memory space. On the other hand the cost of using small index values is one addition and one bit masking operation (an ‘and’ operation) for each memory access of the data structures.
Figure 26 - An Example memory organisation of the data structures on the SPU implementation; black-filled boxes represents object lists, gray filled boxes represents KDTree nodes, and the rounded boxes represents the object data
Table 14 - Data structure for KDTree and Object list; a) compressed data structure for SPU, b) naive implementation of the data structure As shown in the table-14 by using bit fields the size of the KDTree node is reduced from 18 bytes to 12 bytes (33% compression) and the object list structure reduced from 8 bytes to 4 bytes (%50 compression).
Figure 27 - Execution order of the SPU program The next step of the execution is preparing the argument data structure to be sent to the SPU program main function as a parameter.
Table 15 - Argument data structure to be sent to the SPU program
Apart from the spuID attribute all the other parameters are set with the same values in the initToSpu procedure. After constructing KDTree structure and preparing the argument data structure, execution continues on the SPU side. SPU program firstly, fetches the data structures (objects, object lists and tree nodes) from the effective addresses of these data structures by using DMA calls (Table-16).
Table 16 - An example DMA call for fetching the object lists from the main memory to the SPU memory The fourth parameter of the cellDmaLargeGet function is the channel id field which can be considered as the identity of the transfer. The next line is cellDmaWaitTagStatusAll(1<<(TAG+g_SpuID)); which pauses the execution at this point until the whole data with the given id is transferred. After fetching all the necessary data structures for the ray tracing, the next step is the rendering process which has already been explained in the previous step of the implementation. However there are two differences. First difference is; each primary ray is calculated during execution time since the memory size is restricted and the second difference is; the line on the screen is calculated to be rendered. To give an example, let us assume that there are 600 lines and 6 SPUs are running simultaneously, in that case each SPU has to render 100 lines (lines to be rendered are determined according to the SPU’s ID). The result data (Pixel colour values) is written to a colour buffer and after finishing the render of a line, the colour buffer is sent to the main memory. At this point, if a single buffer is used the execution has to pause until the data is written to the main memory otherwise the colour buffer can be overwritten by the ray tracer before being sent. This problem causes unnecessary delay in the execution as shown in the figure 28.
The solution for this problem is double buffering. By using two buffers, while the first buffer is being sent to the main memory the second buffer can be used to continue the execution. The cone of this method is the memory used for the buffers. Let us assume one SPU needs to render 100 pixels per DMA transfer in this case the memory used for the colour buffers is; 100x2x3bytes (red,green,blue)=600 bytes however the gain from the transfer time is more important compared to the extra 300 bytes of memory space. The code section below shows how the double buffering is applied for waiting the next colour buffer data to be transferred. The first if-else condition is used for identifying the next colour buffer to be waited.
Table 17 - Code part for double buffering, before continuing the ray tracing process waits until the next buffer gets transferred After transferring the colour buffer data, the last step is switching the colour buffer with the next colouring buffer and continuing the execution the same steps until all the lines assigned to the SPU are rendered.
To summarise the implementation process took approximately forty five days of evolutional implementation process. Starting from the most basic implementation of ray tracing and forwarding step by step from the basic ray tracer to the aimed ray tracer enabled better understanding of each step and made sorting out problems easier. On the other hand the implementation process took longer than expected because of the use of the evolutional model for the development process. However at the end the desired implementation was achieved and the data is collected for comparing the Cell and the PC implementation of the ray tracer in order to understand the contributions of parallel ray tracing as well as the KD Tree acceleration structure on the ray tracing performance. Therefore in the next chapter the rendering times of the same scenes between the implemented ray tracers will be discussed.
VI.Project OutcomesIn this chapter of the thesis the rendering time results of the ray tracers will be given and discussed. The hardware specifications of the PC used for PC version of the ray tracers is as follows; Sony Vaio laptop – VGN-FZ18E Intel Core Duo CPU T7100 @ 1.80 GHZ, 32 bits 2 GB Ram Windows Vista Operation SP1, 32 bits However it should be noted that only one CPU is being used for ray tracing purposes since the PC implementations are using single thread.
The specifications of the used Cell architecture are already given in the methodology chapter. A.Basic PC implementation without any acceleration structureIn this part the figures taken from the simple implementation of the ray tracer will be given and discussed to investigate the effects of the factors such as number of objects and number of reflections on ray tracing time.
Chart 1 - Rendering times for the basic ray tracer on PC As shown in the line chart above, the rendering time increases linearly respect to the number of balls in the scene since the number of intersection tests increases linearly respect to the number of balls as well. With this simple implementation without using any acceleration methods after six balls the number of frames per second goes below 20 which shows that for a fast ray tracer number of intersection tests per ray should be under 6 for the used hardware.
Chart 2 - Times used for intersection tests, for the basic ray tracer on PC
Chart 3 - Percentage of the intersection calculation times over the whole ray tracing The two charts above show that the time consumed for intersection tests increases linearly and the percentage of the time consumed by the intersection tests is in the range between 87% and 99% as stated in the background part of this thesis. The main reason why the pattern of the figure above seems like steps might be because of cache misses.
Table 18- testing result for different number of reflections
Chart 4 - Effect of the reflection limit on ray tracing time. Horizontal axis represents the number of reflections Compare to the previous results rendering time takes longer time since the objects used in the scene are not only spheres but also planes which covering the whole screen which means each primary ray reflects from and the secondary rays are being created. In addition to that the chart above shows that after a certain level of reflection, rays do not reflect any more since they hit an empty space eventually and the tracing process ends for most of the rays before these rays get reflected more than 10 times.
Chart 5 - Effect of the reflection limit on the percentage of the intersection test over the rendering time. The chart given above shows the percentage of the intersection time respect to the total rendering time. Surprisingly the figure starts from 41% and goes up to 47% which are lower values compare to the previous test result. The main reason for that might be the time consumed by calculating the normal and the reflection ray at the hit point, since all the primary rays hit an object in the scene more time is being used for the reflection rays. This show that the time consumed by the rendering process also depends on the type of the objects and the distribution of the objects in the scene. B.KD-Tree implementation on PC
Figure 30 - 1001 spheres, 10 reflections, resolution 800x600, 1 light source; rendering takes 35 seconds In this section, in order to test the effects of the KD-Tree structure on ray tracing some figures will be given such as rendering time depending on the number of spheres in the scene as well as depending on the number of reflections.
Chart 6 - Change of the ray tracing time accoding to the number of the objects in the scene for KD-Tree PC implementation
Chart 7 - Ray tracing time obviously increases as the number of objects in the scene increases however because of KDTree acceleration structure the increasing rate decreases since the complexity of traversing the KDTree structure is logarithmic The data above shows the logarithmic increasing pattern of the ray tracing with KD-Tree acceleration structure. As the number of objects in the scene increases linearly; contrary to the results of the ray tracing times without using any kind of acceleration structure, which are given in the previous part of this chapter, increase in time is not linear. Moreover after 60 spheres the increase rate tends to slow down which was mentioned in the methodology chapter of this thesis while giving the reason why KD-Tree structure is chosen for this thesis.
Figure 31 - Ultra reflective scene; completely covered by reflective sphere walls which do not allow early termination of the rays before 10 reflections. 85 spheres all reflective, 10 reflections limit, 800x600. Rendering of this scene took 8 seconds to render on the specified PC hardware. The same scene is rendered in approx. 5 seconds by using 1 SPU. C.KD-Tree implementation on SPUIn this part, first of all a comparison between one SPU render time with the CPU render time will be given. After, to expose the importance of multi-core architecture on ray tracing the render times collected by using different number of SPUs will be given.
As shown in the table above, for an empty scene the rendering time for one SPU takes 133 milliseconds however PC version renders the empty scene in about 62 milliseconds. The main reason of the delay for the PC implementation time is looping through the primary rays to be rendered and sending the pixel information to the graphics card to be displayed on the screen. On the other the main reasons of the delay in the SPU execution is the transfer time of the pixel data from the SPU memory to the PPU memory besides since the SPU code does not use pre-calculated rays, all the primary rays need to be created in real time which means one normalisation operation per ray is done extra compare to the PC run.
Chart 8 - Render time comparison between 1 SPU and PC, both using KD-tree and exactly the same scene
Chart 9 - The difference in rendering times between the PC and the Cell tends to decrease after a certain number of primitives in the scene The rendering time difference between runs peaks at twenty spheres with 256 milliseconds. After this peak point the difference falls rapidly and continues falling. The reason for having such a peak point is; while traversing the tree, for each access of the memory the SPU code does one addition and one bit masking operation in addition to the tree traversing operation because of the reasons explained in the implementation chapter. On the other hand the on the PC code no such kind of operations are being done. One of the reason the of rapid fall down after 20 spheres might be the cache misses on the PC side however this should not be the main reason. One more important reason is, as the number of spheres in the scene increases the tree structures get wider and the memory access operations reduces exponentially which makes the SPU run faster compare to the PC run. Moreover increasing the number of spheres in the scene results with closer spheres and this may cause having more than one sphere per tree leaf node which means more ray-object intersections. Since SPUs are vector processors, ray-object intersections are made much faster compare to regular CPUs. After fifty spheres, the SPU code runs fasters and continues getting faster compare to the CPU.
Table 19 - Showing the change in render time depending on the number of the SPUs used. Time values are in milliseconds
Chart 10 - Since ray tracing is very suitable for parallelisation, increasing the number of the SPUs decreases the rendering time As the chart shows above, increasing the number of SPUs apparently speeds up the process however for understanding the effectiveness of the chosen method for distributing the render areas between the SPUs; the effectiveness table should be examined.
Table 20 - Efficiency of using more than one SPU The efficiency of the rendering tends to decrease as the number of the SPUs increase as shown in the chart below.
Chart 11 - Worst and best efficiency cases for 6 SPUs The main reason for the decreasing efficiency is the method chosen for distributing the screen into SPUs to be rendered. As explained in the methodology part every SPU renders one line of the screen and each SPU renders constant number, pre-assigned lines depending on the height of the desired result image. The problem of this approach is, rendering of some of the lines assigned to one SPUs may take longer than the other SPUs. In other words the total rendering time of one of the SPUs may take longer than the rest of the SPUs. In that case all these SPUs have to wait until all the slowest render part gets completed (Chart-12). As the scene gets more complex, the efficiency increases since the proportion of the time waited by the SPUs to the total rendering time decreases.
Chart 12 - Rendering speed depends on the slowest part of the screen; number in the boxes represents the pixel numbers. D.SummaryIn summary, with the help of the rendering time data collected from ray tracers, firstly it is shown that using KD-Tree acceleration structure increases the rendering speed tremendously. Besides the render time increases in a logarithmic fashion which means the increase rate in rendering time decreases as the number of the objects in the scene increases. Secondly, it is exposed that using multi-SPUs fastens the rendering speed with an acceptable level of efficiency. On the other hand figures proved that the method used for synchronising the SPUs plays a key role on the ray tracing efficiency. Moreover, the lack of efficiency in the method used for the implementation of this project is also revealed since the efficiency of the ray traced dropped down up to 70%.
VII.Conclusion
The research has proliferated the ray tracing terminology augmented by the implementation presented by this project. The initial study has helped gain valuable knowledge about ray tracing and its applications in different domains. It has also helped realise the quintessential requirements posed by ray tracing technology and hence motivated to find alternatives that can satisfactorily fulfil these requirements. In this dimension knowledge of different hardware architectures has been gained. And the implementation for the cell architecture was achieved. Even the basic application offered by this project demonstrates the effectiveness of the KD-Tree data structure for ray tracing on Cell architecture. The implementation samples presented by this research show promising improvements in ray tracing for animation and movie industries and at the same time pave the path for its applications in computer games. Finally we conclude that ray tracing is a promising technology in computer graphics field.
VIII.References
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||