GLSCENEINT: A Synthetic image generation tool for testing computer ...

2 downloads 0 Views 270KB Size Report
who used POV-Ray [2] has observed, ray-tracers are also painfully slow. Therefore, it is ... gram or as list of paths in an environment variable. 5. Any object can ...
GLSCENEINT: A Synthetic image generation tool for testing computer vision systems Silviu Bota and Sergiu Nedevschi Technical University of Cluj-Napoca {Silviu.Bota,Sergiu.Nedevschi}@cs.utcluj.ro Abstract This paper presents a synthetic image generation tool that is useful for testing computer vision algorithms. The application interprets scene description files written in a language we developed. Image rendering is done using OpenGL, as opposed to the more usual ray-tracing techniques, in order to gain speed. An arbitrary sub-pixel accuracy can be used. The synthetic image generation application is useful for testing calibration methods, edge extraction stereo reconstruction methods, lane detectors, grouping etc.

1. Introduction 1.1. Motivation Developing a computer vision application is a difficult and error prone job. Testing it proves to be even more difficult. Acquiring large amounts of images, with varying parameters, in controlled environments is a time consuming and expensive operation. It is also impractical to have large, controlled environments such as long roads, with various curvatures. Also, measuring land-marks in such environments is a process where human error can introduce large measurement errors. Instead, we proposed and implemented a synthetic image generation application, which we named “GLSCENEINT” (openGL SCENE INTerpretor). The name comes from the fact that we use OpenGL as our renderer, we use a custom scene description language in order to describe our artificial environment and the whole application acts as an interpretor for files written in the scene description languages. On top of our scene interpretor we developed some high level applications written usually in MATLAB that can generate high-level structures such as roads, based on a NURBS model [1], or can simulate motion by varying the position of the camera.

1.2. Related Work The usual choice for a synthetic image generator is to use a ray tracer. Ray tracing gives more freedom in choosing the shapes that can be render (any 3D volume, intersections and unions of 3D volumes, etc.). It can also simulate such complicated effects as inter-reflections (up to any limit we choose), lens distortions etc. However, as anyone who used POV-Ray [2] has observed, ray-tracers are also painfully slow. Therefore, it is impractical to use ray tracing to generate long sequences of images. It is also hard to use applications developed for other purposes than testing, as for example 3D Studio Max. The problem here is that the coordinate systems and camera models used by these applications may not match our models. We also found that is usually hard to convert parameters between our model and the model used by these applications as these applications are more concerned with the visual aspect of generated images rather than their strict adherence to some explicit model. Another nice feature of our system is that we also obtain dense range images associated with the intensity images. This is very useful in testing stereo reconstruction algorithms, and also systems that rely on a dense stereo machine (e.g. TYZX [3]).

1.3. Contributions The system we developed is unique, as it permits easy synthetic image generation for testing a large number computer vision applications. The camera model that GLSCENEINT uses (intrinsic and extrinsic parameters) is very precise. The accuracy of the generated images is programmable, and it can achieve any sub-pixel precision required. We tested our system for 1/256 sub-pixel precision, and even higher precision can be achieved (the actual limit is 1/4096 pixels). Also, on top of GLSCENEINT we developed a road generation framework, based on a NURBS model, which can generate roads of arbitrary curvature and length, by specifying just a few control points. We also developed a framework for simulating motion, at arbitrary

speed and frame rate. The fact that we also obtain dense range images is also very important in testing stereovision algorithms

1.4. Paper structure In section 2 we present our scene description language. Section 3 discusses interpretation and OpenGL rendering of the scene description files. Section 4 describes road and motion generation. Section 5 shows some examples of synthetic images. In section 6 we draw our conclusions and present possible future work.

which they are stored. For example, the object described in the file “road/tex/vegetation/Tree.scene” is named “road.tex.vegetation.Tree”. 4. There can be multiple roots of the directory tree. They are specified as parameters to the GLSCENEINT program or as list of paths in an environment variable 5. Any object can include other objects, specified by their fully qualified name. The included objects are rendered based on the current rotation, translation, scaling and texture.

2. The scene description language

6. Any object can save the current state, modify it and subsequently restore it, if desired

In order for our system to be useful it has to be easily programmed. Before developing this system our team developed a ray tracer that could render scenes by using only rectangle primitives. It soon became clear that such a system is hard to use, because any single rectangle in the scene had to be manually specified by given its vertices. The next attempt was a system that could render polygons with arbitrary number of edges, but this too was very hard to use. The current system also renders polygons, but it is also capable of generating triangle and rectangle strips and triangle fans. It is also capable of rendering textured objects. To make the system really easy to use we developed a specialized programming language, that can be used to describe a hierarchical artificial scene. This language mimics the classical programming technique used when programming OpenGL applications using display lists. Each object type of the scene described in its own file and is actually captured as a display list. The objects are somehow parametrized, because their appearance is controlled by the current rotation, translation, scaling and texture. It is also desirable to reuse objects, once defined. Therefore, we came up with the concept of scene library. This library contains object descriptions for objects that appear frequently in our scenes, for example road segments, road markings, cars, trees etc. It is useful to have a clear and intuitive organization of this library. We considered that the hierarchical organization of packages such as the one used in JAVA or PYTHON is excellent for our purposes. So we came up with the following design for the library:

7. Textures are also considered objects. They are stored in .tga image files, having 4 channels (RGB and transparency). Currently, the transparency is only used to select transparent/opaque values (there are no intermediate transparency).

1. Each object is described in its own file, with a .scene extension 2. Objects are organized hierarchically, this hierarchy being defined by the directory tree that stores the objects. 3. Object naming is defined by the directory tree in

8. There must also be a top-level object, that represents the whole scene. GLSCENEINT reads the top-level object and invokes on demand all other objects that are used by the top-level object and renders them. 9. In order to save speed, each object encountered is compiled into a display list and, on subsequent invocation, the compiled form is used. This also applies to textures. The primitives that can be rendered are: 1. Polygons specified by the poly{...} instruction 2. Rectangle strips specified by the qstrip{...} instruction 3. Triangle strips specified by the tstrip{...} instruction 4. Triangle fans specified by the tfan{...} instruction Each primitive can contain an arbitrary number of vertices (actually there are some limitations that must be observed in terms of the minimum number of vertices accepted and their parity). The vertices are specified by a vert(...); instruction. This can contain 1, 2, 3 or 4 coordinates, and if not specified they get implicit values. Also, a color given by col(...); and texture coordinates given by tex(...); There are also instructions to select global behavior, such as global color, texture/color mode, current transform (rotation, translation, scaling or general transforms), current texture etc. Another group of instructions for saving and restoring the current state, and for including other objects.



 Xndc  Yndc  1  Pndc =   Zndc  = Px Peye eye 1

3. Interpretation and OpenGL rendering of artificial scenes 3.1. The interpretor Because of the relatively high complexity of our scene description language, and because we want to be able to further extend it, we created a parsers using the well known LEX and YACC utilities. The parsers generated by these utilities are written in C, are fast and are portable across different platforms, which is advantageous. Each compilation unit (scene file) is compiled separately and transformed into a display list. Most commands have a simple one to one mapping to OpenGL commands. The display lists are called in the sequence specified by each scene description file. Our system is capable of generating monocular, binocular and trinocular scenes, according to the number of cameras specified. When in binocular or trinocular mode, each image is rendered separately, by changing the camera’s parameters.These parameters are stored in separate files for each camera. There is also a file that describes the resolution of the generated images.

3.2. GLSCENEINT to OpenGL model mapping In order to have an exact camera model, the intrinsic and extrinsic parameters of each camera must be mapped to OpenGL concepts such as the projection matrix and viewport. The camera model we use is defined by: Pcam = M(RPworld + T ),

(1)

where R and T are the rotation and translation from the world’s system to the camera’s system (R is a 3x3 matrix and T is a 3-vector) M is the projection matrix, defined as:   fx 0 cx M =  0 fy cy  (2) 0 0 1 Here fx and fy represent the focal length measured in horizontal and vertical pixel units and cx and cy are the coordinates of the camera’s optical center. This model, although very suitable for computer vision applications is very different from the model used by OpenGL. For starters, OpenGL uses homogeneous coordinates (4x4 transformation matrices and 4-vectors). Another difference is that only 2 transforms are used, the modelview and the projection. The equations OpenGL uses are thus [4]: Peye = Tmodelview Pworld (3) Pclip = Tpro jection Peye ,

(4)



x Pimage y Pimage



 =

x + 1) width (Pndc 2 y (Pndc + 1) height 2



 +

Viewportx Viewporty

(5)



(6) where Pworld are the point’s coordinates in the world’s system, Peye are the point’s eye-space coordinates, Pndc are the normalized device coordinates and Pimage are the final image coordinates. A few things must be considered here, as the discrepancy between the OpenGL coordinate system (the negative Z axis extends from the eye toward infinity, the Y axis points up and the X axis points to the left) as opposed to our system (Z positive from eye to infinity, Y down and X left). The viewport’s position (Viewportx ,Viewporty ) and size (width, height) also need to be considered. The way that OpenGL performs clipping is important. Although the technical OpenGL manual specifies that the clipping is performed in the eye space using the 6 planes of the viewing volume [5], we can consider (for the sake of simplicity) that clipping is performed in the normalized device coordinates space (which, when not considering things as lights) is actually the same thing. In NDC space, the clipping planes have simple equations X = ±1,Y = ±1, Z = ±1. Therefore, we need to make sure that our near and far clipping planes are mapped to Z = −1 and Z = +1 respectively. In order to obtain the correct mapping between our model and OpenGL’s model we make the following substitutions: fx (7) fx1 = 2Width fy 2Height

(8)

c1x =

2cx Width − 1

(9)

c1y =

2cy Height − 1

(10)

α=

Znear + Z f ar Znear − Z f ar

(11)

β=

2Znear Z f ar Znear − Z f ar

(12)

fy1 =

fx1  0 Tpro jection =   0 0 

0 fy1 0 0

c1x c1y −α 1

 0 0   β  0

(13)

 Tmodelview =

R 0

T 1



Viewportx = Viewporty = 0

(14) (15)

One can verify that performing the above substitutions the transforms happen according to our model, and clipping is performed correctly for the planes Z = Znear and Z = Z f ar .

Fortunately, OpenGL itself needs the depth, when rendering scenes, so it keeps it in the “depth buffer”. We only need to read this depth buffer and recover it. There are, of course, some complications. The depth is mapped from the [−1, +1] NDC interval to the [0, 1] depth buffer range. We need to transform it back into the camera’s coordinate system. This transformation is given by: Zeye =

3.3. Sub-pixel accuracy anti-aliasing In order to obtain realistic images, we must performed anti-aliasing of the whole generated scene [6]. Most modern video adapters have multi-sampling capabilities. These, however, are limited to a relatively small number of samples, and although they provide a fast anti-aliasing method, they do not provide the precision we need for our system. We used the method described in [6] for full-scene antialiasing. This method renders each image multiple times, each time with a small, sub-pixel, displacement. The results are averaged using the accumulation buffer. In fact, this method gives similar results to rendering the image at a large size, filtering it with a box filter and than re-sampling it by decimation. In order to cope with the artifacts introduced by box filters, we also use Gaussian filtering. The results have very small differences, but they may be important when testing sub-pixel accuracy edge detectors. Unfortunately, usual accumulation buffers are limited to 16 bit precision. This means that we cannot use antialiasing windows that contain more than 216−8 = 256 samples. This means that the maximum window size, when using such an accumulation buffer, would be 16x16 pixels, which is not very large enough for some applications. To cope with this problem, we also use a custom, 32bit, software accumulation buffer. This increases the number of possible samples/pixel to 232−8 = 16777216, or a 4096x4096 window, which should be sufficient for all our accuracy needs. There is some speed loss associated with the software accumulation, but that is not a big concern, because, anyway, for such a high number of samples, the rendering itself takes a long time. For most of the testing purposes, a sampling using 4x4 windows is usually sufficient.

3.4. Depth recovering As we said in the introduction, it is important that we can generate ground truth depth for each point in the synthetic images. We need this depth in order to test stereo reconstruction algorithms, and to get a “perfect pseudoreconstruction algorithm” in order to test high-level vision algorithms, such as grouping, tracking and lane detection which use sparse or dense stereo reconstructed points.

α 2(Depth − 0.5) + β

(16)

We have also an option to transform this coordinates from the camera’s coordinate system to the world system, recovering a cloud of point, by having:   Xeye  Yeye  =   Zeye 

(x−cx )Zeye fx (y−cy )Zeye fy

  

(17)

Zeye

and Pworld = RT (Peye − T )

(18)

4. Road and motion generation In order to test more complicated vision algorithms, present in vision based driving assistance systems we must generate complicated structures such as roads with varying curvature, and be able to move the virtual camera along such roads. We choose not to integrate more complicated primitives such as curved surfaces in our scene generation language, but rather build separate applications that generate scene files from a higher level description. These applications generate curved roads description files by using a large number of rectangles (a long qstrip{...}construct). We need to have a model of the road itself. Initially, we generated only straight roads and fixed curvature roads. These can be easily be build by using a small number of equations, but are very special cases, and do not occur in reality very often. We turned our attention to B´ezier, BSpline and finally NURBS curves. The last category is sufficiently general, so we used this model. A clamped NURBS curve is given by using a small number of 3D control points and associated weights. The resulting curve passes closer to points having high weights. On top of these curve, we generate a flat band, of constant width, which is the road. Currently we do not simulate the inclination of the road at curves. In order to simulate motion, we move the camera’s position along the NURBS curve. The speed is interpolated linearly between a number of control points. The frame rate is programmable.

Figure 1. Camera calibration test image

Figure 2. Stereo reconstruction test image

5. Results We used our system to test a large number of vision algorithms. These include • Camera calibration (Fig. 1) • Rectification • Stereo reconstruction (both for canonic and general geometries, binocular and trinocular [7]) (Fig. 2) • Dense stereo reconstruction and dense stereo algorithms (Fig. 3) • Points grouping (Fig. 4) • Tracking • Lane detection [8](Fig. 5)

Figure 3. Dense stereo test image

• Sub-pixel edge detection We also made sure that the mapping between our projection model and OpenGL’s projection is right by projecting a number of test points and comparing the results with the ones predicted by our projection equations.

6. Conclusions and future work We managed to develop a very useful test tool for computer vision algorithms. Most algorithms can be tested by simply writing a scene description file, generating the test images with GLSCENEINT and comparing the scene parameters with the one obtained by the tested algorithm. GLSCENEINT is a robust and easy to use tool. It can produce very high quality images by using large anti-aliasing windows. It helped us detecting and correcting errors that are inevitably made when programming high complexity computer vision applications.

Figure 4. Object grouping test image

Figure 5. Lane detection test image

In the future we want to extend mostly our road generation capabilities. Top priorities include generating roads that are automatically inclined in curves, generating sequences for which the pitch of the camera oscillates as it would if the camera were placed in a moving vehicle. To do this we must first generate a faithful model for a vehicle. We would also like to have other moving vehicles in our scenes. Another nice feature to have would be the possibility to generate road junctions, intersections, variable width roads etc. We would also like to easily simulate urban environments. We would also like to extend our scene description language, to make the object parametrization easier. Support for other image types as textures besides TGA would also be a nice feature. We would like to include advanced techniques such as lights, environment mapping, shadows etc.

References

Figure 6. Road marks and signs

trans(-1500, 0, 0); call("road.tex.roads.Curve400m"); push; trans(-8000,0,15000); call("road.tex.vegetation.Birch"); pop; push; trans(-9000,0,30000); call("road.tex.vegetation.Tree1"); pop; push; trans(-10000,0,50000); call("road.tex.vegetation.Tree2"); pop; push; trans(-14000,0,75000); call("road.tex.vegetation.Birch"); pop; Figure 7. A typical scene description file

[1] L. Piegl, “On NURBS: A survey,” in Proceedings of the IEEE Computer Graphics and Applications, vol. 11, January 1991, pp. 51–71. [2] The POV-Ray site. [Online]. Available: http://www.povray.org [3] J. I. Woodfill, G. Gordon, and R. Buck, “Tyzx deepsea high speed stereo vision system,” in Proceedings of the IEEE Computer Society Workshop on Real Time 3-D Sensors and Their Use, Conference on Computer Vision and Pattern Recognition, jun 2004. [Online]. Available: http://www.tyzx.com/pubs/CVPR3DWorkshop2004.pdf [4] T. McReynolds and D. Blythe, Advanced Graphics Programming Using OpenGL. Morgan Kaufmann, 2005. [5] D. Shreiner, OpenGL(R) Reference Manual : The Official Reference Document to OpenGL, Version 1.4 (4th Edition). Addison-Wesley Professional, 2004. [6] D. Shreiner, M. Woo, J. Neider, and T. Davis, The Official Guide to Learning OpenGL(R), Version 2. Addison-Wesley Professional, 2005. [7] S. Nedevschi, S. Bota, T. Marita, F. Oniga, and C. Pocol, “Real-time 3d environment reconstruction using high precision trinocular streovision,” in Proceedings of the International Conference on Automation, Quality and Testing, Robotics, vol. 2, 2006, pp. 333–338. [8] S. Nedevschi, R. Danescu, T. Marita, F. Oniga, C. Pocol, S. Sobol, T. Graf, and R. Schmidt, “Driving environment perception using stereovision,” in Procedeeings of IEEE Intelligent Vehicles Symposium (IV), June 2005, pp. 331–336. [Online]. Available: http://users.utcluj.ro/ vision/index files/Publications/IVS2005.pdf