Content-Based Surgical Workflow Representation Using Probabilistic ...

10 downloads 3143 Views 507KB Size Report
Sign up / Log in .... Springer International Publishing AG, Part of Springer Science+Business Media Privacy Policy, Disclaimer, General Terms & Conditions.
Content-Based Surgical Workflow Representation Using Probabilistic Motion Modeling Stamatia Giannarou1 and Guang-Zhong Yang1,2 1

Institute of Biomedical Engineering 2 Department of Computing Imperial College London, SW7 2BZ, UK

Abstract. Succinct content-based representation of minimally invasive surgery (MIS) video is important for efficient surgical workflow analysis and modeling of instrument-tissue interaction. Current approaches to video representation are not well suited to MIS as they do not fully capture the underlying tissue deformation nor provide reliable feature tracking. The aim of this paper is to propose a novel framework for content-based surgical scene representation, which simultaneously identifies key surgical episodes and encodes motion of tracked salient features. The proposed method does not require pre-segmentation of the scene and can be easily combined with 3D scene reconstruction techniques to provide further scene representation without the need of going back to the raw data. Keywords: surgical workflow analysis, motion modeling.

1

Introduction

Assessment of surgical workflow for Minimally Invasive Surgery (MIS) is valuable for evaluating surgical skills and designing context-sensitive surgical assistance systems. In this regard, direct use of MIS video has many practical advantages as it does not complicate the existing operating room settings. However, the challenges associated with this approach are also evident. MIS video data are associated with high temporal redundancy making off-line analysis of surgical workflow difficult. Being extremely voluminous, they require significant time for visualization. Also, information of interest can be accessed by sequential video scanning and data manipulation and editing is achieved by frame-by-frame video processing. Extensive research has already been conducted for surgical scene segmentation, instrument tracking, tissue deformation recovery and modeling [1]. In practice, the large volume of video data involved raises a significant information management challenge and a succinct content-based representation is required to facilitate efficient indexing, browsing, retrieval and comparison of relevant information for faster and more comprehensive analysis and understanding of the surgical workflow with minimal information loss. This requires a H. Liao et al. (Eds.): MIAR 2010, LNCS 6326, pp. 314–323, 2010. c Springer-Verlag Berlin Heidelberg 2010 

Content-Based Surgical Workflow Representation

315

high-level representation of visual information which reflects not only the scene structure but also the underlying semantic and context of the in vivo environment. Common approaches to video representation follow the framework of the MPEG-7 standardization [2]. The identification of representative frames, the so called “keyframes”, has been used extensively to convey the content of videos [3]. Although useful for broadcasting and general use, such a representation is of limited use for MIS workflow analysis as it does not adequately represent the underlying information and instrument-tissue interaction. One prerequisite of content-based representation of surgical workflow is the identification of surgical episodes that portray different events within the sequence. Existing work for surgical workflow segmentation has focused on the analysis of surgical actions. Probabilistic frameworks have been used to classify surgical actions based on detecting instrument-tissue interaction by tracking surgical tools and incorporating multiple visual cues [4]. Special MIS tools have also been designed, equipped with force and torque sensors to facilitate the classification of surgical actions by considering instrument kinematics [5]. Dynamic Time Warping (DTW) and Hidden Markov Model (HMM) based approaches have been proposed to classify the overall surgical procedures [6]. Additional information such as eye-gaze and hand/limb movement has also been included to improve the overall sensitivity and specificity of the analysis framework [7]. The purpose of this paper is to propose a new content-based surgical workflow representation scheme that is suitable for both surgical episode identification and instrument-tissue motion modeling. The aim is to transform the MIS data from an implicit and redundant frame-based representation to an explicit, abstracted, high-level surgical episode description. The proposed approach does not require pre-segmentation and the motion characteristics of salient features are used to identify tissue deformation in response to instrument interaction. Surgical episodes can be naturally derived from this framework and further representation of the scene with 3D reconstruction using motion or simultaneous localization and mapping (SLAM) can be readily built on top of this representation without the need of going back to the raw data.

2 2.1

Methods Episode Segmentation

In this work, surgical episodes describe distinct surgical actions and consecutive episodes represent surgical steps. The visual content information of MIS data is expressed by the action events of salient features. Episode borders occur when feature tracking fails as this is due to the contrastingly different visual appearance of the changing surgical environment. The affine-invariant anisotropic region detector [8] is employed for reliable and persistent feature tracking where salient features are identified as points that have strong gradients and are anisotropic along several directions. Scale adaptation is based on the strength of the anisotropic pattern whereas affine adaptation relies on the pattern’s intrinsic Fourier properties. An Extended Kalman

316

S. Giannarou and G.-Z. Yang

Filter (EKF) parameterization scheme is used to adaptively adjust the optimal templates of the detected affine-invariant anisotropic regions, enabling accurate identification and matching of a set of tracked features over a series of frames. The information provided to the EKF is the location of a salient point in each frame and the parameters of the ellipse that represent its region. The state of the EKF consists of the coordinates of the ellipse center (x, y), the velocities along the horizontal and vertical axes (u, v), the coordinates of the tip of the major axis (rx , ry ), the angle between the horizontal and the major axis of the ellipse θ, the angular velocity ω, and the ratio between the major and the minor axes k of the ellipse. The state of a salient point at time t is defined as: ⎡ ⎤ ⎡ ⎤ u xt−1 + (ut−1 + wt−1 xt ) v ⎢ yt ⎥ ⎢ yt−1 + (vt−1 + wt−1 )⎥ ⎢ ⎥ ⎢ ⎥ u ⎢ ut ⎥ ⎢ ⎥ ut−1 + wt−1 ⎢ ⎥ ⎢ ⎥ v ⎢ vt ⎥ ⎢ ⎥ vt−1 + wt−1 ⎢ ⎥ ⎢ ⎥ ω ⎢ ⎢ ⎥ (1) st = f (st−1 , wt ) = ⎢ θt ⎥ = ⎢ θt−1 + ωt−1 + wt−1 ⎥ ⎥ ω ⎢ ωt ⎥ ⎢ ⎥ ω + w t−1 t−1 ⎢ x⎥ ⎢ ⎥ ⎢ rt ⎥ ⎢ ⎥ rtx ⎢ y⎥ ⎢ ⎥ y ⎣ rt ⎦ ⎣ ⎦ rt kt kt−1 where, wt = [wtu , wtv , wtω ] is zero mean additive Gaussian noise and rtx and rty are the results of the homography:  x   x ω ω ) − sin(ωt−1 + wt−1 ) rt rt−1 − xt−1 cos(ωt−1 + wt−1 · y = ω ω sin(ω rty rt−1 − yt−1  t−1 + uwt−1 ) cos(ω t−1 + wt−1 ) (2) ut−1 + wt−1 + xt−1 + v vt−1 + wt−1 + yt−1 The aim of tracking is to establish frame correspondence between region sˆ− t predicted by the EKF and the detected regions in the search window at the examined frame. In this work, we use the relative amount of overlap O in the image area covered by the compared regions and the dissimilarity C in the anisotropic measure of the compared features as an indication of region correspondence, defined as: |cA −cB | OA,B = 1 − A∩B A∪B , CA,B = max |c − c | (3) A n n∈S

where A ∩ B and A ∪ B are the intersection and the union, respectively of regions A, B, cA represents the anisotropic measure of feature A and S denotes the search area. The EKF corrected state sˆ+ t is verified using spatial context and regional similarity. The context of a feature is represented by a set of auxiliary features that exhibit strong motion correlation to the feature and is used to estimate an approximate location {˜ xt , y˜t } of the feature. The region similarity is estimated based on the Bhattacharyya distance between their RGB histograms. The spatial context information is also used to boost the prediction of the EKF and recover

Content-Based Surgical Workflow Representation

317

potential tracking failure. When sˆ− t is not able to be matched to any of the detected features or sˆ+ xt , y˜t } is t is not a valid correspondence, the location {˜ used to generate a “hypothetical” predicted state, defined as: + xt , y˜t , x ˜t − x ˆ+ ˜t − yˆt−1 , θˆt−1 , ω ˆ t−1 , sˆht = [˜ t−1 , y y + + x rˆt−1 + (˜ xt − x ˆt−1 ), rˆt−1 + (˜ yt − yˆt−1 ), kˆt−1 ]T

(4)

which is compared to the detected features in the examined frame to estimate the final state of the feature. A feature is declared lost if the verification of sˆht has failed for 5 consecutive frames in order to eliminate false positives during tracking as they would provide inaccurate information about the underlying tissue motion. Failure of the tracking process is determined when 35% of the features have been lost. A low threshold for signalling tracking failure might lead to over-segmentation of the video. However, in the proposed framework it is desirable to acquire motion information from the entire scene without neglecting areas where features might have been lost. A new episode is defined by re-initializing tracking and automatically selecting a subset of the detected features to track. Non-maximal suppression is applied to select the most salient ones to enable long tracks along time. Salient points that correspond to specular highlights are excluded. In order to represent every area of the examined environment, it is desirable that the tracked features span the whole field of view. The selected features are compared in terms of the difference in the surrounding image area and the most dissimilar ones are selected to initialize tissue tracking as they correspond to different areas of the scene. 2.2

Episode Representation

In the proposed surgical episode representation, probabilistic motion modeling is used to represent the motion of tracked features. In the ideal case, the distribution of a feature’s velocity within an episode should be represented by a Gaussian-like distribution around some point. In practice, the motion of a feature can change dramatically over a period of a few frames due to periodic tissue motion or tool-tissue interaction causing free-form tissue deformation. In order to account for such multiple movements, the motion of each feature is modeled as a mixture of K bivariate Gaussian distributions. Given the motion vectors U k = (uk , v k ), k = 1 . . . te of a tracked feature within an episode te frames long, the probability of observing the motion vector U t is given by: K

P (U t ) = ωp · g(U t , μp , Σp ) (5) p=1

where ωp is an estimate of the weight of the pth Gaussian in the mixture, μp , Σp are the mean and the covariance matrix and g(·, μp , Σp ) is the density of the pth component, given by: g(U, μp , Σp ) =

1 2πΣp

1 2

1 exp{− (U − μp )T Σp −1 (U − μp )} 2

(6)

318

S. Giannarou and G.-Z. Yang

(a)

(b) Fig. 1. (a) PME sequence with blue vertical lines indicating episode borders and the first frame of each episode shown. (b) Surgical episode content representation maps.

The number of mixture components is empirically determined by the desired computational complexity and usually ranges from 3 to 5 [9]. In this work, five components have been used. The parameters (μp , Σp ) of the distributions as well as the weights ωp are learned using the Expectation Maximization (EM) algorithm [10]. For a compact motion model representation, the mean and the covariance matrix of the Gaussian mixture are used and they are estimated as: M= K p=1 (ωp μp ) K K K (7) T Σ = p=1 ωp (Σp + μp μp ) − ( p=1 ωp μp ))( p=1 ωp μp ))T To this end, the motion models of tracked features are blended to estimate the motion at every point in the scene. Considering a set of N neighboring tracked features, the motion model at a scene point i is estimated as in Eq. (7). The weight of the j th neighbor in the motion blend is estimated as ωj = N −1 d−1 i,j / j=1 di,j , where di,j is the Euclidean distance between the scene point and its j th neighbor.

3

Results

In order to assess the practical value of the proposed framework, quantitative evaluation has been performed on four in vivo data sets from robotically assisted

Content-Based Surgical Workflow Representation

319

(a)

(b)

(c) Fig. 2. Motion vectors superimposed on surgical episode content representation maps for MIS sequences with (a) varying camera motions (b) instrument-tissue interaction (c) respiratory motion

MIS procedures involving a variety of representative surgical scenarios such as varying camera motions, occlusion due to the presence of tools and significant tissue deformation. The images are of resolution 360×288 pixels, in line with the output resolution of the available endoscopic tools used in MIS. Visual content information is provided by motion patterns generated by tracking 100 affineinvariant anisotropic regions. Due to a lack of accepted benchmarking or ground truth for episode segmentation algorithms, it is not possible to perform objective evaluation of the proposed episode segmentation approach, particularly for in vivo patient studies. Hence subjectively defined feature tracking is used for performance evaluation. The Perceived Motion Energy (PME) model [11] has been widely used in video segmentation and it is used here to demonstrate the motion activity along surgical episodes and verify that each extracted episode describes a distinct surgical action. The PME model combines the motion intensity and the motion direction of

320

S. Giannarou and G.-Z. Yang

(a)

(b)

(c)

Fig. 3. Motion model similarity assessment results for (a)K = 5, N = 5 (b)K = 5, N = 3 (c)K = 3, N = 5. The horizontal axis corresponds to the motion model similarity and the vertical axis to the percentage of examined features on the ground truth.

the tracked features. The PME at frame t of a surgical episode is mathematically defined as:    F t+T

maxb (H(t, b)) 1 k k (ui )2 + (vi )2 · n P M E(t) = (8) F (2T + 1) i=1 b=1 H(t, b) k=t−T

The first term in Eq. (8) corresponds to the average velocity of tracked features within the frame interval [t − T, t + T ] of the episode. The second term expresses the percentage of dominant motion direction within the episode and H(t, b) is the n-bin histogram of the motion vector angles of the tracked features for the interval [t, t + T ]. The parameter T determines the amount of activity detail captured in the PME and in this work is set to 25% of the episode length. To validate the proposed episode content representation approach, ground truth data is estimated by manually identifying corresponding features between the first sequence frame and subsequent frames. In order to estimate feature velocities within a surgical episode the ground truth is obtained at equally spaced pairs of consecutive frames that correspond to 20% of the episode length and to a minimum number of 20 frames. In order to reduce the computational complexity of the performance evaluation, 20 of the initially detected features are manually tracked along time to obtain their ground truth motion model which is compared to the one estimated by the proposed approach. The similarity between two motion models i, j is estimated using Matusita’s measure [12] which expresses the difference between the covariance matrices and the distance between the means of multivariate distributions, defined in the bivariate case as: 1

Si,j =

1

2|Σi | 4 |Σj | 4 |Σi + Σj |

1 4

1 exp {− (μi − μj )T (Σi + Σj )−1 (μi − μj )} 4

(9)

Detailed workflow analysis of an in vivo sequence with tissue motion due to respiration and significant instrument-tissue interaction is presented in Fig. 1. The PME pattern in Fig. 1(a) verifies that the extracted surgical episodes describe distinct tool actions within the workflow. Also, the PME level is proportional

Content-Based Surgical Workflow Representation

(a)

(b)

(c)

(d)

321

Fig. 4. Statistical significance results for MIS sequences shown in (a) Fig.1 (b) Fig. 2a (c) Fig. 2b (d) Fig. 2c

to the degree of deformation and episodes with no instrument-tissue interaction are characterized by low PME which is mainly due to respiration motion. Observing the PME pattern around the borders of the episodes extracted by the proposed method, it becomes clear that video segmentation based on the PME model (episode borders identified at the peaks of the PME pattern) has missed surgical actions occurring at a small part of the scene (for instance the 3th and the 6th episode borders). This demonstrates the success of the proposed method and suggests that PME is a useful measure for representing overall scene changes but not sufficient enough to detail fine granularity of the movement patterns. The inherent surgical episodes of the above video sequence are represented in Fig. 1(b). Areas of coherent motion within the episode are graphically classified using a colormap to demonstrate the similarity between the motion of scene points and a reference point (e.g the upper left corner of the scene) estimated using Eq. (9). The colormap representations clearly illustrate instrument induced tissue deformation. The content of a sample of surgical episodes extracted from the rest of the examined MIS data sets is presented in Fig. 2. Arrows have been used to illustrate the motion of discrete scene areas within the episode,

322

S. Giannarou and G.-Z. Yang

demonstrating global camera motion (a), instrument interaction (b) and respiratory motion (c). Fig. 3 demonstrates the accuracy of the estimated motion model for each surgical episode of the sequence in Fig. 1 when compared to the ground truth. Each curve point (a, b) corresponds to the percentage b of the examined features on the ground truth with similarity between the ground truth and the estimated motion model higher than a. Fig. 3(a) shows the motion model accuracy when using K = 5 mixture components and blending N = 5 features. The similarity curves show that the motion of a significant percentage of the scene conforms to the manually defined ground truth. The effect of the number N of neighbors used in the blending process is investigated in Fig. 3(b) by setting N = 3. The similarity curves in Fig. 3(c) show the accuracy of motion modeling when 3 mixture components are used. The ideal similarity curve would be a straight horizontal line at 100%. However, this is not always achievable and the slope of the curve is an indication of the robustness of the proposed episode representation. Since the difference in the performance of the above three parameterization schemes is not distinctive enough, statistical analysis of the similarity scores is performed. The statistical significance of the different parameterizations is evaluated as: Ei Ei ) + std(Snorm ) (10) mean(Snorm M Ej 1 Ei where, Snorm = mean(StEi − M j=1 St ) stands for the normalized similarity Ei scores using parameterization Ei , St denotes the similarity score at episode t when using parameterization Ei and M is the total number of compared parameterizations. As shown in Fig. 4, the selected parameterization with K = 5 and N = 5 gives the highest statistical significance in the majority of episodes for all of the examined sequences. The number of mixture components does not affect the accuracy of the proposed representation significantly, while the distance of the estimated representation from the ground truth increases when fewer neighbors are blended.

4

Conclusions

In this paper, we have proposed a novel framework for the succinct representation of the content of MIS data. Probabilistic tissue tracking is used to generate motion patterns that represent visual content complexity and guide the identification of surgical episodes. Surgical episodes are represented by modeling the tissue motion using probabilistic models and blending the models together to acquire information about the motion of the entire scene. The proposed compact, content-based data representation will facilitate surgical workflow analysis and understanding. It identifies episodes of distinct surgical actions providing the required information for their fast and robust reconstruction with techniques such as SLAM or structure from motion. The proposed framework also simplifies the problem of scene reconstruction as the information for feature extraction and data association is already conveyed in the proposed data representation.

Content-Based Surgical Workflow Representation

323

References 1. Varadarajan, B., Reiley, C., Lin, H., Khudanpur, S., Hager, G.: Data-derived models for segmentation with application to surgical assessment and training. In: Yang, G.-Z., Hawkes, D., Rueckert, D., Noble, A., Taylor, C. (eds.) MICCAI 2009. LNCS, vol. 5761, pp. 426–434. Springer, Heidelberg (2009) 2. Sikora, T.: The MPEG-7 visual standard for content description-an overview. IEEE Transactions on Circuits and Systems for Video Technology 11(6), 696–702 (2001) 3. Smoliar, S.W., Zhang, H.: Content based video indexing and retrieval. IEEE Multimedia 1(2), 62–72 (1994) 4. Lo, B., Darzi, A., Yang, G.Z.: Episode classification for the analysis of tissue/instrument interaction with multiple visual cues. In: Ellis, R.E., Peters, T.M. (eds.) MICCAI 2003. LNCS, vol. 2878, pp. 230–237. Springer, Heidelberg (2003) 5. Rosen, J., Solazzo, M., Hannaford, B., Sinanan, M.: Task decomposition of laparoscopic surgery for objective evaluation of surgical residents’ learning curve using hidden markov model. Computer Aided Surgery 7(1), 49–61 (2002) 6. Ahmadi, S.A., Sielhorst, T., Stauder, R., Horn, M., Feussner, H., Navab, N.: Recovery of surgical workflow without explicit models. In: Larsen, R., Nielsen, M., Sporring, J. (eds.) MICCAI 2006. LNCS, vol. 4190, pp. 420–428. Springer, Heidelberg (2006) 7. James, A., Vieira, D., Lo, B., Darzi, A., Yang, G.Z.: Eye-gaze driven surgical workflow segmentation. In: Ayache, N., Ourselin, S., Maeder, A. (eds.) MICCAI 2007, Part II. LNCS, vol. 4792, pp. 110–117. Springer, Heidelberg (2007) 8. Giannarou, S., Visentini-Scarzanella, M., Yang, G.Z.: Affine-invariant anisotropic detector for soft tissue tracking in minimally invasive surgery. In: IEEE International Symposium on Biomedical Imaging, pp. 1059–1062 (2009) 9. Stauffer, C., Grimson, W.: Learning patterns of activity using real-time tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8), 747–757 (2000) 10. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. Journal of the Royale Statistical Society, Series B 39(1), 1–38 (1977) 11. Liu, T., Zhang, H.J., Qi, F.: A novel video key-frame-extraction algorithm based on perceived motion energy model. IEEE Transactions on Circuits and Systems for Video Technology 13(10), 1006–1013 (2003) 12. Matusita, K.: Decision rules based on distance for problems of fit, two samples and estimation. Annals of Mathematical Statistics 26, 631–641 (1955)