Pipelined scientific workflows for inferring ... - Semantic Scholar

5 downloads 0 Views 565KB Size Report
Timothy M. McPhillips*. Natural Diversity Discovery Project ...... D. McMullan, T. Shin, J. Vincent, A. Robb, L.S.. Brinen, M.D. Miller, T.M. McPhillips, M.A.. Miller, D.
Pipelined scientific workflows for inferring evolutionary relationships Timothy M. McPhillips* Natural Diversity Discovery Project Technical Note NDDP-001 May 2005 Abstract Large-scale scientific research projects commonly span widely separated computing and experimental resources. The challenge of the future will be to provide small research groups with access to virtual laboratories composed of network-accessible, highthroughput research tools. Unfortunately, even current projects are limited by the lack of comprehensive computing infrastructure for supporting data-intensive, large-scale research. The Natural Diversity Discovery Project (NDDP) is developing a software framework to address key aspects of this problem. NDDP software builds on the Ptolemy II heterogeneous modeling and design system, adding capabilities for managing collections of data in pipelined scientific workflows. Paired tokens delimit nested collections of associated data flowing between workflow components. These collections provide support for customizing workflow component behavior, handling exceptions, controlling data flow, storing metadata, and recording data provenance. The NDDP is applying pipelined scientific workflows to the problem of inferring phylogenetic trees. Objectives include developing a collaboratory for evolutionary biologists and providing a web-based discovery environment for the general public. 1. Introduction New instrumentation, automation, computers, and networks are catalyzing high-throughput scientific research in a broad range of fields. These approaches promise to deliver data at rates orders of magnitude greater than in the past. Amid high expectations, however, is a growing awareness that existing software infrastructure for supporting dataintensive research will not meet future needs. A group of DOE-funded researchers recently identified common computing challenges in high-energy physics, biology, nanotechnology, climate research,

Author’s e-mail: [email protected]

*

Copyright  2005 Natural Diversity Discovery Project

and other disciplines at the 2004 Data-Management Workshops (Office of Science Data-Management Challenge, 2005). Workshop speakers reported that current technologies for managing large-scale scientific research do not satisfy even current needs and warned of a “coming tsunami of scientific data.” They identified critical computing challenges including (1) integrating large data sets from diverse sources, (2) capturing data provenance and other metadata, and (3) streaming data through geographically dispersed experimental and computing resources in real time. The authors of the meeting report recommended developing integrated computing environments for managing data in largescale scientific workflows. 2. Structural genomics exemplifies large-scale scientific workflows The challenges of large-scale, data-intensive research are illustrated by the Joint Center for Structural Genomics (Lesley et al., 2002). The JCSG is one of nine pilot centers funded by the NIH Protein Structure Initiative to systematically determine the structures of large numbers of proteins. JCSG takes a pipelined approach, with each protein target proceeding through a number of stages: (1) target identification; (2) gene cloning and expression; (3) protein purification and crystallization; (4) crystal quality screening; (5) X-ray diffraction data collection and phasing; (6) structure determination and validation; (7) publication and deposition in the Protein Data Bank (Berman et al., 2000). The approach is typical of very-large scale scientific workflows. The workflow includes both experimental and computational operations. Operations on a single protein target span multiple sites: U.C. San Diego, the San Diego Supercomputer Center, the Genomics Institute of the Novartis Research Foundation (San Diego), The Scripps Research Institute (La Jolla), and the Stanford Synchrotron Radiation Laboratory. Biological samples, data, and project status information are relayed frequently between these sites. Staff at the different facilities must address inter-site communication, resource allocation, target prioritization and other project management issues. Gigabytes of diverse data may be generated for each protein target, with project data organized hierarchically. Each of the hundreds of targets is associated with one or more cloning, expression and purification attempts. Tens to hundreds of crystallization trials can be carried out for one purified protein sample. Large numbers of crystals must be screened in order to find the small fraction

2 that diffracts well. The resulting crystals may be used for experimental phasing or dedicated to highresolution data collection. The JCSG workflow is pipelined in the sense that multiple samples of numerous targets pass through the workflow at the same time. Attempts are made to keep available resources busy at all times. Numerous feedback loops occur throughout the pipeline. For example, the results of diffraction quality screening are used to improve crystallization conditions iteratively until crystals of sufficient quality are obtained. Although many of the experimental and computational stages of the JCSG pipeline are largely automated, the overall workflow is not. A single database and web site centralize project information shared between the facilities, but other technologies are used for local information management at each site. Much of the data entry, analysis, and reporting is performed manually. The lack of a comprehensive integrating framework for managing the flow and storage of information in large-scale scientific workflows significantly complicates the JCSG and similar efforts. 3. Large-scale research for the general scientific community Successful projects like JCSG demonstrate to the scientific community the advantages of highthroughput approaches. As new automation technologies for structural genomics and other disciplines are developed, demonstrated, and made cost-effective, it becomes increasingly feasible to make such large-scale research tools available to medium-sized collaborations, small research groups, and individual investigators. Such researchers could exploit virtual laboratories composed of geographically distributed experimental and computing resources, much as large consortia like JCSG do now. Unfortunately, the lack of software frameworks for integrating these experimental and computing resources poses resource management, data management, and project management complications insurmountable to small groups. Collaboratories (for an example see Chiu et al., 2002), the Grid (Berman et al., 2003), web services, and other remote-access technologies (such as Blu-Ice, see McPhillips et al., 2002) are only partial solutions to this problem. Providing resources over networks is key, but putting science online is not sufficient to facilitate large or complex projects (Johnston, 2004). Effective information-intensive research by small groups

requires an automated-workflow approach where computing infrastructure integrates disparate resources and explicitly manages project data (Office of Science Data-Management Challenge, 2005). To date, making experimental and computational resources for structural biology available online has resulted in only modest improvements in research efficiency. In contrast, enabling scientists to control such resources in integrated fashion from high-level, workflow perspectives will likely have revolutionary effects on the scope, quality, and efficiency of research carried out by the general scientific community. Supporting large-scale research in this manner will require solving a multitude of daunting, interdependent problems and making commensurate investments of funding, time, and effort. Fortunately, some of the most fundamental problems can be addressed first. Prominent among these is the challenge of supporting pipelined scientific workflows. Automation infrastructure for pipelining must do more than facilitate simultaneous operation of each stage of a workflow. It must allow independent data sets or parts of data sets to be processed at the same time. And it must maintain associations between data, metadata, and results passing through the pipeline. Solving these problems would have immediate benefits. 4. The Natural Diversity Discovery Project Developing infrastructure to support large-scale, pipelined scientific workflows is an objective of the Natural Diversity Discovery Project (NDDP). The NDDP is a nonprofit organization dedicated to facilitating public understanding of scientific explanations for the diversity of organisms and ecosystems in the natural world. The NDDP plans to develop a virtual laboratory for inferring phylogenetic trees and testing evolutionary hypotheses. This collaboratory will be designed for evolutionary biologists and other researchers. The NDDP further plans to provide an easy to use, webbased version of this environment for a general public that is often skeptical of scientific explanations of biological diversity. Both environments will support open-ended, free enquiry into the relationships of living and extinct organisms using current scientific data. Automated scientific workflows will be key to providing, maintaining, and extending the high-level tools in these environments. The NDDP is designing and implementing the workflow automation technology needed to implement these workflows and use them in production. It plans to

Natural Diversity Discovery Project Technical Note NDDP-TN-001

Pipelined scientific workflows for inferring evolutionary relationships distribute and support this technology along with workflow components specifically designed for evolutionary biology. In addition, the NDDP hopes to collaborate with other research groups and software development teams in supporting diverse, large-scale scientific research projects. 5. Inferring phylogenetic trees Inferring phylogenetic trees (Felsenstein, 2004), i.e. determining the evolutionary relationships between organisms, is a key workflow in evolutionary biology. Many complications arise in these studies. Diverse sources of information may be used when comparing organisms and inferring relationships between them: morphology, DNA and protein sequences, genome-level features, and the fossil record, among others. In some cases, quite different phylogenetic trees relating the same set of organisms may be inferred by using different types or sets of data, disparate approaches or algorithms, distinct algorithm parameter values, and various weightings of data. Even for a particular combination of data, approach, parameters, and weights, many subtly different trees can appear equally plausible. Furthermore, many researchers favor, and sometimes reject outright, particular methods of phylogenetic inference (Schuh, 2000). Consequently, any detailed claim about evolutionary relationships must be considered provisional at best—and of limited usefulness in the absence of complete information about how the relationships were inferred. Inference of evolutionary relationships—the phylogenetic pattern—is only part of the problem. Making and testing hypotheses about natural processes to explain observed data and inferred patterns requires the use of even more diverse sets of data. Theories invoking particular evolutionary events, environmental conditions, and ecological interactions are complex and often controversial. And it can be argued that the validity of some approaches used to detect phylogenetic patterns depends on assumptions about the processes that produce the patterns (Sober, 1988)! Circularity of argument can go unnoticed if results are not kept closely associated with their provenance (Kemp, 1999). The research and discovery environments envisioned by the NDDP will address these issues. Researchers will be able to do the following using web-based or desktop workflow environments: (1) Infer, display, and compare phylogenies based on morphology, molecular sequences, and genome features. (2) Correlate phylogenies with events in Earth history using molecular clocks and the fossil record. (3)

3

Iterate over alternative methods, character weightings, and algorithm parameter values. (4) Maintain associations between phylogenies and the data, methods, parameters, and assumptions used to infer them. (5) Share workflows and results. (6) Repeat studies reported by other workers and note the effects of varying data sets, approaches, and parameters. Future tools will address the problem of formulating and testing process hypotheses. 6. NDDP requirements for pipelined workflows The workflows automated by the NDDP in the short term will be simpler than the structural genomics pipeline of JCSG. Nevertheless, the general issues hindering large-scale scientific research will apply, and NDDP requirements for workflow automation infrastructure will have much in common with those of other disciplines. The NDDP is taking the general approach of automating scientific workflows composed of (1) workflow components representing the computational and experimental steps in the workflow; (2) connections between the inputs and outputs of the workflow components representing the flow of data between them; and (3) parameters to the workflow components for customizing their behavior. Ludäscher et al. (2005) provide an excellent overview of the general requirements and desiderata for such scientific workflow automation systems. The NDDP is focusing on the following requirements for pipelined workflows in particular: Workflows must be reusable. Component parameters and connections between components must not require reconfiguration when simply running a new set of data through a previously defined workflow. Workflow components must operate context dependently. Data from independent data sets or subsets must be partitioned effectively when passing through a single instance of a workflow in pipelined fashion. Components must be able to operate on a single data set multiple times using different sets of parameters. And stateful components must associate their state with particular data sets. Workflows must be robust. Exceptions thrown for particular data or parameter sets must not disrupt operations on unrelated sets. The effects of external program crashes must be under the control of the workflow author. Metadata must stay associated with data. In addition, the provenance of input data and important

Natural Diversity Discovery Project Technical Note NDDP-TN-001

4 intermediate results must be associated automatically with workflow results.

behavior. Actor authors, model composers, and users may specify parameter values.

Workflows must be executable in the absence of user interfaces. This background execution support must allow development of high-level interfaces for workflow execution and monitoring, asynchronous results browsing, and long-term project management.

8. The data association problem Prototyping phylogenetics workflows using Ptolemy II reveals a problem that must be solved immediately. Supporting complex workflows, large data sets, and pipelined operation requires a mechanism for associating multiple data items into groups to be processed together. In the Ptolemy process network model, an actor that operates on more than one token received on a single port must determine when the correct quantity of data has been received. Simple approaches to solving this problem include fixing the number of tokens expected by each actor, sending token counts or trigger signals on separate ports, bundling data tokens in Ptolemy array tokens, and inserting token counts between sets of related tokens. Some of these approaches can be effective for simple workflows, but they complicate the design of workflows and actors that must operate on complex or deeply nested scientific data sets. Moreover, ad hoc application of these approaches leads to code redundancy, with variations of similar code appearing in different actors.

7. Ptolemy II The NDDP uses the Ptolemy II heterogeneous modeling and design system (Brooks et al., 2004) as the foundation for its scientific workflow infrastructure. While not designed specifically for scientific workflows, Ptolemy II provides a powerful actor-oriented programming and model execution environment fundamentally more flexible than many commercial scientific workflow products. Ptolemy includes an elegant graphical user interface, Vergil, for composing and executing models. Other advantages of Ptolemy include source code availability, comprehensive documentation, liberal license and redistribution policies, impressive core product software quality, and cross-platform compatibility. Furthermore, the consistent objectoriented design and Java implementation facilitate extension and enable loose coupling between Ptolemy and software built using it. No NDDP extensions require source code changes to the Ptolemy II package. The Kepler Collaboration has chosen Ptolemy II as the basis for its scientific workflow system for similar reasons (Ludäscher et al., 2005). Ptolemy II supports multiple computing models within a single software framework. It does this by providing distinct directors that manage the execution of Ptolemy models. NDDP scientific workflows are based on the process network computing model and use the Ptolemy II Process Network director (Goel, 1998). In this computing model, workflow components, or actors, execute as separate threads of execution. Actors communicate asynchronously via messages, or tokens, sent across connections between actor ports. Token types supported by Ptolemy II include representations of integers, floating-point numbers, arrays, records, and matrices. Actor implementation requires overriding the fire() method (or related methods) of an actor base class. Within the fire() method an actor may read tokens from input ports, operate on input data, and write results to output ports. The process network model allows an indefinite number of tokens to be received and sent each time the director calls the fire() method. Ptolemy II also provides actor parameters for customizing actor and model

Passing references to custom Java objects in tokens is a tempting possibility. Indeed, custom objects are essential for relatively small, application-domain specific data structures (e.g., phylogenetic trees). But domain-specific Java objects sufficiently large and complex to store complete data sets suffer from several problems. They are relatively inflexible due to the need to edit source code when changing associations between substructures, adding new data or metadata fields, or making other changes. They require repeated disassembly and reassembly to allow concurrent, pipelined operation on data set contents by different actors. And complex objects are opaque to simple display and troubleshooting tools. Finally, complex objects are difficult to make immutable. Passing references to mutable objects can lead to concurrency problems as discussed below. In addition to the drawbacks mentioned above, all of these approaches hamper rapid prototyping of workflows and associated data structures. They make comprehension, reuse, and refactoring of existing workflows difficult. And they tend to tightly couple actor and workflow design. The NDDP requires a generic approach to defining, managing, and processing collections of data that works under most circumstances and facilitates rapid prototyping and reuse of actors and workflows.

Natural Diversity Discovery Project Technical Note NDDP-TN-001

Pipelined scientific workflows for inferring evolutionary relationships 9. Defining nested collections using paired delimiter tokens The NDDP has taken the approach of systematically associating data into nested collections by inserting paired opening and closing delimiter tokens into the data stream. This approach corresponds closely with the control-token approach proposed by Ludäscher & Altintas (2003) for supporting “pipelined-arrays” of data; and the use of open-bracket and closing-bracket information packets described by Morrison (1994). Delimited collections may contain data tokens, metadata tokens, and other collections (subcollections). Data tokens include the token types provided by the Ptolemy package as well as immutable instances of domain-specific Java classes. Metadata tokens are used to carry information that applies to a collection as whole, including data provenance. The complexity of creating, managing and processing collections is largely encapsulated in two classes, C o l l e c t i o n A c t o r and CollectionManager. Collection-processing actors typically are derived from C o l l e c t i o n A c t o r . Instances of CollectionManager are actor-local objects for manipulating particular collections. The CollectionActor base class facilitates collection nesting by maintaining a stack of CollectionManager objects corresponding to all collections concurrently processed by an actor. Collection actors are simple to write, debug, and maintain within this framework. Rather than reading tokens directly from input ports, collection actors operate on collections from within event handlers. The CollectionActor fire() method calls the handleCollectionStart() method when the opening delimiter for a collection is received; the handleData() or handleMetadata() method when a data or metadata token is received; and the handleCollectionEnd() method when the closing delimiter for a collection is received. The fire() method passes the CollectionManager object associated with the incoming collection to these event handlers, and the newly received token to the handleData() and handleMetadata() methods.

5

of the handleCollectionStart() method declares whether the actor will further process a collection and whether the collection should be discarded or forwarded to the next actor in the workflow. Similarly, the return values of the handleData() and handleMetadata() methods indicate whether the token in question should be forwarded or discarded. Incoming information not discarded by an actor is streamed to succeeding actors in the workflow as the collection is received and processed. The result is highly pipelined, concurrent actor execution. Like the data within it, a collection is associated with a type with scientific or data management semantics. For example, a collection of type Nexus contains data or results associated with one or more phylogenetics computations, while a TextFile collection contains strings representing the contents of a text file. Collection types simplify the processing of collections. Each instance of a collection actor has a CollectionPath parameter that specifies what types of collections and data the actor will handle. Collections and data tokens with types matching the CollectionPath value trigger collection-handling events, e.g., calls to the actor’s handleCollectionStart() and handleData() methods. Collections and data not matching the CollectionPath value are streamed silently to the next actor in the workflow. Workflow composers may use the CollectionPath parameter to operate selectively on particular collection and data types, while actor authors may fix the value of these parameters to simplify actor implementation. For example, the NexusFileComposer actor replaces each Nexus collection it receives with a TextFile collection. This actor gives CollectionPath a fixed value of “Nexus//” so that data and collections not nested within collections of type Nexus pass through unnoticed by the actor. Figure 1 illustrates a pipelined workflow for inferring phylogenetic trees using this approach to defining collections of data. Figure 2 enumerates the tokens flowing between two actors in this workflow.

Collection actor output is indirect as well. An actor may add data or metadata to a collection it is processing using methods provided by the associated CollectionManager object. An actor may create a new collection within another collection and add data or metadata to it. And it may replace data or metadata or copy the information to other collections. Collection actors specify the disposition of incoming collections, data, and metadata via the return values of the event handlers. The return value Natural Diversity Discovery Project Technical Note NDDP-TN-001

6 10. Context-dependent operations on collections Ptolemy II actor parameters are generally held constant while models are executed. Although very convenient for interactive modeling, these effectively fixed parameters do not meet the requirements of pipelined scientific workflows. Independent data sets passing through a workflow may require different actor behavior within a single execution of the Ptolemy model. Actors used in such workflows must be dynamically configurable and able to operate context dependently. Collection metadata provide this context to actors. An instance of the MetadataToken class comprises a Java String representing the name, or key, of the metadata item, and a Ptolemy token representing the value of the metadata item. Any number of metadata tokens with distinct names may be placed in a single

collection. Metadata tokens added by the actor creating the collection are placed at the beginning of the sequence of tokens in the collection. Metadata tokens added by downstream actors are placed at the beginning of the collection if these tokens are added before the actor receives the first data token in the collection. Otherwise, such metadata tokens are placed at the end of the collection, following the data tokens and any sub-collections. Actors may use metadata values to tune their own behavior appropriately for the current collection. Actors may observe metadata sequentially via the handleMetadata() method or on demand using the CollectionManager metadataValue() method after the metadata tokens have been received. The latter random access method traverses the stack of successively enclosing collections to return the first

Figure 1 A simple workflow for inferring phylogenetic trees, displayed using Vergil. The List actor specifies a list of files containing input data in the Nexus format. TextFileReader reads these Nexus files from disk and outputs a generic TextFile collection for each. NexusFileParser transforms these text collections into corresponding Nexus collections. PhylipPars executes the Phylip PARS program (as a separate system process) on each Nexus collection it receives, adding the phylogenetic trees it infers to the collection. PhylipConsense applies the Phylip CONSENSE program to these trees, adding a consensus tree (reflecting commonalities in the trees inferred by PARS) to each Nexus collection. NexusFileComposer creates a TextFile collection for each Nexus collection it receives, and TextFileWriter writes a file to disk for each. The Project actor provides an enclosing Project collection for the collections created within the workflow, and ExceptionCatcher discards Nexus collections for which PhylipPars or PhylipConsense generated application errors. The green rectangle represents the Process Network director that manages execution of this workflow. The PARS and CONSENSE programs are part of the PHYLIP phylogeny inference package (Felsenstein, 2005).

Natural Diversity Discovery Project Technical Note NDDP-TN-001

Pipelined scientific workflows for inferring evolutionary relationships metadata value corresponding to the given key. Thus, the context for actor behavior may be defined at any level within a set of nested collections and may be overridden at successively lower levels. Ptolemy actor parameters do provide a convenient mechanism for specifying default actor behavior. Instances of V a r i a b l e T o k e n , a subclass of MetadataToken otherwise treated identically to its parent class, can be used to override automatically the values of actor parameters at run time. Actor authors must enable this run-time overriding of parameters. Previous values of the parameter are successively restored when the ends of collections that override the parameter are reached. Variable tokens make it easy for workflow users to apply particular parameter values to subsets of their data. For example, a simple actor that re-roots phylogenetic trees might act on each tree token it receives individually, rooting each tree to the node named by the rootAtNode parameter. The author of this actor could choose to ignore the structures of the

7

collections containing the trees and override only the handleData() method of the CollectionActor base class. Nevertheless, a user of this actor could root a number of trees to different nodes by placing the trees in different collections, adding a variable token named rootAtNode to each collection, and giving these variables different values. 11. Addressing flow control issues in pipelined workflows Pipelined workflows entail a number of flow control issues illustrated below by actors that address them. First, actors providing graphical displays of data must behave appropriately when processing collections. Normally it would not make sense for a graph-plotting actor to concatenate all data received in a pipelined workflow and produce one continuous plot. The CollectionGraph actor demonstrates more appropriate behavior. This actor displays a series of distinct plots corresponding to the data contained by incoming collections. The actor can create a different plot trace for each collection of data it receives, superimpose these traces, and hide or delete

Figure 2 XML representation of tokens passing between the ExceptionCatcher and NexusFileComposer actors in Figure 1. Each token is represented by a single line of text and is numbered according to the order in which NexusFileComposer receives it. Tokens 001 and 031 represent the opening and closing delimiters of the Project collection. The Project collection encloses a Nexus collection, which in turn encloses a Taxa collection. Tokens 003 and 007 represent metadata for the Nexus and Taxa collections, respectively; token 029 represents metadata for the Nexus collection. PhylipPars added the first five Tree objects to the Nexus collection. PhylipConsense added the sixth. Additional Nexus collections would follow token 030 if the List actor (see Figure 1) specified more than one input file name. Note that while this representation is suggestive of an XML file, no actor in the workflow receives or holds all of this information at a single point in time; the thirty-one tokens may be scattered over a number of different actors acting on the data stream concurrently. (Data are from Stone and Telford, 1999).

Natural Diversity Discovery Project Technical Note NDDP-TN-001

8 previous traces at the beginning of collections of specified types. Graph configuration parameters (e.g., title, scales, units, trace colors, and even graph window dimensions) can be overridden by each collection in turn. A second issue follows from the first. Results displayed graphically by actors are overwritten repeatedly by succeeding collections in the pipeline. This behavior is confusing when workflows are executed interactively. The PauseFlow actor addresses this issue by facilitating interactive control of data flow timing. It also allows an interactive user to discard particular collections. The PauseFlow actor prompts the user at the start of each collection matching the value of its CollectionPath parameter. The user determines if and when each such collection continues through the PauseFlow actor by clicking either the Continue or the Discard button. Related flow control issues arise when loops are introduced into pipelined workflows. The StartLoop and EndLoop actors are used to implement do-while loops. These actors bound the workflow segment that forms the body of the loop. A LoopCollectionType parameter specifies the types of the collections that should be cycled through the connected loopback ports of the StartLoop and EndLoop actors. Iteration continues until the EndLoop actor detects a specified condition, e.g., until a metadata item achieves a specified value. The workflow fragment illustrated in Figure 3 demonstrates the utility of this approach to looping. The PARS program, provided by the PHYLIP phylogeny inference package (Felsenstein, 2005), infers parsimonious phylogenetic trees from input data sets. The PhylipPars actor wraps this program

and allows it to be run in pipelined workflows. PARS does not guarantee that all of the most parsimonious trees will be discovered by a single execution of the program. Finding all the most parsimonious phylogenetic trees for a given data set requires that the PhylipPars actor be run multiple times for each collection and that a different random number seed (used by PARS to jumble the input taxa prior to analysis) be provided to the actor on each iteration. This is achieved by placing the actor between StartLoop and EndLoop actors and incrementing the value of the JumbleSeed variable token on each iteration. The EndLoop actor is configured to allow each collection to exit the loop when a sufficient number of trees has been inferred. 12. Exception handling and long-running workflows Exception handling is a significant hurdle to supporting complex scientific workflows. Exceptions can occur for many reasons. The paireddelimiter approach to defining collections provides a simple approach to handling one significant class of exceptions. Many applications of automated scientific workflows involve wrapping existing software. But scientific application programs vary widely in software quality and robustness. Many of these programs are prone to crashes even when given valid instructions and data. Workflows that take hours, days, or longer to complete will not be used in production if likely crashes of wrapped applications require workflows to be started from scratch. This problem is exacerbated in pipelined workflows, since an error caused by one data set can terminate the processing of any number of other data sets.

Figure 3 Iterative execution of the PARS application using the StartLoop-EndLoop construct. The actor labeled “Initialize seed” adds a VariableToken named jumbleSeed to each Nexus collection; the actor labeled “Increment seed” updates the value of this variable. The VariableToken overrides the value of the jumbleSeed parameter of the PhylipPars actor, causing the PARS application to jumble the order of analyzed taxa differently on each execution. The UniqueTrees actor removes redundant trees from the Nexus collection. Both PhylipPars and UniqueTrees update the treeCount metadata item. EndLoop inspects the value of treeCount and allows each Nexus collection to exit the loop when a minimum number of unique trees have been inferred or the maximum allowed number of cycles have been performed.

Natural Diversity Discovery Project Technical Note NDDP-TN-001

Pipelined scientific workflows for inferring evolutionary relationships The ExceptionCatcher actor demonstrates a solution to this problem: it limits the effects of external application errors to the collections that trigger them. An actor that catches such an error may add an ExceptionToken to the collection that caused the error. The actor may then proceed to operate on the next collection. An ExceptionCatcher actor downstream of the first actor can filter out collections that contain exception tokens. Workflow composers specify the level in the collection stack at which exceptions should be filtered out using the ExceptionCatcher CollectionPath parameter. As an example, consider an exception caused by one of six collections of phylogenetic trees. A downstream ExceptionCatcher actor could discard just the tree collection that caused the error. Alternatively, it could discard the collection containing the six tree collections. The appropriate response depends on the purpose of the workflow. Using the ExceptionCatcher actor, workflow composers control the scope over which exceptions impact workflow execution. In contrast to the trycatch constructs of C++ and Java, which limit exception scope in terms of the instruction stream, the ExceptionToken-ExceptionCatcher idiom limits exception scope in terms of the data stream as well. This is a clear example of the shift in primary emphasis—from control-flow to data-flow and effective information management—required when building infrastructure to support large-scale scientific workflows. 13. Facilitating parallelism It is tempting to achieve some of the benefits of collections using database tables, external files, nested directories of files, or other external systems. Tokens could carry references (e.g., file handles) to these external objects, and actors could read and update the external objects using these references. Unfortunately, such approaches often amount to a resurrection of the object-oriented programmer’s bête noire, the global variable. These approaches are particularly hazardous in a multithreaded environment such as Ptolemy II. Reading an external object from within a Ptolemy model is safe under many circumstances; but updating its state from within a workflow is often risky. Passing mutable Java objects in tokens is similarly inadvisable. Consider a workflow comprising multiple actors that update the state of a particular object either internal or external to the Java Virtual Machine. What are the likely results of carelessly introducing parallelism into this workflow? Corrupted objects,

9

irreproducible behavior, and unresponsive systems are all to be expected. In general, significant concurrency problems arise if more than one actor can operate on a particular mutable object simultaneously. Authors of such actors must carefully write code that obtains exclusive locks on shared objects to avoid corrupting the objects or accessing them in temporarily inconsistent states. Introducing such locks into workflows can lead to deadlock, especially when the workflow is refactored or reused later. And if the modifications made by one actor could affect the activity of an actor operating in parallel, race conditions may result—even if locks are used correctly. Of course there is a place for databases and other persistence mechanisms. The data streaming into and out of workflows must be stored somewhere; projects must be organized and their states updated repeatedly. There is little danger in writing intermediate results to persistent storage if this information cannot be modified elsewhere in the workflow. And few complex data management systems can, or should, avoid entirely the concurrency issues noted above. Nevertheless, even experienced software engineers find these concurrency issues challenging. It is unrealistic to expect the intended users of the NDDP software tools—evolutionary biologists and other scientists not trained in information technology—to design workflows consistently immune from these problems. Fortunately, the paired-delimiter approach to defining collections avoids these pitfalls. All tokens passed between collection actors are immutable, including the delimiter tokens that bracket collections. If a collection feeds into two or more actors in parallel, it is the references to immutable tokens, not the token objects themselves that are duplicated. Consequently, collection actors are free to add data or otherwise modify collections they process without the risk of affecting actors operating in parallel. Parallel modifications to collections could be merged later in the workflow using a generic collection-merging actor. 14. Conclusions Defining collections using paired delimiter tokens has significant advantages. Collections enable pipelined operation while maintaining data association. Support for metadata enables actors to customize behavior and provides for recording data provenance. The repercussions of external application errors are limited to the collections that trigger them. And delimiting collections with

Natural Diversity Discovery Project Technical Note NDDP-TN-001

10 immutable tokens facilitates parallelism. Consequently, collection actors are simple to implement and maintain, and workflows based on them are easy to prototype and safe to refactor or reuse. Together with the Ptolemy II system, delimited collections provide a foundation for supporting large-scale scientific workflows and achieving the objectives of the Natural Diversity Discovery Project. Acknowledgments The author thanks David Collier, Ashley Deacon, and Ruth Ann Bertsch for very helpful comments on this article; III, Neal Caldecott, David Collier, Brian Lawrence, and Melora Svoboda for many useful discussions; and Bertram Ludäscher and Shawn Bowers for excellent technical suggestions. References F. Berman, G. Fox, T. Hey, eds. (2003). G r i d Computing: Making the Global Infrastructure a Reality, John Wiley & Sons Ltd, West Sussex, England. H.M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H. Weissig, I.N. Shindyalov, P.E. Bourne (2000). The Protein Data Bank, Nucleic Acids Research 28, 235-242. C. Brooks, E.A. Lee, X. Liu, S. Neuendorffer, Y. Zhao, H. Zheng (2004). Heterogeneous Concurrent Modeling and Design in Java (Volume 1: Introduction to Ptolemy II), T e c h n i c a l Memorandum UCB/ERL M04/27, University of California, Berkeley, California. H.-J. Chiu, T. McPhillips, S. McPhillips, K. Sharp, T. Eriksson, N. Sauter, M. Soltis, P. Kuhn (2002). Collaboratory for Macromolecular Crystallography at SSRL. Networked Learning in a Global Environment, Natural & Artificial Intelligence Systems Organization (NAISO). J. Felsenstein (2004). Inferring Phylogenies, Sinauer Associates, Inc., Sunderland, Massachusetts. J. Felsenstein (2005). Phylip web site: http://evolution.gs.washington.edu/phylip.html. M. Goel (1998). Process Networks in Ptolemy II. MS Report, ERL Technical Report UCB/ERL No. M98/69, University of California, Berkeley, California. W.E. Johnston (2004). Semantic Services for GridBased Large-Scale Science, IEEE Intelligent Systems 19, 34-39. T.S. Kemp (1999). Fossils & Evolution, Oxford University Press Inc., New York. S.A. Lesley, P. Kuhn, A. Godzik, A.M. Deacon, I. Mathews, A. Kreusch, G. Spraggon, H.E. Klock, D. McMullan, T. Shin, J. Vincent, A. Robb, L.S.

Brinen, M.D. Miller, T.M. McPhillips, M.A. Miller, D. Scheibe, J.M. Canaves, C. Guda, L. Jaroszewski, T.L. Selby, M.-A. Elsliger, J. Wooley, S.S. Taylor, K.O. Hodgson, I.A. Wilson, P.G. Schultz and R.C. Stevens (2002). Structural Genomics of the Thermotoga maritima Proteome Implemented in a High-throughput Structure Determination Pipeline, Proceedings of the National Academy of Sciences of the USA 99, 11664-11669. B. Ludäscher & I. Altintas (2003). On Providing Declarative Design and Programming Constructs for Scientific Workflows based on Process Networks. Technical Note: SciDAC-SPA-TN2003-01. B. Ludäscher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger-Frank, M. Jones, E. Lee, J. Tao, Y. Zhao (2005). Scientific Workflow Management and the Kepler System, Concurrency and Computation: Practice & Experience, Special Issue on Scientific Workflows, in press. T.M. McPhillips, S.E. McPhillips, H.-J. Chiu, A.E. Cohen, A.M. Deacon, P.J. Ellis, E. Garman, A. Gonzalez, N.K. Sauter, R.P. Phizackerley, S.M. Soltis and P. Kuhn (2002). Blu-Ice and the Distributed Control System: Software for Data Acquisition and Instrument Control at Macromolecular Crystallography Beamlines, Journal of Synchrotron Radiation 9, 401-406. J.P. Morrison (1994). Flow-Based Programming. Van Nostrand Reinhold, New York. R. Schuh (2000). Biological Systematics. Principles and Applications, Comstock Publishing Associates, Cornell University Press, Ithaca, New York. E. Sober (1988). Reconstructing the Past. Parsimony, Evolution and Inference, MIT Press, Cambridge, Massachusetts. J.R. Stone, M. Telford (1999). Using Critical Path Method to Analyse the Radiation of Rudist Bivalves, Palaeontology 42, 231-242. The Office of Science Data-Management Challenge (2005). Report from the DOE Office of Science Data-Management Workshops, March-May 2004.

Natural Diversity Discovery Project Technical Note NDDP-TN-001