The Power of Declarative Languages: A Comparative ... - CiteSeerX

1 downloads 8888 Views 190KB Size Report
sual application builder, called VizBuilder, with which end users are able to ... ous ones offering better quality information and thus incen- tives for application .... lar function sub-ontologies of the GO database using a web service wrapping of ...
Paper # 1258

The Power of Declarative Languages: A Comparative Exposition of Scientific Workflow Design using BioFlow and Taverna∗ Hasan Jamil

Aminul Islam

Department of Computer Science, Wayne State University, USA [email protected], [email protected]

Abstract Scientific workflow design is usually complex and demands integration of numerous resources. Geographical distribution and semantic heterogeneity of resources add to this complexity. The cost effectiveness of such workflow design thus depends upon the lifespan of the application and its anticipated use. Shorter application lifespan usually entails prohibitive development costs. In this paper, we present an alternative platform for declarative workflow design using BioFlow in distributed and heterogenous environments. We argue that a declarative workflow design using BioFlow is more efficient and cost effective compared to traditional approaches using systems such as Taverna. To demonstrate the advantages of BioFlow, we compare a canonical micro array data analysis workflow application design approach using both Taverna and BioFlow. We show that BioFlow supports ad hoc and modular application design at a throw away cost, and produces a superior maintainable application that can adapt to changes in the source without significant effort. Finally, we discuss a visual application builder, called VizBuilder, with which end users are able to design workflows without any knowledge of BioFlow.

1 Introduction Scientific workflows usually require access to numerous distributed resources, application of analysis tools, subjective interpretation by users and several iterative refinements of the investigative hypotheses. Often times, these workflows are extremely short-lived but no less complicated and involved than their long-lived and larger counterparts. Access to distributed resources such as databases and analysis ∗ This research was partially supported by National Science Foundation grants CNS 0521454 and IIS 0612203.

tools are complicated by their semantic heterogeneity, autonomy and constant need to change. New resources become available on a regular basis that improve upon previous ones offering better quality information and thus incentives for application designers to migrate. In such an environment, tools that are designed to be hardwired with the existing setup face major redesign, adaptation and maintenance overhead. Consequently many interesting and potentially beneficial questions cannot even be asked due to the underlying prohibitive costs of the application development. Workflows in general require sophisticated query support, control structures, exception handling, monitoring and user intervention, and control flow implementation in the form of process graphs. Scientific workflows typically use massive amounts of data, long data analysis and complicated series of steps. Traditionally, data from multiple sources are copied to a local machine, tools are implemented or installed, and workflows are designed around these resources. This approach makes it difficult to maintain the constantly changing data and maintain elaborate collaborative protocol with the data sources, making it a rigid coupling, resulting in loss of autonomy, and often requiring acquisition of fast local servers. We are developing a new database engine, called LifeDB, to support applications such as the ones just described using a new interpretive and declarative query language called BioFlow. Using BioFlow, a user can design workflows, integrate databases and tools, and resolve heterogeneity of databases, all on-the-fly and fully automatically. The declarative nature of BioFlow offers higher levels of abstractions and users are able to focus more on the analysis needs rather than invest in developing the application with knowledge of all lower level details. Since applications are compiled and executed at each invocation, changes in the source are not an issue for BioFlow. In this paper, we discuss two scientific workflow examples in Life Sciences that we treat as model representations of real life scientific workflows and as a basis for com-

parison of the two systems – Taverna [13] and BioFlow1 [14]. The first example implements a distributed workflow to compute differentially expressed genes, and then to identify common GO terms associated with these genes. The second example is related to human p63 transcription factors and regulation of mRNAs. The first workflow has already been implemented in Taverna, and the second implemented in BioFlow. In this paper, we show a straightforward simulation of the Taverna example in BioFlow, and argue that it is neither that intuitive nor easy to implement the second BioFlow example in Taverna. This is because Taverna will require extensive glue code writing, or web services support to implement it. We leverage the Taverna workflow implemented in BioFlow to highlight some of the BioFlow features on intuitive grounds.

1

getMeasurementNames

2

getNamesUsingXPath

3

selectControlData

4

queryMaxd1

5

mergeOutput1

9

Ttest_pvalue

6

queryMaxd2

7

mergeOutput2

8

10

performT_test

11

cleanCSV

Query and retrieve data frommaxdusing maxDBrowser

Perform statistical analysis R

cleanup results from R usingBeanshell script and local javaprocessor

flatten

12

getAffyIDs

13

2 A Bird’s Eye View of BioFlow

selectTestData

14

getTranscriptID 15 getTargetDescription16

17

createFinalCSV 18

createGeneList

19

Mapaffymetrixids to yeast gene names and ORFnumbers to create a CSVfile containing results GOTermFinder _pvalue getGeneNames

20 analysisGenePDFOutput Output

BioFlow is a largely declarative database query language designed for scientific applications. It supports ad hoc integration of distributed heterogeneous databases on the web, XML-relational mixed mode data manipulation, workflow design and tool integration. It uses state of the art techniques of semantic reconciliation and schema mapping, wrapper generation and mediation, and system support modules. These features and the application design approach supported in BioFlow makes it possible to develop applications with substantial ease and clarity than many other contemporary systems. In the discussion to ensue, we introduce salient features of BioFlow. A complete discussion of BioFlow is outside the scope of this article. Interested readers are referred to [14, 6] for an in depth exposition to its formal syntax and semantics, and its host system LifeDB.

2.1

21

CSV

22

spurious_gene

23

pdf

Identifying common gene ontology terms common to genes using GOTermFinderservice

Figure 1. Taverna Workflow

The purpose of this workflow was to discover differentially expressed genes by a t-test analysis between two sets of microarray data followed by the identification of common terms from the Gene Ontology (GO) [4] associated with these genes [8]. These terms are identified from the biological processes, cellular components and molecular function sub-ontologies of the GO database using a web service wrapping of the GOTermFinder tool [7]. As shown in figure 1, the workflow has four major parts – obtain data from maxdLoad2 microarray database, perform a statistical t-test to identify differentially expressed genes, clean up disparities and collect gene ID mappings as annotations from GO database. In most part, Taverna is an application directed service composition system. It essentially means that for Taverna to be able to compose tools, applications, or data in a meaningful way to create a workflow, an expert user has to write the glue codes ahead of time so that a user can leverage those glue codes written in the form of Java, PHP or Perl scripts in his application. In this example, for the collection of data from maxdLoad2 database, Taverna uses the maxdBrowse web service which supports access for Taverna workflows. New operations can be added by the site maintainer for Taverna applications requiring new services. Available service descriptions are provided by maxdLoad2 database using WSDL documents for which tools exist in Taverna workbench to access and read. Schema heterogeneity and data disparities are resolved using BeanShell

BioFlow by Example: A Gene Expression Data Analysis Use Case

In this section we introduce an example from microarray data analysis that has been argued in [15] as a fairly complicated workflow requiring the sophistication and modeling power of the leading workflow system such as Taverna [13]. Taverna is a leading workflow management system developed in the context of my Grid project, and it is widely favored in Life Sciences workflow application design across the globe. We have reproduced and adapted the workflow in figure 1 from [15] to analyze it in the context of both BioFlow and Taverna. 1 In

our opinion, Taverna stands at the same level as Kepler [5], Triana [16] and Pegasus [11] in terms of their architecture and operational semantics. We thus only compare our system with Taverna in this paper for the sake of brevity and for the want of space.

2

scripts completely manually and remain the responsibility of the end users. In this application, BeanShell applications were written to transform and merge the data returned by maxdLoad2 database prior to t-test analysis using R. Another BeanShell script was written to generate the final list of differentially expressed genes after the t-test in CSV format. Finally, to enable interaction with R and actually perform the t-test, a new Taverna processor called RShell was developed and integrated into the system to invoke scripts inside the R computing environment. This processor acts as a client to R, which now serves as a TCP/IP server using the RServe library, making it possible to relay the script and its inputs to R. We present below a partial implementation of the same workflow in BioFlow, and discuss BioFlow in the context of this example to highlight its unique features in contrast with Taverna. 1 2 3 4 5 6 7 8 9 10

insert into controlMeasure select trim(replace(measure_name, substring_index(measure_name,’m751’,1), ’ ’)), exp_name from measurements where measure_name like ’%m751%’;

26 27 28

call getMeasurement with (select "*", measure_name, ",", "DEFAULT", "XML", true, call getMeasAll("Measurement DataTabTextFASTTransposed") from controlMeasure) into outfile "e:/bioflow/control.txt";

29 30 31 32

define function getMeasRes table extract measure_name varchar(100), exp_name varchar(200) using wrapper mWrapper mapping mMapping in ontology mOntology from maxdURL submit(meas_name varchar(100), maxd_split varchar(20), maxd_prof varchar(20), type varchar(20), display_brow logical | maxdURL varchar(200)); define function getMeasurement text from maxdURL submit(col_type varchar(10), measure_name varchar(100), maxd_split varchar(20), maxd_prof varchar(20), type varchar(20), display_brow logical | maxdURL varchar(200));

14 15 16

define function local getMerged text from "e:/bioflow/merge.jar" submit(measure_name varchar(500));

17 18 19

define function local getR text from "e:/bioflow/r.bat" submit(inFileName varchar(500), pvalue float);

20

define function local createCSV text from "e:/bioflow/rCSV.bat" submit(inputFileName1 varchar(500), inputFileName2 varchar(500));

call getMerged("e:/bioflow/control.txt") into outfile "e:/bioflow/second_tmp.txt"; } process selectTestData { drop table if exists testMeasure; create datatable testMeasure { measure_name varchar(500), exp_name varchar(200) };

33

insert into testMeasure select trim(replace(measure_name, substring_index(measure_name,’m145’,1), ’ ’)), exp_name from measurements where measure_name like ’%m145%’;

34

call getMeasurement with (select "*", measure_name, ",", "DEFAULT", "XML", true, call getMeasAll("Measurement DataTabTextFASTTransposed") from testMeasure) into outfile "e:/bioflow/test.txt";

define function remote getMeasAll URL from "http://dbkgroup.org/software/maxd/ maxdBrowse/sequences/" submit(choice varchar(100));

11 12 13

35 36

call getMerged("e:/bioflow/test.txt") into outfile "e:/bioflow/first_tmp.txt"; } 37 38

process main { open database diffGeneExpression; drop table if exists measurements; create datatable measurements { measure_name varchar(500), exp_name varchar(200) };

define function local getAffyIDs text from "e:/bioflow/affyID.jar" submit(inFileName varchar(500)); define function local getDetailsForAffyIDs text from "e:/bioflow/RaffyIDdetails.bat" submit(inFileName varchar(500)); define function local spuriousgenes text from "e:/bioflow/Rspur.bat" submit(inFileName varchar(500)); define function local goTermPdfOutput from "e:/bioflow/goTermPDF.jar" submit(inFileName varchar(500), gotermP_value float, outputPDFname varchar(500)); 21 22 23

24 25

39 40 41

insert into measurements call getMeasRes("*"," ","DEFAULT", "XML", true, call getMeasAll("GetAllMeasurementNames"));

42 43

perform parallel selectTestData, selectControlData leave;

44

wait on selectTestData, selectControlData;

45

call getR("e:/bioflow/first_tmp.txt", "e:/bioflow/second_tmp.txt", 0.05) into outfile "e:/bioflow/t_test_results.txt";

46

call getAffyIDs("e:/bioflow/t_test_results.txt") into outfile "e:/bioflow/affyids.txt"; call getDetailsForAffyIDs("e:/bioflow/affyids.txt") into outfile "e:/bioflow/affyDetails.txt"; call spuriousgenes("e:/bioflow/affyDetails.txt") into outfile "e:/bioflow/spur_genes.txt"; call createCSV("e:/bioflow/t_test_results.txt", "e:/bioflow/affyDetails.txt" ) into outfile "e:/bioflow/result.csv"; call goTermPdfOutput("e:/bioflow/affyDetails.txt", 0.05, "e:/bioflow/go.pdf");

47 48 49 50 51

process selectControlData { drop table if exists controlMeasure; create datatable controlMeasure { measure_name varchar(500), exp_name varchar(200) };

3

close database diffGeneExpression; }

In the above script, the define function statements numbered (1)-(3), (4)-(10), (11)-(13), (14)-(16), (17)-(19), and (20) are unique to BioFlow and thus, of special interest. The define function abstraction

was first introduced in [9] as a user defined function for SQL to establish a declarative connectivity with worldwide web. It combined http protocol, CGI and wrapper technology in one single statement to allow form submission and data extraction in the form of a table into relational databases. define function essentially declares an interface to the web site at URL in the from clause and specifies what actions are taken to interact with the web site. In its current form, define function is a context dependent interface to multiple different types of internet accessible resources. In this example, the define function statement in line (1) declares an interface for the web site for maxd microarray database at http://dbkgroup.org/software/maxd/maxdBrowse/sequences/ included in the from clause in line (2). The key words remote and URL in line (1) signal that the resource is an internet reachable entity, and it returns another web address in response to the arguments in the submit clause in line (3) submitted to it. In other words, this function navigates to another web form depending on the submitted arguments in the web form for which this function is defined. The define function statement in line (4) is more involved. It also defines an interface for maxd database, but this time, the URL is sent to this function as an argument. In general, the extract, using and submit clauses, and the options such as remote, local, wrapper and mapping are optional. But, if extract is used, using becomes mandatory. Within using, wrapper and mapping are optional. extract specifies what data items are to be collected from the returned page. So, for extract to be applicable, the function must return an XML or HTML document in which a table can be found (table option in line (4))2 . That means, extract cannot be used with URL or text options. Furthermore, the submit clauses take a two sorted vector of arguments. The first sort, arguments from left up to the vertical bar (line (9)), are the arguments for onward submission to the URL in the from clause, and are called the application variables. The remaining arguments, after the vertical bar, called the function variables, are used to instantiate the function itself to make functions behave differently as needed. In this case, the define function at (4) takes the URL from the function variables. This means, the function is fully defined at run time after the invocation and must use late binding as in C++. Such functionalities are required to capture multipage web forms such as the one at maxd database. For example, line (41) calls function getMeasAll to generate the URL which is then passed on to define function in line (4) through the call at line (40). In the call statement in line (40), the call statement of line (41) is an argu-

ment, which instantiates the maxdURL variable in line (7) through the submit clause function variable maxdURL in line (10). Correspondingly, the call statements that invoke the define functions (as shown in lines (26), (31), (34), (40), (41), and (45)) take a role depending on the context of invocation. For example, the call at (26) will return a text file which will be saved as ”e:/bioflow/control.txt”, whereas the call at (40) will return a table and will be stored as a table named measurements. But the call at (41) will just return a URL. Differently from these calls, the statements at (31) and (45) returns nothing directly. The functions at lines (14), (17) and (20) define local executable functions that communicate with BioFlow via stored files. For example, the function getMerged is a Java program that merges a set of data in text files such as control.txt, and the function getR is a batch program that invokes R to perform a t-test using the data in files first tmp.txt and second tmp.txt when the call at (45) is initiated. In this script, the process selectControlData performs tasks 3, 4 and 5 in figure 1, while process selectTestData performs tasks 6, 7 and 8 through the execution of the perform statement at line (42). Statement (39) implements tasks 1 and 2, and (45) accomplishes the tasks of (9) and (10). Line (49) implements steps 11 and 12 of figure 1. For now, we defer a detail discussion on statements perform and wait on at lines (42) and (44) until section 2.2.2. Finally, the statements (46) through (50) implement tasks 13 through 23 of figure 1. Overall, it is important to note that BioFlow does not require site cooperation, and it can function without any requirement for custom code writing. It, however, requires local tools such as merge.jar and r.bat for functionalities not part of BioFlow, and for which no internet services are available.

2.2

Simplicity of BioFlow

It should be clear from the discussion above that designing a workflow as simple as the differential gene expression example is notably complicated in Taverna. The important issue to observe here is that Taverna is incapable of actually implementing the workflow without writing application specific glue codes in Java or BeanShell, and once the underlying resources change, it will require a new set of codes and scripts written for another application. Furthermore, it requires significant site and server cooperation to avoid any customized code writing needs. For example, if maxdLoad2 database does not offer a service, Taverna will require writing a new set of glue codes to access this service, which is actually available as a web interface for all to use, not withstanding the manual reconciliation of the schema mismatch that are required in Taverna. The BioFlow implementation of the same workflow, as presented, can be written fairly

2 A define function can return only three types of objects – a table (possibly nested such as in XML), a URL or a text document. However, it may also be void and return nothing at all.

4

fast by end users, and can be developed using BioFlow’s front end VizBuilder literally in several minutes. It is again important to note that the BioFlow implementation did not require any code writing3 , specific low level knowledge, or site cooperation of any form from maxdLoad2 database.

structs very easily. To see how, consider the abstract workflow shown in figure 2. This workflow can be implemented in BioFlow as shown below. As shown, besides many declarative constructs, BioFlow also supports programming constructs such as assignment statements (line 2), repeat loop (line 3), and control statements such as if then else (line 7). Inclusion of these constructs help design workflows more conveniently compared to the core BioFlow declarative language. The semantics of each of the constructs can be described as follows. In lines 1, 5, 8, and 9, BioFlow executes the processes listed exactly in the same sequence they appear in each of the sentences. For example, in line 8, process t is executed before process u. But the processes in line 4 are executed very differently than all of these perform statements. q and r are executed first in parallel. Next, BioFlow moves on to line 5 immediately after scheduling these two processes due to the leave option at the end, while in all other perform cases, BioFlow waits for the execution of all processes to finish before moving onto the next logical statement in the program. Once leave option is used, BioFlow can be made to wait on the completion of a set of processes as shown in line 6. Here, BioFlow will not execute the if statement until process q is completed. Since q and r were scheduled and left to execute, and if q has already completed, BioFlow will not wait. In other words, BioFlow will check to see if it is completed and take necessary actions. It should now be clear that BioFlow is fairly expressive and can handle and support sophisticated workflow design needs.

2.2.1 Analysis Tool Integration The data model of BioFlow treats external resources, objects that are not part of BioFlow engine, usually as functions. This simple view makes it possible to integrate and implement a wide range of services as part of the BioFlow system. As discussed, the define function statement treats web forms as external functions. In a similar manner, the define function statement with local option treats a command line executable system as a user defined function that has a defined input output behavior. The functions are executed asynchronously as a separate process making it possible to even launch other applications for users to interact with their system of choice. The only requirement the function will need to satisfy is that the function may only return a single document of type text, XML or HTML as output, if any. 2.2.2 Modular Workflow Design and Process Graphs BioFlow supports fairly complicated workflows through its modular process definition structure and its perform statement options. Processes in BioFlow are named, can be made persistent as part of a named database, and reused by simply referring to them. In the example in section 3.1, compute mirna is such a named process. A process may contain any combination of valid BioFlow statements, except a call to itself (recursion is not supported in the current version of BioFlow). So, arbitrary processes can be developed, packaged as a module in the form of a named process, and stored or used.

open database workflow; perform p; out := false; repeat perform parallel q, r leave; perform s; wait on q; if (c) then perform t, u else { perform v; out := true; } until (out); close database workflow;

r set out = false

p

q

true if (c)

s

Legend Control flow wait on

repeat until !(out)

t

u

false

(1) (2) (3) (4) (5) (6) (7) (8) (9)

(10)

v set out = true

2.2.3 Data Manipulation and Mixed Mode Querying BioFlow supports a nested SQL syntax to include complex structures in the direction of nested relations [10], and XML. In BioFlow, nested relations are essentially XML documents, and all nested SQL queries are translated to XQuery for processing the corresponding XML documents. The distinction between XML and relations are made at the execution and file server level, while the user view at the query level uniformly treats them as relations. We use MonetDB [2] as our local LifeDB data and query server which

Figure 2. An Abstract Workflow However, whatever the granularity of the tasks in a workflow, the workflow can be fairly complicated. Yet, sophisticated workflows can be developed using BioFlow con3 We contrast code writing as in Java, Perl, PHP or C with ad hoc querying in SQL like declarative languages by arguing that declarative languages are more abstract and conceptual in nature, and not procedural.

5

supports both XML and relational data formats making it easier for us to support mixed mode querying and present a uniform view of data to the users.

user opens up the editor in the browser and draws the visual program by dragging the operators and the edges from the toolbar. The rule base, built in the deployment phase, guides the development with on-the-fly syntax checking. When the user chooses to compile his or her visual program, it is checked for syntactic and semantic errors. If the program is error free, then it is translated into the code of the target textual language BioFlow using the Model2Code translation scheme set in the deployment phase.

2.2.4 Visual Workflow Design using VizBuilder VizBuilder [12] is a web-based editor for visual programming with BioFlow. With the advent of scientific work flows and business process management systems, people with little or no background in computer programming are forced to develop working programs, biologists included. VizBuilder can aid non-programmers, as well as seasoned developers with its intuitive design. VizBuilder is based on a novel Nested Graph Grammar, developed in Java as an applet which can be deployed with any J2EE server and a Java enabled browser like Internet Explorer or MozillaFirefox to draw a visual program. We used Java Universal Network/Graph Framework (JUNG) [1] as the library for drawing and maintaining graphical structures.

3 The Power of Abstraction in BioFlow We view an SQL like declarative query language, such as BioFlow, to be an acceptable abstraction for workflow modeling. SQL provides two major abstractions – data definition and data manipulation. We expand data definitions to include external resources and call it resource definition. This includes the create datatable, define ontology and the define function statements. While the define ontology is in the proposed syntax of BioFlow, it is not currently supported. Instead, all ontology elements such as schema matches and wrappers are directly accessed via the define function statements. For data manipulation, BioFlow retains all SQL features but supports nested structures and the power to transparently mix SQL and XML data in the same application and language. BioFlow also extends SQL by supporting the workflow abstractions in the form of process definitions, include and perform statements, and wait on statements. Together, these three sets of abstractions lend a powerful combination for writing ad hoc applications for on-the-fly data integration and workflow design using distributed resources without the need for writing low level application codes.

Nested Graph Grammars XQuery

BioFlow

Deployment

Model2Code

Translation Scheme

V alp L ha bet

Editor

Programming

Language Definition

Kernel

VizBuilder

x Synta d te direc g in draw

Translation

Tool Bar Drawing Pane

BioFlow Script

Figure 3. System Overview of VizBuilder

3.1

In figure 3, we show the major components of the VizBuilder system. We can separate the system into an editor and a kernel. The editor has several predefined operations such as the capability to Open/Close or Save an editing session, Undo/Redo operations etc. The tool bar of the editor is reconstituted from the visual language alphabet of BioFlow. The editor also supports a set of syntax directed editing rules which are derived from BioFlow. During the deployment phase, a language designer defines the visual icons as instances of the nested graph grammar. The visual language elements are expressed in Extensible VizBuilder Application Markup Language (XVAML) which was inspired by XAML [3]. The kernel then converts the XVAML definition to a set of VizBuilder specific Java classes and objects. In this stage, the kernel derives a translation scheme from the visual model to the textual language. The system is then recompiled with these classes and deployed for programming purposes. To program with VizBuilder, an end

The Case of New Applications

To illustrate the capabilities of LifeDB and contrast with that of Taverna, we adapt another real life Life Sciences application discussed in [17] which has been used as a use case for many other systems where a substantial amount of glue codes were written to implement the application by manually reconciling the source schema to filter and extract information of interest. Our goal in this section is to show how simple and efficient it is to develop this application in LifeDB. The query, or work flow, the user wants to submit is the hypothesis: ”the human p63 transcription factor indirectly regulates certain target mRNAs via direct regulation of miRNAs”. If positive, the user also wants to know the list of miRNAs that indirectly regulate other target mRNAs with high enough confidence score (i.e., pV alue ≤ 0.0006 and targetSites ≥ 2), and so he proceeds as follows. He 6

collects 52 genes along with their chromosomal locations (shown partially in figure 4(a) as the table genes) from a wet lab experiment using the host miRNA genes and map at or near genomic p63 binding sites in the human cervical carcinoma cell line ME180. He also has a set of several thousand direct and indirect protein-coding genes (shown partially in figure 4(d) as the table proteinCodingGenes) which are the targets of p63 in ME180 as candidates. The rest of the exploration thus proceeds as follows. geneID

miRNA

chromosome

microRNA geneName pValue

hsa-mir-10a

ch17

hsa-mir-10a

FLJ36874

0.004

hsa-mir-205

ch1

hsa-miR-196b

MYO16

0.009

hsa-mir-10a

10

0.004

FLJ36874

hsa-mir-10b

3

null

RUNDC2C hsa-mir-205

8

null

null

0.009

MYO16

10

0.004

Y

FLJ36874 hsa-mir-10b

3

null

Y

RUNDC2C

Y

RUNDC2C hsa-mir-205

8

null

Y

MYO16

N

MYO16 hsa-miR-196b

null

0.009

N

FLJ36874

FLJ36874 hsa-mir-10b

3

RUNDC2C hsa-mir-205

8

(c)micrornaRegulation

(d)proteinCodingGene

drop table if exists micrornaRegulation; create datatable micrornaRegulation { mirna varchar(200), targetsites varchar(200), geneID varchar(300) };

targetSites pValue p63Binding

Y

Gene

10

miRNA

miRNA

p63Binding FLJ36874 hsa-mir-10a

targetSites

FLJ36874 hsa-mir-10a

drop table if exists proteinCodingGene; create datatable proteinCodingGene { Gene varchar(200), p63binding varchar(20) }; load data local infile ’/proteinCodingGene.txt’ into table proteinCodingGenes fields terminated by ’\t’ lines terminated by ’\r\n’;

(e) regulation geneID

geneID

hsa-miR-196b

process compute_mirna { open database bioflow_mirna; drop table if exists genes; create datatable genes { chromosome varchar(20), start int, end int, miRNA varchar(20) }; load data local infile ’/genes.txt’ into table genes fields terminated by ’\t’ lines terminated by ’\r\n’;

targetSites pValue

FLJ36874

(b)sangerRegulation

(a) genes

miRNA

using any other system.

define function getMiRNA extract mirna varchar(100), targetsites varchar(200), geneID varchar(300) using wrapper mirnaWrapper in ontology mirnaOntology from "http://www.microrna.org/microrna/getTargets.do" submit(matureName varchar(100), organism varchar(300));

(f)proteinCodingGeneRegulation

Figure 4. User tables and data collected from microRNA.org and microrna.sanger.ac.uk.

drop table if exists sangerRegulation; create datatable sangerRegulation { microRNA varchar(200), geneName varchar(200), pvalue varchar(200) };

He first collects a set of genes (geneIDs) for each of the miRNAs in the table genes by submitting one gene at a time in the web form at the web site www.microrna.org, that returns for each such gene, a set of gene names that are known to be targets for that miRNA. From the returned response by the site, the user collects the targetSites alongwith the gene name partially shown as the table micrornaRegulation in figure 4(c). To be certain, he also collects the set of gene names for each miRNA in table genes from microrna.sanger.ac.uk in a similar fashion partially shown in table sangerRegulation in figure 4(b). Notice that this time the column targetSites is not available, and so he collects the pValue values. Also note that the scheme for each of the tables are syntactically heterogeneous, but semantically they are similar (i.e., miRNA≡microRNA, geneName≡geneID, and so on). He does so because the data in the two databases are not identical, and there is a chance that querying only one site may not return all possible responses. Once these two tables are collected, he then takes a union of these two sets of gene names (in micrornaRegulation and sangerRegulation), and finally selects the genes from the intersection of the tables proteinCodingGenes (that bind to p63, i.e., p63Binding=’N’) and micrornaRegulation∪sangerRegulation as his response. To compute his answers in BioFlow using LifeDB, all he will need to do is execute the following script that completely implements the application. It is interesting to note that in this application, the total number of data manipulation statements used are only seven. The rest of the statements are data definition statements needed in any solution

define function getMiRNASanger extract microRNA varchar(200), geneName varchar(200), pvalue varchar(30) using wrapper mirnaWrapper in ontology mirnaOntology from "http://microrna.sanger.ac.uk/ cgi-bin/targets/v5/hit_list.pl/" submit(mirna_id varchar(300), genome_id varchar(100)); insert into micrornaRegulation call getMiRNA with select miRNA, ’9606’ from genes ; insert into sangerRegulation call getMiRNASanger with select miRNA, ’2964’ from genes ; create view regulation as combine micrornaRegulation, sangerRegulation using matcher OntoMatch identifier gordian; create view proteinCodingGeneRegulation as link regulation, proteinCodingGene using matcher OntoMatch identifier gordian; select * from proteinCodingGeneRegulation where p63binding=’N’; close database bioflow_mirna; }

3.1.1 Taverna Implementation It is fair to say that unlike the maxd database, application design using Taverna for this example will be truly involved. This is particularly so because the microRNA.org and sanger.org do not have web services, and they do not have all the required services Taverna needs to operate. So, an application designer will have to necessarily write all communication and data manipulation codes written to implement the workflow and develop all schema mediation 7

mappings manually. While in BioFlow, no code writing is required what so ever. Infact, the script presented in section 2.1 was developed using VizBuilder in minutes.

[3] XAML - Extensible Application Markup Language. http://msdn.microsoft.com/en-us/library/ms752059.aspx.

4 Summary and Future Research

[5] I. Altintas, C. Berkley, E. Jaeger, M. Jones, B. Ludascher, and S. Mock. Kepler: An extensible system for design and execution of scientific workflows. In SSDBM, page 423, 2004.

[4] Gene ontology: tool for the unification of biology. Nature Genet., 25:25–29, 2007.

Our goal in this paper was to discuss where does BioFlow stand in relation to leading and contemporary scientific workflow management systems in the literature. We have singled out Taverna as the system to compare because we believe it has been more widely and successfully used in Life Sciences than other systems such as Kepler, Pegasus and Triana. As was discussed in the preceding sections, it is easy to infer that BioFlow essentially offers better support and higher abstractions for scientific workflow and data analysis. It does so because it offers a complete suit of data management, workflow and data integration abstractions for workflow applications involving distributed resources in a single language. In our current version, however, we do not support mixing relational and XML data, while both data types can be used in solely relational or solely XML queries using the same BioFlow syntax. This is because our underlying data management system monetDB does not support mixed mode queries involving SQL and XML data. We are planning to remove this restriction in our future versions of BioFlow by developing query translations techniques that will allow such intermixing. For now we wanted to avoid translating entire databases into one form or the other to support a single back end query language such as SQL or XQuery. Java scripts continue to pose significant obstacles in our automated wrapper generation modules. Since we operate in a world where we do not seek site cooperation, web page scrapping for table identification and data retrieval appears to be the only choice we have in BiFlow. Thus, in many applications Java scripts tend to throw BioFlow into states where it cannot properly recognize the table boundaries in multi-page responses. We plan to continue to find a solution to eliminate this limitation. Finally, workflows requiring large volume of data movement across sites raises new challenges. Specially when sites or servers tend to close connections if the processor or BioFlow takes too long to complete a task. We plan to find a solution for such exceptions as well. These are some of the research issues we seek to address in the future.

[6] A. Bhattacharjee, A. Islam, M. S. Amin, S. Hossain, S. Hosain, and H. Jamil. LifeDB: An autonomous semantic data integration system for life sciences. In International Workshop on Semantic Web Applications and Tools for Life Sciences, UK, 2008. [7] E. I. Boyle, S. Weng, J. Gollub, H. Jin, D. Botstein, J. M. Cherry, and G. Sherlock. Go::termfinder–open source software for accessing gene ontology information and finding significantly enriched gene ontology terms associated with a list of genes. Bioinformatics, 20(18):3710–3715, December 2004. [8] J. Castrillo and Others. Growth control of the eukaryote cell: a systems biology study in yeast. Journal of Biology, 6:4+, April 2007. [9] L. Chen and H. M. Jamil. On using remote user defined functions as wrappers for biological database interoperability. Int. J. Cooperative Inf. Syst., 12(2):161–195, 2003. [10] L. S. Colby. A recursive algebra for nested relations. Inf. Syst., 15(5):567–582, 1990. [11] E. Deelman and Others. Pegasus: A framework for mapping complex scientific workflows onto distributed systems. Scientific Programming, 13(3):219–237, 2005. [12] S. Hossain and H. Jamil. A visual interface for on-the-fly biological database integration and workflow design using VizBuilder. In 6th International Workshop on Data Integration in the Life Sciences, 2009. Under Review. [13] D. Hull, K. Wolstencroft, R. Stevens, C. Goble, M. R. Pocock, P. Li, and T. Oinn. Taverna: a tool for building and running workflows of services. Nucleic Acids Res, 34(Web Server issue), July 2006. [14] H. Jamil and B. El-Hajj-Diab. BioFlow: A web-based declarative workflow language for Life Sciences. In 2nd IEEE Workshop on Scientific Workflows, pages 453–460. IEEE Computer Society, 2008. [15] P. Li and Others. Performing statistical analyses on quantitative data in taverna workflows: an example using r and maxdbrowse to identify differentially-expressed genes from microarray data. BMC Bioinformatics, 9(1), 2008. [16] S. Majithia, M. Shields, I. Taylor, and I. Wang. Triana: A graphical web service composition and execution toolkit. In IEEE ICWS, page 514, 2004.

References

[17] A. Yang, Z. Zhu, P. Kapranov, F. McKeon, G. M. Church, T. R. Gingeras, and K. Struhl. Relationships between p63 binding, dna sequence, transcription activity, and biological function in human cells. Molecular Cell, 24(4):593 – 602, 2006.

[1] JUNG - Java Universal Network/Graph Framework. http://jung.sourceforge.net. [2] MonetDB System Home Page. http://monetdb.cwi.nl/.

8