Data Management for Combinatorial Heterogeneous ... - Springer Link

6 downloads 223228 Views 3MB Size Report
of solid and heterogeneous catalysis makes data analysis a challenge [6-8]. In addition, .... files to robots. A Key Tool for Advanced Combinatorial Methodology.
Chapter 25

Data Management for Combinatorial Heterogeneous Catalysis: Methodology and Development of Advanced Tools D. Farrusseng, l. Baumes, and C. Mirodatos

INTRODUCTION As catalytic chemical processes become increasingly mature, innovation in this field is less and less likely. The industry is facing such a slow rate of discovery because the empirical development of new active solids by a trial and error process from a few selected samples remains highly speCUlative. As well as the low rate of investigation, this research strategy based on exhaustive studies and complete understanding is very time consuming. After decades of very intensive efforts, new and major breakthroughs based on experience and common knowledge cannot be expected in most research areas dealing with bulk chemistry. Therefore new research strategies must be developed to produce breakthroughs and a paradigm shift in chemical research to revitalize innovation. Drug development underwent a drastic and successful change in the 1990s by means of the fast synthesis and screening of large libraries of diverse formulations using fast fully automated workstations and analytical equipment. This combinatorial approach, which rapidly spread to other research domains such as materials science and catalysis, relies on the systematic screening of a population surface which combines all the parameters of the study [1-4]. Thus the investigation strategy shifts from essentially qualitative studies to highly quantitative studies with data work flow increased by several orders of magnitude. Indeed, the unprecedented rate of exploration involved in this screening approach directly D. Farrusseng, l. Baumes, and C. Mirodatos • Institut de Recherches sur la Catalyse-CNRS, 2 avenue Albert Einstein, F-69626 Villeurbanne, France l. Baumes • Equipe de Recherche en Ingenierie des Connaissances, Universite Lumiere Lyon 2, Biitiment L, 5 avenue P. Mendes-France, F-69676 Bron, France Contact author: David Farrusseng, email: [email protected] High- Throughput Analysis, edited by Potyrailo and Amis Kluwer Academic Publishers, New York, 2003.

551

552

Farrusseng et al.

increases the probability of success. In addition, instead of generating a single material with improved properties, the combinatorial strategy produces a series of leads with similar overall catalytic properties. As a logical result, the diversity of these leads increases the probability that at least one of the candidates could be marketed successfully. Rapid automated screening of large libraries of solids and materials is now possible because of the fast-growing technologies for automation and miniaturization. Fully automated robots specially designed for fast synthesis and testing of solids are now commercially available. In contrast with the technological breakthroughs outlined above, information tools capable of managing the high-throughput (HT) work flow of data for heterogeneous catalysis are not commercially available [5]. The extremely complex nature of solid and heterogeneous catalysis makes data analysis a challenge [6-8]. In addition, data management is a large and complex process including instrument and software integration, databasing, statistical studies, and data-mining functions. New information tools have to be designed and developed for every single application, which represents a new and highly valuable market for the computer science industry linked to science. This chapter aims at pointing out the major and often underestimated issues involved in data management for heterogeneous catalysis. All management steps ranging from raw data acquisition to data feedback will be addressed, including data storage, identification, pretreatment, and coding, together with the algorithms related to catalyst discovery and optimization. A methodology for facilitating data management for combinatorial heterogeneous catalysis is also provided. In addition, a database of general interest for heterogeneous catalysis and a powerful algorithm for the iterative discovery and optimization of catalysts are presented.

DATABASING Motivation A simple file consisting of arrays ofrows and columns can be used as a database (DB) for a single target application in such a way that catalyst performance (conversion, selectivity, yield, etc.) can be linked to the main features of the catalysts (synthesis recipe and characteristics) and the test conditions. However, this two-dimensional file structure cannot accommodate various categories of solids and different catalytic reactions. Thus a completely new file has to be created when the context changes even slightly, making cross-analysis of data difficult. Therefore it is necessary to design robust relational DBs that are sufficiently flexible to accomodate changes. A database management system (DBMS) provides users with access to their data and helps them to transform the data into information. Such DBMSs include dBase, Paradox, IMS, and Oracle. These systems allow users to create, update, and extract information from their databases. Compared with a manual filing system, the main advantages of a computerized database system are speed, accuracy, reliability, accessibility, and integrity. In combinatorial heterogeneous catalysis, the information is not concentrated in a single place but spread over various spots and functionalities, more or less automated, such as synthesis robots for preparation, parallel reactors for fast testing, characterization tools, and analytical equipment. Thus the main role of the database is to store in a single

Combinatorial Heterogeneous Catalysis

553

location all information available concerning the catalytic systems generated which include both their performances and their history. As a case study, the Symyx Company claims to have stored more than lOT bytes of data regarding catalyst development for various processes over the last decade. These data, accumulated in large DBs, are believed to accelerate discoveries in new projects by means of data mining.

What is a Relational Database? A database is a structured collection of data referring to any type of system such as items (catalysts) and/or events (catalytic performance). Relational DBMSs store data in tables and enable users to define relationships between the tables. Each table, easily visualized as a tabular arrangement of data, consists of fields and rows of records. The field names (column names) are fixed during the construction of the DB, although the field values may differ. A field has little meaning unless it is seen within the context of other fields. For example, the selectivity code field is informative only if associated with other related data (e.g., reaction name). The fields related to a particular system are bundled together to form a single and complete set of data. The link between tables is based on one or more field values common to the related tables. In each column of a table, a specific category of information has to be related to the table subject. Every table has a field or a combination of fields that uniquely identifies each row in the table. This unique identifier is called the primary key (Pk). The primary key distinguishes each set of records from all the others in a given table. It allows the user and the DB system to identify, locate, and refer to one particular row in the table.

Example As an example, a small part of the StoCat'M DB scheme we have developed for data management of HT catalyst screening is presented in Figure 25.1. This part is related to the identification of catalyst precursors; it includes the name, formula, CAS number, type of precursor, purity grade, provider, and commercial reference number. Precursors with a given chemical formula, but with different purity grade or purchased at different companies, have the same CAS number. Therefore the CAS number cannot be used as a unique identification (ID) number for a given precursor or, generally speaking, for any chemical. The relationship between precursor formula, CAS number, commercial chemicals, provider, and purity grade can be seen in Figure 25.1 as linked tables where all the information is split according to a rational procedure. Each chemical is assigned an ID number (Id_commercial_product), which is linked to a unique precursor formula and name via the identification number of the precursor (Id_precursor). Table 25.1, which represents the conventional way of storing data, encompasses all the relevant information one needs about precursors. When applying the DB scheme to this table, data are split into different smaller tables (Table 25.2), which are linked by the key Id values. This example clearly shows how relational DBs can optimize data storage in that that the same n-uplet is written just once, and each time these values are used the corresponding primary key value has to be given. Thus, barium acetate is written three times in the two-dimensional array but just once in the DB. In addition, a well-designed DB strongly minimizes empty records.

554

Farrusseng et al.

,_Precursor-Commercial_Product_tbl Commercial_Product_tbl Id Commercial Product

Id Precursor Id Commercial Product

I",

:'

Precursor_tbl Id Precursor Formula_Precursor

Ref Provider

Name Precursor Family_Precursor

Purity

CAS_Number

~- I~_Provider

-

~

Provider tbl Id

Provid~r

Name_Provider Website

... FIGURE 25.1. Example of a relational scheme. All the fields correspond to the name of the column in a common spreadsheet (Excel type). The primary key (underlined) is the unique identifier for the set of all the other fields.

How to Retrieve Data from the Database (Query)

SQL stands for Structured Query Language According to ANSI (American National Standards Institute) and is the standard language for relational DBMSs used to communicate with a DB. SQL statements are used to perform tasks such as updating or retrieving data from a DB. When one queries the DB, the rows corresponding to matching values can be retrieved from different tables. Depending on the DB structure, these queries, which join tables, may be rather complex and generally require an advanced knowledge of the whole DB scheme to be formulated (Fields, Primary keys, Foreign keys). An interface for users' queries has been specially designed for the present StoCat™ DB which consists of 120 tables and 450 fields.

StoCaf Database M

The challenge was to create a DB that was robust and flexible enough to accommodate most of the reactions using solids as catalysts for both biphase and triphase systems. The StoCat'Mrelational DB has been designed and developed using Oracle and, to our knowledge, is the first extensive DB for heterogeneous catalysis. The adaptative skills of the StoCat™ DB rely on the choice of relevant descriptors (field names) and on a complex structure linking these descriptors. As examples of the flexibility, one can create and store in tables new chemicals, precursors, or products by a combination of elements from the periodic table. It is also possible to store systems as complex as fluid catalytic cracking (FCC) or three-way automotive catalysts. This means that investigations are not restricted to a set of reactions or catalysts, but can be matched to most studies in academic and industrial laboratories.

Name_Prec

Barium acetate Barium acetate Barium acetate Bismuth (III) oxide Bismuth (III) oxide

Forrnula_Prec

H6C404Ba H6C404Ba H6C404Ba Bi203 Bi203

Id_Prec

1 I 30 30

Acetate Acetate Acetate Oxide Oxide

Family_

2 3 4 5

Id_Com_Product

1 2 2

Id_Provider Aldrich Aldrich ABCR ABCR Aldrich

Name_Provider

TABLE 25.1. Data Stored in a Spreadsheet Fonnat

sigma-aldrich.com sigma-aldrich.com abcr.de abcr.de sigma-aldrich.com

Web_site_Provider

24,367-1 25,591-2 11724 15103 33,595-9

ReCProvider

99 99,999 99 99 99,99

Purity

543-80-6 543-80-6 543-80-6 1304-76-3 1304-76-3

CAS

556

Farrusseng et al.

TABLE 25.2. Data Stored in the Relational DB Shown in Figure 25.1

Formula_Precursor

H6C404Ba

Name_Precursor

Barium acetate

... ... ...

Family _Precursor

Acetate

...

Oxide

CAS Number

543-80-6

...

1304-76-3

Id_Precursor

c;.ld C"_~"'l

1

lId_Precursor

P"""

J

Prod",

J

I

I

I 2

I

I

30 Bi203

I

I

30

I

3

4

Bismuth (III) oxide

I 30 I

I

5

I

~ Id_Commercial_Product

I

2

3

4

5

~ Id_Provider

I

I

2

2

I

.... •

ReCProvider

24,367-1

25,591-2

11724

15103

33,595-9

Purity

99%

99,999

99

99

99,99

Id Provider

1

2

Name Provider

Aldrich

ABCR

Website

www.sigma-aldrich.com

www. abcr.de

Basically, StoCat™ permits the storage and control of any information linked to (1) catalyst synthesis (precursor name, origin, volume, concentration, synthesis method

with parameters, heat treatments, etc.) and (2) operating conditions for testing (carrier gas, temperature, total and partial pressure, gas flow rates, time on stream, etc.). Additional information, such as the name of the chemist or precursor batch numbers, can also be stored. Data on relevant characteristics, such as crystallinity, surface area, porosity of porous materials, etc., can be stored as well. The performance (e.g., rate of conversion, selectivity, yield, etc.) of a single catalyst under given operating conditions are automatically calculated and stored in appropriate tables. All data dealing with catalyst synthesis and testing can be retrieved automatically from synthesis robots (or optimization algorithms) and from analytical equipment currently used in laboratories, respectively, and stored. This is achieved via interfaces developed in Delphi. Their design reflects DB flexibility and has to be convenient for use by chemists. For example, vial location in a synthesis robot is displayed through the interface in a way which corresponds to the actual physical location on the xy surface, thus ensuring data integration by allowing export of files to robots.

A Key Tool for Advanced Combinatorial Methodology On the laboratory scale, a single DB can connect several robots and on-line users, which permits a real networking strategy for data management. Thus StoCat™ is used as a single database by a consortium of 10 European organizations including academia and industries working in the framework of the European Union project "Combicat." This DB allows the consortium partners to collect and manage data produced at severallocations in Europe. The aim of this strategy is to acquire a sufficiently large amount of data

557

Combinatorial Heterogeneous Catalysis

to be processed by means of statistics or artificial intelligence tools in order to discover key features of catalysis, such as structure-activity relationships. An unexpected feature has emerged from this work-the compulsory data standardization which can be seen as a new attempt for establishing agreed rules of normalization in the rather confused area of heterogeneous catalysis.

FROM DATA TO ALGORITHMS What Kind of Data are we Dealing with? In combinatorial catalysis, catalysts are usually poorly characterized or uncharacterized, partly because of the lack of fast and inexpensive in situ techniques. As a consequence HT catalysts under reacting conditions can be seen essentially as black boxes. As shown schematically in Figure 25.2, the learning process aims at finding relationships between catalyst features, process variables, and catalytic performances. The input data are called endogenous-they describe the catalyst systems (descriptors)-whereas the output data are called exogenous-they describe the result of the descriptors. Establishment of all the links between endogenous and exogenous data requires the discovery of mathematical functions which model the whole surface. However, in most cases HT studies aim to discover and optimize formulations, and hence only the optimum properties of the surface are of interest. Prior to the data analysis, the type of data to be manipulated must be determined. In general, data can be categorized into five distinct classes: continuous, quantitative ordered, qualitative ordered, binary, and non-ordered discrete (Figure 25.3). As an example, different types of data are described in Figure 25.4 according to synthesis parameters, reaction conditions, catalyst characteristics, and catalytic performances. One of the most relevant variables for a solid is the overall composition. In many processes catalysts are mixed oxides, prepared by mixing appropriate precursors in a well-defined way. Let us consider V-Nb-Mo catalysts, recently screened for the reaction of oxidative dehydrogenation of ethane (ODHE) [9]. These elements were selected a priori because of their individual redox properties. The combinatorial approach systematically combines the elements by predicting promotion, cooperation, or synergy effects. In the corresponding ternary diagram, every single point represents an elemental composition, though different phases may coexist at the atomic level (Figure 25.5). Phase characterization is not carried out systematically, because it would take too much time and the results would still be questionable. Therefore, in order to discover a possible relationship between catalyst composition and ODHE performance, every single point of the ternary is associated with

Output - synthesis parameters, - reaction conditions.

?.

-catalytic performances, -other properties .

FIGURE 25.2. Catalyst as a black box functionJ: output = f(input).

558

Farrusseng et al.

Quantitative Data

Qualitative Data

..

~

-1, 0,

-2.01890,

Data type

1, 2, ...

1056,012

amorphous ~ NaZSM-5 - good - average or : mordenite, crystalline ~ KL,... - bad

.,..------

ordered

non-ordered

continuouS-

discrete

D.

Code

Real

Integer

Ordered Inleger

Non·Ordered Integer

Binary

no factor

no factor

FIGURE 25.3. Encoding of different types of data related to heterogeneous catalysis.

I Synthesis I

________

- continuous (wt% for mixed oxides) -------.. - qualitative non-- 12, and R is of major importance and completely defines the control of the optimization process. For example, if good catalysts have been found early in the same zone, the algorithm will be forced to perform the search in unexplored zones rather than in the same "hot" zone. An additional advantage of GA-KBS hybridization is that the largest-rectangle analysis is independent of the order of variables whereas the GA are not. Indeed for GA, the greater the separation between two variables in a chromosome, the higher is the probability that there will be a crossover between. However, it is important not to cut blindness variables groups (called elementary units) for which values are correlated with high performances. Hybridization allows a dynamic reallocation of the positions of variables which increases the probability of efficient crossover.

CONCLUSIONS The importance of robotics with respect to scientific creativity is probably overestimated in the HT approach. Most breakthroughs speeding up the discovery of new materials are unlikely to come from faster or highly parallel techniques, but probably from smart concepts allowing synthesis, screening, and further optimization via data mining. In addition, the combinatorial approach of heterogeneous catalysis will soon have to deal with even wider parameter spaces, taking into account both HT characterization data (development of

Farrusseng et al.

578

Hard output

t--Knowledge . . output

addilionallnformalion Previous weening

. . Information input

FIGURE 25.26. Scheme of advanced data management systems.

new tools in progress) and process operations and kinetics. Therefore the development of new and adapted data management tools is obviously and urgently required. The hybrid algorithm presented here, which combines data-mining and GA techniques, matches that demand quite well. Indeed, it will accelerate and adjust the convergence rate while generating understanding of both continuous and discrete systems. As a prospective methodology, we propose a search mechanism based on iterative optimization directed by data-mining techniques where a DB stores all the information and links together highthroughput screening by the hardware and the generation of understanding by the software (Figure 25.26).

ACKNOWLEDGMENT The EU "Combicat" program is fully acknowledged for supporting part of the work reported here.

REFERENCES I. Senkan, S. Combinatorial heterogeneous catalysis-a new path in an old field. Angew. Chem. Int. Ed. 2001,

40,312-329. 2. landeleit, B., Schaefer, D. 1., Powers, T. S.,Turner, H. W., Weinberg, W. H. Combinatorial materials science and catalysis. Angew. Chem. , Int. Ed. 1999,38, 2494-2532. 3. Newsam, 1. M., Schuth, F. Combinatorial approaches as a component of high-throughput experimentation (HTEl in catalysis research. Biotechno!' Bioeng. 1999,61,203-216. 4. Harold, M. P., Mills, P. L., Nicole, 1. F. Stud. Surf Sci. Catal. 2001, 133, 87-98. 5. Cohan, P. Results and commercialization-progress in the practice of combinatorial materials science. Abstracts, 221st American Chemical Society Meeting, 2001, Paper BTEC-056. 6. Yoneda Y. Prospect, rather than retrospect, on the impact of computers in catalytic research and development. Cata!. Today 1995, 23, 305-310. 7. Harmon, L. A., Vayda, A. 1., Schlosser, S. G. Informatics challenges in combinatorial materials discovery. Abstracts, 221st American Chemical Society Meeting, 2001, Paper BTEC-067. 8. Dorsett, D. R., Jr. Capturing the combinatorial workflow. Abstracts, 221st American Chemical Society Meeting, 2001, Paper BTEC-064. 9. Cong, P., Dehestani, A., Doolen, R., Giaquinta, D.M., Guan, S., Markov, v., Poojary, D., Self, K., Turner, H., Weinberg, W. H. Combinatorial discovery of oxidative dehydrogenation catalysts within the Mo-V-Nb-O system. Proc. Natl Acad. Sci. USA 1999 96, 11077-11080.

Combinatorial Heterogeneous Catalysis

579

10. Geisler, S., Vauthey, I., Zanthoff, H., Muhler, M. Presented at European Workshop on Combinatorial Catalysis (EuroCombiCat), Ischia, Italy, 2002. 11. Wolf, D., Buyevskaya, O. v., Baerns, M. An evolutionary approach in the combinatorial selection and optimization of catalytic materials. Appl. Cata/. A 2000, 200, 63-77. 12. Banares-Alcantara, R., Westerberg, A. W., Ko, E. I., Rychener, M. D. DECADE-A hybrid expert system for catalyst selection. I. Expert system consideration. Cornput. Chern. Eng. 1987, 11,265-277. 13. Banares-Alcantara, R., Ko, E. I., Westerberg, A. W., Rychener, M. D. DECADE-A hybrid expert system for catalyst selection. II. Final architecture and results. Cornput. Chern. Eng. 1988, 12, 923-938 14. Kito, S., Hattori, T, Murakami Y. Expert systems approach to computer-aided design of catalysts. App/. Cata/. 1989,48, 107-121. IS. Koerting, E., Baerns, M. Use of expert systems in catalyst development. Chemie Ingenieur Technik 1990, 62 (5), 365-372. 16. Sun, Y. H., Li, Y. W. Expert system approach to the preparation of supported catalyst. Chern. Eng. Sci. 1992, 47, 2799-2804. 17. Prevoo, H., Derouane, E. G., Vercauteren, D. P. Development of a prototype expert system for catalysis by zeolites. AlP Can! Proc. 1995,330,775-781. 18. Selvam, T, Iyer, D. N., Deka, R. c., Chatterjee, A., Vetrivel, R. A computational "expert system" approach to design synthesis routes for zeolite catalysts. Stud. Surf. Sci. Cata/. 1997, J05A, 133-140. 19. Klanner, C., Baumes, L., Farrusseng, D., Mirodatos, C., Schueth, F. QASAR, in press. 20. Buyevskaya, O. v., Wolf, D., Baerns, M. Ethylene and propene by oxidative dehydrogenation of ethane and propane-Performance of rare-earth oxide-based catalysts and development of redox -type catalytic materials by combinatorial methods. Catal. Today 2000, 62, 91-99. 21. Reetz, M. T, Jaeger, K. E. Enantioselective enzymes for organic synthesis created by directed evolution. Chern. Eur. J. 20006,407-412.