Lessons learned from the introduction of autonomous monitoring to ...

2 downloads 2368 Views 578KB Size Report
project with limited programming resources can expand the breadth of its goals without incurring the high cost of hiring additional, dedicated programmers. This.
N95- 27393 Lessons

Learned

Monitoring

from the Introduction

to the EUVE

M. Lewis, F. Girouard, F. Kronberg, T. Morgan, and R. F. Malina Center for EUV Astrophysics, Berkeley, CA 94720-5030

Science P. Ringrose,

2150

Kittredge

in conjunction

Research

Center

the Extreme

with

(ARC),

an autonomous

operations

ESOC

to move

costs from

used

(EUVE)

in

(ESOC). The by a need to

and

has

allowed

continuous,

aly

are monitored

the

three-shift,

detection

Eworks,

by an autonomous

system.

an artificial monitoring

RTworks,

and

Epage,

system to notify anomalies.

ESOC

automatic

personnel

to reduce

missions

on

paging

of detected

budgets.

capture,

from

their

collaboration

with

project

limited

can

with

expand

incurring dedicated

the the

In this paper

control ARC

we discuss,

breadth

high

cost

may offer

missions.

insights

for other

NASA

class

spacecraft

instruments

programmers.

This

carries

designed

University

of

The

EUVE

mission

California,

the

first

designed

survey

operations

are

at the (UCB).

to conduct

of

the

entire

sky followed by of EUV sources. run

carried center

out in the EUVE science operations (ESOC) at the Center for EUV Astro-

a

(CEA),

UCB.

Shortly

after

launch,

NASA's

mission

scientific

without

faced

science

it

became

operations

drastic

success,

cuts. CEA

data EUVE's

sought

and

clear

and With

health payload

ways

is

that

analysis early to dra-

matically lower the mission operations budget in the hope that cost reductions would allow

additional, dispersal

physics

budget

resources

of the

while

Goddard

The

The

(GSFC)

from

monitoring

pro-

Center

a set of

built

Berkeley

was

multi-band

and

Flight

how

of its goals of hiring

inception

safety

center.

programming

from

Space

the payload

demonstrates

wide-

our experito one shift

Mission

controllers for implementation in an expert system is directly applicable to any mission considering a transition to autonomous moniin

has

for ways

looking

operations

of knowledge

toring

system

extreme ultraviolet (EUV) spectroscopic observations

NASA their

an expert

future

for collabothe criterion

to completion, the areas where ences in moving from three shifts

science

age of shrinking NASA budgets, the learned on the EUVE project are use-

ful to other cess

an

allows

impacts on the implementation, the completion time and the final

The Explorer

(AI) payload based

centers

The Extreme Ultraviolet Explorer (EUVE) launched on a Delta II rocket in June of i 992.

includes

package

NASA

Introduction

anom-

system

intelligence

telemetry

In this lessons

This

across

to choose

cost.

sci-

human-tended monitoring of the science payload to a one-shift operation in which the off shifts

of California,

spread including

imple-

system

Explorer

ence operations center implementation was driven reduce

63

D. Biroscak,

missions to easily access experts rative efforts of their own. Even

NASA's

has

monitoring

Ultraviolet

Center

/-4

of California at Berkeley's for Extreme Ultraviolet Astro-

(CEA),

mented

A. Abedini,

expertise

The University (UCB) Center Ames

Operations

St., University

Abstract

physics

of Autonomous

of

EUVE

to continue

operating

past

the

end

of

229

PRECEDING

PAGE BLANK

NOT

FILMED

pAe_. ___

IiVTENTtONAL[Y p_A;,!K

the nominalmission.We lookedat manyareas of the project including the possibility of reducing staffing by introducing autonomous monitoring.Becauseof our lack of experience in this area,we beganthe processby looking for a parmershipwith someonepossessingrelevant experience.We found the NASA Code X and NASA Ames ResearchCenter(ARC) hadtheknowledgeandthedesireto helpus.

anomaly was not considered a priority as a human controller could notice it during the dayshift,

and

three

areas require The

monitoring

Selecting

shelf CEA

includes

The

.

science

payload.

Thus,

.

off-the-shelf package, would be

a heater

might

and

that

physical

responses essential

are

performance

NASA

->

monitor

are being

of the

(Explorer ESOC).

the

AI

telemetry

received. links

to to

If one

goes

down,

of the

must recognize the situation and to summon someone to restore

communications. The

hardware

and

in the

failure, summon

the system must someone to resolve

In the

an extensive

products that tested products speed,

ESOC.

software

systems

search became

the

data

software be able

we decided

the safety

-> cannot

This

telemetry

links

the communications

scratch,

to ensuring

from

communications

unless

competing mentation

of the problem

payload.

thermal

the continued instruments.

software

ity,

scope

is done

Platform

the time and resources to create an intelligence (AI) system from

we limited

science

Appropriate conditions

ensure science

Lacking artificial

step,

at risk.

electrical,

systems. changing

a Package

of it. As a first

it does

monitoring:

EUVE

We conducted

required

is exceeded

From the point of view of the EUVE science operations center (ESOC), we concluded that

The Explorer Platform is an inherentlyrobust spacecraftdesignedto last 10 yearson orbit. It hasboth softwareandhardwaresafingconditions that can be entered with ground commandingor autonomouslyby the spacecraft. To date, we have never entered the hardware safehold mode but have autonomously entered the software safepointing mode twice by human error. Like the spacecraft, the sciencepayloadis very robust. The payloadprotectsthe scienceinstrumentswith on-boardhardwareand software safety measures, such as heatersfor the mirrors and control of the detector voltage level in the event a detectoris being overexposed(detector doors close in the event of a serious threat).The ability of the payloadand spacecraft to protect themselvesfrom immediate threats inspired confidencefor the development of a systemthat would detectthreatsof a less immediate nature, without requiring full-time humanmonitoring.

packages. To select an off-the-shelf we needed to examine what

a limit

not put the instruments

lo

to evaluate

until

and

ease

ground event

of a

be able to the problem.

search

for off-the-

would meet our needs. in-house for applicabilof use.

The

cost

of the

products was a factor, as was docuand technical support. As the progressed package stability the critical criterion. With

clearly limited

manpower and a short schedule, we could not risk software deficiencies. A stable package

the of

also helps ensure the accuracy and utility of documentation. This consideration was very

be

on for two days without sending anything out of limits. This situation would indicate a problem but not an immediate threat. This kind of

important, software lacked

230

as we intended to customize the ourselves. Several good products

adequate

documentation.

These

prod-

ucts

would

software

have

required

company

for

Ultimately,

we selected of

to

hire

implementation

system. This would have expensive for our program.

Corporation

us been

the

monitoring

of the

the

prohibitively

RTworks

tasks

deals

are

CA.

local

ground

RTworks displayed solid performance pled with excellent documentation

couand

basic

question;

to know

The ESOC

the

payload.

telemetry,

via

GSFC,

We

receive

with

the satellite

The

ESOC

on a secure

is

real-time

data,

and postpass

during

tists,

line.

CEA

telemetry SOCtools monitoring

software.

During

the dayshift,

developed

at CEA).

tape

(RTie).

If Eworks

detects

rather

monitoring

of

the

Eworks

human computer-interface activated.

(RThci)

oped and capture.

The

edge

from Implementation

We broke

the implementation

data?

supplied.

information

see Abedini

& Malina

(from

science

payload. was

the

representation

of

and

knowledge

an intermediate that

but

After identifying the team devel-

method

to develop

cre-

and compre-

functionality

of the payload. areas to monitor, the

The

blueprints

on expert

of

engihealth

not approached

the hardware based

and ARC).

set of critical to ensure the

system

from

CEA

would

knowl-

serve

as

a

deliverable product from the domain experts to the knowledge base developers. We used informal flowcharts in a series of documents

is

for each we

Lessons

regularly

detailed

a small needed

of the

tuned

We decided

the

module

to the same

receiving

monitoring team consisted of a of controllers, hardware scien-

knowledge

performance the critical

system to For visual

software,

systems,

proceeded

hensive

the

an anomaly,

requests are made to the Epage page an on-call payload controller.

For more

of an expert

by working

data are also fed into the RTworks data acquisition module (RTdaq) and the inference engine

safety

ation

is monitored by a controller using (an interactive, workstation-based system

down

being

and programmers

and

contacts

production

are

The team chose neering monitors

dumps. Immediately upon arrival the data are autonomously archived and decommutated using

if data

The payload small group

receives X.25

boils

is the software

on the ground (1994).

now staffed for only one 8 hour shift per day, 7 days per week. During the off shifts, the customized version of RTworks called Eworks monitors

paper

monitoring.

So, if Eworks does not receive any telemetry for more than 6 hours the on-call controller

The ESOC was formerly staffed 24 hours per day, 7 days per week by a payload controller student.

this

the payload

systems

will be paged.

aide

important,

with

Although

ware does not need to know the state of every link in the communications path, it only needs

Overview

and an engineering

payload.

Or, more accurately, is the science payload being monitored? We determined that the soft-

the generaltools allows

customizing to our needs. Moreover, the open architecture allows us to easily plug in previously existing code.

System

equally

primarily

View,

technical support. Importantly, ized nature of the RTworks

science

The communications/ground systems group did come to a very important realization. Monitoring of communications links and

by Talarian

Mountain

of the

were

of the

major

automating

subsystems the

for which

monitoring.

This

approach proved very useful as it cleanly separates the issues of implementation and knowledge representation from the actual

into two teams:

one to handle the ground systems and communication issues and the other to deal with

231

knowledge

itself.

We

representing

the

domain

had

some

difficulty

knowledge

in

in flow-

charts

until

perceived

we

need

a sequential found that

freed

ourselves

to represent

the

the knowledge

in

way. On several we were attempting

knowledge representation ceived, causal flow when

load

from occasions to make

we the

fit into a preconit is more naturally

and correctly represented by an event-driven model ("event-driven" in that nothing occurs until new data are received). The data are often

received

in what

appears

to be an asyn-

health

pressing

and

need exists received the need

The

of our

nature

problems

quality, dropout, or other effects of receiving our data after the level-zero processing performed at GSFC, as well as the basic

their

complexity

cessed

The

of our telemetry

data-driven

presents

a

nature

problem

things we want data. Ultimately, should

be

data

of issues

since

of the system one

real-time

spacecraft

the

very

of current when we schedule

a

of

hours.

If

and

in

fact

the

whole

we

do

not

for 6 hours then since the RTie

RTworks

system

the

is

ing

complicates

interface external

to

implement.

Not

that the chosen product it is also important

handle

reason-

format

greatly

implementation

and

original

the

data

advantages data

Before

our

implementation

is it

stored

with

the

activity

was

the

verification

basic

original,

nature

us in that each

we

every

quality, storage

which full data quality

Other the

intact.

has some the data stream is

information,

few years.

of our telemetry

frame

5%). keeping

of data

does

challenges not contain

of all engineering chanwith various people and

various monitoring and control systems we encountered a widespread assumption that each frame of data contains a sample from all

had operations personnel verify that data were received for every real-time contact, but the essential

can eas-

information

from

including

it can be reverified

automated.

of RTworks,

less than

result

life expectancy beyond becomes corrupted. If the

proved flexto note the

be

by

all data

a complete snapshot nels. In our contacts

to

quality

almost

often

needs

our

stream,

The

what

facility sent to the unfortunate

This

recasting of the problem. While it is often essential to have an existing working system, to ensure the success of automation one must recast

in

form

for in-band quality provides it at the end we must

(stripping

compresses

custom clients

only

form The

processing and storage resources it does not make sense to marginally compress the data

For example,

easy

the them.

Processing

data.

the

Fortunately, the RTworks architecture provides a convenient application programming

important ible, but

mis-

avoided by providing the full data In today's world of relatively cheap

significant

proved

us

upon the ease of is level zero pro-

Packet

uncertain

delivered

for interfacing with and the external

existing

examine reaches

data message,

on

driven by the reception of data, we had to create external clients that trip time-out alarms.

(API) clients

While

gives

position of having a real-time data stream that has been stripped of all quality information. Since the data delivery format (PACOR mes-

ily be stream.

number

carefully telemetry

by

of each

certain

are suc-

format

can have profound effects automation. Our telemetry

often changes at the last minute. Instead we determined that it is sufficient to check whether or not data has been received within receive data from the payload an alarm is raised. However,

areas.

sages) does not allow information but, rather,

or production contact

telemetry

(PACOR) at GSFC before being ESOC. Thus, we are left in the

in itself

of

to detect is a lack we cannot predict

the

data

stream.

since

receiving

of

that data

No

on every contact, to predict the contact

in several

should

because

basis.

sions rarely have the capability to change the nature of their telemetry stream, future miswhich

fashion

on a regular

to verify

cessfully alleviating schedule.

sions

chronous

safety

of pay-

232

available takes

engineering

128 frames

data

to

channels.

(over

sample

every

It

actually

2 minutes)

of EUVE

engineering

channel,

unknown, and thus the integrity recent value model is maintained.

although many are updated every one or two frames. We found that this issue, and the sim-

Our

reuse

tant

role

ple

dropouts

implement

not

toring.

fact

(from

that

the

data

may

contain

transmission-problems),

dled gracefully

was

by the RTworks

han-

product.

A basic,

underlying

reason

every

assumption

between

the

engineering

is inadequate

channel. values

frame can only be with the most recent

reasoned expected

not

necessarily

received. uses

the

Our

the decommutation neering segment

uses

segment

from

channel

in conjunction value, which is recent

values.

display The

individual

APIs It

has

enabled

paid

off.

to

mentioned we already

moni-

The

fact

that

is available

extremely rapidly

benefi-

develop

the

customized RTdaq. had extensive limit-

limit

checking,

but

rather

we

package

results

of the external

limit

to decouple

cedure

has the added

benefit

of the engi-

easily

memory

timestamps

to

checking code, we did not attempt to create rules in the inference engine (RTie) to do

value

shared

able

and code

software

proven us

an impor-

were

abstractions

operations

previously Also, since

the current

SOCtools

memory

cial.

played we

most

of autonomous

data

really

of our

through

from

assumption

from

most

interactive

a shared

one

sample

This

because

system

modularization

is that

last

our

code

quickly

Appropriate

much can

of existing in how

of the

dle

on every

take limit

lacks

advantage

checks.

our

information.

in

This

of allowing

of existing

checking

quality

pass

code

real-time This

the prous to

to han-

data

that

feat is accom-

engineering channel (and the timestamps

to deal with this issue also conveniently serve

plished tentative

through the use of what we call limit checking. The first time a value

as a semaphore

multiple,

exceeds

a particular

client

accesses

channel The

at the

quality

able

does

engineering

not maintain

on the

because

the

were

individual

product

timestamps

However, and

asynchronous,

tentatively

level).

RTworks

vidual

and ing

for

most

recent

of the product's of

the

to modify

messages, such

in the

a message,

customized

values

the

slots

Initially

contain

the

internal

reference

is the default

In this way, most

RTie

recent

have

values

expected

system cron.

planned

ing

relies

a

The

combination

current,

on standard

have

Unix

postponed

of telephony. system

page

to

constraints.

We

area

paging

A key intervals,

received. personnel

value

and

will either

would

or

233

The

system

login

acknowledge support

requires the

a

of our

pagto

the num-

certain has

timebeen

that the on-call computer

page(s).

telephony

like in the

It continues

escalating

to the CEA

simple

efforts

feature

is its persistence.

at regular

very utilities,

all

ber of people being paged after outs until an acknowledgment

unknown

value

we

schedule

receiving

start-up

all slots

a

and telephony system, but we were forced to scale back our efforts because of resource and

slots corresponding to the channels. Rules do not fire

they

(unknown

for all slots).

of the

until

which also is considered

we RTdaq

missed are sent to the in one of these new

it sets the

unknown for the given engineering when

case

it is not

Paging

modules with a new message type. in the input stream, the engineering

channels expected but other RTworks clients

and

as only

values.

flexibility

RTie to handle this issue by supplementthe basic message types between the

RTworks For gaps

of limits,

it is treated

second consecutive update, exceeds the limits, that a value out of limits.

indi-

documentation,

our

out

limit

system

Ideally system

we that

allowed

the

page

from

any phone

more

button

requests

and

pushes.

services

provided

distance

phone

to

be

There

by carders,

by one or

are

several

result, have

reviewed

acknowledged

a number

local

and

record become

paged

of

at 2 A.M. will

next order

long-

but to our knowledge

cohesive

that

the delivery

unambiguous.

not

of

common

Aside

from

such wear

is unreliable,

a fact

to

users.

knowledge

the

possible

most

human

We

structures

ference Our

that

can cause

prevents

operations

There

the

center

is also

time.

service

low-cost

provider

solution

persistent pages that continue of acknowledgment is made.

is

until

down-

the major ing

ability

clearly page

we

need,

is the

requests.

Our

on the detection person

into

that can

nient

no automated

multiple

alarms

requests,

we

are

settling

we

are

of removing

discovering

humans

and

This

move

flow

of information. but

exchanged

has a

had

great

face

noteworthy controllers

events before

In our current separated

mode by

one-shift

significance

a profound

effect

on the

In the past,

records

could the

of

information

During

be discussed ending and

more

thing

unstaffed

base,

so far, is

and over

half

the human

fact

is

the

com-

particularly

compiles

the rules

at runtime.

The

for

unneces-

a performance

an automated

penalty

batch

process-

removing

set used

history

was

shift

the

to process

of the

route our

of the

rule

engineering

important only

of limits.

go

out

current

base

reacts

of limits

is

But we are

areas

as well. when

It cannot

trends.

to

monitors

for improvement.

system

out

will and

NASA's

by the

raise

Jet alarms

Propulsion based

For

some-

predict

based

This

on

kind

to

a

234

our

inference

Laboratory

on predictions

tor will go out of limits.

controllers As

broadening

is an ongoing

a past

of pre-

dicting is a normal part of human monitoring. We are currently working with software from

departed.

distance.

of our system

on other goes

monitor

were

shift changes,

of operations, time

The

include

instance,

room.

deal

since

the monitor-

We are considering

process.

working

control

to face.

the

This

The development

requests.

new the

from

In our

The Future

an obvious

in to our

as

the graphi-

during

display rules from the rule our tape dump data.

conve-

(acknowledging page

system

important.

to support

introduce

and

but the

to allow

simultaneous

the

our rule

as RTworks

ing system.

and

out,

developing

paged

is secondary,

(< 500 rules),

engine

the

As such,

systems

simply

rules

when

a single problem system can handle

of page of

sary

a

using

is very

interface.

the inference

with the System

scheme,

are

and then bring

exist

significant,

of

focus

have

large

by

complete,

is on automating

As it turns

puter

but

grouping

is too primitive

multiple,

Living

kept,

take

number

handling

closing)

As

We

for,

systems

of problems

interface

plan

automated

them together into request). The paging

an unlimited user

not

sophisticated

the loop.

diagnostics group (page

did

focus

arrives. In to act as a

Many expert systems operations personnel,

system

payload

not very rules

Another

left

them.

display

of

shifts.

form

at 8 A.M. the

clear,

we would. to assist interface

the

simply

some

records be

than replace

case,

location.

documentation A controller

be asleep

find we are not

cal human

of pages.

such

also

rather

or inter-

reception

is one

paging

The

shielding

the

must

we suspected are designed

as turning the pager off or forgetting to it, and the occurrence of dead batteries,

many

unit,

controller

problems

and issues.

morning when the dayshift for the members of the team

none currently allow a customizable feedback feature (non-email based). We have found of pages

keeping critical

engine

This

that that

does

a moni-

kind of addition

will

significantly

reduce

the remaining

dayshift

human

monitor-

thank

Dr.

Guenter

tions

innovation on the EUVE project. also like to thank all the members

be

detected

early

and

Hughes

avoided

altogether.

staff Our

current

when

an anomaly

be called the

system

monitors is detected,

in to deal

expanding

for

with

capability

anomalies;

a person

must

the

problem.

With

for

on-board

fault

of

who

helped in

selves.

References

an

ideal

situation,

autonomous

taught

to recognize

lies,

then

necessary applies would

certain

it could to

be

deal

types

taught

with

the

action

situation.

primarily to known anomalies, be an important step toward

Critical

is

1994,

This

Conclusion

to have

to find ways

tions

costs.

availability

With of

proven mission tion of labor attainable you can room

and

lower-cost increases a model wise,

our

climate,

we are all going

to reduce the

mission

opera-

development

low-cost

AI

bperations intensive

packages

software, activities

a more

operation.

reliable, As

our

efforts

Friedland, their tive

to thank

D. Korsmeyer,

con-

Ames

on

"Low

F.

1994,

grant

Morgan,

Operations

Congress,

Sess.

for

Small

Operations Israel,

R. F.

and 1994

1995,

Operations "Robotic

IAA

Missions,

Approaches

T. & Malina

Soc. Pac.,

Ground

the EUVE Science presented at the 45 th Satellite

Cost

on

Technology

Astronautical Small

in Autonomous Astron.

and

Low-Cost

Jerusalem,

Spacecraft,

SPACEOPS Symposium

Innovative

Mission

Analysis," 9-14.

and

Proc.

Operations in press.

at Center,

Satellite

Data

October

Advances

for the

Telescopes,"

EUVE Proc.

in press.

NASA RTworks, Street,

Acknowledgments like

by NASA

NASA

International

Talarian Corporation, Suite 140, Mt. View,

(415) 965-8050.

would

supported

Testbedding Operations

on

centers can only help to increase the expertise available for other missions to call upon.

We

operations

and

and

experience

across

R.

Symp.

and our system matures, we become for other missions to follow. Likecollaborative

operations

Operations,

International

eliminais an

safer,

of

We would of the CEA

science

Approaches

and

goal. At CEA, we are proving that remove humans from the control obtain

championing

one-shift

and

"Third

Malina,

fiscal

Peter

EUVE

Space Mission Data Systems,"

but it greater

autonomy.

In the current

their

Head-

and

Abedini, A. & Malina, R. F. 1995, Designing an Autonomous Environment for Mission

of anoma-

what

Polidan

to make

work has been

detection and reaction, the next logical step is to move the autonomous monitoring software from the control center to the satellites themIn

for

the

tract NAS5-29298 NCC2-838.

monitoring software would have the ability to take corrective action. If the software can be

Ron

GSFC

a reality center. This

Dr.

of NASA

quarters

may

and

Riegler

ing functions, ultimately allowing a move to zero shifts. In some cases, anomalous situa-

M. Montemerlo,

P.

and D. Atkinson

for

support of the development of innovatechnologies for NASA missions. We

235

444 CA

Castro 94041,