project with limited programming resources can expand the breadth of its goals without incurring the high cost of hiring additional, dedicated programmers. This.
N95- 27393 Lessons
Learned
Monitoring
from the Introduction
to the EUVE
M. Lewis, F. Girouard, F. Kronberg, T. Morgan, and R. F. Malina Center for EUV Astrophysics, Berkeley, CA 94720-5030
Science P. Ringrose,
2150
Kittredge
in conjunction
Research
Center
the Extreme
with
(ARC),
an autonomous
operations
ESOC
to move
costs from
used
(EUVE)
in
(ESOC). The by a need to
and
has
allowed
continuous,
aly
are monitored
the
three-shift,
detection
Eworks,
by an autonomous
system.
an artificial monitoring
RTworks,
and
Epage,
system to notify anomalies.
ESOC
automatic
personnel
to reduce
missions
on
paging
of detected
budgets.
capture,
from
their
collaboration
with
project
limited
can
with
expand
incurring dedicated
the the
In this paper
control ARC
we discuss,
breadth
high
cost
may offer
missions.
insights
for other
NASA
class
spacecraft
instruments
programmers.
This
carries
designed
University
of
The
EUVE
mission
California,
the
first
designed
survey
operations
are
at the (UCB).
to conduct
of
the
entire
sky followed by of EUV sources. run
carried center
out in the EUVE science operations (ESOC) at the Center for EUV Astro-
a
(CEA),
UCB.
Shortly
after
launch,
NASA's
mission
scientific
without
faced
science
it
became
operations
drastic
success,
cuts. CEA
data EUVE's
sought
and
clear
and With
health payload
ways
is
that
analysis early to dra-
matically lower the mission operations budget in the hope that cost reductions would allow
additional, dispersal
physics
budget
resources
of the
while
Goddard
The
The
(GSFC)
from
monitoring
pro-
Center
a set of
built
Berkeley
was
multi-band
and
Flight
how
of its goals of hiring
inception
safety
center.
programming
from
Space
the payload
demonstrates
wide-
our experito one shift
Mission
controllers for implementation in an expert system is directly applicable to any mission considering a transition to autonomous moniin
has
for ways
looking
operations
of knowledge
toring
system
extreme ultraviolet (EUV) spectroscopic observations
NASA their
an expert
future
for collabothe criterion
to completion, the areas where ences in moving from three shifts
science
age of shrinking NASA budgets, the learned on the EUVE project are use-
ful to other cess
an
allows
impacts on the implementation, the completion time and the final
The Explorer
(AI) payload based
centers
The Extreme Ultraviolet Explorer (EUVE) launched on a Delta II rocket in June of i 992.
includes
package
NASA
Introduction
anom-
system
intelligence
telemetry
In this lessons
This
across
to choose
cost.
sci-
human-tended monitoring of the science payload to a one-shift operation in which the off shifts
of California,
spread including
imple-
system
Explorer
ence operations center implementation was driven reduce
63
D. Biroscak,
missions to easily access experts rative efforts of their own. Even
NASA's
has
monitoring
Ultraviolet
Center
/-4
of California at Berkeley's for Extreme Ultraviolet Astro-
(CEA),
mented
A. Abedini,
expertise
The University (UCB) Center Ames
Operations
St., University
Abstract
physics
of Autonomous
of
EUVE
to continue
operating
past
the
end
of
229
PRECEDING
PAGE BLANK
NOT
FILMED
pAe_. ___
IiVTENTtONAL[Y p_A;,!K
the nominalmission.We lookedat manyareas of the project including the possibility of reducing staffing by introducing autonomous monitoring.Becauseof our lack of experience in this area,we beganthe processby looking for a parmershipwith someonepossessingrelevant experience.We found the NASA Code X and NASA Ames ResearchCenter(ARC) hadtheknowledgeandthedesireto helpus.
anomaly was not considered a priority as a human controller could notice it during the dayshift,
and
three
areas require The
monitoring
Selecting
shelf CEA
includes
The
.
science
payload.
Thus,
.
off-the-shelf package, would be
a heater
might
and
that
physical
responses essential
are
performance
NASA
->
monitor
are being
of the
(Explorer ESOC).
the
AI
telemetry
received. links
to to
If one
goes
down,
of the
must recognize the situation and to summon someone to restore
communications. The
hardware
and
in the
failure, summon
the system must someone to resolve
In the
an extensive
products that tested products speed,
ESOC.
software
systems
search became
the
data
software be able
we decided
the safety
-> cannot
This
telemetry
links
the communications
scratch,
to ensuring
from
communications
unless
competing mentation
of the problem
payload.
thermal
the continued instruments.
software
ity,
scope
is done
Platform
the time and resources to create an intelligence (AI) system from
we limited
science
Appropriate conditions
ensure science
Lacking artificial
step,
at risk.
electrical,
systems. changing
a Package
of it. As a first
it does
monitoring:
EUVE
We conducted
required
is exceeded
From the point of view of the EUVE science operations center (ESOC), we concluded that
The Explorer Platform is an inherentlyrobust spacecraftdesignedto last 10 yearson orbit. It hasboth softwareandhardwaresafingconditions that can be entered with ground commandingor autonomouslyby the spacecraft. To date, we have never entered the hardware safehold mode but have autonomously entered the software safepointing mode twice by human error. Like the spacecraft, the sciencepayloadis very robust. The payloadprotectsthe scienceinstrumentswith on-boardhardwareand software safety measures, such as heatersfor the mirrors and control of the detector voltage level in the event a detectoris being overexposed(detector doors close in the event of a serious threat).The ability of the payloadand spacecraft to protect themselvesfrom immediate threats inspired confidencefor the development of a systemthat would detectthreatsof a less immediate nature, without requiring full-time humanmonitoring.
packages. To select an off-the-shelf we needed to examine what
a limit
not put the instruments
lo
to evaluate
until
and
ease
ground event
of a
be able to the problem.
search
for off-the-
would meet our needs. in-house for applicabilof use.
The
cost
of the
products was a factor, as was docuand technical support. As the progressed package stability the critical criterion. With
clearly limited
manpower and a short schedule, we could not risk software deficiencies. A stable package
the of
also helps ensure the accuracy and utility of documentation. This consideration was very
be
on for two days without sending anything out of limits. This situation would indicate a problem but not an immediate threat. This kind of
important, software lacked
230
as we intended to customize the ourselves. Several good products
adequate
documentation.
These
prod-
ucts
would
software
have
required
company
for
Ultimately,
we selected of
to
hire
implementation
system. This would have expensive for our program.
Corporation
us been
the
monitoring
of the
the
prohibitively
RTworks
tasks
deals
are
CA.
local
ground
RTworks displayed solid performance pled with excellent documentation
couand
basic
question;
to know
The ESOC
the
payload.
telemetry,
via
GSFC,
We
receive
with
the satellite
The
ESOC
on a secure
is
real-time
data,
and postpass
during
tists,
line.
CEA
telemetry SOCtools monitoring
software.
During
the dayshift,
developed
at CEA).
tape
(RTie).
If Eworks
detects
rather
monitoring
of
the
Eworks
human computer-interface activated.
(RThci)
oped and capture.
The
edge
from Implementation
We broke
the implementation
data?
supplied.
information
see Abedini
& Malina
(from
science
payload. was
the
representation
of
and
knowledge
an intermediate that
but
After identifying the team devel-
method
to develop
cre-
and compre-
functionality
of the payload. areas to monitor, the
The
blueprints
on expert
of
engihealth
not approached
the hardware based
and ARC).
set of critical to ensure the
system
from
CEA
would
knowl-
serve
as
a
deliverable product from the domain experts to the knowledge base developers. We used informal flowcharts in a series of documents
is
for each we
Lessons
regularly
detailed
a small needed
of the
tuned
We decided
the
module
to the same
receiving
monitoring team consisted of a of controllers, hardware scien-
knowledge
performance the critical
system to For visual
software,
systems,
proceeded
hensive
the
an anomaly,
requests are made to the Epage page an on-call payload controller.
For more
of an expert
by working
data are also fed into the RTworks data acquisition module (RTdaq) and the inference engine
safety
ation
is monitored by a controller using (an interactive, workstation-based system
down
being
and programmers
and
contacts
production
are
The team chose neering monitors
dumps. Immediately upon arrival the data are autonomously archived and decommutated using
if data
The payload small group
receives X.25
boils
is the software
on the ground (1994).
now staffed for only one 8 hour shift per day, 7 days per week. During the off shifts, the customized version of RTworks called Eworks monitors
paper
monitoring.
So, if Eworks does not receive any telemetry for more than 6 hours the on-call controller
The ESOC was formerly staffed 24 hours per day, 7 days per week by a payload controller student.
this
the payload
systems
will be paged.
aide
important,
with
Although
ware does not need to know the state of every link in the communications path, it only needs
Overview
and an engineering
payload.
Or, more accurately, is the science payload being monitored? We determined that the soft-
the generaltools allows
customizing to our needs. Moreover, the open architecture allows us to easily plug in previously existing code.
System
equally
primarily
View,
technical support. Importantly, ized nature of the RTworks
science
The communications/ground systems group did come to a very important realization. Monitoring of communications links and
by Talarian
Mountain
of the
were
of the
major
automating
subsystems the
for which
monitoring.
This
approach proved very useful as it cleanly separates the issues of implementation and knowledge representation from the actual
into two teams:
one to handle the ground systems and communication issues and the other to deal with
231
knowledge
itself.
We
representing
the
domain
had
some
difficulty
knowledge
in
in flow-
charts
until
perceived
we
need
a sequential found that
freed
ourselves
to represent
the
the knowledge
in
way. On several we were attempting
knowledge representation ceived, causal flow when
load
from occasions to make
we the
fit into a preconit is more naturally
and correctly represented by an event-driven model ("event-driven" in that nothing occurs until new data are received). The data are often
received
in what
appears
to be an asyn-
health
pressing
and
need exists received the need
The
of our
nature
problems
quality, dropout, or other effects of receiving our data after the level-zero processing performed at GSFC, as well as the basic
their
complexity
cessed
The
of our telemetry
data-driven
presents
a
nature
problem
things we want data. Ultimately, should
be
data
of issues
since
of the system one
real-time
spacecraft
the
very
of current when we schedule
a
of
hours.
If
and
in
fact
the
whole
we
do
not
for 6 hours then since the RTie
RTworks
system
the
is
ing
complicates
interface external
to
implement.
Not
that the chosen product it is also important
handle
reason-
format
greatly
implementation
and
original
the
data
advantages data
Before
our
implementation
is it
stored
with
the
activity
was
the
verification
basic
original,
nature
us in that each
we
every
quality, storage
which full data quality
Other the
intact.
has some the data stream is
information,
few years.
of our telemetry
frame
5%). keeping
of data
does
challenges not contain
of all engineering chanwith various people and
various monitoring and control systems we encountered a widespread assumption that each frame of data contains a sample from all
had operations personnel verify that data were received for every real-time contact, but the essential
can eas-
information
from
including
it can be reverified
automated.
of RTworks,
less than
result
life expectancy beyond becomes corrupted. If the
proved flexto note the
be
by
all data
a complete snapshot nels. In our contacts
to
quality
almost
often
needs
our
stream,
The
what
facility sent to the unfortunate
This
recasting of the problem. While it is often essential to have an existing working system, to ensure the success of automation one must recast
in
form
for in-band quality provides it at the end we must
(stripping
compresses
custom clients
only
form The
processing and storage resources it does not make sense to marginally compress the data
For example,
easy
the them.
Processing
data.
the
Fortunately, the RTworks architecture provides a convenient application programming
important ible, but
mis-
avoided by providing the full data In today's world of relatively cheap
significant
proved
us
upon the ease of is level zero pro-
Packet
uncertain
delivered
for interfacing with and the external
existing
examine reaches
data message,
on
driven by the reception of data, we had to create external clients that trip time-out alarms.
(API) clients
While
gives
position of having a real-time data stream that has been stripped of all quality information. Since the data delivery format (PACOR mes-
ily be stream.
number
carefully telemetry
by
of each
certain
are suc-
format
can have profound effects automation. Our telemetry
often changes at the last minute. Instead we determined that it is sufficient to check whether or not data has been received within receive data from the payload an alarm is raised. However,
areas.
sages) does not allow information but, rather,
or production contact
telemetry
(PACOR) at GSFC before being ESOC. Thus, we are left in the
in itself
of
to detect is a lack we cannot predict
the
data
stream.
since
receiving
of
that data
No
on every contact, to predict the contact
in several
should
because
basis.
sions rarely have the capability to change the nature of their telemetry stream, future miswhich
fashion
on a regular
to verify
cessfully alleviating schedule.
sions
chronous
safety
of pay-
232
available takes
engineering
128 frames
data
to
channels.
(over
sample
every
It
actually
2 minutes)
of EUVE
engineering
channel,
unknown, and thus the integrity recent value model is maintained.
although many are updated every one or two frames. We found that this issue, and the sim-
Our
reuse
tant
role
ple
dropouts
implement
not
toring.
fact
(from
that
the
data
may
contain
transmission-problems),
dled gracefully
was
by the RTworks
han-
product.
A basic,
underlying
reason
every
assumption
between
the
engineering
is inadequate
channel. values
frame can only be with the most recent
reasoned expected
not
necessarily
received. uses
the
Our
the decommutation neering segment
uses
segment
from
channel
in conjunction value, which is recent
values.
display The
individual
APIs It
has
enabled
paid
off.
to
mentioned we already
moni-
The
fact
that
is available
extremely rapidly
benefi-
develop
the
customized RTdaq. had extensive limit-
limit
checking,
but
rather
we
package
results
of the external
limit
to decouple
cedure
has the added
benefit
of the engi-
easily
memory
timestamps
to
checking code, we did not attempt to create rules in the inference engine (RTie) to do
value
shared
able
and code
software
proven us
an impor-
were
abstractions
operations
previously Also, since
the current
SOCtools
memory
cial.
played we
most
of autonomous
data
really
of our
through
from
assumption
from
most
interactive
a shared
one
sample
This
because
system
modularization
is that
last
our
code
quickly
Appropriate
much can
of existing in how
of the
dle
on every
take limit
lacks
advantage
checks.
our
information.
in
This
of allowing
of existing
checking
quality
pass
code
real-time This
the prous to
to han-
data
that
feat is accom-
engineering channel (and the timestamps
to deal with this issue also conveniently serve
plished tentative
through the use of what we call limit checking. The first time a value
as a semaphore
multiple,
exceeds
a particular
client
accesses
channel The
at the
quality
able
does
engineering
not maintain
on the
because
the
were
individual
product
timestamps
However, and
asynchronous,
tentatively
level).
RTworks
vidual
and ing
for
most
recent
of the product's of
the
to modify
messages, such
in the
a message,
customized
values
the
slots
Initially
contain
the
internal
reference
is the default
In this way, most
RTie
recent
have
values
expected
system cron.
planned
ing
relies
a
The
combination
current,
on standard
have
Unix
postponed
of telephony. system
page
to
constraints.
We
area
paging
A key intervals,
received. personnel
value
and
will either
would
or
233
The
system
login
acknowledge support
requires the
a
of our
pagto
the num-
certain has
timebeen
that the on-call computer
page(s).
telephony
like in the
It continues
escalating
to the CEA
simple
efforts
feature
is its persistence.
at regular
very utilities,
all
ber of people being paged after outs until an acknowledgment
unknown
value
we
schedule
receiving
start-up
all slots
a
and telephony system, but we were forced to scale back our efforts because of resource and
slots corresponding to the channels. Rules do not fire
they
(unknown
for all slots).
of the
until
which also is considered
we RTdaq
missed are sent to the in one of these new
it sets the
unknown for the given engineering when
case
it is not
Paging
modules with a new message type. in the input stream, the engineering
channels expected but other RTworks clients
and
as only
values.
flexibility
RTie to handle this issue by supplementthe basic message types between the
RTworks For gaps
of limits,
it is treated
second consecutive update, exceeds the limits, that a value out of limits.
indi-
documentation,
our
out
limit
system
Ideally system
we that
allowed
the
page
from
any phone
more
button
requests
and
pushes.
services
provided
distance
phone
to
be
There
by carders,
by one or
are
several
result, have
reviewed
acknowledged
a number
local
and
record become
paged
of
at 2 A.M. will
next order
long-
but to our knowledge
cohesive
that
the delivery
unambiguous.
not
of
common
Aside
from
such wear
is unreliable,
a fact
to
users.
knowledge
the
possible
most
human
We
structures
ference Our
that
can cause
prevents
operations
There
the
center
is also
time.
service
low-cost
provider
solution
persistent pages that continue of acknowledgment is made.
is
until
down-
the major ing
ability
clearly page
we
need,
is the
requests.
Our
on the detection person
into
that can
nient
no automated
multiple
alarms
requests,
we
are
settling
we
are
of removing
discovering
humans
and
This
move
flow
of information. but
exchanged
has a
had
great
face
noteworthy controllers
events before
In our current separated
mode by
one-shift
significance
a profound
effect
on the
In the past,
records
could the
of
information
During
be discussed ending and
more
thing
unstaffed
base,
so far, is
and over
half
the human
fact
is
the
com-
particularly
compiles
the rules
at runtime.
The
for
unneces-
a performance
an automated
penalty
batch
process-
removing
set used
history
was
shift
the
to process
of the
route our
of the
rule
engineering
important only
of limits.
go
out
current
base
reacts
of limits
is
But we are
areas
as well. when
It cannot
trends.
to
monitors
for improvement.
system
out
will and
NASA's
by the
raise
Jet alarms
Propulsion based
For
some-
predict
based
This
on
kind
to
a
234
our
inference
Laboratory
on predictions
tor will go out of limits.
controllers As
broadening
is an ongoing
a past
of pre-
dicting is a normal part of human monitoring. We are currently working with software from
departed.
distance.
of our system
on other goes
monitor
were
shift changes,
of operations, time
The
include
instance,
room.
deal
since
the monitor-
We are considering
process.
working
control
to face.
the
This
The development
requests.
new the
from
In our
The Future
an obvious
in to our
as
the graphi-
during
display rules from the rule our tape dump data.
conve-
(acknowledging page
system
important.
to support
introduce
and
but the
to allow
simultaneous
the
our rule
as RTworks
ing system.
and
out,
developing
paged
is secondary,
(< 500 rules),
engine
the
As such,
systems
simply
rules
when
a single problem system can handle
of page of
sary
a
using
is very
interface.
the inference
with the System
scheme,
are
and then bring
exist
significant,
of
focus
have
large
by
complete,
is on automating
As it turns
puter
but
grouping
is too primitive
multiple,
Living
kept,
take
number
handling
closing)
As
We
for,
systems
of problems
interface
plan
automated
them together into request). The paging
an unlimited user
not
sophisticated
the loop.
diagnostics group (page
did
focus
arrives. In to act as a
Many expert systems operations personnel,
system
payload
not very rules
Another
left
them.
display
of
shifts.
form
at 8 A.M. the
clear,
we would. to assist interface
the
simply
some
records be
than replace
case,
location.
documentation A controller
be asleep
find we are not
cal human
of pages.
such
also
rather
or inter-
reception
is one
paging
The
shielding
the
must
we suspected are designed
as turning the pager off or forgetting to it, and the occurrence of dead batteries,
many
unit,
controller
problems
and issues.
morning when the dayshift for the members of the team
none currently allow a customizable feedback feature (non-email based). We have found of pages
keeping critical
engine
This
that that
does
a moni-
kind of addition
will
significantly
reduce
the remaining
dayshift
human
monitor-
thank
Dr.
Guenter
tions
innovation on the EUVE project. also like to thank all the members
be
detected
early
and
Hughes
avoided
altogether.
staff Our
current
when
an anomaly
be called the
system
monitors is detected,
in to deal
expanding
for
with
capability
anomalies;
a person
must
the
problem.
With
for
on-board
fault
of
who
helped in
selves.
References
an
ideal
situation,
autonomous
taught
to recognize
lies,
then
necessary applies would
certain
it could to
be
deal
types
taught
with
the
action
situation.
primarily to known anomalies, be an important step toward
Critical
is
1994,
This
Conclusion
to have
to find ways
tions
costs.
availability
With of
proven mission tion of labor attainable you can room
and
lower-cost increases a model wise,
our
climate,
we are all going
to reduce the
mission
opera-
development
low-cost
AI
bperations intensive
packages
software, activities
a more
operation.
reliable, As
our
efforts
Friedland, their tive
to thank
D. Korsmeyer,
con-
Ames
on
"Low
F.
1994,
grant
Morgan,
Operations
Congress,
Sess.
for
Small
Operations Israel,
R. F.
and 1994
1995,
Operations "Robotic
IAA
Missions,
Approaches
T. & Malina
Soc. Pac.,
Ground
the EUVE Science presented at the 45 th Satellite
Cost
on
Technology
Astronautical Small
in Autonomous Astron.
and
Low-Cost
Jerusalem,
Spacecraft,
SPACEOPS Symposium
Innovative
Mission
Analysis," 9-14.
and
Proc.
Operations in press.
at Center,
Satellite
Data
October
Advances
for the
Telescopes,"
EUVE Proc.
in press.
NASA RTworks, Street,
Acknowledgments like
by NASA
NASA
International
Talarian Corporation, Suite 140, Mt. View,
(415) 965-8050.
would
supported
Testbedding Operations
on
centers can only help to increase the expertise available for other missions to call upon.
We
operations
and
and
experience
across
R.
Symp.
and our system matures, we become for other missions to follow. Likecollaborative
operations
Operations,
International
eliminais an
safer,
of
We would of the CEA
science
Approaches
and
goal. At CEA, we are proving that remove humans from the control obtain
championing
one-shift
and
"Third
Malina,
fiscal
Peter
EUVE
Space Mission Data Systems,"
but it greater
autonomy.
In the current
their
Head-
and
Abedini, A. & Malina, R. F. 1995, Designing an Autonomous Environment for Mission
of anoma-
what
Polidan
to make
work has been
detection and reaction, the next logical step is to move the autonomous monitoring software from the control center to the satellites themIn
for
the
tract NAS5-29298 NCC2-838.
monitoring software would have the ability to take corrective action. If the software can be
Ron
GSFC
a reality center. This
Dr.
of NASA
quarters
may
and
Riegler
ing functions, ultimately allowing a move to zero shifts. In some cases, anomalous situa-
M. Montemerlo,
P.
and D. Atkinson
for
support of the development of innovatechnologies for NASA missions. We
235
444 CA
Castro 94041,