Using Hardware-Software Codesign Language to implement CANSCID

7 downloads 7557 Views 79KB Size Report
Center for informatization of Education ”CTE”. Saint-Petersburg, Russia. Email: [email protected]. Abstract—The paper describes our solution of this year hw-.
Using Hardware-Software Codesign Language to implement CANSCID Oleg Medvedev Lanit-Tercom Saint-Petersburg, Russia [email protected], [email protected]

Abstract—The paper describes our solution of this year hwsw codesign problem with our codesign language (HaSCoL).

I. I NTRODUCTION The main goal of our participation in the hardwaresoftware codesign contest was to test our language (HaSCoL — [2], [3]) on a new practical problem. Thus, this paper describes our solution as well as some features of the language, which simplify hardware development for this problem. The only thing that the reader would need to know about HaSCoL is that this language is semantically based on a model of blocking and nonblocking message passing. A message is guaranteed to be delivered as soon as a receiver accepts it, that is, possibly, in the same cycle when it is sent. Thus, cycle boundaries need to be expressed explicitly. Also some syntactic notes usually appear as comments in code fragments. We would also like to note that this year’s contest seems to be more hardware-instrumental software codesign. II. PATTERN MATCHERS All patterns are matched with DFAs. We used JLex based implementation of the organizers to generate the DFAs. We generated two flavors of DFA implementation: • a pure state machine in VHDL for patterns that need to be checked for every incoming byte; • a binary code for “DFA engines” (described in section II-B), which is stored in a block RAM. The latter type allows to store several category specific DFAs in two block RAMs. This implies that those DFAs belong to different categories ⇒ only one of them is used for each packet. We used a rather small FPGA (xcvlx50t), thus we needed this technique to free more logic resources for DFAs of the first type. Our plan was to use NFAs for the most complex patterns. We believe that this approach would allow us to implement all the patterns, but we failed to do this idea on time. Apart for matching, all the matchers are responsible for storing an array of current matcher state for each of the 64 streams and to be able to quickly switch states. Pure VHDL matchers use distributed RAM for the array; DFA engines use the same block RAM, which stores DFA codes; NFA

Ilya Posov Center for informatization of Education ”CTE” Saint-Petersburg, Russia Email: [email protected]

matchers would use block RAMs as well (mostly because we had a lot of spare block RAMs). We don’t describe pure VHDL pattern matchers, because their implementation and generation is straightforward. A. Pattern matcher interface All the matcher kinds have the same interface. It consists of 4 incoming and 1 outgoing channel to report a match (channels are used to send messages to and wait them from): −− next symbol of the current stream ( unsigned i n t 8 b i t s wide)

in newSymbol(symbol : uint( 8 ) ) ; −− a command to save a s t a t e of the current stream −− into the i n t e r n a l array at a given 6−b i t index

in saveStreamState(index : uint( 6 ) ) ; −− load a s t a t e of the matcher from a given index for a given category

in loadStreamState(index : uint(6) , categoryID : uint( 6 ) ) ; −− set the s t a t e of the matcher to the s t a r t s t a t e −− of a DFA for a given category

in resetState(categoryID : uint( 6 ) ) ; −− a message i s sent to t h i s channel each time a match occurs . −− A category of the pattern that matches i s reported

out matched0(categoryID : uint( 6 ) ) ;

Category IDs are only used by DFA engines, because they implement many DFAs for different categories in one unit. Each incoming command is processed in one cycle. Stream symbols may be fed in each cycle or less often, while stream switch requires up to 3 cycles for a sequence of saveStreamState-loadStreamState or saveStreamState-resetState messages. Matches are reported with one cycle delay. B. DFA engine The engine operates a DFA as a sequence of 144-bit-wide “instructions” with one-to-one mapping between instructions and DFA states. The first bit of an instruction indicates whether it represents an accepting state. Other bits define different transitions depending on an interval, which an incoming symbol falls into. The intervals must be nonoverlapping. Each transition is described by an 8-bit first character of the interval, by its length, and by an address of the destination state. The address is relative to the beginning of the DFA image in memory and is 7 bits wide. An instruction can encapsulate two intervals of ≤ 63 symbols,

4 intervals of ≤ 15 symbols and one of ≤ 7 symbols. If an incoming symbol doesn’t fall into any of the intervals then a default address is used as a destination. To make usage of relative addresses possible an absolute address of the current DFA is stored in a special register. A message handler that accepts an instruction and a new symbol from a stream and sends another message with a destination address looks like this: −− acts only when messages come through both l i s t e d channels

command(cmd : uint(144)) , newSymbol(s : uint( 8 ) ) { match cmd with −− the code below describes a l l f i e l d s of an instruction , that is , −− i n t e r v a l s t a r t i n g symbols , i n t e r v a l lengths , destinations

{ AC : uint(1) s0 : uint(8) len0 : uint(6) addr0 : uint(7) ............ s6 : uint(8) len6 : uint(3) addr6 : uint(7) defaultAddr : uint(7) } −> {

−− r0−r6 booleans say whether s f a l l s in one of the i n t e r v a l s

l e t r0 = l e t df = s − s0 in −− df{0:5} means ” b i t s 0−5 inclusive of df”

(df{0:5} < len0) and (df{6:7} == 0) in ........................ l e t r6 = l e t df = s − s6 in (df{0:2} < len6) and (df{3:7} == 0) in −− ”&” means b i t vector concatenation

} }

l e t variants = r0 & r1 & r2 & r3 & r4 & r5 & r6 in send newAddressIs( i f variants eq 0b1000000 then addr0 e l i f variants eq 0b0100000 then addr1 ................... e l i f variants eq 0b0000001 then addr6 else defaultAddr fi )

Here let x = ... in constructs are used to assign names to expressions, which are then used to evaluate a parameter of a message, which is sent to the newAddressIs channel; send means blocking message sending. A particular width and format of state-instructions was chosen so that almost all category-specific patterns could fit into this format. This process required several iterations with DFA-to-code translation software. C. NFA matcher NFA seemed us to be a bad choice for the majority of patterns because of their simplicity. On the other hand HaSCoL is rather convenient to generate NFA to, that’s why we decided to use NFAs for the most complex patterns (like dns or snmp). We haven’t had enough time to fully implement them and didn’t use them in our final submission. Nevertheless, we describe how to implement an NFA in HaSCoL because it looks quite elegant. The general

any S0

[ab] a

S1

Figure 1.

Accepting state b

S2

b

S3

an NFA for .*a[ab]*bb

Malicious pattern detector

packet FIFO

UART manager

Frontend

Categorizer

GRLIB

stream ID

match FIFO

Figure 2.

toplevel diagram

principle is to implement each state of an NFA as a message handler. A message coming to this handler on a particular cycle means that this state of the NFA is active on this cycle. A current symbol comes to the handler with another message. The handler inspects the symbol and decides, what other states must be active on the next step. It sends activation messages to the appropriate handlers on the next cycle. We can use a blocking send operation if we are not sure that a new symbol comes every cycle. Simultaneous activation of a handler from several other handlers counts as one activation. Consider a pattern “.*a[ab]*bb”. Its NFA is presented on fig. 1. Its implementation looks like: default { send s0( ) } −− sends to s0 every cycle next(s) , local s0 { skip ; i f s = ’a’ then send s1( ) f i } next(s) , local s1 { skip ; i f s = ’a’ then send s1( ) f i | i f s = ’b’ then send s2( ) f i } next(s) , local s2 { skip ; i f s = ’b’ then send s3( ) f i } next(s) , local s3 { send matchHappened( ) }

Here “;” delimits consecutive pipeline stages; “|” glues statements, which run in parallel in the same cycle; “skip” means “do nothing”; “local x” is an in-place declaration of a channel x, which is used to pass messages with no parameters. One channel corresponds to each NFA state. III. T OPLEVEL The diagram of our solution in whole is presented on fig. 2. Several notes follow: • the solution is designed to process one byte per cycle, that is, one flit per 5 cycles regardless of the flit type; • we use GRLIB [1] library of GNU hardware blocks, particularly, DDR2 and UART controllers; • the frontend reads a stream of flits from an offchip memory. It consults the stream ID block for 6-bit

identifiers of TCP streams and translates a stream of flits to a stream of internal commands, which basically encode messages to be sent to the pattern matchers according to a matcher interface (section II-A); • stream ID accepts IP addresses and ports of a stream as a sequence of 3 32-bit words. Each word is processed for 4 cycles. Thus the block consists of 16 parallel checkers. Each checker compares an incoming word with respective IPs/ports of 4 existing streams sequentially. Thus we sustain the 1 byte per cycle throughput with a minimal resource usage. The checkers use a total of 8 block RAMs to store stream properties, because we had more spare block RAMs than logic; • the categorizer manages DFAs of all the category patterns by sending them commands, which come from the frontend. The same commands are passed to the malicious pattern detector, together with a category ID for each packet. The problem statement obliges us not to process a packet with the malicious pattern detector until the packet passes through the categorizer in whole. Thus the incoming commands are stored to a FIFO. A category ID (with a special ID to denote an unknown category) is added to the beginning of a packet. If a category is determined during processing of the packet then the ID is overwritten in-place in the FIFO; • the malicious pattern detector reads a stream of commands from the FIFO and sends them to pure VHDL DFAs and DFA engines, which search for malicious activity patterns; • all the matches found are stored in another FIFO, which sends them to the UART manager. The latter prints them to a terminal as series of hexadecimal numbers. They are post-processed on a PC to the jury’s log format. The UART manager is also responsible for updating and printing the counters. We use blocking message passing whenever is needed to guarantee that no information gets lost. For example, if a test consists of too many matches then (because UART is slow) the UART FIFO gets full and the rest of the design freezes block by block. It unfreezes when some free space appears in the UART FIFO. Despite that such blocking may happen, strict guarantees on message delivery time allows us to predict performance accurately. We had a temptation to add a Z80-like processor to our design to implement the UART manager as well as counter printing code in C, but we ended up with a pure hardware implementation because HaSCoL offers enough means to write low speed sequential code. In particular, HaSCoL supports traditional imperative style programming, so a message handler that prints a bit vector as a hexadecimal string looks not more complex than in C:

data ready : bool = true; −− the handler waits for a message with 32+8+6 b i t s to be printed

toUART(a : 32 , b : 8 , c : 6) when ready { pos := 5 | ready := false | buf[2] := a{0:7} | buf[3] := a{8:15} | buf[4] := a{16:23} | buf[5] := a{24:31} | buf[1] := b | buf[0] := 0b00 & c ; while not (pos == 7) do send putChar(buf[pos]{4:7}); send putChar(buf[pos]{0:3}) | pos := pos − 1 done; ready := false }

Interestingly, the code above seems to be the most complex code of the design, because other (high-performance) parts look just as a sequence of linear pipelines with FIFOs in between and a fixed-latency message exchange with the pattern matchers. IV. S OME FACTS •

• •







The problem solution occupies 950 lines in HaSCoL (before preprocessing), 692 lines of direct block RAM instantiation in VHDL, 577 lines of Java code to generate DFAs in VHDL and binaries for the engines, 92 lines in shell + 62 lines in ocaml to build a synthesizable code for a list of patterns; we spent approximately 120 man-hours; we debugged the design mostly in simulation with Modelsim. We implemented an pattern-to-NFA generator and a program to generate random short tests with given properties with streams that match given patterns; at the moment of final submission it seemed that we still didn’t know how to configure DDR2 controller from Gaisler appropriately and that leaded to reading of wrong words from memory (this controller needs calibration of read delays and seems to be sensitive to DDR2 frequency). This happened not very often, but when it did, our solution used to hang in a middle of a test. Otherwise it produced a correct answer. We decided to submit the solution anyway and inform the jury about the fact; our device is a Xilinx Virtex5 xcvlx50tff1136-1, device utilization is 42% registers, 71% LUTs, 25% SLICEMs, we occupied 32 of 36K block RAMs directly; we used ISE Webpack for synthesis and place-androute. R EFERENCES

[1] http://www.gaisler.com [2] D. Boulytchev and O. Medvedev, Hardware Description Language Based on Message Passing and Implicit Pipelining, EWDTS09

−− global variables declaration

data buf : [ 0 . . 5 : uint( 3 ) ] uint( 8 ) ; data pos : uint(3) = 7;

[3] http://oops.tepkom.ru/projects/coolkit