Elapsed time on arrival: a simple and versatile ...

4 downloads 1068 Views 613KB Size Report
Ratnakar Dash is an Assistant Professor in the department of Computer Science and Engineering ..... https://omnetpp.org/doc/omnetpp/manual/usman.html.
Modelling and simulation of IP over InfiniBand with OMNeT++ Utkal Sinha* M.Tech Scholar Dept. of Computer Science and Engineering, National Institute of Technology, Rourkela, Sector - 2, Rourkela, Odisha 769008, India Email: [email protected] * Corresponding author

Ratnakar Dash Assistant Professor Dept. of Computer Science and Engineering, National Institute of Technology, Rourkela, Sector - 2, Rourkela, Odisha 769008, India Email: [email protected] Abstract: Most data centres have IP based applications which are often executed on top of high speed and low latency interconnection technologies like the InfiniBand. It is achieved by following a protocol called IP over InfiniBand (IPoIB). To develop and test new techniques that depends on IPoIB protocol, a network simulation model for IPoIB is desired. In this work, an IPoIB simulation model based on OMNeT++ framework has been designed. This simulation model also supports InfiniBand quality of service (QoS) and InfiniBand network partitions. Simulation results, validated against real systems, show the accuracy of the proposed model. Keywords: IPoIB Model, OMNeT++ Model; IPoIB Performance Evaluation; HPC Simulation; Reference to this paper: This paper is under review in a Journal. It is uploaded to ResearchGate as it creates a dependency to another paper. Please use the ResearchGate DOI for referencing this article. Biographical notes: Utkal Sinha is an M. Tech scholar in the field of computer science at the National Institute of Technology, Rourkela, India. His research interests include computer networking, high performance computing, pattern recognition and machine learning. Ratnakar Dash is an Assistant Professor in the department of Computer Science and Engineering at the National Institute of Technology, Rourkela, India. His research interests include signal processing, image processing and computer networks.

1

Introduction

According to Banks, J. (2005), a simulation is an imitation of the behaviour of a system over time. Simulations are extensively used in the academia for general or specific purposes in the field of high performance computing (HPC). There are many states of the art network-based simulators available. For instance – network simulator version 2 (NS2) and version 3 (NS3), as described by Lee Breslau et al. (1999), are discrete event based simulators mostly used for networking research. OMNeT++ (2016) is also a discrete event based simulator, but it primarily differs from other network simulators in its architectural design as discussed in section 1.1. OMNeT++ also has a commercial version called OMNEST (2016). OMNEST has numerous simulation

models in the field of HPC. However, currently, it does not have a standardized model for InfiniBand and more specifically for IPoIB. On the other hand, data centres and HPC systems needs high throughput and low latency interconnect technologies to connect different computing nodes. According to TOP500 (2015) supercomputers, InfiniBand is the leading interconnect family based on system share, connecting nodes in HPC clusters as shown in Table I. Inherently, InfiniBand does not support internet protocol (IP) since it has its own networking stack as defined in InfiniBand Trade Association (2015). In order to support IP based communication in HPC clusters and data centres, the IP over InfiniBand (IPoIB) protocol has been standardized by

the Internet Engineering Task Force (IETF) and V. Kashyap (2006). Table 1: Interconnect Family System Share: Interconnect Family InfiniBand 10 Gigabit Ethernet Custom Interconnect Gigabit Ethernet Proprietary Network

System Share (%) 47.4 23.8 14.8 12.4 1.6

InfiniBand is popular for its zero buffer copy feature whereas IP is based on buffer copies. Therefore, it is imperative to have a simulation model to study the impact and performance of IPoIB based communications before actually implementing it on the InfiniBand physical fabric. In this paper, we propose an IPoIB simulation model which is based on the OMNeT++ simulation framework. This simulation model can be used to simulate:  IP differentiated services  IPoIB secure fabrics The proposed IPoIB model is different from the existing state of the art in a way that it introduces the key components of IPoIB protocol as well as that of InfiniBand:  IPoIB IP address resolution  InfiniBand Queue Pairs (QP) for IPoIB network interface  Complete and dedicated sixteen virtual lanes (VLs) to simulate QoS in InfiniBand and study the impact and performance of all possible VLs. The simulation model is then used to calculate network latencies and throughputs. Results thus obtained is then compared with that of the existing real world IPoIB devices verifying the proposed IPoIB model.

1.1 OMNeT++ Overview OMNeT++ in general is a framework for network simulations. In other words, it provides the necessary infrastructure or the kernel on top of which any network simulator or model can be built. As mentioned in OMNeT++ version 4.6 User Manual (2016), OMNeT++ based simulation models are sub divided into compound modules and simple modules. Simple modules are the ultimate elements or building blocks any networking model. Each simple module is connected to other simple or compound modules via gates. And each module communicates with other modules using message passing. This simplifies the entire process of simulation model design and also leads to module reusability.

divided into multiple layers - upper-level protocols, transport layer, network layer, link and physical layers. Each of these layers is only dependent on service offered by below layer. The InfiniBand architecture (IBA) or fabric constitutes of network elements such as - InfiniBand switches, managements components, routers, channel adapters (CAs), and links. Following are some of the major features in the IBA:  Queue Pairs: Queue pairs (QP) is a component in a channel adapter. The acts as buffers for send and receive packets in an InfiniBand (IB) node. Each queue pair is identified or addressed by another IB node using a number called Queue Pair Number (QPN).  InfiniBand Partitions: An InfiniBand partition defines a group of InfiniBand end nodes that are allowed to communicate with one another. The partitions in InfiniBand is implemented using a key known as Partition Key (P_Key). It has the following properties – o A partition key (P_Key) is a unique ID assigned to an InfiniBand partition. o P_Key of default partition(0xffff) is 0x7fff o When a P_Key is created, it is a 15 – bit number. After the membership type is set, the P_Key value becomes a 16 – bit number. o Most significant bit (MSB) of P_Key value denotes the membership type –  0: Limited Member (A limited member can only communicate with full members of the partition)  1: Full Member (A full member can communicate with both full and limited members of the partition)  If 0 represents a restricted communication and 1 represents permitted communication, then the communication policies between member types can be represented as shown in Table 2. Table 2: Membership Possible Communications Limited Member Full Member o

1.2 InfiniBand Overview InfiniBand is a high-speed, low latency, point-to-point interconnect technology developed by InfiniBand Trade Association (2015). Since InfiniBand has most of the required buffers and registers on the channel adapter itself, it has low processing overheads. The InfiniBand architecture is

o

Limited Member 0

Full Member 1

1

1

When assigning a P_Key value for a unique non-default partition, we should select a 15 –bit value with values ranging from 0x0001 to 0x7fff. So there is a total of 32767 P_Keys available. For example, 0x1234. When a packet arrives at a compute node, the partition key (P_Key) of the packet is matched with the Subnet Manager (SM) configuration. This validation prevents a







compute node from communicating with another compute node outside its partition. InfiniBand partitions can be used to increase security by imposing network isolation. Virtual Lanes: IB virtual lanes (VL) virtually divide the physical link into some virtual links. IBA defines 16 (VL0 –VL15) VLs per port. These virtual lanes are configured by the subnet manager. Quality of Service: IB defines 16 service levels (SL) to support priority applications on the IB fabric. To support QoS in IB, the SM configures and maps available VLs to SLs. IBA does not define any specific priority order for VLs. VL priority order assignment is usually implemented by the device vendor. In general, VL0 has the least priority where as VL15 has the highest priority. Addressing: Addressing in InfiniBand is mainly achieved using the below fields – o Local Identifiers (LIDs): It is a 16–bit identifier used by SM for intra-subnet routing. o Global Unique Identifiers (GUIDs): It is a 64–bit identifier which uniquely identifies each and every element in a subnet. The GUID never changes and is used as a part of the address for creating a GID. It’s equivalent to MAC address in Ethernet. This identifier is used to form GID. o Global Identifiers (GIDs): It is a 128– bit identifier used by IB routers for inter–subnet routing.

1.3 IP over InfiniBand (IPoIB) InfiniBand uses different addressing mechanism (which is not IP based and does not support sockets) to identify nodes in a network. Therefore, it does not support TCP/IP based applications. To support IP based applications on top of InfiniBand fabric, the Internet Engineering Task Force (IETF) (1986) IPoIB working group has specified the IPoIB protocol as specified by J. Chu (2006) and V. Kashyap (2006) as rfc specifications. It is important to note that since this protocol is designed to support IP based applications or protocols which uses the IP address like ICMP, IPv4, IPv6, TCP, UDP, and more, it does not support VLAN. The IPoIB implementation is done at the layer 2 of OSI model as shown in Figure 1.

Figure 1: IPoIB driver is implemented at the data link layer of the OSI model

2

Related Works

A Mellanox contributed OMNeT++ InfiniBand Flit Level Simulation Model (2013) already exist. However, this model is no longer supported and the model does not reflect the parameters default values of micro-architectures of Mellanox products. Also, this Flit Level model does not include the IPoIB protocol. Another OMNeT++ based InfiniBand simulation model is presented by P. Yebenes et al. (2013). This model almost accurately simulates the real InfiniBand devices and it suggests models for most of the key components in the InfiniBand architecture (For instance, InfiniBand network card (HCA), VL arbitration etc.). However, this model does not include the newly standardized IPoIB protocol. There is significant buffering and processing overheads due to multiple buffer copies in IPoIB protocol which this model does not account for. Also, this model does not consider multiple buffering latencies at each of the virtual lane buffers in the InfiniBand host channel adapter (HCA).

3

IPoIB Simulation Model

Since IPoIB uses multiple buffer copies and has own IP address to hardware address resolution mechanism, new message types and several modules are implemented inside the OMNeT++ framework.

3.1 Message Types Table 3 shows the list of message types that are created to model the IPoIB ARP handling (IP address to hardware address resolution) and to carry IPoIB Data frames over the InfiniBand fabric. Table 3: OMNeT++ IPoIB simulation model message types Message Types Description To carry IPoIB ARP IPv4ARP.msg Requests and Reply packets To carry IPoIB Data Frames IPoIBFrameIB.msg QPsendCompletion.msg To notify the Queue Pair on work request completion

3.2 InfiniBand IPoIB Node Model Originally, the IPoIB driver in real InfiniBand nodes sits on top of InfiniBand queue pairs (QP) and registers itself as a network interface to the OS kernel. It then accepts IP packets from the IP or network layer and translates them into InfiniBand frames after attaching IPoIB header. Once the IPoIB frames are created, work requests are pushed into the queue pairs. Typical InfiniBand work requests are packet send and packet receive requests which are pushed into the send and receive queues of a queue pair. The work requests are then processed and entries are pushed to the completion queues so as to notify the completion of any IPoIB send or receive operation. According to the InfiniBand Trade Association (2015), each InfiniBand HCA port has a maximum of sixteen virtual lanes (VLs) which in turn maps to a maximum of sixteen service levels (SLs) to support QoS. Each of the VL has a dedicated buffer to send or receive network frames. A VL arbitrator is used to select the next virtual lane for transmission. The InfiniBand IPoIB node model architecture is divided into a stack of four layers as shown in Figure 2.

Figure 2: IPoIB modelled node layered architecture Layer 1: It comprises of InfiniBand HCA port model, virtual lane (VL) model, VL arbitrator and arbitration table. HCA port in this layer connects to another nodes’ HCA port via an InfiniBand switch model to create links. This layer provides services to the Queue Pair Layer. Layer 2: Queue Pair layer consists of send and receive queues. It also has an optional completion queue. It provides services to the IPoIB layer. Layer 3: IPoIB layer models the IPoIB protocol. It implements the shadow ARP caches to help IP address to hardware address resolution. This layer is also responsible for

the creation of InfiniBand network frames. It provides services to upper layers. Layer 4: This is top layer in the proposed simulation model. It consists of two sub-models – IP Traffic Generator and IP Traffic Sink. They model a typical IP traffic load source and an IP traffic consumer.

Figure 3: InfiniBand IPoIB modelled node showing its internal components Figure 3 shows the modelled InfiniBand IPoIB node. It consists of the following internal components:  hcaPort: The hcaPort simple module has sixteen virtual lane buffer objects of OMNeT++ cPacketQueues class to simulate the sixteen VLs in real HCA cards per port and to support QoS in terms of InfiniBand SLs and traffic priorities. That is, network traffic with higher service level (SL) should be processed before traffics with lower SLs. The hcaPort simple module also has VL arbitrator and arbitration table objects. The VL arbitrator object is used to select the next VL out of 16 VLs for packet transmission. This VL selection is facilitated by the VL arbitration table whose entries are entered through the omnetpp.ini configuration file. These VL buffers are then polled in regular intervals for send or receive frames.  queuePair: The queuePair simple module has the queue pairs objects – send queue object and receive queue object. The send queue object holds the IPoIB messages to send whereas the receive queue object holds IPoIB messages that are received via the hcaPort simple module. Also a completion queue object has been implemented in this simple module to receive work request completion events or messages from the hcaPort module. This completion events can then be optionally be forwarded to the upper layer applications or modules such as the iPoIBLayer simple module to notify successful completion of work requests.  iPoIBLayer: The iPoIBLayer simple module accepts the IP packets from the above layer or network layer or IP packets generating layers which in this case is the ipTrafficGenerator simple module. iPoIBLayer module also maintains a shadow cache of the ARP table. So whenever an IP packet is received and it does not find the hardware address





corresponding to the destination IP address in its shadow ARP table, an IPoIBARP Request (using IPv4ARP.msg) is sent by the iPoIBLayer. On receiving the IPoIBARP Reply (using IPv4ARP.msg) from the destination node, the iPoIBLayer updates its ARP table. It then creates an IPoIB data frame with necessary InfiniBand headers (using IPoIBFrameIB.msg) and sends to the below layer which is queuePair simple module. ipTrafficGenerator: The proposed IPoIB simulation model has 16 types of ipTrafficGenerator objects corresponding to 16 types of SL traffics. Each ipTrafficGenerator simulates multiple upper layers of OSI protocol stack starting from the network layer till application layer. It creates IP packets which are then pushed to the iPoIBLayer module. iPoIBSink: The iPoIBSink module consumes the IP packets forwarded iPoIBLayer module.

3.3 InfiniBand Switch Model The proposed InfiniBand switch model is divided into two layers as represented in Figure 4.

Figure 4: InfiniBand switch model layered architecture Layer 1: Just like the InfiniBand IPoIB node model architecture as described in section 3.2, this layer consists of the modelled HCA ports, VLs, arbitrator and arbitration table. Layer 2: This is the switching layer. It consists of port forwarding tables which can be easily configured using the omnetpp.ini configuration file before starting the simulation. It is also responsible for isolating InfiniBand partitions using port forwarding table and partition keys (P_Key).

Figure 5: InfiniBand switch model implementation in OMNeT++ Figure 5 illustrates the InfiniBand switch model implementation using the OMNeT++ framework. The hcaPort objects constitutes the layer 1 and the ibSwitchLogic

object constitutes the layer 2 in the InfiniBand switch model architecture as shown in Figure 4. The InfiniBand subnet manager (SM) is not simulated in the proposed simulation model since the parameter values like the LIDs, PKeys, port mapping tables etc., are passed using the omnetpp.ini configuration file during the simulation initialization phase.

4

IPoIB Simulation Features

4.1 IP differentiated services Modern IP networks uses a 6-bit differentiated services code point (DSCP) field in the IP header for packet classification purposes. To model such IP differentiated services on top of IPoIB fabric, mapping tables are set in the omnetpp.ini configuration file to distribute uniformly and map DSCP values, as mention by J. Babiarz et al. (2006), to InfiniBand service levels (SL). In other words, higher priority IP traffic will be mapped to IPoIB frame with higher SL value. According to InfiniBand Trade Association (2015), packet with higher SL value will be process before packet with lower SL value. 4.2 IPoIB secure fabrics The IPoIB secure fabrics can be modelled in the proposed simulation model by mapping IP broadcast domains to InfiniBand partitions using partition keys (P_Keys) in the omnetpp.ini configure file before beginning any simulation. 4.3 IPoIB QoS performance impact measurements End-to-end latency is considered as the QoS performance metric in this proposed IPoIB simulation model. That is, if 𝑝1 and 𝑝2 are two types of traffics with two SLs 𝑠1 and 𝑠2 such that 𝑠1 > 𝑠2 and 𝑝1 and 𝑝2 have same load and start simultaneously, and if 𝑡1 and 𝑡1 are their corresponding endto-end latencies, then 𝑡1 < 𝑡2.

5

IPoIB Model Examples of Use

5.1 Scenarios To show the working of the proposed simulation model, a five node star topology is constructed with varying traffics. Although the example model is using start topology, the proposed model could be used to simulate any other topology. The links used are InfiniBand FDR with a data rate of 54 Gbps. Number of partitions is 2 with PKeys 0x0012 and 0x0015. Number of VLs is 16. IPoIB MTU size is set to 1500 bytes. The latency introduced by InfiniBand switch is set to 200ns. The processor speed of HCA nodes is 3.0 GHz and uses PCIe v3 buses. The buffer polling in the HCA nodes is set to 4.11us. Similarly, other parameters are passed using the omnetpp.ini configuration file. 5.2 Results and Discussion The message size is varied, and the corresponding IPoIB model and device latencies are plotted as shown in Figure 6. The latency plots for model and device shows some irrelevant differences and the results obtained are as expected for this configuration. In order to test for the QoS in this IPoIB simulation model, all the 16 priority applications shown in Figure 3 are set to send traffic of 256kB each and suitable VL arbitration table

entries are also made at each hcaPort configuration files. The results thus obtained is shown in Figure 7.

Also, to understand how well the IPoIB model simulates a real physical IPoIB device, throughput measurements are obtained by varying the message size from 256 Bytes to 32 KB. The measurements obtained is shown in Figure 8. Figure 6: IPoIB Model and device latency vs message size

6

Conclusion

The proposed IPoIB simulation model is designed and implemented using OMNeT++. It is then used to implement a star topology network. The results thus obtained is then compared to that of real IPoIB InfiniBand nodes. The irrelevant differences found verifies the accuracy of the proposed IPoIB simulation model. As part of the future work, the proposed model could be extended to include InfiniBand subnet manager and subnet management agent models.

Glossaries HCA

Figure 7: End-to-end latencies experienced by applications with different level of priorities Application 0 being of least priority experienced the maximum latency whereas application 15 being of highest priority has the least latency while transmitting a payload of 16kB. It is important to note that different priority level applications have varying latencies due to the extra buffer wait cycles in the HCA node VLs and the InfiniBand switch VLs. Depending on the priority level, variance in the latencies can be adjusted by setting appropriate VL arbitration entries in the VL arbitration table.

: InfiniBand Host Channel Adapter IPoIB : IP over InfiniBand VL : HCA port virtual lane SL : InfiniBand Service Levels hcaPort : Model of HCA queuePair : Model of InfiniBand Queue Pair iPoIBLayer : Model of IPoIB protocol ARP : IP address to InfiniBand Hardware Address Resolution LIDs : InfiniBand Subnet Local Identifier P_Key : InfiniBand Partition Key ibSwitchLogic : InfiniBand Switching Implementation

References Banks, J. (2005) ‘Discrete-event System Simulation’, Prentice-Hall international series in industrial and systems engineering, Pearson Prentice Hall [Online] https://books.google.co.in/books?id=xK21QgAACAAJ Sandeep Bajaj Lee and Eep Bajaj and Lee Breslau and Deborah Estrin and Kevin Fall and Sally Floyd and Padma Haldar and Mark H and Ahmed Helmy and John Heidemann and Polly Huang and Satish Kumar and Steven Mccanne and Haobo Yu. (1999) ‘Improving Simulation for Network Research’, University of Southern California OMNeT++ (2016) ‘OMNeT++ Discrete Event Simulator –Home’, [Online] https://omnetpp.org/ OMNEST++ (2016) ‘OMNEST - High-Performance Simulation for All Kinds of Networks’, [Online] https://omnest.com/ TOP500 (2015) ‘Interconnect Family System Share’ [Online] http://www.top500.org/statistics/list/ InfiniBand Trade Association (2015) ‘InfiniBand Architecture Volume 1 and Volume 2.’ [Online] http://www.infinibandta.org/content/pages.php?pg=tech nology_public_specification IETF and V. Kashyap (2006) ‘IP over InfiniBand (IPoIB) Architecture’ [Online] https://tools.ietf.org/html/rfc4392 OMNeT++ version 4.6 User Manual (2016) ‘OMNeT++ version 4.6 User Manual’ [Online] https://omnetpp.org/doc/omnetpp/manual/usman.html Internet engineering task force (1986) ‘Internet engineering task force’ [Online] http://ietf.org/ OMNeT++ InfiniBand Flit Level Simulation Model (2013) ‘OMNeT++ InfiniBand Flit Level Simulation Model’ [Online] http://jp.mellanox.com/page/omnet P. Yebenes, J. Escudero-Sahuquillo, P. J. Garcia and F. J. Quiles. (2013) ‘Towards Modeling Interconnection Networks of Exascale Systems with OMNet++’, Parallel, Distributed and Network-Based Processing (PDP), 21st Euromicro International Conference on, Belfast, 2013, pp. 203-207. doi: 10.1109/PDP.2013.36 IETF and V. Kashyap (2006) ‘IP over InfiniBand (IPoIB) Architecture’ [Online] https://tools.ietf.org/html/rfc4392 J. Babiarz, K. Chan, Nortel Networks, F. Baker, Cisco Systems, (2006) ‘Configuration Guidelines for DiffServ Service Classes’ [Online] https://tools.ietf.org/html/rfc4594