Confinity Low Latency Messaging
Table of Contents
When the Institute of Electrical and Electronics Engineers (IEEE) published a paper titled “A Protocol for Packet Network Intercommunication” in 1974, no one has ever envisaged that its central control component called “Transmission Control Program” (TCP) would become the de-facto standard for Ethernet and Internet communication. The monolithic definition of TCP was later redesigned into a modular architecture consisting of the Transmission Control Protocol (TCP) and the User Datagram Protocol (UDP) at the transport layer and the Internet Protocol (IP) at the internet layer.
The design of the Internet protocol suite adheres to the end-to-end principle, it means that the network infrastructure is considered inherently unreliable at any single network element or transmission medium and is dynamic in terms of the availability of links and nodes. No central monitoring or performance measurement facility exists that tracks or maintains the state of the network. For the benefit of reducing network complexity, the intelligence in the network is purposely located in the end nodes.
As a consequence of this design, the Internet Protocol only provides best-effort delivery, and its service is characterized as unreliable. It is a connectionless protocol, in contrast to connection-oriented communication. Various fault conditions may occur, such as data corruption, packet loss and duplication. Because routing is dynamic, meaning every packet is treated independently, and because the network maintains no state based on the path of prior packets, different packets may be routed to the same destination via different paths, resulting in out-of-order delivery to the receiver.
All fault conditions in the network must be detected and compensated by the participating end nodes.
TCP is defined for one-to-one (1:1) or unicast communication and provides basic built-in reliability through packet acknowledgements. UDP, on the other hand, is used for one-to-many (1:m), many-to-many (n:m) or multicast communication but doesn’t have any built-in reliability.
Here Confinity Low Latency Messaging (CLLM) comes into play. CLLM enhances standard TCP and UDP and adds a Reliable Unicast Messaging (RUM) and Reliable Multicast Messaging (RMM) layer.
2. Confinity Low Latency Messaging
Confinity Low Latency Messaging (CLLM) is the successor of IBM’s MQ LLM, which
was developed by IBM Research Laboratories to address the needs for a robust, highly reliable, yet low latency and high throughput messaging infrastructure.
It serves several use cases in Financial Markets esp. in trading environments but also other industries requiring a daemon-less, peer-to-peer transport for one-to-one, one-to-many, and many-to-many data exchange. It exploits the IP multicast infrastructure to ensure scalable resource conservation and timely information distribution.
CLLM has been architected to build reliability for both unicast and multicast messaging. A modified proprietary packet transport layer resides above the datagram layer, incorporating IP, UDP and TCP. For multicast communication, the packet transport layer conforms to the Pragmatic General Multicast (PGM) protocol standard.
The CLLM packet transport layer ensures reliability through a fully developed acknowledgment mechanism. Negative acknowledgments (NAKs) are supported for all transports, although due to TCP’s inherent reliability, negative acknowledgements are used only for stream failover in unicast messaging over TCP/IP.
When NAKs are used, CLLM incorporates several techniques such as a sliding repair window and duplicate NAK suppression to maximize reliability with minimal protocol overhead. This level of reliability enables each client either to receive all the packets or to detect unrecoverable packet loss. Positive acknowledgement can also be used with non-TCP/IP communications to provide higher levels of reliability.
2.3. Message Acknowledgement
CLLM provides multiple options for specifying how the acknowledgement of the message delivery is performed, to support varying degrees of reliable message delivery.
The default form of feedback is using negative acknowledgements (NAKs). This is the optimal method for throughput and simplicity, but packet loss is possible if a receiver does not recognize a packet loss and transmit the NAK before the packet is removed from the transmitter’s buffer.
A higher degree of reliability can be achieved through the use of positive acknowledgements (ACKs), which ensure that a packet has been received and processed by the receiver.
For one-to-many communications, additional levels of reliability can be specified using the unique “Wait-NACK” feature. This allows a transmitter to specify a configurable number of ACKs that must be received before packets can be removed from the history buffer.
The receiver applications can also control when an ACK should be sent, allowing mixed modes of reliability on a single topic and making sure that a message has been processed before sending an acknowledgement.
Transmitters can further request notification (synchronously or asynchronously) when ACKs are received.
These capabilities provide the application specified flexibility in message delivery assurance which is essential for high-performance architectures that integrate many different applications.
2.4. Packet and Message Management
CLLM can handle both out-of-order packets and lost packets in the network. To control the packet order and allow receivers to detect missing packets and request
their retransmission, the transmitter sequentially numbers the packets it sends and
treats the data flow as a packet stream. The streams are a fundamental concept of
the packet transport layer. Each stream has a number that uniquely identifies the physical packet sequence originated at one source. The stream packets are sent out by a transmitter. Receivers join the multicast group that corresponds to the stream and receive the packets or listen to a specified port in the case of unicast transmission. If a number of streams use the same group, the stream ID included in each packet header is used to filter out irrelevant packets.
The product has an “unreliable streaming” transmission mode for real-time data and other information feeds that do not require delivery assurances. This mode uses a “fire-and-forget” approach where there is no retransmission of packets.
Other applications will require reliable, in-order delivery of all stream packets.
The reliable streaming mode uses either an ACK or NAK mechanism to recover the losses. Using ACKs, the receiver acknowledges each packet with the stream ID and the range of received packets. With the NAK mechanism, once a receiver detects a gap in the packet sequence, it can send a datagram with the stream ID and sequence (or range) of missing packets to the transmitter, requesting retransmission. The transmitter maintains a history queue per stream. The repair facility, which is a separate thread in the transmitter process, listens for NAKs and uses their contents to identify and resend packets.
The transmitter cannot keep the packets forever. The streaming data is virtually unlimited in size, so old data (packets) must be discarded at some point. With an ACK mechanism, packets are not discarded until they are acknowledged by the receiver, possibly throttling the transmitter. With a NAK mechanism, old packets are discarded when the transmitter’s history buffer is full and new packets are sent.
The message transport layer is built on top of the packet transport layer. This service is responsible for reliable message delivery, and it implements a publish/subscribe messaging model by mapping the message topics onto the packet transport streams. The service allows for peer-to-peer data exchange, with any host being able to both transmit and receive messages in a daemon-less fashion. The layer functionality incorporates a batching (burst suppression) mechanism for bandwidth-optimal delivery of small and medium messages, along with a message fragmentation/assembly mechanism for delivering large messages.
2.5. Message Persistence
A fundamental limit to the reliability within CLLM is the size of the history buffer used to resend packets missed by a receiver. To supplement the in-memory history buffer, the CLLM Message Store is added. It provides increased reliability through wire speed message and event persistence with retrieval capabilities. The Message Store is highly configurable and allows filtering of the stored messages.
Applications can utilize the CLLM Message Store to work around otherwise unrecoverable packet loss, retrieving messages from the disk store that are no longer available from the transmitter’s history buffer. The Message Store can also be used to initialize a late-joining or restarted application into a given (current) state using a replay of messages from the store, minimizing impact on the actual transmitter originating the messages.
CLLM Message Store supports the request of previously transmitted messages of a message stream. By using the Message Store Late-join feature, CLLM can provide transparent transition to the live transmitter message stream. With durable subscriptions, an application can stop and, when restarted, resume receiving messages which were sent while it was offline from the Message Store. Then, it switches back to the live message stream.
2.7. Monitoring and Congestion Control
Many existing applications experience difficulties with the volume of events they must consume from today’s volatile markets. CLLM congestion facilities help ensure that the infrastructure continues to perform even when connected applications are overburdened.
Both multicast and unicast transports include methods to monitor traffic (including transmission rate, losses and retransmissions, and latency) to notify the application of network congestion problems, and to manage these detected problems by handling slow receivers or regulating the transmission rate.
CLLM includes a comprehensive monitoring API that provides access to aggregate per-topic statistics involving
- message rates
- packets and messages that are received, filtered, or lost
- current receivers
- NAKs processed
- transmitter and receiver topic latency information
- other key data
The level of details is configurable and ranges from basic buffer utilization information to detailed histograms of internal and external latency timings. An extensible module interface can also be used to integrate monitoring data with any external monitoring tool.
Additionally, several options are available for network congestion management. By default, CLLM does not regulate data transmission, so submitted messages are sent as fast as possible. A simple transmission static rate limit policy, based on the token bucket algorithm, can be activated to set the maximal rate at which a transmitter is allowed to send data. A dynamic rate policy is intended for situations where no receiver should be excluded, even temporarily, from the session. When the receiver set experiences difficulties and reports losses exceeding a certain level, the transmission rate is reduced until losses are below the threshold again.
Per-instance limits can be implemented for the amount of memory that low-latency messaging may consume. When this amount nears exhaustion, configurable event notifications are triggered. Buffer limits can include per-topic limits on the size of transmit and receiver buffers, as well as configurable time-based or space-based cleaning parameters.
ACK/NAK limits can also be implemented, with event notification thresholds set for when limits are exceeded. Slow consumer policies can include automatic or manual suspension or expulsion of receivers that have exceeded NAK-generation thresholds.
2.7. Reliable and Consistent Message Streaming (RCMS)
CLLM facilitates the development of highly available transmitters and receivers using the Reliable and Consistent Message Streaming (RCMS) component. RCMS provides a layer of high availability and consistent ordered delivery using the high-performance transport fabric offered by CLLM.
RCMS utilizes Reliable Multicast Messaging (RMM), which provides high performance, reliability, late joiner support, congestion and traffic control.
RCMS via transparent stream failover provides high availability for the message stream. It delivers system reliability with negligible performance impact – a business critical benefit for today’s organizations that need both high availability and high performance for their 24×7 applications. Organizations can decide on different options based on their specific technology and business requirements.
RCMS defines the concept of a “tier,” which consists of a group of components (tier members) that are replicas of each other. Each replica executes the application’s logic as if it were the only component. RCMS connects the tier members and ensures availability in case of a failure. The application can define the level of redundancy it wants to use; with X tier members running, up to X-1 members can fail and the application will continue to function. RCMS detects component failure and migrates the data stream to a backup member without message loss.
RCMS provides facilities to perform state synchronization of tier members, as well as handling failure detection and automatic role changes with split-brain prevention policies. RCMS automatically synchronizes the state of the tier member’s incoming and outgoing traffic, and helps synchronize the state of the application itself, allowing a new tier member to start fully functioning in parallel with existing peers. These capabilities permit application developers to focus on the application functionality while RCMS handles most of the complexities associated with high availability.
To support different levels of failover, there are two types of transmitter tiers – “hotwarm” tier and “hot-hot” tier (also called a duplex tier). An RCMS transmitter tier with members in a “hot-warm” configuration satisfies most of the high availability needs of an application. However, for environments needing extremely fast failover, a duplex tier may be used.
In a typical hot-warm RCMS transmitter tier there are two or more nodes, potentially connected to different networks, with one node designated as primary and others as backup. In a backup node, CLLM accepts messages submitted by the sending application (or component) and builds history buffers as usual, but it does not send packets out. If a primary node should fail, a backup node is activated, and CLLM starts sending packets from that node. It receives retransmission requests for missing data and resends packets that contain the required messages.
Using this RCMS tier configuration, the CLLM application achieves high availability with rapid and lossless failover. However, message retransmission can occur if a backup node in a transmitter tier begins transmitting messages at some point in the message stream later than the last packet the receiver set has seen. This packet sequence gap results in retransmissions and increased latency of the messages that were missed.
For the quickest failover without message loss or retransmission, the RCMS transmitter tier can be configured in duplex tier (“hot-hot”) mode. Two active nodes in the tier are designated as primary nodes and all others are in backup mode.
CLLM sends messages in the active nodes and suppresses messages in the backup nodes. If an active node should fail, a backup node becomes active and starts sending messages in this node.
In the RCMS receiver tier, CLLM performs the required setup in each node. It connects to all networks, joins all relevant multicast groups, and completes all relevant unicast connections. If it receives messages from an RCMS transmitter tier in “hot-warm” configuration, RCMS will receive a failure event when the activation of a new sender in the transmitter tier is detected in the event of failover. The event is delivered to the application event listener and RCMS starts reception of the new data stream while detecting if messages were lost during any failover. CLLM sends a retransmission request to the activated backup sender for any missed packets. It also delivers messages from the new source to the same application message listener, making the failover transparent to the application message processing.
Figure 1: The RCMS component of CLLM provides advanced high-availability features such as failure detection, component synchronization, and stream failover.
If the RCMS receiver tier is receiving messages from a RCMS transmitter tier in “hot-hot” configuration, the receiver nodes will receive duplicate messages from the duplex active transmitter nodes. Duplicate messages are identified and discarded at the RCMS layer, so that the application only receives one copy of a message. If one of the duplex active transmitters fails, there will be no message loss as the receivers will continue to receive the other active message stream.
The total ordering feature of RCMS enforces a consistent order of message delivery from a number of independent data transmitters to multiple receivers. Total ordering assures that all receivers see exactly the same order of incoming messages.
This can be critical for some applications functioning as a tier. If the processing of messages affects the application state, total ordering can be used to guarantee that the applications maintain identical state.
3. Latency and throughput
The CLLM functions and features are implemented on top of the standard TCP and UDP protocol stack but are also available for other network topologies (e.g. Infiniband) and network protocols like RDMA and RoCE v1/v2.
In addition to all its resiliency and high availability features, CLLM has been designed for optimized performance with lowest latency and highest message throughput. With a message rate of close to 100 million messages per second, the latency is in the range of a few microseconds on average.
The table below shows the average latency which is dependent on the length of a message and the network clock speed.
While the absolute number of an average latency is important, it is even more important to have a very small jitter in latency. Here CLLM plays to its strengths as its unique monitoring and congestion control features empower the latency outliers
be stay very close to the mean/latency average (steep Gauss curve). This allows applications to almost ‘predict’ the latency.
4. Hardware-Accelerated CLLM
CLLM provides a number of optimization techniques to execute the application in the fastest possible way. It supports the concepts for thread affinity when multiple CPU cores are present. It also makes efficient use of kernel system calls when the OS platform is supportive of real-time operations, e.g. RedHat RT kernels.
However, there are challenges with the pre-emptive natures of a multitasking operation system, which can slow down CLLM’s performance.
A major overhaul of CLLM software led to a new hardware-accelerated version, which allows to ‘offload’ some of its core functions and features onto a FPGA chip on a NIC card. This further reduces the OS platform dependencies and bypasses some of its deficiencies.
CLLM 4.0 has been re-developed and ported into a series of free-running kernels on a FPGA boardsfrom Xilinx. Specialized and customized TCP and UDP kernels are used to process CLLM messages at Ethernet line rates (clock cycles).
By applying deep pipelining techniques and utilizing Xilinx’ Memory Mapped Slave Bridge shell, mem copies have been reduced further and efficiency has been increased significantly on Ethernet interface.
In its current version, CLLM 4.0 supports Xilinx Alveo U250 and U50 boards. Both NIC boards come with QSFP connector for 1 x 40 GbE or 1x 100GbE or 4x 10GbE or SFP connector for 4 x 25GbE Ethernet speeds. The FPGA has close to 1 million LUTs on an Alveo U50 and more than 1.2 million LUTs on an Alveo U250, which offers enough capacity to offload some of the application logic in addition to CLLM’s program logic.
Built with the leading-edge technology, CLLM 4.0 makes reliable low latency multicast messaging even more predictable, deterministic and platform-agnostic.