Network Working Group David Cheriton Request for Comments: 1045 Stanford University February 1988 VMTP: VERSATILE MESSAGE TRANSACTION PROTOCOL Protocol Specification STATUS OF THIS MEMO This RFC describes a protocol proposed as a standard for the Internet community. Comments are encouraged. Distribution of this document is unlimited. OVERVIEW This memo specifies the Versatile Message Transaction Protocol (VMTP) [Version 0.7 of 19-Feb-88], a transport protocol specifically designed to support the transaction model of communication, as exemplified by remote procedure call (RPC). The full function of VMTP, including support for security, real-time, asynchronous message exchanges, streaming, multicast and idempotency, provides a rich selection to the VMTP user level. Subsettability allows the VMTP module for particular clients and servers to be specialized and simplified to the services actually required. Examples of such simple clients and servers include PROM network bootload programs, network boot servers, data sensors and simple controllers, to mention but a few examples. RFC 1045 VMTP February 1988 Table of Contents 1. Introduction 1 1.1. Motivation 2 1.1.1. Poor RPC Performance 2 1.1.2. Weak Naming 3 1.1.3. Function Poor 3 1.2. Relation to Other Protocols 4 1.3. Document Overview 5 2. Protocol Overview 6 2.1. Entities, Processes and Principals 7 2.2. Entity Domains 9 2.3. Message Transactions 10 2.4. Request and Response Messages 11 2.5. Reliability 12 2.5.1. Transaction Identifiers 13 2.5.2. Checksum 14 2.5.3. Request and Response Acknowledgment 14 2.5.4. Retransmissions 15 2.5.5. Timeouts 15 2.5.6. Rate Control 18 2.6. Security 19 2.7. Multicast 21 2.8. Real-time Communication 22 2.9. Forwarded Message Transactions 24 2.10. VMTP Management 25 2.11. Streamed Message Transactions 25 2.12. Fault-Tolerant Applications 28 2.13. Packet Groups 29 2.14. Runs of Packet Groups 31 2.15. Byte Order 32 2.16. Minimal VMTP Implementation 33 2.17. Message vs. Procedural Request Handling 33 2.18. Bibliography 34 3. VMTP Packet Formats 37 3.1. Entity Identifier Format 37 3.2. Packet Fields 38 Cheriton [page i] RFC 1045 VMTP February 1988 3.3. Request Packet 45 3.4. Response Packet 47 4. Client Protocol Operation 49 4.1. Client State Record Fields 49 4.2. Client Protocol States 51 4.3. State Transition Diagrams 51 4.4. User Interface 52 4.5. Event Processing 53 4.6. Client User-invoked Events 54 4.6.1. Send 54 4.6.2. GetResponse 56 4.7. Packet Arrival 56 4.7.1. Response 58 4.8. Management Operations 61 4.8.1. HandleNoCSR 62 4.9. Timeouts 64 5. Server Protocol Operation 66 5.1. Remote Client State Record Fields 66 5.2. Remote Client Protocol States 66 5.3. State Transition Diagrams 67 5.4. User Interface 69 5.5. Event Processing 70 5.6. Server User-invoked Events 71 5.6.1. Receive 71 5.6.2. Respond 72 5.6.3. Forward 73 5.6.4. Other Functions 74 5.7. Request Packet Arrival 74 5.8. Management Operations 78 5.8.1. HandleRequestNoCSR 79 5.9. Timeouts 82 6. Concluding Remarks 84 I. Standard VMTP Response Codes 85 II. VMTP RPC Presentation Protocol 87 Cheriton [page ii] RFC 1045 VMTP February 1988 II.1. Request Code Management 87 III. VMTP Management Procedures 89 III.1. Entity Group Management 100 III.2. VMTP Management Digital Signatures 101 IV. VMTP Entity Identifier Domains 102 IV.1. Domain 1 102 IV.2. Domain 3 104 IV.3. Other Domains 105 IV.4. Decentralized Entity Identifier Allocation 105 V. Authentication Domains 107 V.1. Authentication Domain 1 107 V.2. Other Authentication Domains 107 VI. IP Implementation 108 VII. Implementation Notes 109 VII.1. Mapping Data Structures 109 VII.2. Client Data Structures 111 VII.3. Server Data Structures 111 VII.4. Packet Group transmission 112 VII.5. VMTP Management Module 113 VII.6. Timeout Handling 114 VII.7. Timeout Values 114 VII.8. Packet Reception 115 VII.9. Streaming 116 VII.10. Implementation Experience 117 VIII. UNIX 4.3 BSD Kernel Interface for VMTP 118 Index 120 Cheriton [page iii] RFC 1045 VMTP February 1988 List of Figures Figure 1-1: Relation to Other Protocols 4 Figure 3-1: Request Packet Format 45 Figure 3-2: Response Packet Format 47 Figure 4-1: Client State Transitions 52 Figure 5-1: Remote Client State Transitions 68 Figure III-1: Authenticator Format 92 Figure VII-1: Mapping Client Identifier to CSR 109 Figure VII-2: Mapping Server Identifiers 110 Figure VII-3: Mapping Group Identifiers 111 Cheriton [page iv] RFC 1045 VMTP February 1988 1. Introduction The Versatile Message Transaction Protocol (VMTP) is a transport protocol designed to support remote procedure call (RPC) and general transaction-oriented communication. By transaction-oriented communication, we mean that: - Communication is request-response: A client sends a request for a service to a server, the request is processed, and the server responds. For example, a client may ask for the next page of a file as the service. The transaction is terminated by the server responding with the next page. - A transaction is initiated as part of sending a request to a server and terminated by the server responding. There are no separate operations for setting up or terminating associations between clients and servers at the transport level. - The server is free to discard communication state about a client between transactions without causing incorrect behavior or failures. The term message transaction (or transaction) is used in the reminder of this document for a request-response exchange in the sense described above. VMTP handles the error detection, retransmission, duplicate suppression and, optionally, security required for transport-level end-to-end reliability. The protocol is designed to provide a range of behaviors within the transaction model, including: - Minimal two packet exchanges for short, simple transactions. - Streaming of multi-packet requests and responses for efficient data transfer. - Datagram and multicast communication as an extension of the transaction model. Example Uses: - Page-level file access - VMTP is intended as the transport level for file access, allowing simple, efficient operation on a local network. In particular, VMTP is appropriate for use by diskless workstations accessing shared network file Cheriton [page 1] RFC 1045 VMTP February 1988 servers. - Distributed programming - VMTP is intended to provide an efficient transport level protocol for remote procedure call implementations, distributed object-oriented systems plus message-based systems that conform to the request-response model. - Multicast communication with groups of servers to: locate a specific object within the group, update a replicated object, synchronize the commitment of a distributed transaction, etc. - Distributed real-time control with prioritized message handling, including datagrams, multicast and asynchronous calls. The protocol is designed to operate on top of a simple unreliable datagram service, such as is provided by IP. 1.1. Motivation VMTP was designed to address three categories of deficiencies with existing transport protocols in the Internet architecture. We use TCP as the key current transport protocol for comparison. 1.1.1. Poor RPC Performance First, current protocols provide poor performance for remote procedure call (RPC) and network file access. This is attributable to three key causes: - TCP requires excessive packets for RPC, especially for isolated calls. In particular, connection setup and clear generates extra packets over that needed for VMTP to support RPC. - TCP is difficult to implement, speaking purely from the empirical experience over the last 10 years. VMTP was designed concurrently with its implementation, with focus on making it easy to implement and providing sensible subsets of its functionality. - TCP handles packet loss due to overruns poorly. We claim that overruns are the key source of packet loss in a high-performance RPC environment and, with the increasing Cheriton [page 2] RFC 1045 VMTP February 1988 performance of networks, will continue to be the key source. (Older machines and network interfaces cannot keep up with new machines and network interfaces. Also, low-end network interfaces for high-speed networks have limited receive buffering.) VMTP is designed for ease of implementation and efficient RPC. In addition, it provides selective retransmission with rate-based flow control, thus addressing all of the above issues. 1.1.2. Weak Naming Second, current protocols provide inadequate naming of transport-level endpoints because the names are based on IP addresses. For example, a TCP endpoint is named by an Internet address and port identifier. Unfortunately, this makes the endpoint tied to a particular host interface, not specifically the process-level state associated with the transport-level endpoint. In particular, this form of naming causes problems for process migration, mobile hosts and multi-homed hosts. VMTP provides host-address independent names, thereby solving the above mentioned problems. In addition, TCP provides no security and reliability guarantees on the dynamically allocated names. In particular, other than well-known ports, (host-addr, port-id)-tuples can change meaning on reboot following a crash. VMTP provides large identifiers with guarantee of stability, meaning that either the identifier never changes in meaning or else remains invalid for a significant time before becoming valid again. 1.1.3. Function Poor TCP does not support multicast, real-time datagrams or security. In fact, it only supports pair-wise, long-term, streamed reliable interchanges. Yet, multicast is of growing importance and is being developed for the Internet (see RFC 966 and 988). Also, a datagram facility with the same naming, transmission and reception facilities as the normal transport level is a powerful asset for real-time and parallel applications. Finally, security is a basic requirement in an increasing number of environments. We note that security is natural to implement at the transport level to provide end-to-end security (as opposed to (inter)network level security). Without security at the transport level, a transport level protocol cannot guarantee the standard transport level service definition in the presence of an intruder. In particular, the intruder can interject packets or modify Cheriton [page 3] RFC 1045 VMTP February 1988 packets while updating the checksum, making mockery out of the transport-level claim of "reliable delivery". In contrast, VMTP provides multicast, real-time datagrams and security, addressing precisely these weaknesses. In general, VMTP is designed with the next generation of communication systems in mind. These communication systems are characterized as follows. RPC, page-level file access and other request-response behavior dominates. In addition, the communication substrate, both local and wide-area, provides high data rates, low error rates and relatively low delay. Finally, intelligent, high-performance network interfaces are common and in fact required to achieve performance that approximates the network capability. However, VMTP is also designed to function acceptably with existing networks and network interfaces. 1.2. Relation to Other Protocols VMTP is a transport protocol that fits into the layered Internet protocol environment. Figure 1-1 illustrates the place of VMTP in the protocol hierarchy. +-----------+ +----+ +-----------------+ +------+ |File Access| |Time| |Program Execution| |Naming|... Application +-----------+ +----+ +-----------------+ +------+ Layer | | | | | +-----------+-----------+-------------+------+ | +------------------+ | RPC Presentation | Presentation +------------------+ Layer | +------+ +--------+ | TCP | | VMTP | Transport +------+ +--------+ Layer | | +-----------------------------------+ | Internet Protocol & ICMP | Internetwork +-----------------------------------+ Layer Figure 1-1: Relation to Other Protocols The RPC presentation level is not currently defined in the Internet suite of protocols. Appendix II defines a proposed RPC presentation level for use with VMTP and assumed for the definition of the VMTP management procedures. There is also a need for the definition of the Cheriton [page 4] RFC 1045 VMTP February 1988 Application layer protocols listed above. If internetwork services are not required, VMTP can be used without the IP layer, layered directly on top of the network or data link layers. 1.3. Document Overview The next chapter gives an overview of the protocol, covering naming, message structure, reliability, flow control, streaming, real-time, security, byte-ordering and management. Chapter 3 describes the VMTP packet formats. Chapter 4 describes the client VMTP protocol operation in terms of pseudo-code for event handling. Chapter 5 describes the server VMTP protocol operation in terms of pseudo-code for event handling. Chapter 6 summarizes the state of the protocol, some remaining issues and expected directions for the future. Appendix I lists some standard Response codes. Appendix II describes the RPC presentation protocol proposed for VMTP and used with the VMTP management procedures. Appendix III lists the VMTP management procedures. Appendix IV proposes initial approaches for handling entity identification for VMTP. Appendix V proposes initial authentication domains for VMTP. Appendix VI provides some details for implementing VMTP on top of IP. Appendix VII provides some suggestions on host implementation of VMTP, focusing on data structures and support functions. Appendix VIII describes a proposed program interface for UNIX 4.3 BSD and its descendants and related systems. Cheriton [page 5] RFC 1045 VMTP February 1988 2. Protocol Overview VMTP provides an efficient, reliable, optionally secure transport service in the message transaction or request-response model with the following features: - Host address-independent naming with provision for multiple forms of names for endpoints as well as associated (security) principals. (See Sections 2.1, 2.2, 3.1 and Appendix IV.) - Multi-packet request and response messages, with a maximum size of 4 megaoctets per message. (Sections 2.3 and 2.14.) - Selective retransmission. (Section 2.13.) and rate-based flow control to reduce overrun and the cost of overruns. (Section 2.5.6.) - Secure message transactions with provision for a variety of encryption schemes. (Section 2.6.) - Multicast message transactions with multiple response messages per request message. (Section 2.7.) - Support for real-time communication with idempotent message transactions with minimal server overhead and state (Section 2.5.3), datagram request message transactions with no response, optional header-only checksum, priority processing of transactions, conditional delivery and preemptive handling of requests (Section 2.8) - Forwarded message transactions as an optimization for certain forms of nested remote procedure calls or message transactions. (Section 2.9.) - Multiple outstanding (asynchronous) message transactions per client. (Section 2.11.) - An integrated management module, defined with a remote procedure call interface on top of VMTP providing a variety of communication services (Section 2.10.) - Simple subset implementation for simple clients and simple servers. (Section 2.16.) This chapter provides an overview of the protocol as introduction to the basic ideas and as preparation for the subsequent chapters that describe the packet formats and event processing procedures in detail. Cheriton [page 6] RFC 1045 VMTP February 1988 In overview, VMTP provides transport communication between network- visible entities via message transactions. A message transaction consists of a request message sent by the client, or requestor, to a group of server entities followed by zero or more response messages to the client, at most one from each server entity. A message is structured as a message control portion and a segment data portion. A message is transmitted as one or more packet groups. A packet group is one or more packets (up to a maximum of 32 packets) grouped by the protocol for acknowledgment, sequencing, selective retransmission and rate control. Entities and VMTP operations are managed using a VMTP management mechanism that is accessed through a procedural interface (RPC) implemented on top of VMTP. In particular, information about a remote entity is obtained and maintained using the Probe VMTP management operation. Also, acknowledgment information and requests for retransmission are sent as notify requests to the management module. (In the following description, reference to an "acknowledgment" of a request or a response refers to a management-level notify operation that is acknowledging the request or response.) 2.1. Entities, Processes and Principals VMTP defines and uses three main types of identifiers: entity identifiers, process identifiers and principal identifiers, each 64-bits in length. Communication takes place between network-visible entities, typically mapping to, or representing, a message port or procedure invocation. Thus, entities are the VMTP communication endpoints. The process associated with each entity designates the agent behind the communication activity for purposes of resource allocation and management. For example, when a lock is requested on a file, the lock is associated with the process, not the requesting entity, allowing a process to use multiple entity identifiers to perform operations without lock conflict between these entities. The principal associated with an entity specifies the permissions, security and accounting designation associated with the entity. The process and principal identifiers are included in VMTP solely to make these values available to VMTP users with the security and efficiency provided by VMTP. Only the entity identifiers are actively used by the protocol. Entity identifiers are required to have three properties; Uniqueness Each entity identifier is uniquely defined at any given time. (An entity identifier may be reused over time.) Stability An entity identifier does not change between valid Cheriton [page 7] RFC 1045 VMTP February 1988 meanings without suitable provision for removing references to the entity identifier. Certain entity identifiers are strictly stable, (i.e. never changing meaning), typically being administratively assigned (although they need not be bound to a valid entity at all times), often called well-known identifiers. All other entity identifiers are required to be T-stable, not change meaning without having remained invalid for at least a time interval T. Host address independent An entity identifier is unique independent of the host address of its current host. Moreover, an entity identifier is not tied to a single Internet host address. An entity can migrate between hosts, reside on a mobile host that changes Internet addresses or reside on a multi-homed host. It is up to the VMTP implementation to determine and maintain up to date the host addresses of entities with which it is communicating. The stability of entity identifiers guarantees that an entity identifier represents the same logical communication entity and principal (in the security sense) over the time that it is valid. For example, if an entity identifier is authenticated as having the privileges of a given user account, it continues to have those privileges as long as it is continuously valid (unless some explicit notice is provided otherwise). Thus, a file server need not fully authenticate the entity on every file access request. With T-stable identifiers, periodically checking the validity of an entity identifier with period less than T seconds detects a change in entity identifier validity. A group of entities can form an entity group, which is a set of zero or more entities identified by a single entity identifier. For example, one can have a single entity identifier that identifies the group of name servers. An entity identifier representing an entity group is drawn from the same name space as entity identifiers. However, single entity identifiers are flagged as such by a bit in the entity identifier, indicating that the identifier is known to identify at most one entity. In addition to the group bit, each entity identifier includes other standard type flags. One flag indicates whether the identifier is an alias for an entity in another domain (See Section 2.2 below.). Another flag indicates, for an entity group identifier, whether the identifier is a restricted group or not. A restricted group is one in which an entity can be added only by another entity with group management authorization. With an unrestricted group, an entity is allowed to add itself. If an entity identifier does not represent a Cheriton [page 8] RFC 1045 VMTP February 1988 group, a type bit indicates whether the entity uses big-endian or little-endian data representation (corresponding to Motorola 680X0 and VAX byte orders, respectively). Further specification of the format of entity identifiers is contained in Section 3.1 and Appendix IV. An entity identifier identifies a Client, a Server or a group of Servers <1>. A Client is always identified by a T-stable identifier. A server or group of servers may be identified by a a T-stable identifier (group or single entity) or by strictly stable (statically assigned) entity group identifier. The same T-stable identifier can be used to identify a Client and Server simultaneously as long as both are logically associated with the same entity. The state required for reliable, secure communication between entities is maintained in client state records (CSRs), which include the entity identifier of the Client, its principal, its current or next transaction identifier and so on. 2.2. Entity Domains An entity domain is an administration or an administration mechanism that guarantees the three required entity identifier properties of uniqueness, stability and host address independence for the entities it administers. That is, entity identifiers are only guaranteed to be unique and stable within one entity domain. For example, the set of all Internet hosts may function as one domain. Independently, the set of hosts local to one autonomous network may function as a separate domain. Each entity domain is identified by an entity domain identifier, Domain. Only entities within the same domain may communicate directly via VMTP. However, hosts and entities may participate in multiple entity domains simultaneously, possibly with different entity identifiers. For example, a file server may participate in multiple entity domains in order to provide file service to each domain. Each entity domain specifies the algorithms for allocation, interpretation and mapping of entity identifiers. Domains are necessary because it does not appear feasible to specify one universal VMTP entity identification administration that covers all entities for all time. Domains limit the number of entities that need to be managed to maintain the uniqueness and stability of the entity _______________ <1> Terms such as Client, Server, Request, Response, etc. are capitalized in this document when they refer to their specific meaning in VMTP. Cheriton [page 9] RFC 1045 VMTP February 1988 name space. Domains can also serve to separate entities of different security levels. For instance, allocation of a unclassified entity identifier cannot conflict with secret level entity identifiers because the former is interpreted only in the unclassified domain, which is disjoint from the secret domain. It is intended that there be a small number of domains. In particular, there should be one (or a few) domains per installation "type", rather than per installation. For example, the Internet is expected to use one domain per security level, resulting in at most 8 different domains. Cluster-based internetwork architectures, those with a local cluster protocol distinct from the wide-area protocol, may use one domain for local use and one for wide-area use. Additional details on the specification of specific domains is provided in Appendix IV. 2.3. Message Transactions The message transaction is the unit of interaction between a Client that initiates the transaction and one or more Servers. A message transaction starts with a request message generated by a client. At the service interface, a server becomes involved with a transaction by receiving and accepting the request. A server terminates its involvement with a transaction by sending a response message. In a group message transaction, the server entity designated by the client corresponds to a group of entities. In this case, each server in the group receives a copy of the request. In the client's view, the transaction is terminated when it receives the response message or, in the case of a group message transaction, when it receives the last response message. Because it is normally impractical to determine when the last response message has been received. the current transaction is terminated by VMTP when the next transaction is initiated. Within an entity domain, a transaction is uniquely identified by the tuple (Client, Transaction, ForwardCount). where Transaction is a 32-bit number and ForwardCount is a 4-bit value. A Client uses monotonically increasing Transaction identifiers for new message transactions. Normally, the next higher transaction number, modulo 2**32, is used for the next message transaction, although there are cases in which it skips a small range of Transaction identifiers. (See the description of the STI control flag.) The ForwardCount is used when a message transaction is forwarded and is zero otherwise. A Client generates a stream of message transactions with increasing transaction identifiers, directed at a diversity of Servers. We say a Cheriton [page 10] RFC 1045 VMTP February 1988 Client has a transaction outstanding if it has invoked a message transaction, but has not received the last Response (or possibly any Response). Normally, a Client has only one transaction outstanding at a time. However, VMTP allows a Client to have multiple message transactions outstanding simultaneously, supporting streamed, asynchronous remote procedure call invocations. In addition, VMTP supports nested calls where, for example, procedure A calls procedure B which calls procedure C, each on a separate host with different client entity identifiers for each call but identified with the same process and principal. 2.4. Request and Response Messages A message transaction consists of a request message and one or more Response messages. A message is structured as message control block (MCB) and segment data, passed as parameters, as suggested below. +-----------------------+ | Message Control Block | +-----------------------+ +-----------------------------------+ | segment data | +-----------------------------------+ In the request message, the MCB specifies control information about the request plus an optional data segment. The MCB has the following format: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + ServerEntityId (8 octets) + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Flags | RequestCode | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + CoresidentEntity (8 octets) + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > User Data (12 octets) < +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | MsgDelivery | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | SegmentSize | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The ServerEntityId is the entity to which the Request MCB is to be sent (or was sent, in the case of reception). The Flags indicate various options in the request and response handling as well as whether the Cheriton [page 11] RFC 1045 VMTP February 1988 CoresidentEntity, MsgDelivery and SegmentSize fields are in use. The RequestCode field specifies the type of Request. It is analogous to a packet type field of the Ethernet, acting as a switch for higher-level protocols. The CoresidentEntity field, if used, designates a subgroup of the ServerEntityId group to which the Request should be routed, namely those members that are co-resident with the specified entity (or entity group). The primary intended use is to specify the manager for a particular service that is co-resident with a particular entity, using the well-known entity group identifier for the service manager in the ServerEntityId field and the identifier for the entity in the CoresidentEntity field. The next 12 octets are user- or application-specified. The MsgDelivery field is optionally used by the RPC or user level to specify the portions of the segment data to transmit and on reception, the portions received. It provides the client and server with (optional) access to, and responsibility for, a simple selective transmission and reception facility. For example, a client may request retransmission of just those portions of the segment that it failed to receive as part of the original Response. The primary intended use is to support highly efficient multi-packet reading from a file server. Exploiting user-level selective retransmission using the MsgDelivery field, the file server VMTP module need not save multi-packet Responses for retransmission. Retransmissions, when needed, are instead handled directly from the file server buffers. The SegmentSize field indicates the size of the data segment, if present. The CoresidentEntity, MsgDelivery and SegmentSize fields are usable as additional user data if they are not otherwise used. The Flags field provides a simple mechanism for the user level to communicate its use of VMTP options with the VMTP module as well as for VMTP modules to communicate this use among themselves. The use of these options is generally fixed for each remote procedure so that an RPC mechanism using VMTP can treat the Flags as an integral part of the RequestCode field for the purpose of demultiplexing to the correct stub. A Response message control block follows the same format except the Response is sent from the Server to the Client and there is no Coresident Entity field (and thus 20 octets of user data). 2.5. Reliability VMTP provides reliable, sequenced transfer of request and response messages as well as several variants, such as unreliable datagram requests. The reliability mechanisms include: transaction identifiers, Cheriton [page 12] RFC 1045 VMTP February 1988 checksums, positive acknowledgment of messages and timeout and retransmission of lost packets. 2.5.1. Transaction Identifiers Each message transaction is uniquely identified by the pair (Client, Transaction). (We defer discussion of the ForwardCount field to Section 2.9.) The 32-bit transaction identifier is initialized to a random value when the Client entity is created or allocated its entity identifier. The transaction identifier is incremented at the end of each message transaction. All Responses with the same specified (Client, Transaction) pair are associated with this Request. The transaction identifier is used for duplicate suppression at the Server. A Server maintains a state record for each Client for which it is processing a Request, identified by (Client, Transaction). A Request with the same (Client, Transaction) pair is discarded as a duplicate. (The ForwardCount field must also be equal.) Normally, this record is retained for some period after the Response is sent, allowing the Server to filter out subsequent duplicates of this Request. When a Request arrives and the Server does not have a state record for the sending Client, the Server takes one of three actions: 1. The Server may send a Probe request, a simple query operation, to the VMTP management module associated with the requesting Client to determine the Client's current Transaction identifier (and other information), initialize a new state record from this information, and then process the Request as above. 2. The Server may reason that the Request must be a new request because it does not have a state record for this Client if it keeps these state records for the maximum packet lifetime of packets in the network (plus the maximum VMTP retransmission time) and it has not been rebooted within this time period. That is, if the Request is not new either the Request would have exceeded the maximum packet lifetime or else the Server would have a state record for the Client. 3. The Server may know that the Request is idempotent or can be safely redone so it need not care whether the Request is a duplicate or not. For example, a request for the current time can be responded to with the current time without being concerned whether the Request is a duplicate. The Response is discarded at the Client if it is no longer of interest. Cheriton [page 13] RFC 1045 VMTP February 1988 2.5.2. Checksum Each VMTP packet contains a checksum to allow the receiver to detect corrupted packets independent of lower level checks. The checksum field is 32 bits, providing greater protection than the standard 16-bit IP checksum (in combination with an improved checksum algorithm). The large packets, high packet rates and general network characteristics expected in the future warrant a stronger checksum mechanism. The checksum normally covers both the VMTP header and the segment data. Optionally (for real-time applications), the checksum may apply only to the packet header, as indicated by the HCO control bit being set in the header. The checksum field is placed at the end of the packet to allow it to be calculated as part of a software copy or as part of a hardware transmission or reception packet processing pipeline, as expected in the next generation of network interfaces. Note that the number of header and data octets is an integral multiple of 8 because VMTP requires that the segment data be padded to be a multiple of 64 bits. The checksum field is appended after the padding, if any. The actual algorithm is described in Section 3.2. A zero checksum field indicates that no checksum was transmitted with the packet. VMTP may be used without a checksum only when there is a host-to-host error detection mechanism and the VMTP security facility is not being used. For example, one could rely on the Ethernet CRC if communication is restricted to hosts on the same Ethernet and the network interfaces are considered sufficiently reliable. 2.5.3. Request and Response Acknowledgment VMTP assumes an unreliable datagram network and internetwork interface. To guarantee delivery of Requests and Response, VMTP uses positive acknowledgments, retransmissions and timeouts. A Request is normally acknowledged by receipt of a Response associated with the Request, i.e. with the same (Client, Transaction). With streamed message transactions, it may also be acknowledged by a subsequent Response that acknowledges previous Requests in addition to the transaction it explicitly identifies. A Response may be explicitly acknowledged by a NotifyVmtpServer operation requested of the manager for the Server. In the case of streaming, this is a cumulative acknowledgment, acknowledging all Responses with a lower transaction identifier as well.) In addition, with non-streamed communication, a subsequent Request from the same Client acknowledges Responses to all previous message transactions (at least in the sense that either the client received a Response or is no longer interested in Responses to Cheriton [page 14] RFC 1045 VMTP February 1988 those earlier message transactions). Finally, a client response timeout (at the server) acknowledges a Response at least in the sense that the server need not be prepared to retransmit the Response subsequently. Note that there is no end-to-end guarantee of the Response being received by the client at the application level. 2.5.4. Retransmissions In general, a Request or Response is retransmitted periodically until acknowledged as above, up to some maximum number of retransmissions. VMTP uses parameters RequestRetries(Server) and ResponseRetries(Client) that indicate the number of retransmissions for the server and client respectively before giving up. We suggest the value 5 be used for both parameters based on our experience with VMTP and Internet packet loss. Smaller values (such as 3) could be used in low loss environments in which fast detection of failed hosts or communication channels is required. Larger values should be used in high loss environments where transport-level persistence is important. In a low loss environment, a retransmission only includes the MCB and not the segment data of the Request or Response, resulting in a single (short) packet on retransmission. The intended recipient of the retransmission can request selective retransmission of all or part of the segment data as necessary. The selective retransmission mechanism is described in Section 2.13. If a Response is specified as idempotent, the Response is neither retransmitted nor stored for retransmission. Instead, the Client must retransmit the Request to effectively get the Response retransmitted. The server VMTP module responds to retransmissions of the Request by passing the Request on to the server again to have it regenerate the Response (by redoing the operation), rather than saving a copy of the Response. Only Request packets for the last transaction from this client are passed on in this fashion; older Request packets from this client are discarded as delayed duplicates. If a Response is not idempotent, the VMTP module must ensure it has a copy of the Response for retransmission either by making a copy of the Response (either physically or copy-on-write) or by preventing the Server from continuing until the Response is acknowledged. 2.5.5. Timeouts There is one client timer for each Client with an outstanding transaction. Similarly, there is one server timer for each Client transaction that is "active" at the server, i.e. there is a transaction Cheriton [page 15] RFC 1045 VMTP February 1988 record for a Request from the Client. When the client transmits a new Request (without streaming), the client timer is set to roughly the time expected for the Response to be returned. On timeout, the Request is retransmitted with the APG (Acknowledge Packet Group) bit set. The timeout is reset to the expected roundtrip time to the Server because an acknowledgment should be returned immediately unless a Response has been sent. The Request may also be retransmitted in response to receipt of a VMTP management operation indicating that selected portions of the Request message segment need to be retransmitted. With streaming, the timeout applies to the oldest outstanding message transaction in the run of outstanding message transactions. Without streaming, there is one message transaction in the run, reducing to the previous situation. After the first packet of a Response is received, the Client resets the timeout to be the time expected before the next packet in the Response packet group is received, assuming it is a multi-packet Response. If not, the timer is stopped. Finally, the client timer is used to timeout waiting for second and subsequent Responses to a multicast Request. The client timer is set at different times to four different values: TC1(Server) The expected time required to receive a Response from the Server. Set on initial Request transmission plus after its management module receives a NotifyVmtpClient operation, acknowledging the Request. TC2(Server) The estimated round trip delay between the client and the server. Set when retransmitting after receiving no Response for TC1(Server) time and retransmitting the Request with the APG bit set. TC3(Server) The estimated maximum expected interpacket time for multi-packet Responses from the Server. Set when waiting for subsequent Response packets within a packet group before timing out. TC4 The time to wait for additional Responses to a group Request after the first Response is received. This is specified by the user level. These values are selected as follows. TC1 can be set to TC2 plus a constant, reflecting the time within which most servers respond to most requests. For example, various measurements of VMTP usage at Stanford indicate that 90 percent of the servers respond in less than 200 milliseconds. Setting TC1 to TC2 + 200 means that most Requests receive a Response before timing out and also that overhead for retransmission Cheriton [page 16] RFC 1045 VMTP February 1988 for long running transactions is insignificant. A sophisticated implementation may make the estimation of TC1 further specific to the Server. TC2 may be estimated by measuring the time from when a Probe request is sent to the Server to when a response is received. TC2 can also be measured as the time between the transmission of a Request with the APG bit set to receipt of a management operation acknowledging receipt of the Request. When the Server is an entity group, TC1 and TC2 should be the largest of the values for the members of the group that are expected to respond. This information may be determined by probing the group on first use (and using the values for the last responses to arrive). Alternatively, one can resort to default values. TC3 is set initially to 10 times the transmission time for the maximum transmission unit (MTU) to be used for the Response. A sophisticated implementation may record TC3 per Server and refine the estimate based on measurements of actual interpacket gaps. However, a tighter estimate of TC3 only improves the reaction time when a packet is lost in a packet group, at some cost in unnecessary retransmissions when the estimate becomes overly tight. The server timer, one per active Client, takes on the following values: TS1(Client) The estimated maximum expected interpacket time. Set when waiting for subsequent Request packets within a packet group before timing out. TS2(Client) The time to wait to hear from a client before terminating the server processing of a Request. This limits the time spent processing orphan calls, as well as limiting how out of date the server's record of the Client state can be. In particular, TS2 should be significantly less than the minimum time within which it is reasonable to reuse a transaction identifier. TS3(Client) Estimated roundtrip time to the Client, TS4(Client) The time to wait after sending a Response (or last hearing from a client) before discarding the state associated with the Request which allows it to filter duplicate Request packets and regenerate the Response. TS5(Client) The time to wait for an acknowledgment after sending a Response before retransmitting the Response, or giving Cheriton [page 17] RFC 1045 VMTP February 1988 up (after some number of retransmissions). TS1 is set the same as TC3. The suggested value for TS2 is TC1 + 3*TC2 for this server, giving the Client time to timeout waiting for a Response and retransmit 3 Request packets, asking for acknowledgments. TS3 is estimated the same as TC1 except that refinements to the estimate use measurements of the Response-to-acknowledgment times. In the general case, TS4 is set large enough so that a Client issuing a series of closely-spaced Requests to the same Server reuses the same state record at the Server end and thus does not incur the overhead of recreating this state. (The Server can recreate the state for a Client by performing a Probe on the Client to get the needed information.) It should also be set low enough so that the transaction identifier cannot wrap around and so that the Server does not run out of CSR's. We suggest a value in the range of 500 milliseconds. However, if the Server accepts non-idempotent Requests from this Client without doing a Probe on the Client, the TS4 value for this CSR is set to at least 4 times the maximum packet lifetime. TS5 is TS3 plus the expected time for transmission and reception of the Response. We suggest that the latter be calculated as 3 times the transmission time for the Response data, allowing time for reception, processing and transmission of an acknowledgment at the Client end. A sophisticated implementation may refine this estimate further over time by timing acknowledgments to Responses. 2.5.6. Rate Control VMTP is designed to deal with the present and future problem of packet overruns. We expect overruns to be the major cause of dropped packets in the future. A client is expected to estimate and adjust the interpacket gap times so as to not overrun a server or intermediate nodes. The selective retransmission mechanism allows the server to indicate that it is being overrun (or some intermediate point is being overrun). For example, if the server requests retransmission of every Kth block, the client should assume overrun is taking place and increase the interpacket gap times. The client passes the server an indication of the interpacket gap desired for a response. The client may have to increase the interval because packets are being dropped by an intermediate gateway or bridge, even though it can handle a higher rate. A conservative policy is to increase the interpacket gap whenever a packet is lost as part of a multi-packet packet group. Cheriton [page 18] RFC 1045 VMTP February 1988 The provision of selective retransmission allows the rate of the client and the server to "push up" against the maximum rate (and thus lose packets) without significant penalty. That is, every time that packet transmission exceeds the rate of the channel or receiver, the recovery cost to retransmit the dropped packets is generally far less than retransmitting from the first dropped packet. The interpacket gap is expressed in 1/32nd's of the MTU packet transmission time. The minimum interpacket gap is 0 and the maximum gap that can be described in the protocol is 8 packet times. This places a limit on the slowest receivers that can be efficiently used on a network, at least those handling multi-packet Requests and Responses. This scheme also limits the granularity of adjustment. However, the granularity is relative to the speed of the network, as opposed to an absolute time. For entities on different networks of significantly different speed, we assume the interconnecting gateways can buffer packets to compensate<2>. With different network speeds and intermediary nodes subject to packet loss, a node must adjust the interpacket gap based on packet loss. The interpacket gap parameter may be of limited use. 2.6. Security VMTP provides an (optional) secure mode that protects against the usual security threats of peeking, impostoring, message tampering and replays. Secure VMTP must be used to guarantee any of the transport-level reliability properties unless it is guaranteed that there are no intruders or agents that can modify packets and update the packet checksums. That is, non-secure VMTP provides no guarantees in the presence of an intelligent intruder. The design closely follows that described by Birrell [1]. Authenticated information about a remote entity, including an encryption/decryption key, is obtained and maintained using a VMTP management operation, the authenticated Probe operation, which is executed as a non-secure VMTP message transaction. If a server receives a secure Request for which the server has no entity state, it sends a Probe request to the VMTP _______________ <2> Gateways must also employ techniques to preserve or intelligently modify (if appropriate) the interpacket gaps. In particular, they must be sure not to arbitrarily remove interpacket gaps as a result of their forwarding of packets. Cheriton [page 19] RFC 1045 VMTP February 1988 management module of the client, "challenging" it to provide an authenticator that both authenticates the client as being associated with a particular principal as well as providing a key for encryption/decryption. The principal can include a real and effective principal, as used in UNIX <3>. Namely, the real principal is the principal on whose behalf the Request is being performed whereas the effective principal is the principal of the module invoking the request or remote procedure call. Peeking is prevented by encrypting every Request and Response packet with a working Key that is shared between Client and Server. Impostoring and replays are detected by comparing the Transaction identifier with that stored in the corresponding entity state record (which is created and updated by VMTP as needed). Message tampering is detected by encryption of the packet including the Checksum field. An intruder cannot update the checksum after modifying the packet without knowing the Key. The cost of fully encrypting a packet is close to the cost of generating a cryptographic checksum (and of course, encryption is needed in the general case), so there is no explicit provision for cryptographic checksum without packet encryption. A Client determines the Principal of the Server and acquires an authenticator for this Server and Principal using a higher level protocol. The Server cannot decrypt the authenticator or the Request packets unless it is in fact the Principal expected by the Client. An encrypted VMTP packet is flagged by the EPG bit in the VMTP packet header. Thus, encrypted packets are easily detected and demultiplexed from unencrypted packets. An encrypted VMTP packet is entirely encrypted except for the Client, Version, Domain, Length and Packet Flags fields at the beginning of the packet. Client identifiers can be assigned, changed and used to have no real meaning to an intruder or to only communicate public information (such as the host Internet address). They are otherwise just a random means of identification and demultiplexing and do not therefore divulge any sensitive information. Further secure measures must be taken at the network or data link levels if this information or traffic behavior is considered sensitive. VMTP provides multiple authentication domains as well as an encryption qualifier to accommodate different encryption algorithms and their _______________ <3> Principal group membership must be obtained, if needed, by a higher level protocol. Cheriton [page 20] RFC 1045 VMTP February 1988 corresponding security/performance trade-offs. (See Appendix V.) A separate key distribution and authentication protocol is required to handle generation and distribution of authenticators and keys. This protocol can be implemented on top of VMTP and can closely follow the Birrell design as well. Security is optional in the sense that messages may be secure or non-secure, even between consecutive message transactions from the same client. It is also optional in that VMTP clients and servers are not required to implement secure VMTP (although they are required to respond intelligently to attempts to use secure VMTP). At worst, a Client may fail to communicate with a Server if the Server insists on secure communication and the Client does not implement security or vice versa. However, a failure to communicate in this case is necessary from a security standpoint. 2.7. Multicast The Server entity identifier in a message transaction can identify an entity group, in which case the Request is multicast to every Entity in this group (on a best-efforts basis). The Request is retransmitted until at least one Response is received (or an error timeout occurs) unless it is a datagram Request. The Client can receive multiple Responses to the Request. The VMTP service interface does not directly provide reliable multicast because it is expensive to provide, rarely needed by applications, and can be implemented by applications using the multiple Response feature. However, the protocol itself is adequate for reliable multicast using positive acknowledgments. In particular, a sophisticated Client implementation could maintain a list of members for each entity group of interest and retransmit the Request until acknowledged by all members. No modifications are required to the Server implementations. VMTP supports a simple form of subgroup addressing. If the CRE bit is set in a Request, the Request is delivered to the subgroup of entities in the Server group that are co-resident with one or more entities in the group (or individual entity) identified by the CoresidentEntity field of the Request. This is commonly used to send to the manager entity for a particular entity, where Server specifies the group of such managers. Co-resident means "using the same VMTP module", and logically on the same network host. In particular, a Probe request can be sent to the particular VMTP management module for an entity by specifying the VMTP management group as the Server and the entity in question as the CoResidentEntity. Cheriton [page 21] RFC 1045 VMTP February 1988 As an experimental aspect of the protocol, VMTP supports the Server sending a group Response which is sent to the Client as well as members of the destination group of Servers to which the original Request was sent. The MDG bit indicates whether the Client is a member of this group, allowing the Server module to determine whether separately addressed packet groups are required to send the Response to both the Client and the Server group. Normally, a Server accepts a group Response only if it has received the Request and not yet responded to the Client. Also, the Server must explicitly indicate it wants to accept group Responses. Logically, this facility is analogous to responding to a mail message sent to a distribution list by sending a copy of the Response to the distribution list. 2.8. Real-time Communication VMTP provides three forms of support for real-time communication, in addition to its standard facilities, which make it applicable to a wide range of real-time applications. First, a priority is transmitted in each Request and Response which governs the priority of its handling. The priority levels are intended to correspond roughly to: - urgent/emergency. - important - normal - background. with additional gradations for each level. The interpretation and implementation of these priority levels is otherwise host-specific, e.g. the assignment to host processing priorities. Second, datagram Requests allow the Client to send a datagram to another entity or entity group using the VMTP naming, transmission and delivery mechanism, but without blocking, retransmissions or acknowledgment. (The client can still request acknowledgment using the APG bit although the Server does not expect missing portions of a multi-packet datagram Request to be retransmitted even if some are not received.) A datagram Request in non-streamed mode supersedes all previous Requests from the same Client. A datagram Request in stream mode is queued (if necessary) after previous datagram Requests on the same stream. (See Section 2.11.) Finally, VMTP provides several control bit flags to modify the handling of Requests and Responses for real-time requirements. First, the Cheriton [page 22] RFC 1045 VMTP February 1988 conditional message delivery (CMD) flag causes a Request to be discarded if the recipient is not waiting for it when it arrives, similarly for the Response. This option allows a client to send a Request that is contingent on the server being able to process it immediately. The header checksum only (HCO) flag indicates that the checksum has been calculated only on the VMTP header and not on the data segment. Applications such as voice and video can avoid the overhead of calculating the checksum on data whose utility is insensitive to typical bit errors without losing protection on the header information. Finally, the No Retransmission (NRT) flag indicates that the recipient of a message should not ask for retransmission if part of the message is missing but rather either use what was received or discard it. None of these facilities introduce new protocol states. In fact, the total processing overhead in the normal case is a bit flag test for CMD, HCO or NRT plus assignment of priority on packet transmission and reception. (In fact, CMD and NRT are not tested in the normal case.) The additional code complexity is minimal. We feel that the overhead for providing these real-time facilities is minimal and that these facilities are both important and adequate for a wide class of real-time applications. Several of the normal facilities of VMTP appear useful for real-time applications. First, multicast is useful for distributed, replicated (fault-tolerant) real-time applications, allowing efficient state query and update for (for example) sensors and control state. Second, the DGM or idempotent flag for Responses has some real-time benefits, namely: a Request is redone to get the latest values when the Response is lost, rather than just returning the old values. The desirability of this behavior is illustrated by considering a request for the current time of day. An idempotent handling of this request gives better accuracy in returning the current time in the case that a retransmission is necessary. Finally, the request-response semantics (in the absence of streaming) of each new Request from a Client terminating the previous message transactions from that Client, if any, provides the "most recent is most important" handling of processing that most real-time applications require. In general, a key design goal of VMTP was provide an efficient general-purpose transport protocol with the features required for real-time communication. Further experience is required to determine whether this goal has been achieved. Cheriton [page 23] RFC 1045 VMTP February 1988 2.9. Forwarded Message Transactions A Server may invoke another Server to handle a Request. It is fairly common for the invocation of the second Server to be the last action performed by the first Server as part of handling the Request. For example, the original Server may function primarily to select a process to handle the Request. Also, the Server may simply check the authorization on the Request. Describing this situation in the context of RPC, a nested remote procedure call may be the last action in the remote procedure and the return parameters are exactly those of the nested call. (This situation is analogous to tail recursion.) As an optimization to support this case, VMTP provides a Forward operation that allows the server to send the nested Request to the other server and have this other server respond directly to the Client. If the message transaction being forwarded was not multicast, not secure or the two Servers are the same principal and the ForwardCount of the Request is less than the maximum forward count of 15, the Forward operation is implemented by the Server sending a Request onto the next Server with the forwarded Request identified by the same Client and Transaction as the original Request and a ForwardCount one greater than the Request received from the Client. In this case, the new Server responds directly to the Client. A forwarded Request is illustrated in the following figure. +---------+ Request +----------+ | Client +---------------->| Server 1 | +---------+ +----------+ ^ | | | forwarded Request | V | Response +----------+ +----------------------| Server 2 | +----------+ If the message transaction does not meet the above requirements, the Server's VMTP module issues a nested call and simply maps the returned Response to a Response to original Request without further Server-level processing. In this case, the only optimization over a user-level nested call is one fewer VMTP service operation; the VMTP module handles the return to the invoking call directly. The Server may also use this form of forwarding when the Request is part of a stream of message transactions. Otherwise, it must wait until the forwarded message transaction completes before proceeding with the subsequent message transactions in the stream. Cheriton [page 24] RFC 1045 VMTP February 1988 Implementation of the user-level Forward operation is optional, depending on whether the server modules require this facility. Handling an incoming forwarded Request is a minor modification of handling a normal incoming Request. In particular, it is only necessary to examine the ForwardCount field when the Transaction of the Request matches that of the last message transaction received from the Client. Thus, the additional complexity in the VMTP module for the required forwarding support is minimal; the complexity is concentrated in providing a highly optimized user-level Forward primitive, and that is optional. 2.10. VMTP Management VMTP management includes operations for creating, deleting, modifying and querying VMTP entities and entity groups. VMTP management is logically implemented by a VMTP management server module that is invoked using a message transaction addressed to the Server, VMTP_MANAGER_GROUP, a well-known group entity identifier, in conjunction with Coresident Entity mechanism introduced in Section 2.7. A particular Request may address the local module, the module managing a particular entity, the set of modules managing those entities contained in a specific group or all management modules, as appropriate. The VMTP management procedures are specified in Appendix III. 2.11. Streamed Message Transactions Streamed message transactions refer to two or more message transactions initiated by a Client before it receives the response to the first message transaction, with each transaction being processed and responded to in order but asynchronous relative to the initiation of the transactions. A Client streams messages transactions, and thereby has multiple message transactions outstanding, by sending them as part of a single run of message transactions. A run of message transactions is a sequence of message transactions with the same Client and Server and consecutive Transaction identifiers, with all but the first and last Requests and Responses flagged with the NSR (Not Start Run) and NER (Not End Run) control bits. (Conversely, the first Request and Response does not have the NSR set and the last Request and Response does not have the NER bit set.) The message transactions in a run use Cheriton [page 25] RFC 1045 VMTP February 1988 consecutive transaction identifiers (except if the STI bit <4> is used in one, in which case the transaction identifier for the next message transaction is 256 greater, rather than 1). The Client retains a record for each outstanding transaction until it gets a Response or is timed out in error. The record provides the information required to retransmit the Request. On retransmission timeout, the client retransmits the last Request for which it has not received a Response the same as is done with non-streamed communication. (I.e. there need be only one timeout for all the outstanding message transactions associated with a single client.) The consecutive transaction identifiers within a run of message transactions are used as sequence numbers for error control. The Server handles each message transaction in the sequence specified by its transaction identifier. When it receives a message transaction that is not marked as the beginning of a run, it checks that it previously received a message transaction with the predecessor transaction identifier, either 1 less than the current one or 256 less if the previous one had the STI bit set. If not, the Server sends a NotifyVmtpClient operation to the Client's manager indicating either: (1) the first message transaction was not fully received, or else (2) it has no record of the last one received. If the NRT control flag is set, it does not await nor expect retransmission but proceeds with handling this Request. This flag is used primarily when datagram Requests are used as part of a stream of message transactions. If NRT was not specified, the Client must retransmit from the first message transaction not fully received (either at all or in part) before the Server can proceed with handling this run of Requests or else restart the run of message transactions. The Client expects to receive the Responses in a consecutive sequence, using the Transaction identifier to detect missing Responses. Thus, the Server must return Responses in sequence except possibly for some gaps, as follows. The Server can specify in the PGcount field in a Response, the number of consecutively previous Responses that this Response _______________ <4> The STI bit is used by the Client to effectively allocate 255 transaction identifiers for use by the Server in returning a large Response or stream of Responses. Cheriton [page 26] RFC 1045 VMTP February 1988 corresponds to, up to a maximum of 255 previous Responses <5>. Thus, for example, a Response with Transaction identifier 46 and PGcount 3 represents Responses 43, 44, 45 and 46. This facility allows the Server to eliminate sending Responses to Requests that require no Response, effectively batching the Responses into one. It also allows the Server to effectively maintain strictly consecutive sequencing when the Client has skipped 256 Transaction identifiers using the STI bit and the Server does not have that many Responses to return. If the Client receives a Response that is not consecutive, it retransmits the Request(s) for which the Response(s) is/are missing (unless, of course, the corresponding Requests were sent as datagrams). The Client should wait at the end of a run of message transactions for the last one to complete. When a Server receives a Request with the NSR bit clear and a higher transaction identifier than it currently has for the Client, it terminates all processing and discards Responses associated with the previous Requests. Thus, a stream of message transactions is effectively aborted by starting a new run, even if the Server was in the middle of handling the previous run. Using a mixture of datagram and normal Requests as part of a stream of message transactions, particularly with the use of the NRT bit, can lead to complex behavior under packet loss. It is recommended that a run of message transactions be all of one type to avoid problems, i.e. all normal or all datagrams. Finally, when a Server forwards a Request that is part of a run, it must suspend further processing of the subsequent Requests until the forwarded Request has been handled, to preserve order of processing. The simplest handling of this situation is to use a real nested call when forwarding with streamed message transactions. Flow control of streamed message transactions relies on rate control at the Client plus receipt (or non-receipt) of management notify operations indicating the presence of overrunning. A Client must reduce the number of outstanding message transactions at the Server when it receives a NotifyVmtpServer operation with the MSGTRANS_OVERFLOW ResponseCode. The transact parameter indicates the last packet group that was accepted. _______________ <5> PGcount actually corresponds to packet groups which are described in Section 2.13. This (simplified) description is accurate when there is one Request or Response per packet group. Cheriton [page 27] RFC 1045 VMTP February 1988 The implementation of multiple outstanding message transactions requires the ability to record, timeout and buffer multiple outstanding message transactions at the Client end as well as the Server end. However, this facility is optional for both the Client and the Server. Client systems with heavy-weight processes and high network access cost are most likely to benefit from this facility. Servers that serve a wide variety of client machines should implement streaming to accommodate these types of clients. 2.12. Fault-Tolerant Applications One approach to fault-tolerant systems is to maintain a log of all messages sent at each node and replay the messages at a node when the node fails, after restarting it from the last checkpoint <6>. As an experimental facility, VMTP provides a Receive Sequence Number field in the NotifyVmtpClient and NotifyVmtpServer operations as well as the Next Receive Sequence (NRS) flag in the Response packet to allow a sender to log a receive sequence number with each message sent, allowing the packets to be replayed at a recovering node in the same sequence as they were originally received, thereby recovering to the same state as before. Basically, each sending node maintains a receive sequence number for each receiving node. On sending a Request to a node, it presume that the receive sequence number is one greater than the one it has recorded for that node. If not, the receiving node sends a notify operation indicating the receive sequence number assigned the Request. The NRS in the Response confirms that the Request message was the next receive sequence number, so the sender can detect if it failed to receive the notify operation in the previous case. With Responses, the packets are ordered by the Transaction identifier except for multicast message transactions, in which there may be multiple Responses with the same identification. In this case, NotifyVmtpServer operations are used to provide receive sequence numbers. This experimental extension of the protocol is focused on support for fault-tolerant real-time distributed systems required in various critical applications. It may be removed or extended, depending on further investigations. _______________ <6> The sender-based logging is being investigated by Willy Zwaenepoel of Rice University. Cheriton [page 28] RFC 1045 VMTP February 1988 2.13. Packet Groups A message (whether Request or Response) is sent as one or more packet groups. A packet group is one or more packets, each containing the same transaction identification and message control block. Each packet is formatted as below with the message control block logically embedded in the VMTP header. +------------------------------------++---------------------+ | VMTP Header || | +------------+-----------------------|| segment data | |VMTP Control| Message Control Block || | +------------+-----------------------++---------------------+ The some fields of the VMTP control portion of the packet and data segment portion can differ between packets within the same packet group. The segment data portion of a packet group represents up to 16 kilooctets of the segment specified in the message control block. The portion contained in each packet is indicated by the PacketDelivery field contained in the VMTP header. The PacketDelivery field as a bit mask has a similar interpretation to the MsgDelivery field in that each bit corresponds to a segment data block of 512 octets. The PacketDelivery field limits a packet group to 16 kilooctets and a maximum of 32 VMTP packets (with a minimum of 1 packet). Data can be sent in fewer packets by sending multiple data blocks per packet. We require that the underlying datagram service support delivery of (at minimum) the basic 580 octet VMTP packet <7>. To illustrate the use of the PacketDelivery field, consider for example the Ethernet which has a MTU of 1536 octets. so one would send 2 512-octet segment data blocks per packet. (In fact, if a third block is last in the segment and less than 512 octets and fits in the packet without making it too big, an Ethernet packet could contain three data blocks. Thus, an Ethernet packet group for a segment of size 0x1D00 octets (14.5 blocks) and MsgDelivery 0x000074FF consists of 6 packets indicated as follows <8>. _______________ <7> Note that with a 20 octet IP header, a VMTP packet is 600 octets. We propose the convention that any host implementing VMTP implicitly agrees to accept IP/VMTP packets of at least 600 octets. <8> We use the C notation 0xHHHH to represent a hexadecimal number. Cheriton [page 29] RFC 1045 VMTP February 1988 Packet Delivery 1 1 1 1 1 1 1 1 0 0 1 0 1 0 1 0 0 0 0 0 0 . . . 0000 0400 0800 0C00 1000 1400 1800 1C00 +----+----+----+----+----+----+----+-+ Segment |....|....|....|....|....|....|....|.| +----+----+----+----+----+----+----+-+ : : : : : : : / / : v v v v v v v /| v +----+----+----+----+ +----+ +---+ Packets | 1 | 2 | 3 | 4 | | 5 | | 6 | +----+----+----+----+ +----+ +---+ Each '.' is 256 octets of data. The PacketDelivery masks for the 6 packets are: 0x00000003, 0x0000000C, 0x00000030, 0x000000C0, 0x00001400 and 0x00006000, indicating the segment blocks contained in each of the packets. (Note that the delivery bits are in little endian order.) A packet group is sent as a single "blast" of packets with no explicit flow control. However, the sender should estimate and transmit at a rate of packet transmission to avoid congesting the network or overwhelming the receiver, as described in Section 2.5.6. Packets in a packet group can be sent in any order with no change in semantics. When the first packet of a packet group is received (assuming the Server does not decide to discard the packet group), the Server saves a copy of the VMTP packet header, indicates it is currently receiving a packet group, initializes a "current delivery mask" (indicating the data in the segment received so far) to 0, accepts this packet (updating the current delivery mask) and sets the timer for the packet group. Subsequent packets in the packet group update the current delivery mask. Reception of a packet group is terminated when either the current delivery mask indicates that all the packets in the packet group have been received or the packet group reception timer expires (set to TC3 or TS1). If the packet group reception timer expires, if the NRT bit is set in the Control flags then the packet group is discarded if not complete unless MDM is set. In this case, the MsgDelivery field in the message control block is set to indicate the segment data blocks actually received and the message control block and segment data received is delivered to application level. If NRT is not set and not all data blocks have been received, a NotifyVmtpClient (if a Request) or NotifyVmtpServer (if a Response) is sent back with a PacketDelivery field indicating the blocks received. The source of the packet group is then expected to retransmit the missing blocks. If not all blocks of a Request are received after RequestAckRetries(Client) retransmissions, the Request is discarded and Cheriton [page 30] RFC 1045 VMTP February 1988 a NotifyVmtpClient operation with an error response code is sent to the client's manager unless MDM is set. With a Response, there are ResponseAckRetries(Server) retransmissions and then, if MDM is not set, the requesting entity is returned the message control block with an indication of the amount of segment data received extending contiguously from the start of the segment. E.g. if the sender sent 6 512-octet blocks and only the first two and the last two arrived, the receiver would be told that 1024 octets were received. The ResponseCode field is set to BAD_REPLY_SEGMENT. (Note that VMTP is only able to indicate the specific segment blocks received if MDM is set.) The parameters RequestAckRetries(Client) and ResponseAckRetries(Server) could be set on a per-client and per-server basis in a sophisticated implementation based on knowledge of packet loss. If the APG flag is set, a NotifyVmtpClient or NotifyVmtpServer operation is sent back at the end of the packet group reception, depending on whether it is a Request or a Response. At minimum, a Server should check that each packet in the packet group contains the same Client, Server, Transaction identifier and SegmentSize fields. It is a protocol error for any field other than the Checksum, packet group control flags, Length and PacketDelivery in the VMTP header to differ between any two packets in one packet group. A packet group containing a protocol error of this nature should be discarded. Notify operations should be sent (or invoked) in the manager whenever there is a problem with a unicast packet. i.e. negative acknowledgments are always sent in this case. In the case of problems with multicast packets, the default is to send nothing in response to an error condition unless there is some clear reason why no other node can respond positively. For example, the packet might be a Probe for an entity that is known to have been recently existing on the receiving host but now invalid and could not have migrated. In this case, the receiving host responds to the Probe indicating the entity is nonexistent, knowing that no other host can respond to the Probe. For packets and packet groups that are received and processed without problems, a Notify operation is invoked only if the APG bit is set. 2.14. Runs of Packet Groups A run of packet groups is a sequence of packet groups, all Request packets or all Response packets, with the same Client and consecutive transaction identifiers, all but the first and last packets flagged with the NSR (Not Start Run) and NER (Not End Run) control bits. When each packet group in the run corresponds to a single Request or Response, it Cheriton [page 31] RFC 1045 VMTP February 1988 is identical to a run of message transactions. (See Section 2.11) However, a Request message or a Response message may consists of up to 256 packet groups within a run, for a maximum of 4 megaoctets of segment data. A message that is continued in the next packet group in the run is flagged in the current packet group by the CMG flag. Otherwise, the next packet group in the run (if any) is treated as a separate Request or Response. Normally, each Request and Response message is sent as a single packet group and each run consists of a single packet group. In this case neither NSR or NER are set. For multi-packet group messages, the PacketDelivery mask in the i-th packet group of a message corresponds to the portion of the segment offset by i-1 times 16 kilooctets, designating the the first packet group to have i = 1. 2.15. Byte Order For purposes of transmission and reception, the MCB is treated as consisting of 8 32-bit fields and the segment is a sequence of bytes. VMTP transmits the MCB in big-endian order, performing byte-swapping, if necessary, before transmission. A little-endian host must byte-swap the MCB on reception. (The data segment is transmitted as a sequence of bytes with no reordering.) The byte order of the sender of a message is indicated by the LEE bit in the entity identifier for the sender, the Client field if a Request and the Server field if a Response. The sender and receiver of a message are required to agree in some higher level protocol (such as an RPC presentation protocol) on who does further swapping of the MCB and data segment if required by the types of the data actually being transmitted. For example, the segment data may contain a record with 8-bit, 16-bit and 32-bit fields, so additional transformation is required to move the segment from a host of one byte order to another. VMTP to date has used a higher-level presentation protocol in which segment data is sent in the native order of the sending host and byte-swapped as necessary by the receiving host. This approach minimizes the byte-swapping overhead between machines of common byte order (including when the communication is transparently local to one host), avoids a strong bias in the protocol to one byte-order, and allows for the sending entity to be sending to a group of hosts with different byte orders. (Note that the byte-swap overhead for the MCB is minimal.) The presentation-level overhead is minimal because most common operations, such as file access operations, have parameters that fit the MCB and data segment data types exactly. Cheriton [page 32] RFC 1045 VMTP February 1988 2.16. Minimal VMTP Implementation A minimal VMTP client needs to be able to send a Request packet group and receive a Response packet group as well as accept and respond to Requests sent to its management module, including Probe and NotifyClient operations. It may also require the ability to invoke Probe and Notify operations to locate a Server and acknowledge responses. (the latter only if it is involved in transactions that are not idempotent or datagram message transactions. However, a simple sensor, for example, can transmit VMTP datagram Requests indicating its current state with even less mechanism.) The minimal client thus requires very little code and is suitable as a basis for (e.g.) a network boot loader. A minimal VMTP server implements idempotent, non-encrypted message transactions, possibly with no segment data support. It should use an entity state record for each Request but need only retain it while processing the Request. Without segment data larger than a packet, there is no need for any timers, buffering (outside of immediate request processing) or queuing. In particular, it needs only as many records as message transactions it handles simultaneously (e.g. 1). The entity state record is required to recognize and respond to Request retransmissions during request processing. The minimal server need only receive Requests and and be able to send Response packets. It need have only a minimal management module supporting Probe operations. (Support for the NotifyVmtpClient operation is only required if it does not respond immediately to a Request.) Thus the VMTP support for say a time server, sensor, or actuator can be extremely simple. Note that the server need never issue a Probe operation if it uses the host address of the Request for the Response and does not require the Client information returned by the Probe operation. The minimal server should also support reception of forwarded Requests. 2.17. Message vs. Procedural Request Handling A request-response protocol can be used to implement two forms of semantics on reception. With procedural handling of a Request, a Request is handled by a process associated with the Server that effectively takes on the identity of the calling process, treating the Request message as invoking a procedure, and relinquishing its association to the calling process on return. VMTP supports multiple nested calls spanning multiple machines. In this case, the distributed call stack that results is associated with a single process from the standpoint of authentication and resource management, using the ProcessId field supported by VMTP. The entity identifiers effectively Cheriton [page 33] RFC 1045 VMTP February 1988 link these call frames together. That is, the Client field in a Request is effectively the return link to the previous call frame. With message handling of a Request, a Request message is queued for a server process. The server process dequeues, reads, processes and responds to the Request message, executing as a separate process. Subsequent Requests to the same server are queued until the server asks to receive the next Request. Procedural semantics have the advantage of allowing each Request (up to the resource limits of the Server) to execute concurrently at the Server, with Request-specific synchronization. Message semantics have the advantage that Requests are serialized at the Server and that the request processing logically executes with the priority, protection and independent execution of a separate process. Note that procedural and message handling of a request appear no differently to the client invoking the message transaction, except possibly for differences in performance. We view the two Request handling approaches as appropriate under different circumstances. VMTP supports both models. 2.18. Bibliography The basic protocol is similar to that used in the original form of the V kernel [3, 4] as well as the transport protocol of Birrell and Nelson's [2] remote procedure call mechanism. An earlier version of the protocol was described in SIGCOMM'86 [6]. The rate-based flow control is similar to the techniques of Netblt [9]. The support for idempotency draws, in part, on the favorable experience with idempotency in the V distributed system. Its use was originally inspired by the Woodstock File Server [11]. The multicast support draws on the multicast facilities in V [5] and is designed to work with, and is now implemented using, the multicast extensions to the Internet [8] described in RFC 966 and 988. The secure version of the protocol is similar to that described by Birrell [1] for secure RPC. The use of runs of packet groups is similar to Fletcher and Watson's delta-T protocol [10]. The use of "management" operations implemented using VMTP in place of specialized packet types is viewed as part of a general strategy of using recursion to simplify protocol architectures [7]. Finally, this protocol was designed, in part, to respond to the requirements identified by Braden in RFC 955. We believe that VMTP satisfies the requirements stated in RFC 955. Cheriton [page 34] RFC 1045 VMTP February 1988 [1] A.D. Birrell, "Secure Communication using Remote Procedure Calls", ACM. Trans. on Computer Systems 3(1), February, 1985. [2] A. Birrell and B. Nelson, "Implementing Remote Procedure Calls", ACM Trans. on Computer Systems 2(1), February, 1984. [3] D.R. Cheriton and W. Zwaenepoel, "The Distributed V Kernel and its Performance for Diskless Workstations", In Proceedings of the 9th Symposium on Operating System Principles, ACM, 1983. [4] D.R. Cheriton, "The V Kernel: A Software Base for Distributed Systems", IEEE Software 1(2), April, 1984. [5] D.R. Cheriton and W. Zwaenepoel, "Distributed Process Groups in the V Kernel", ACM Trans. on Computer Systems 3(2), May, 1985. [6] D.R. Cheriton, "VMTP: A Transport Protocol for the Next Generation of Communication Systems", In Proceedings of SIGCOMM'86, ACM, Aug 5-7, 1986. [7] D.R. Cheriton, "Exploiting Recursion to Simplify an RPC Communication Architecture", in preparation, 1988. [8] D.R. Cheriton and S.E. Deering, "Host Groups: A Multicast Extension for Datagram Internetworks", In 9th Data Communication Symposium, IEEE Computer Society and ACM SIGCOMM, September, 1985. [9] D.D. Clark and M. Lambert and L. Zhang, "NETBLT: A Bulk Data Transfer Protocol", Technical Report RFC 969, Defense Advanced Research Projects Agency, 1985. [10] J.G. Fletcher and R.W. Watson, "Mechanism for a Reliable Timer- based Protocol", Computer Networks 2:271-290, 1978. Cheriton [page 35] RFC 1045 VMTP February 1988 [11] D. Swinehart and G. McDaniel and D. Boggs, "WFS: A Simple File System for a Distributed Environment", In Proc. 7th Symp. Operating Systems Principles, 1979. Cheriton [page 36] RFC 1045 VMTP February 1988 3. VMTP Packet Formats VMTP uses 2 basic packet formats corresponding to Request packets and Response packets. These packet formats are identical in most of the fields to simplify the implementation. We first describe the entity identifier format and the packet fields that are used in general, followed by a detailed description of each of the packet formats. These fields are described below in detail. The individual packet formats are described in the following subsections. The reader and VMTP implementor may wish to refer to Chapters 4 and 5 for a description of VMTP event handling and only refer to this detailed description as needed. 3.1. Entity Identifier Format The 64-bit non-group entity identifiers have the following substructure. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |R| |L|R| |A|0|E|E| Domain-specific structure |E| |E|S| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Domain-specific structure | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The field meanings are as follows: RAE Remote Alias Entity - the entity identifier identifies an entity that is acting as an alias for some entity outside this entity domain. This bit is used by higher-level protocols. For instance, servers may take extra security and protection measures with aliases. GRP Group - 0, for non-group entity identifiers. LEE Little-Endian Entity - the entity transmits data in little-endian (VAX) order. RES Reserved - must be 0. The 64-bit entity group identifiers have the following substructure. Cheriton [page 37] RFC 1045 VMTP February 1988 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |R| |U|R| |A|1|G|E| Domain-specific structure |E| |P|S| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Domain-specific structure | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The field meanings are as follows: RAE Remote Alias Entity - same as for non-group entity identifier. GRP Group - 1, for entity group identifiers. UGP Unrestricted Group - no restrictions are placed on joining this group. I.e. any entity can join limited only by implementation resources. RES