1# PV Calls Protocol version 1 2 3## Glossary 4 5The following is a list of terms and definitions used in the Xen 6community. If you are a Xen contributor you can skip this section. 7 8* PV 9 10 Short for paravirtualized. 11 12* Dom0 13 14 First virtual machine that boots. In most configurations Dom0 is 15 privileged and has control over hardware devices, such as network 16 cards, graphic cards, etc. 17 18* DomU 19 20 Regular unprivileged Xen virtual machine. 21 22* Domain 23 24 A Xen virtual machine. Dom0 and all DomUs are all separate Xen 25 domains. 26 27* Guest 28 29 Same as domain: a Xen virtual machine. 30 31* Frontend 32 33 Each DomU has one or more paravirtualized frontend drivers to access 34 disks, network, console, graphics, etc. The presence of PV devices is 35 advertized on XenStore, a cross domain key-value database. Frontends 36 are similar in intent to the virtio drivers in Linux. 37 38* Backend 39 40 A Xen paravirtualized backend typically runs in Dom0 and it is used to 41 export disks, network, console, graphics, etcs, to DomUs. Backends can 42 live both in kernel space and in userspace. For example xen-blkback 43 lives under drivers/block in the Linux kernel and xen_disk lives under 44 hw/block in QEMU. Paravirtualized backends are similar in intent to 45 virtio device emulators. 46 47* VMX and SVM 48 49 On Intel processors, VMX is the CPU flag for VT-x, hardware 50 virtualization support. It corresponds to SVM on AMD processors. 51 52 53 54## Rationale 55 56PV Calls is a paravirtualized protocol that allows the implementation of 57a set of POSIX functions in a different domain. The PV Calls frontend 58sends POSIX function calls to the backend, which implements them and 59returns a value to the frontend and acts on the function call. 60 61This version of the document covers networking function calls, such as 62connect, accept, bind, release, listen, poll, recvmsg and sendmsg; but 63the protocol is meant to be easily extended to cover different sets of 64calls. Unimplemented commands return ENOTSUP. 65 66PV Calls provide the following benefits: 67* full visibility of the guest behavior on the backend domain, allowing 68 for inexpensive filtering and manipulation of any guest calls 69* excellent performance 70 71Specifically, PV Calls for networking offer these advantages: 72* guest networking works out of the box with VPNs, wireless networks and 73 any other complex configurations on the host 74* guest services listen on ports bound directly to the backend domain IP 75 addresses 76* localhost becomes a secure host wide network for inter-VMs 77 communications 78 79 80## Design 81 82### Why Xen? 83 84PV Calls are part of an effort to create a secure runtime environment 85for containers (Open Containers Initiative images to be precise). PV 86Calls are based on Xen, although porting them to other hypervisors is 87possible. Xen was chosen because of its security and isolation 88properties and because it supports PV guests, a type of virtual machines 89that does not require hardware virtualization extensions (VMX on Intel 90processors and SVM on AMD processors). This is important because PV 91Calls is meant for containers and containers are often run on top of 92public cloud instances, which do not support nested VMX (or SVM) as of 93today (early 2017). Xen PV guests are lightweight, minimalist, and do 94not require machine emulation: all properties that make them a good fit 95for this project. 96 97### Xenstore 98 99The frontend and the backend connect via [xenstore] to 100exchange information. The toolstack creates front and back nodes with 101state of [XenbusStateInitialising]. The protocol node name 102is **pvcalls**. There can only be one PV Calls frontend per domain. 103 104#### Frontend XenBus Nodes 105 106version 107 Values: <string> 108 109 Protocol version, chosen among the ones supported by the backend 110 (see **versions** under [Backend XenBus Nodes]). Currently the 111 value must be "1". 112 113port 114 Values: <uint32_t> 115 116 The identifier of the Xen event channel used to signal activity 117 in the command ring. 118 119ring-ref 120 Values: <uint32_t> 121 122 The Xen grant reference granting permission for the backend to map 123 the sole page in a single page sized command ring. 124 125#### Backend XenBus Nodes 126 127versions 128 Values: <string> 129 130 List of comma separated protocol versions supported by the backend. 131 For example "1,2,3". Currently the value is just "1", as there is 132 only one version. 133 134max-page-order 135 Values: <uint32_t> 136 137 The maximum supported size of a memory allocation in units of 138 log2n(machine pages), e.g. 1 = 2 pages, 2 == 4 pages, etc. It must 139 be 1 or more. 140 141function-calls 142 Values: <uint32_t> 143 144 Value "0" means that no calls are supported. 145 Value "1" means that socket, connect, release, bind, listen, accept 146 and poll are supported. 147 148#### State Machine 149 150Initialization: 151 152 *Front* *Back* 153 XenbusStateInitialising XenbusStateInitialising 154 - Query virtual device - Query backend device 155 properties. identification data. 156 - Setup OS device instance. - Publish backend features 157 - Allocate and initialize the and transport parameters 158 request ring. | 159 - Publish transport parameters | 160 that will be in effect during V 161 this connection. XenbusStateInitWait 162 | 163 | 164 V 165 XenbusStateInitialised 166 167 - Query frontend transport parameters. 168 - Connect to the request ring and 169 event channel. 170 | 171 | 172 V 173 XenbusStateConnected 174 175 - Query backend device properties. 176 - Finalize OS virtual device 177 instance. 178 | 179 | 180 V 181 XenbusStateConnected 182 183Once frontend and backend are connected, they have a shared page, which 184will is used to exchange messages over a ring, and an event channel, 185which is used to send notifications. 186 187Shutdown: 188 189 *Front* *Back* 190 XenbusStateConnected XenbusStateConnected 191 | 192 | 193 V 194 XenbusStateClosing 195 196 - Unmap grants 197 - Unbind event channels 198 | 199 | 200 V 201 XenbusStateClosing 202 203 - Unbind event channels 204 - Free rings 205 - Free data structures 206 | 207 | 208 V 209 XenbusStateClosed 210 211 - Free remaining data structures 212 | 213 | 214 V 215 XenbusStateClosed 216 217 218### Commands Ring 219 220The shared ring is used by the frontend to forward POSIX function calls 221to the backend. We shall refer to this ring as **commands ring** to 222distinguish it from other rings which can be created later in the 223lifecycle of the protocol (see [Indexes Page and Data ring]). The grant 224reference for shared page for this ring is shared on xenstore (see 225[Frontend XenBus Nodes]). The ring format is defined using the familiar 226`DEFINE_RING_TYPES` macro (`xen/include/public/io/ring.h`). Frontend 227requests are allocated on the ring using the `RING_GET_REQUEST` macro. 228The list of commands below is in calling order. 229 230The format is defined as follows: 231 232 #define PVCALLS_SOCKET 0 233 #define PVCALLS_CONNECT 1 234 #define PVCALLS_RELEASE 2 235 #define PVCALLS_BIND 3 236 #define PVCALLS_LISTEN 4 237 #define PVCALLS_ACCEPT 5 238 #define PVCALLS_POLL 6 239 240 struct xen_pvcalls_request { 241 uint32_t req_id; /* private to guest, echoed in response */ 242 uint32_t cmd; /* command to execute */ 243 union { 244 struct xen_pvcalls_socket { 245 uint64_t id; 246 uint32_t domain; 247 uint32_t type; 248 uint32_t protocol; 249 uint8_t pad[4]; 250 } socket; 251 struct xen_pvcalls_connect { 252 uint64_t id; 253 uint8_t addr[28]; 254 uint32_t len; 255 uint32_t flags; 256 grant_ref_t ref; 257 uint32_t evtchn; 258 uint8_t pad[4]; 259 } connect; 260 struct xen_pvcalls_release { 261 uint64_t id; 262 uint8_t reuse; 263 uint8_t pad[7]; 264 } release; 265 struct xen_pvcalls_bind { 266 uint64_t id; 267 uint8_t addr[28]; 268 uint32_t len; 269 } bind; 270 struct xen_pvcalls_listen { 271 uint64_t id; 272 uint32_t backlog; 273 uint8_t pad[4]; 274 } listen; 275 struct xen_pvcalls_accept { 276 uint64_t id; 277 uint64_t id_new; 278 grant_ref_t ref; 279 uint32_t evtchn; 280 } accept; 281 struct xen_pvcalls_poll { 282 uint64_t id; 283 } poll; 284 /* dummy member to force sizeof(struct xen_pvcalls_request) to match across archs */ 285 struct xen_pvcalls_dummy { 286 uint8_t dummy[56]; 287 } dummy; 288 } u; 289 }; 290 291The first two fields are common for every command. Their binary layout 292is: 293 294 0 4 8 295 +-------+-------+ 296 |req_id | cmd | 297 +-------+-------+ 298 299- **req_id** is generated by the frontend and is a cookie used to 300 identify one specific request/response pair of commands. Not to be 301 confused with any command **id** which are used to identify a socket 302 across multiple commands, see [Socket]. 303- **cmd** is the command requested by the frontend: 304 305 - `PVCALLS_SOCKET`: 0 306 - `PVCALLS_CONNECT`: 1 307 - `PVCALLS_RELEASE`: 2 308 - `PVCALLS_BIND`: 3 309 - `PVCALLS_LISTEN`: 4 310 - `PVCALLS_ACCEPT`: 5 311 - `PVCALLS_POLL`: 6 312 313Both fields are echoed back by the backend. See [Socket families and 314address format] for the format of the **addr** field of connect and 315bind. The maximum size of command specific arguments is 56 bytes. Any 316future command that requires more than that will need a bump the 317**version** of the protocol. 318 319Similarly to other Xen ring based protocols, after writing a request to 320the ring, the frontend calls `RING_PUSH_REQUESTS_AND_CHECK_NOTIFY` and 321issues an event channel notification when a notification is required. 322 323Backend responses are allocated on the ring using the `RING_GET_RESPONSE` macro. 324The format is the following: 325 326 struct xen_pvcalls_response { 327 uint32_t req_id; 328 uint32_t cmd; 329 int32_t ret; 330 uint32_t pad; 331 union { 332 struct _xen_pvcalls_socket { 333 uint64_t id; 334 } socket; 335 struct _xen_pvcalls_connect { 336 uint64_t id; 337 } connect; 338 struct _xen_pvcalls_release { 339 uint64_t id; 340 } release; 341 struct _xen_pvcalls_bind { 342 uint64_t id; 343 } bind; 344 struct _xen_pvcalls_listen { 345 uint64_t id; 346 } listen; 347 struct _xen_pvcalls_accept { 348 uint64_t id; 349 } accept; 350 struct _xen_pvcalls_poll { 351 uint64_t id; 352 } poll; 353 struct _xen_pvcalls_dummy { 354 uint8_t dummy[8]; 355 } dummy; 356 } u; 357 }; 358 359The first four fields are common for every response. Their binary layout 360is: 361 362 0 4 8 12 16 363 +-------+-------+-------+-------+ 364 |req_id | cmd | ret | pad | 365 +-------+-------+-------+-------+ 366 367- **req_id**: echoed back from request 368- **cmd**: echoed back from request 369- **ret**: return value, identifies success (0) or failure (see [Error 370 numbers] in further sections). If the **cmd** is not supported by the 371 backend, ret is ENOTSUP. 372- **pad**: padding 373 374After calling `RING_PUSH_RESPONSES_AND_CHECK_NOTIFY`, the backend checks whether 375it needs to notify the frontend and does so via event channel. 376 377A description of each command, their additional request and response 378fields follow. 379 380 381#### Socket 382 383The **socket** operation corresponds to the POSIX [socket][socket] 384function. It creates a new socket of the specified family, type and 385protocol. **id** is freely chosen by the frontend and references this 386specific socket from this point forward. See [Socket families and 387address format] to see which ones are supported by different versions of 388the protocol. 389 390Request fields: 391 392- **cmd** value: 0 393- additional fields: 394 - **id**: generated by the frontend, it identifies the new socket 395 - **domain**: the communication domain 396 - **type**: the socket type 397 - **protocol**: the particular protocol to be used with the socket, usually 0 398 399Request binary layout: 400 401 8 12 16 20 24 28 402 +-------+-------+-------+-------+-------+ 403 | id |domain | type |protoco| 404 +-------+-------+-------+-------+-------+ 405 406Response additional fields: 407 408- **id**: echoed back from request 409 410Response binary layout: 411 412 16 20 24 413 +-------+--------+ 414 | id | 415 +-------+--------+ 416 417Return value: 418 419 - 0 on success 420 - See the [POSIX socket function][connect] for error names; see 421 [Error numbers] in further sections. 422 423#### Connect 424 425The **connect** operation corresponds to the POSIX [connect][connect] 426function. It connects a previously created socket (identified by **id**) 427to the specified address. 428 429The connect operation creates a new shared ring, which we'll call **data 430ring**. The data ring is used to send and receive data from the 431socket. The connect operation passes two additional parameters: 432**evtchn** and **ref**. **evtchn** is the port number of a new event 433channel which will be used for notifications of activity on the data 434ring. **ref** is the grant reference of the **indexes page**: a page 435which contains shared indexes that point to the write and read locations 436in the **data ring**. The **indexes page** also contains the full array 437of grant references for the **data ring**. When the frontend issues a 438**connect** command, the backend: 439 440- finds its own internal socket corresponding to **id** 441- connects the socket to **addr** 442- maps the grant reference **ref**, the indexes page, see struct 443 pvcalls_data_intf 444- maps all the grant references listed in `struct pvcalls_data_intf` and 445 uses them as shared memory for the **data ring** 446- bind the **evtchn** 447- replies to the frontend 448 449The [Indexes Page and Data ring] format will be described in the 450following section. The **data ring** is unmapped and freed upon issuing 451a **release** command on the active socket identified by **id**. A 452frontend state change can also cause data rings to be unmapped. 453 454Request fields: 455 456- **cmd** value: 0 457- additional fields: 458 - **id**: identifies the socket 459 - **addr**: address to connect to, see [Socket families and address format] 460 - **len**: address length up to 28 octets 461 - **flags**: flags for the connection, reserved for future usage 462 - **ref**: grant reference of the indexes page 463 - **evtchn**: port number of the evtchn to signal activity on the **data ring** 464 465Request binary layout: 466 467 8 12 16 20 24 28 32 36 40 44 468 +-------+-------+-------+-------+-------+-------+-------+-------+-------+ 469 | id | addr | 470 +-------+-------+-------+-------+-------+-------+-------+-------+-------+ 471 | len | flags | ref |evtchn | 472 +-------+-------+-------+-------+ 473 474Response additional fields: 475 476- **id**: echoed back from request 477 478Response binary layout: 479 480 16 20 24 481 +-------+-------+ 482 | id | 483 +-------+-------+ 484 485Return value: 486 487 - 0 on success 488 - See the [POSIX connect function][connect] for error names; see 489 [Error numbers] in further sections. 490 491#### Release 492 493The **release** operation closes an existing active or a passive socket. 494 495When a release command is issued on a passive socket, the backend 496releases it and frees its internal mappings. When a release command is 497issued for an active socket, the data ring and indexes page are also 498unmapped and freed: 499 500- frontend sends release command for an active socket 501- backend releases the socket 502- backend unmaps the data ring 503- backend unmaps the indexes page 504- backend unbinds the event channel 505- backend replies to frontend with an **ret** value 506- frontend frees data ring, indexes page and unbinds event channel 507 508Request fields: 509 510- **cmd** value: 1 511- additional fields: 512 - **id**: identifies the socket 513 - **reuse**: an optimization hint for the backend. The field is 514 ignored for passive sockets. When set to 1, the frontend lets the 515 backend know that it will reuse exactly the same set of grant pages 516 (indexes page and data ring) and event channel when creating one of 517 the next active sockets. The backend can take advantage of it by 518 delaying unmapping grants and unbinding the event channel. The 519 backend is free to ignore the hint. Reused data rings are found by 520 **ref**, the grant reference of the page containing the indexes. 521 522Request binary layout: 523 524 8 12 16 17 525 +-------+-------+-----+ 526 | id |reuse| 527 +-------+-------+-----+ 528 529Response additional fields: 530 531- **id**: echoed back from request 532 533Response binary layout: 534 535 16 20 24 536 +-------+-------+ 537 | id | 538 +-------+-------+ 539 540Return value: 541 542 - 0 on success 543 - See the [POSIX shutdown function][shutdown] for error names; see 544 [Error numbers] in further sections. 545 546#### Bind 547 548The **bind** operation corresponds to the POSIX [bind][bind] function. 549It assigns the address passed as parameter to a previously created 550socket, identified by **id**. **Bind**, **listen** and **accept** are 551the three operations required to have fully working passive sockets and 552should be issued in that order. 553 554Request fields: 555 556- **cmd** value: 2 557- additional fields: 558 - **id**: identifies the socket 559 - **addr**: address to connect to, see [Socket families and address 560 format] 561 - **len**: address length up to 28 octets 562 563Request binary layout: 564 565 8 12 16 20 24 28 32 36 40 44 566 +-------+-------+-------+-------+-------+-------+-------+-------+-------+ 567 | id | addr | 568 +-------+-------+-------+-------+-------+-------+-------+-------+-------+ 569 | len | 570 +-------+ 571 572Response additional fields: 573 574- **id**: echoed back from request 575 576Response binary layout: 577 578 16 20 24 579 +-------+-------+ 580 | id | 581 +-------+-------+ 582 583Return value: 584 585 - 0 on success 586 - See the [POSIX bind function][bind] for error names; see 587 [Error numbers] in further sections. 588 589 590#### Listen 591 592The **listen** operation marks the socket as a passive socket. It corresponds to 593the [POSIX listen function][listen]. 594 595Reuqest fields: 596 597- **cmd** value: 3 598- additional fields: 599 - **id**: identifies the socket 600 - **backlog**: the maximum length to which the queue of pending 601 connections may grow in number of elements 602 603Request binary layout: 604 605 8 12 16 20 606 +-------+-------+-------+ 607 | id |backlog| 608 +-------+-------+-------+ 609 610Response additional fields: 611 612- **id**: echoed back from request 613 614Response binary layout: 615 616 16 20 24 617 +-------+-------+ 618 | id | 619 +-------+-------+ 620 621Return value: 622 - 0 on success 623 - See the [POSIX listen function][listen] for error names; see 624 [Error numbers] in further sections. 625 626 627#### Accept 628 629The **accept** operation extracts the first connection request on the 630queue of pending connections for the listening socket identified by 631**id** and creates a new connected socket. The id of the new socket is 632also chosen by the frontend and passed as an additional field of the 633accept request struct (**id_new**). See the [POSIX accept function][accept] 634as reference. 635 636Similarly to the **connect** operation, **accept** creates new [Indexes 637Page and Data ring]. The **data ring** is used to send and receive data from 638the socket. The **accept** operation passes two additional parameters: 639**evtchn** and **ref**. **evtchn** is the port number of a new event 640channel which will be used for notifications of activity on the data 641ring. **ref** is the grant reference of the **indexes page**: a page 642which contains shared indexes that point to the write and read locations 643in the **data ring**. The **indexes page** also contains the full array of 644grant references for the **data ring**. 645 646The backend will reply to the request only when a new connection is 647successfully accepted, i.e. the backend does not return EAGAIN or 648EWOULDBLOCK. 649 650Example workflow: 651 652- frontend issues an **accept** request 653- backend waits for a connection to be available on the socket 654- a new connection becomes available 655- backend accepts the new connection 656- backend creates an internal mapping from **id_new** to the new socket 657- backend maps the grant reference **ref**, the indexes page, see struct 658 pvcalls_data_intf 659- backend maps all the grant references listed in `struct 660 pvcalls_data_intf` and uses them as shared memory for the new data 661 ring **in** and **out** arrays 662- backend binds to the **evtchn** 663- backend replies to the frontend with a **ret** value 664 665Request fields: 666 667- **cmd** value: 4 668- additional fields: 669 - **id**: id of listening socket 670 - **id_new**: id of the new socket 671 - **ref**: grant reference of the indexes page 672 - **evtchn**: port number of the evtchn to signal activity on the data ring 673 674Request binary layout: 675 676 8 12 16 20 24 28 32 677 +-------+-------+-------+-------+-------+-------+ 678 | id | id_new | ref |evtchn | 679 +-------+-------+-------+-------+-------+-------+ 680 681Response additional fields: 682 683- **id**: id of the listening socket, echoed back from request 684 685Response binary layout: 686 687 16 20 24 688 +-------+-------+ 689 | id | 690 +-------+-------+ 691 692Return value: 693 694 - 0 on success 695 - See the [POSIX accept function][accept] for error names; see 696 [Error numbers] in further sections. 697 698 699#### Poll 700 701In this version of the protocol, the **poll** operation is only valid 702for passive sockets. For active sockets, the frontend should look at the 703indexes on the **indexes page**. When a new connection is available in 704the queue of the passive socket, the backend generates a response and 705notifies the frontend. 706 707Request fields: 708 709- **cmd** value: 5 710- additional fields: 711 - **id**: identifies the listening socket 712 713Request binary layout: 714 715 8 12 16 716 +-------+-------+ 717 | id | 718 +-------+-------+ 719 720 721Response additional fields: 722 723- **id**: echoed back from request 724 725Response binary layout: 726 727 16 20 24 728 +--------+--------+ 729 | id | 730 +--------+--------+ 731 732Return value: 733 734 - 0 on success 735 - See the [POSIX poll function][poll] for error names; see 736 [Error numbers] in further sections. 737 738#### Expanding the protocol 739 740It is possible to introduce new commands without changing the protocol 741ABI. Naturally, a feature flag among the backend xenstore nodes should 742advertise the availability of a new set of commands. 743 744If a new command requires parameters in struct xen_pvcalls_request 745larger than 56 bytes, which is the current size of the request, then the 746protocol version should be increased. One way to implement the large 747request structure without disrupting the current ABI is to introduce a 748new command, such as PVCALLS_CONNECT_EXTENDED, and a flag to specify 749that the request uses two request slots, for a total of 112 bytes. 750 751#### Error numbers 752 753The numbers corresponding to the error names specified by POSIX are: 754 755 [EPERM] -1 756 [ENOENT] -2 757 [ESRCH] -3 758 [EINTR] -4 759 [EIO] -5 760 [ENXIO] -6 761 [E2BIG] -7 762 [ENOEXEC] -8 763 [EBADF] -9 764 [ECHILD] -10 765 [EAGAIN] -11 766 [EWOULDBLOCK] -11 767 [ENOMEM] -12 768 [EACCES] -13 769 [EFAULT] -14 770 [EBUSY] -16 771 [EEXIST] -17 772 [EXDEV] -18 773 [ENODEV] -19 774 [EISDIR] -21 775 [EINVAL] -22 776 [ENFILE] -23 777 [EMFILE] -24 778 [ENOSPC] -28 779 [EROFS] -30 780 [EMLINK] -31 781 [EDOM] -33 782 [ERANGE] -34 783 [EDEADLK] -35 784 [EDEADLOCK] -35 785 [ENAMETOOLONG] -36 786 [ENOLCK] -37 787 [ENOTEMPTY] -39 788 [ENOSYS] -38 789 [ENODATA] -61 790 [ETIME] -62 791 [EBADMSG] -74 792 [EOVERFLOW] -75 793 [EILSEQ] -84 794 [ERESTART] -85 795 [ENOTSOCK] -88 796 [EOPNOTSUPP] -95 797 [EAFNOSUPPORT] -97 798 [EADDRINUSE] -98 799 [EADDRNOTAVAIL] -99 800 [ENOBUFS] -105 801 [EISCONN] -106 802 [ENOTCONN] -107 803 [ETIMEDOUT] -110 804 [ENOTSUP] -524 805 806#### Socket families and address format 807 808The following definitions and explicit sizes, together with POSIX 809[sys/socket.h][address] and [netinet/in.h][in] define socket families and 810address format. Please be aware that only the **domain** `AF_INET`, **type** 811`SOCK_STREAM` and **protocol** `0` are supported by this version of the 812specification, others return ENOTSUP. 813 814 #define AF_UNSPEC 0 815 #define AF_UNIX 1 /* Unix domain sockets */ 816 #define AF_LOCAL 1 /* POSIX name for AF_UNIX */ 817 #define AF_INET 2 /* Internet IP Protocol */ 818 #define AF_INET6 10 /* IP version 6 */ 819 820 #define SOCK_STREAM 1 821 #define SOCK_DGRAM 2 822 #define SOCK_RAW 3 823 824 /* generic address format */ 825 struct sockaddr { 826 uint16_t sa_family_t; 827 char sa_data[26]; 828 }; 829 830 struct in_addr { 831 uint32_t s_addr; 832 }; 833 834 /* AF_INET address format */ 835 struct sockaddr_in { 836 uint16_t sa_family_t; 837 uint16_t sin_port; 838 struct in_addr sin_addr; 839 char sin_zero[20]; 840 }; 841 842 843### Indexes Page and Data ring 844 845Data rings are used for sending and receiving data over a connected socket. They 846are created upon a successful **accept** or **connect** command. 847The **sendmsg** and **recvmsg** calls are implemented by sending data and 848receiving data from a data ring, and updating the corresponding indexes 849on the **indexes page**. 850 851Firstly, the **indexes page** is shared by a **connect** or **accept** 852command, see **ref** parameter in their sections. The content of the 853**indexes page** is represented by `struct pvcalls_ring_intf`, see 854below. The structure contains the list of grant references which 855constitute the **in** and **out** buffers of the data ring, see ref[] 856below. The backend maps the grant references contiguously. Of the 857resulting shared memory, the first half is dedicated to the **in** array 858and the second half to the **out** array. They are used as circular 859buffers for transferring data, and, together, they are the data ring. 860 861 862 +---------------------------+ Indexes page 863 | Command ring: | +----------------------+ 864 | @0: xen_pvcalls_connect: | |@0 pvcalls_data_intf: | 865 | @44: ref +-------------------------------->+@76: ring_order = 1 | 866 | | |@80: ref[0]+ | 867 +---------------------------+ |@84: ref[1]+ | 868 | | | 869 | | | 870 +----------------------+ 871 | 872 v (data ring) 873 +-------+-----------+ 874 | @0->4098: in | 875 | ref[0] | 876 |-------------------| 877 | @4099->8196: out | 878 | ref[1] | 879 +-------------------+ 880 881 882#### Indexes Page Structure 883 884 typedef uint32_t PVCALLS_RING_IDX; 885 886 struct pvcalls_data_intf { 887 PVCALLS_RING_IDX in_cons, in_prod; 888 int32_t in_error; 889 890 uint8_t pad[52]; 891 892 PVCALLS_RING_IDX out_cons, out_prod; 893 int32_t out_error; 894 895 uint8_t pad[52]; 896 897 uint32_t ring_order; 898 grant_ref_t ref[]; 899 }; 900 901 /* not actually C compliant (ring_order changes from socket to socket) */ 902 struct pvcalls_data { 903 char in[((1<<ring_order)<<PAGE_SHIFT)/2]; 904 char out[((1<<ring_order)<<PAGE_SHIFT)/2]; 905 }; 906 907- **ring_order** 908 It represents the order of the data ring. The following list of grant 909 references is of `(1 << ring_order)` elements. It cannot be greater than 910 **max-page-order**, as specified by the backend on XenBus. It has to 911 be one at minimum. 912- **ref[]** 913 The list of grant references which will contain the actual data. They are 914 mapped contiguosly in virtual memory. The first half of the pages is the 915 **in** array, the second half is the **out** array. The arrays must 916 have a power of two size. Together, their size is `(1 << ring_order) * 917 PAGE_SIZE`. 918- **in** is an array used as circular buffer 919 It contains data read from the socket. The producer is the backend, the 920 consumer is the frontend. 921- **out** is an array used as circular buffer 922 It contains data to be written to the socket. The producer is the frontend, 923 the consumer is the backend. 924- **in_cons** and **in_prod** 925 Consumer and producer indexes for data read from the socket. They keep track 926 of how much data has already been consumed by the frontend from the **in** 927 array. **in_prod** is increased by the backend, after writing data to **in**. 928 **in_cons** is increased by the frontend, after reading data from **in**. 929- **out_cons**, **out_prod** 930 Consumer and producer indexes for the data to be written to the socket. They 931 keep track of how much data has been written by the frontend to **out** and 932 how much data has already been consumed by the backend. **out_prod** is 933 increased by the frontend, after writing data to **out**. **out_cons** is 934 increased by the backend, after reading data from **out**. 935- **in_error** and **out_error** They signal errors when reading from the socket 936 (**in_error**) or when writing to the socket (**out_error**). 0 means no 937 errors. When an error occurs, no further reads or writes operations are 938 performed on the socket. In the case of an orderly socket shutdown (i.e. read 939 returns 0) **in_error** is set to ENOTCONN. **in_error** and **out_error** 940 are never set to EAGAIN or EWOULDBLOCK (the data is written to the 941 ring as soon as it is available). 942 943The binary layout of `struct pvcalls_data_intf` follows: 944 945 0 4 8 12 64 68 72 76 946 +---------+---------+---------+-----//-----+---------+---------+---------+ 947 | in_cons | in_prod |in_error | padding |out_cons |out_prod |out_error| 948 +---------+---------+---------+-----//-----+---------+---------+---------+ 949 950 76 80 84 88 4092 4096 951 +---------+---------+---------+----//---+---------+ 952 |ring_orde| ref[0] | ref[1] | | ref[N] | 953 +---------+---------+---------+----//---+---------+ 954 955**N.B** For one page, N is maximum 991 ((4096-132)/4), but given that N needs 956to be a power of two, actually max N is 512 (ring_order = 9). 957 958#### Data Ring Structure 959 960The binary layout of the data ring follow: 961 962 0 ((1<<ring_order)<<PAGE_SHIFT)/2 ((1<<ring_order)<<PAGE_SHIFT) 963 +------------//-------------+------------//-------------+ 964 | in | out | 965 +------------//-------------+------------//-------------+ 966 967#### Why ring.h is not needed 968 969Many Xen PV protocols use the macros provided by [ring.h] to manage 970their shared ring for communication. PVCalls does not, because the [Data 971Ring Structure] actually comes with two rings: the **in** ring and the 972**out** ring. Each of them is mono-directional, and there is no static 973request size: the producer writes opaque data to the ring. On the other 974end, in [ring.h] they are combined, and the request size is static and 975well-known. In PVCalls: 976 977 in -> backend to frontend only 978 out-> frontend to backend only 979 980In the case of the **in** ring, the frontend is the consumer, and the 981backend is the producer. Everything is the same but mirrored for the 982**out** ring. 983 984The producer, the backend in this case, never reads from the **in** 985ring. In fact, the producer doesn't need any notifications unless the 986ring is full. This version of the protocol doesn't take advantage of it, 987leaving room for optimizations. 988 989On the other end, the consumer always requires notifications, unless it 990is already actively reading from the ring. The producer can figure it 991out, without any additional fields in the protocol, by comparing the 992indexes at the beginning and the end of the function. This is similar to 993what [ring.h] does. 994 995#### Workflow 996 997The **in** and **out** arrays are used as circular buffers: 998 999 0 sizeof(array) == ((1<<ring_order)<<PAGE_SHIFT)/2 1000 +-----------------------------------+ 1001 |to consume| free |to consume | 1002 +-----------------------------------+ 1003 ^ ^ 1004 prod cons 1005 1006 0 sizeof(array) 1007 +-----------------------------------+ 1008 | free | to consume | free | 1009 +-----------------------------------+ 1010 ^ ^ 1011 cons prod 1012 1013The following function is provided to calculate how many bytes are currently 1014left unconsumed in an array: 1015 1016 #define _MASK_PVCALLS_IDX(idx, ring_size) ((idx) & (ring_size-1)) 1017 1018 static inline PVCALLS_RING_IDX pvcalls_ring_unconsumed(PVCALLS_RING_IDX prod, 1019 PVCALLS_RING_IDX cons, 1020 PVCALLS_RING_IDX ring_size) 1021 { 1022 PVCALLS_RING_IDX size; 1023 1024 if (prod == cons) 1025 return 0; 1026 1027 prod = _MASK_PVCALLS_IDX(prod, ring_size); 1028 cons = _MASK_PVCALLS_IDX(cons, ring_size); 1029 1030 if (prod == cons) 1031 return ring_size; 1032 1033 if (prod > cons) 1034 size = prod - cons; 1035 else { 1036 size = ring_size - cons; 1037 size += prod; 1038 } 1039 return size; 1040 } 1041 1042The producer (the backend for **in**, the frontend for **out**) writes to the 1043array in the following way: 1044 1045- read *[in|out]_cons*, *[in|out]_prod*, *[in|out]_error* from shared memory 1046- general memory barrier 1047- return on *[in|out]_error* 1048- write to array at position *[in|out]_prod* up to *[in|out]_cons*, 1049 wrapping around the circular buffer when necessary 1050- write memory barrier 1051- increase *[in|out]_prod* 1052- notify the other end via evtchn 1053 1054The consumer (the backend for **out**, the frontend for **in**) reads from the 1055array in the following way: 1056 1057- read *[in|out]_prod*, *[in|out]_cons*, *[in|out]_error* from shared memory 1058- read memory barrier 1059- return on *[in|out]_error* 1060- read from array at position *[in|out]_cons* up to *[in|out]_prod*, 1061 wrapping around the circular buffer when necessary 1062- general memory barrier 1063- increase *[in|out]_cons* 1064- notify the other end via evtchn 1065 1066The producer takes care of writing only as many bytes as available in 1067the buffer up to *[in|out]_cons*. The consumer takes care of reading 1068only as many bytes as available in the buffer up to *[in|out]_prod*. 1069*[in|out]_error* is set by the backend when an error occurs writing or 1070reading from the socket. 1071 1072 1073[xenstore]: https://xenbits.xen.org/docs/unstable/misc/xenstore.txt 1074[XenbusStateInitialising]: https://xenbits.xen.org/docs/unstable/hypercall/x86_64/include,public,io,xenbus.h.html 1075[address]: http://pubs.opengroup.org/onlinepubs/7908799/xns/syssocket.h.html 1076[in]: http://pubs.opengroup.org/onlinepubs/000095399/basedefs/netinet/in.h.html 1077[socket]: http://pubs.opengroup.org/onlinepubs/009695399/functions/socket.html 1078[connect]: http://pubs.opengroup.org/onlinepubs/7908799/xns/connect.html 1079[shutdown]: http://pubs.opengroup.org/onlinepubs/7908799/xns/shutdown.html 1080[bind]: http://pubs.opengroup.org/onlinepubs/7908799/xns/bind.html 1081[listen]: http://pubs.opengroup.org/onlinepubs/7908799/xns/listen.html 1082[accept]: http://pubs.opengroup.org/onlinepubs/7908799/xns/accept.html 1083[poll]: http://pubs.opengroup.org/onlinepubs/7908799/xsh/poll.html 1084[ring.h]: https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/include/public/io/ring.h;hb=HEAD 1085