1# PV Calls Protocol version 1
2
3## Glossary
4
5The following is a list of terms and definitions used in the Xen
6community. If you are a Xen contributor you can skip this section.
7
8* PV
9
10  Short for paravirtualized.
11
12* Dom0
13
14  First virtual machine that boots. In most configurations Dom0 is
15  privileged and has control over hardware devices, such as network
16  cards, graphic cards, etc.
17
18* DomU
19
20  Regular unprivileged Xen virtual machine.
21
22* Domain
23
24  A Xen virtual machine. Dom0 and all DomUs are all separate Xen
25  domains.
26
27* Guest
28
29  Same as domain: a Xen virtual machine.
30
31* Frontend
32
33  Each DomU has one or more paravirtualized frontend drivers to access
34  disks, network, console, graphics, etc. The presence of PV devices is
35  advertized on XenStore, a cross domain key-value database. Frontends
36  are similar in intent to the virtio drivers in Linux.
37
38* Backend
39
40  A Xen paravirtualized backend typically runs in Dom0 and it is used to
41  export disks, network, console, graphics, etcs, to DomUs. Backends can
42  live both in kernel space and in userspace. For example xen-blkback
43  lives under drivers/block in the Linux kernel and xen_disk lives under
44  hw/block in QEMU. Paravirtualized backends are similar in intent to
45  virtio device emulators.
46
47* VMX and SVM
48
49  On Intel processors, VMX is the CPU flag for VT-x, hardware
50  virtualization support. It corresponds to SVM on AMD processors.
51
52
53
54## Rationale
55
56PV Calls is a paravirtualized protocol that allows the implementation of
57a set of POSIX functions in a different domain. The PV Calls frontend
58sends POSIX function calls to the backend, which implements them and
59returns a value to the frontend and acts on the function call.
60
61This version of the document covers networking function calls, such as
62connect, accept, bind, release, listen, poll, recvmsg and sendmsg; but
63the protocol is meant to be easily extended to cover different sets of
64calls. Unimplemented commands return ENOTSUP.
65
66PV Calls provide the following benefits:
67* full visibility of the guest behavior on the backend domain, allowing
68  for inexpensive filtering and manipulation of any guest calls
69* excellent performance
70
71Specifically, PV Calls for networking offer these advantages:
72* guest networking works out of the box with VPNs, wireless networks and
73  any other complex configurations on the host
74* guest services listen on ports bound directly to the backend domain IP
75  addresses
76* localhost becomes a secure host wide network for inter-VMs
77  communications
78
79
80## Design
81
82### Why Xen?
83
84PV Calls are part of an effort to create a secure runtime environment
85for containers (Open Containers Initiative images to be precise). PV
86Calls are based on Xen, although porting them to other hypervisors is
87possible. Xen was chosen because of its security and isolation
88properties and because it supports PV guests, a type of virtual machines
89that does not require hardware virtualization extensions (VMX on Intel
90processors and SVM on AMD processors). This is important because PV
91Calls is meant for containers and containers are often run on top of
92public cloud instances, which do not support nested VMX (or SVM) as of
93today (early 2017). Xen PV guests are lightweight, minimalist, and do
94not require machine emulation: all properties that make them a good fit
95for this project.
96
97### Xenstore
98
99The frontend and the backend connect via [xenstore] to
100exchange information. The toolstack creates front and back nodes with
101state of [XenbusStateInitialising]. The protocol node name
102is **pvcalls**.  There can only be one PV Calls frontend per domain.
103
104#### Frontend XenBus Nodes
105
106version
107     Values:         <string>
108
109     Protocol version, chosen among the ones supported by the backend
110     (see **versions** under [Backend XenBus Nodes]). Currently the
111     value must be "1".
112
113port
114     Values:         <uint32_t>
115
116     The identifier of the Xen event channel used to signal activity
117     in the command ring.
118
119ring-ref
120     Values:         <uint32_t>
121
122     The Xen grant reference granting permission for the backend to map
123     the sole page in a single page sized command ring.
124
125#### Backend XenBus Nodes
126
127versions
128     Values:         <string>
129
130     List of comma separated protocol versions supported by the backend.
131     For example "1,2,3". Currently the value is just "1", as there is
132     only one version.
133
134max-page-order
135     Values:         <uint32_t>
136
137     The maximum supported size of a memory allocation in units of
138     log2n(machine pages), e.g. 1 = 2 pages, 2 == 4 pages, etc. It must
139     be 1 or more.
140
141function-calls
142     Values:         <uint32_t>
143
144     Value "0" means that no calls are supported.
145     Value "1" means that socket, connect, release, bind, listen, accept
146     and poll are supported.
147
148#### State Machine
149
150Initialization:
151
152    *Front*                               *Back*
153    XenbusStateInitialising               XenbusStateInitialising
154    - Query virtual device                - Query backend device
155      properties.                           identification data.
156    - Setup OS device instance.           - Publish backend features
157    - Allocate and initialize the           and transport parameters
158      request ring.                                      |
159    - Publish transport parameters                       |
160      that will be in effect during                      V
161      this connection.                            XenbusStateInitWait
162                 |
163                 |
164                 V
165       XenbusStateInitialised
166
167                                          - Query frontend transport parameters.
168                                          - Connect to the request ring and
169                                            event channel.
170                                                         |
171                                                         |
172                                                         V
173                                                 XenbusStateConnected
174
175     - Query backend device properties.
176     - Finalize OS virtual device
177       instance.
178                 |
179                 |
180                 V
181        XenbusStateConnected
182
183Once frontend and backend are connected, they have a shared page, which
184will is used to exchange messages over a ring, and an event channel,
185which is used to send notifications.
186
187Shutdown:
188
189    *Front*                            *Back*
190    XenbusStateConnected               XenbusStateConnected
191                |
192                |
193                V
194       XenbusStateClosing
195
196                                       - Unmap grants
197                                       - Unbind event channels
198                                                 |
199                                                 |
200                                                 V
201                                         XenbusStateClosing
202
203    - Unbind event channels
204    - Free rings
205    - Free data structures
206               |
207               |
208               V
209       XenbusStateClosed
210
211                                       - Free remaining data structures
212                                                 |
213                                                 |
214                                                 V
215                                         XenbusStateClosed
216
217
218### Commands Ring
219
220The shared ring is used by the frontend to forward POSIX function calls
221to the backend. We shall refer to this ring as **commands ring** to
222distinguish it from other rings which can be created later in the
223lifecycle of the protocol (see [Indexes Page and Data ring]). The grant
224reference for shared page for this ring is shared on xenstore (see
225[Frontend XenBus Nodes]). The ring format is defined using the familiar
226`DEFINE_RING_TYPES` macro (`xen/include/public/io/ring.h`).  Frontend
227requests are allocated on the ring using the `RING_GET_REQUEST` macro.
228The list of commands below is in calling order.
229
230The format is defined as follows:
231
232    #define PVCALLS_SOCKET         0
233    #define PVCALLS_CONNECT        1
234    #define PVCALLS_RELEASE        2
235    #define PVCALLS_BIND           3
236    #define PVCALLS_LISTEN         4
237    #define PVCALLS_ACCEPT         5
238    #define PVCALLS_POLL           6
239
240    struct xen_pvcalls_request {
241    	uint32_t req_id; /* private to guest, echoed in response */
242    	uint32_t cmd;    /* command to execute */
243    	union {
244    		struct xen_pvcalls_socket {
245    			uint64_t id;
246    			uint32_t domain;
247    			uint32_t type;
248    			uint32_t protocol;
249    			uint8_t pad[4];
250    		} socket;
251    		struct xen_pvcalls_connect {
252    			uint64_t id;
253    			uint8_t addr[28];
254    			uint32_t len;
255    			uint32_t flags;
256    			grant_ref_t ref;
257    			uint32_t evtchn;
258    			uint8_t pad[4];
259    		} connect;
260    		struct xen_pvcalls_release {
261    			uint64_t id;
262    			uint8_t reuse;
263    			uint8_t pad[7];
264    		} release;
265    		struct xen_pvcalls_bind {
266    			uint64_t id;
267    			uint8_t addr[28];
268    			uint32_t len;
269    		} bind;
270    		struct xen_pvcalls_listen {
271    			uint64_t id;
272    			uint32_t backlog;
273    			uint8_t pad[4];
274    		} listen;
275    		struct xen_pvcalls_accept {
276    			uint64_t id;
277    			uint64_t id_new;
278    			grant_ref_t ref;
279    			uint32_t evtchn;
280    		} accept;
281    		struct xen_pvcalls_poll {
282    			uint64_t id;
283    		} poll;
284    		/* dummy member to force sizeof(struct xen_pvcalls_request) to match across archs */
285    		struct xen_pvcalls_dummy {
286    			uint8_t dummy[56];
287    		} dummy;
288    	} u;
289    };
290
291The first two fields are common for every command. Their binary layout
292is:
293
294    0       4       8
295    +-------+-------+
296    |req_id |  cmd  |
297    +-------+-------+
298
299- **req_id** is generated by the frontend and is a cookie used to
300  identify one specific request/response pair of commands. Not to be
301  confused with any command **id** which are used to identify a socket
302  across multiple commands, see [Socket].
303- **cmd** is the command requested by the frontend:
304
305    - `PVCALLS_SOCKET`:  0
306    - `PVCALLS_CONNECT`: 1
307    - `PVCALLS_RELEASE`: 2
308    - `PVCALLS_BIND`:    3
309    - `PVCALLS_LISTEN`:  4
310    - `PVCALLS_ACCEPT`:  5
311    - `PVCALLS_POLL`:    6
312
313Both fields are echoed back by the backend. See [Socket families and
314address format] for the format of the **addr** field of connect and
315bind. The maximum size of command specific arguments is 56 bytes. Any
316future command that requires more than that will need a bump the
317**version** of the protocol.
318
319Similarly to other Xen ring based protocols, after writing a request to
320the ring, the frontend calls `RING_PUSH_REQUESTS_AND_CHECK_NOTIFY` and
321issues an event channel notification when a notification is required.
322
323Backend responses are allocated on the ring using the `RING_GET_RESPONSE` macro.
324The format is the following:
325
326    struct xen_pvcalls_response {
327        uint32_t req_id;
328        uint32_t cmd;
329        int32_t ret;
330        uint32_t pad;
331        union {
332    		struct _xen_pvcalls_socket {
333    			uint64_t id;
334    		} socket;
335    		struct _xen_pvcalls_connect {
336    			uint64_t id;
337    		} connect;
338    		struct _xen_pvcalls_release {
339    			uint64_t id;
340    		} release;
341    		struct _xen_pvcalls_bind {
342    			uint64_t id;
343    		} bind;
344    		struct _xen_pvcalls_listen {
345    			uint64_t id;
346    		} listen;
347    		struct _xen_pvcalls_accept {
348    			uint64_t id;
349    		} accept;
350    		struct _xen_pvcalls_poll {
351    			uint64_t id;
352    		} poll;
353    		struct _xen_pvcalls_dummy {
354    			uint8_t dummy[8];
355    		} dummy;
356    	} u;
357    };
358
359The first four fields are common for every response. Their binary layout
360is:
361
362    0       4       8       12      16
363    +-------+-------+-------+-------+
364    |req_id |  cmd  |  ret  |  pad  |
365    +-------+-------+-------+-------+
366
367- **req_id**: echoed back from request
368- **cmd**: echoed back from request
369- **ret**: return value, identifies success (0) or failure (see [Error
370  numbers] in further sections). If the **cmd** is not supported by the
371  backend, ret is ENOTSUP.
372- **pad**: padding
373
374After calling `RING_PUSH_RESPONSES_AND_CHECK_NOTIFY`, the backend checks whether
375it needs to notify the frontend and does so via event channel.
376
377A description of each command, their additional request and response
378fields follow.
379
380
381#### Socket
382
383The **socket** operation corresponds to the POSIX [socket][socket]
384function. It creates a new socket of the specified family, type and
385protocol. **id** is freely chosen by the frontend and references this
386specific socket from this point forward. See [Socket families and
387address format] to see which ones are supported by different versions of
388the protocol.
389
390Request fields:
391
392- **cmd** value: 0
393- additional fields:
394  - **id**: generated by the frontend, it identifies the new socket
395  - **domain**: the communication domain
396  - **type**: the socket type
397  - **protocol**: the particular protocol to be used with the socket, usually 0
398
399Request binary layout:
400
401    8       12      16      20     24       28
402    +-------+-------+-------+-------+-------+
403    |       id      |domain | type  |protoco|
404    +-------+-------+-------+-------+-------+
405
406Response additional fields:
407
408- **id**: echoed back from request
409
410Response binary layout:
411
412    16       20       24
413    +-------+--------+
414    |       id       |
415    +-------+--------+
416
417Return value:
418
419  - 0 on success
420  - See the [POSIX socket function][connect] for error names; see
421    [Error numbers] in further sections.
422
423#### Connect
424
425The **connect** operation corresponds to the POSIX [connect][connect]
426function. It connects a previously created socket (identified by **id**)
427to the specified address.
428
429The connect operation creates a new shared ring, which we'll call **data
430ring**. The data ring is used to send and receive data from the
431socket. The connect operation passes two additional parameters:
432**evtchn** and **ref**. **evtchn** is the port number of a new event
433channel which will be used for notifications of activity on the data
434ring. **ref** is the grant reference of the **indexes page**: a page
435which contains shared indexes that point to the write and read locations
436in the **data ring**. The **indexes page** also contains the full array
437of grant references for the **data ring**. When the frontend issues a
438**connect** command, the backend:
439
440- finds its own internal socket corresponding to **id**
441- connects the socket to **addr**
442- maps the grant reference **ref**, the indexes page, see struct
443  pvcalls_data_intf
444- maps all the grant references listed in `struct pvcalls_data_intf` and
445  uses them as shared memory for the **data ring**
446- bind the **evtchn**
447- replies to the frontend
448
449The [Indexes Page and Data ring] format will be described in the
450following section. The **data ring** is unmapped and freed upon issuing
451a **release** command on the active socket identified by **id**. A
452frontend state change can also cause data rings to be unmapped.
453
454Request fields:
455
456- **cmd** value: 0
457- additional fields:
458  - **id**: identifies the socket
459  - **addr**: address to connect to, see [Socket families and address format]
460  - **len**: address length up to 28 octets
461  - **flags**: flags for the connection, reserved for future usage
462  - **ref**: grant reference of the indexes page
463  - **evtchn**: port number of the evtchn to signal activity on the **data ring**
464
465Request binary layout:
466
467    8       12      16      20      24      28      32      36      40      44
468    +-------+-------+-------+-------+-------+-------+-------+-------+-------+
469    |       id      |                            addr                       |
470    +-------+-------+-------+-------+-------+-------+-------+-------+-------+
471    | len   | flags |  ref  |evtchn |
472    +-------+-------+-------+-------+
473
474Response additional fields:
475
476- **id**: echoed back from request
477
478Response binary layout:
479
480    16      20      24
481    +-------+-------+
482    |       id      |
483    +-------+-------+
484
485Return value:
486
487  - 0 on success
488  - See the [POSIX connect function][connect] for error names; see
489    [Error numbers] in further sections.
490
491#### Release
492
493The **release** operation closes an existing active or a passive socket.
494
495When a release command is issued on a passive socket, the backend
496releases it and frees its internal mappings. When a release command is
497issued for an active socket, the data ring and indexes page are also
498unmapped and freed:
499
500- frontend sends release command for an active socket
501- backend releases the socket
502- backend unmaps the data ring
503- backend unmaps the indexes page
504- backend unbinds the event channel
505- backend replies to frontend with an **ret** value
506- frontend frees data ring, indexes page and unbinds event channel
507
508Request fields:
509
510- **cmd** value: 1
511- additional fields:
512  - **id**: identifies the socket
513  - **reuse**: an optimization hint for the backend. The field is
514    ignored for passive sockets. When set to 1, the frontend lets the
515    backend know that it will reuse exactly the same set of grant pages
516    (indexes page and data ring) and event channel when creating one of
517    the next active sockets. The backend can take advantage of it by
518    delaying unmapping grants and unbinding the event channel. The
519    backend is free to ignore the hint. Reused data rings are found by
520    **ref**, the grant reference of the page containing the indexes.
521
522Request binary layout:
523
524    8       12      16    17
525    +-------+-------+-----+
526    |       id      |reuse|
527    +-------+-------+-----+
528
529Response additional fields:
530
531- **id**: echoed back from request
532
533Response binary layout:
534
535    16      20      24
536    +-------+-------+
537    |       id      |
538    +-------+-------+
539
540Return value:
541
542  - 0 on success
543  - See the [POSIX shutdown function][shutdown] for error names; see
544    [Error numbers] in further sections.
545
546#### Bind
547
548The **bind** operation corresponds to the POSIX [bind][bind] function.
549It assigns the address passed as parameter to a previously created
550socket, identified by **id**. **Bind**, **listen** and **accept** are
551the three operations required to have fully working passive sockets and
552should be issued in that order.
553
554Request fields:
555
556- **cmd** value: 2
557- additional fields:
558  - **id**: identifies the socket
559  - **addr**: address to connect to, see [Socket families and address
560    format]
561  - **len**: address length up to 28 octets
562
563Request binary layout:
564
565    8       12      16      20      24      28      32      36      40      44
566    +-------+-------+-------+-------+-------+-------+-------+-------+-------+
567    |       id      |                            addr                       |
568    +-------+-------+-------+-------+-------+-------+-------+-------+-------+
569    |  len  |
570    +-------+
571
572Response additional fields:
573
574- **id**: echoed back from request
575
576Response binary layout:
577
578    16      20      24
579    +-------+-------+
580    |       id      |
581    +-------+-------+
582
583Return value:
584
585  - 0 on success
586  - See the [POSIX bind function][bind] for error names; see
587    [Error numbers] in further sections.
588
589
590#### Listen
591
592The **listen** operation marks the socket as a passive socket. It corresponds to
593the [POSIX listen function][listen].
594
595Reuqest fields:
596
597- **cmd** value: 3
598- additional fields:
599  - **id**: identifies the socket
600  - **backlog**: the maximum length to which the queue of pending
601    connections may grow in number of elements
602
603Request binary layout:
604
605    8       12      16      20
606    +-------+-------+-------+
607    |       id      |backlog|
608    +-------+-------+-------+
609
610Response additional fields:
611
612- **id**: echoed back from request
613
614Response binary layout:
615
616    16      20      24
617    +-------+-------+
618    |       id      |
619    +-------+-------+
620
621Return value:
622  - 0 on success
623  - See the [POSIX listen function][listen] for error names; see
624    [Error numbers] in further sections.
625
626
627#### Accept
628
629The **accept** operation extracts the first connection request on the
630queue of pending connections for the listening socket identified by
631**id** and creates a new connected socket. The id of the new socket is
632also chosen by the frontend and passed as an additional field of the
633accept request struct (**id_new**). See the [POSIX accept function][accept]
634as reference.
635
636Similarly to the **connect** operation, **accept** creates new [Indexes
637Page and Data ring]. The **data ring** is used to send and receive data from
638the socket. The **accept** operation passes two additional parameters:
639**evtchn** and **ref**. **evtchn** is the port number of a new event
640channel which will be used for notifications of activity on the data
641ring. **ref** is the grant reference of the **indexes page**: a page
642which contains shared indexes that point to the write and read locations
643in the **data ring**. The **indexes page** also contains the full array of
644grant references for the **data ring**.
645
646The backend will reply to the request only when a new connection is
647successfully accepted, i.e. the backend does not return EAGAIN or
648EWOULDBLOCK.
649
650Example workflow:
651
652- frontend issues an **accept** request
653- backend waits for a connection to be available on the socket
654- a new connection becomes available
655- backend accepts the new connection
656- backend creates an internal mapping from **id_new** to the new socket
657- backend maps the grant reference **ref**, the indexes page, see struct
658  pvcalls_data_intf
659- backend maps all the grant references listed in `struct
660  pvcalls_data_intf` and uses them as shared memory for the new data
661  ring **in** and **out** arrays
662- backend binds to the **evtchn**
663- backend replies to the frontend with a **ret** value
664
665Request fields:
666
667- **cmd** value: 4
668- additional fields:
669  - **id**: id of listening socket
670  - **id_new**: id of the new socket
671  - **ref**: grant reference of the indexes page
672  - **evtchn**: port number of the evtchn to signal activity on the data ring
673
674Request binary layout:
675
676    8       12      16      20      24      28      32
677    +-------+-------+-------+-------+-------+-------+
678    |       id      |    id_new     |  ref  |evtchn |
679    +-------+-------+-------+-------+-------+-------+
680
681Response additional fields:
682
683- **id**: id of the listening socket, echoed back from request
684
685Response binary layout:
686
687    16      20      24
688    +-------+-------+
689    |       id      |
690    +-------+-------+
691
692Return value:
693
694  - 0 on success
695  - See the [POSIX accept function][accept] for error names; see
696    [Error numbers] in further sections.
697
698
699#### Poll
700
701In this version of the protocol, the **poll** operation is only valid
702for passive sockets. For active sockets, the frontend should look at the
703indexes on the **indexes page**. When a new connection is available in
704the queue of the passive socket, the backend generates a response and
705notifies the frontend.
706
707Request fields:
708
709- **cmd** value: 5
710- additional fields:
711  - **id**: identifies the listening socket
712
713Request binary layout:
714
715    8       12      16
716    +-------+-------+
717    |       id      |
718    +-------+-------+
719
720
721Response additional fields:
722
723- **id**: echoed back from request
724
725Response binary layout:
726
727    16       20       24
728    +--------+--------+
729    |        id       |
730    +--------+--------+
731
732Return value:
733
734  - 0 on success
735  - See the [POSIX poll function][poll] for error names; see
736    [Error numbers] in further sections.
737
738#### Expanding the protocol
739
740It is possible to introduce new commands without changing the protocol
741ABI. Naturally, a feature flag among the backend xenstore nodes should
742advertise the availability of a new set of commands.
743
744If a new command requires parameters in struct xen_pvcalls_request
745larger than 56 bytes, which is the current size of the request, then the
746protocol version should be increased. One way to implement the large
747request structure without disrupting the current ABI is to introduce a
748new command, such as PVCALLS_CONNECT_EXTENDED, and a flag to specify
749that the request uses two request slots, for a total of 112 bytes.
750
751#### Error numbers
752
753The numbers corresponding to the error names specified by POSIX are:
754
755    [EPERM]         -1
756    [ENOENT]        -2
757    [ESRCH]         -3
758    [EINTR]         -4
759    [EIO]           -5
760    [ENXIO]         -6
761    [E2BIG]         -7
762    [ENOEXEC]       -8
763    [EBADF]         -9
764    [ECHILD]        -10
765    [EAGAIN]        -11
766    [EWOULDBLOCK]   -11
767    [ENOMEM]        -12
768    [EACCES]        -13
769    [EFAULT]        -14
770    [EBUSY]         -16
771    [EEXIST]        -17
772    [EXDEV]         -18
773    [ENODEV]        -19
774    [EISDIR]        -21
775    [EINVAL]        -22
776    [ENFILE]        -23
777    [EMFILE]        -24
778    [ENOSPC]        -28
779    [EROFS]         -30
780    [EMLINK]        -31
781    [EDOM]          -33
782    [ERANGE]        -34
783    [EDEADLK]       -35
784    [EDEADLOCK]     -35
785    [ENAMETOOLONG]  -36
786    [ENOLCK]        -37
787    [ENOTEMPTY]     -39
788    [ENOSYS]        -38
789    [ENODATA]       -61
790    [ETIME]         -62
791    [EBADMSG]       -74
792    [EOVERFLOW]     -75
793    [EILSEQ]        -84
794    [ERESTART]      -85
795    [ENOTSOCK]      -88
796    [EOPNOTSUPP]    -95
797    [EAFNOSUPPORT]  -97
798    [EADDRINUSE]    -98
799    [EADDRNOTAVAIL] -99
800    [ENOBUFS]       -105
801    [EISCONN]       -106
802    [ENOTCONN]      -107
803    [ETIMEDOUT]     -110
804    [ENOTSUP]      -524
805
806#### Socket families and address format
807
808The following definitions and explicit sizes, together with POSIX
809[sys/socket.h][address] and [netinet/in.h][in] define socket families and
810address format. Please be aware that only the **domain** `AF_INET`, **type**
811`SOCK_STREAM` and **protocol** `0` are supported by this version of the
812specification, others return ENOTSUP.
813
814    #define AF_UNSPEC   0
815    #define AF_UNIX     1   /* Unix domain sockets      */
816    #define AF_LOCAL    1   /* POSIX name for AF_UNIX   */
817    #define AF_INET     2   /* Internet IP Protocol     */
818    #define AF_INET6    10  /* IP version 6         */
819
820    #define SOCK_STREAM 1
821    #define SOCK_DGRAM  2
822    #define SOCK_RAW    3
823
824    /* generic address format */
825    struct sockaddr {
826        uint16_t sa_family_t;
827        char sa_data[26];
828    };
829
830    struct in_addr {
831        uint32_t s_addr;
832    };
833
834    /* AF_INET address format */
835    struct sockaddr_in {
836        uint16_t         sa_family_t;
837        uint16_t         sin_port;
838        struct in_addr   sin_addr;
839        char             sin_zero[20];
840    };
841
842
843### Indexes Page and Data ring
844
845Data rings are used for sending and receiving data over a connected socket. They
846are created upon a successful **accept** or **connect** command.
847The **sendmsg** and **recvmsg** calls are implemented by sending data and
848receiving data from a data ring, and updating the corresponding indexes
849on the **indexes page**.
850
851Firstly, the **indexes page** is shared by a **connect** or **accept**
852command, see **ref** parameter in their sections. The content of the
853**indexes page** is represented by `struct pvcalls_ring_intf`, see
854below. The structure contains the list of grant references which
855constitute the **in** and **out** buffers of the data ring, see ref[]
856below. The backend maps the grant references contiguously. Of the
857resulting shared memory, the first half is dedicated to the **in** array
858and the second half to the **out** array. They are used as circular
859buffers for transferring data, and, together, they are the data ring.
860
861
862        +---------------------------+                 Indexes page
863        | Command ring:             |                 +----------------------+
864        | @0: xen_pvcalls_connect:  |                 |@0 pvcalls_data_intf: |
865        | @44: ref  +-------------------------------->+@76: ring_order = 1   |
866        |                           |                 |@80: ref[0]+          |
867        +---------------------------+                 |@84: ref[1]+          |
868                                                      |           |          |
869                                                      |           |          |
870                                                      +----------------------+
871                                                                  |
872                                                                  v (data ring)
873                                                          +-------+-----------+
874                                                          |  @0->4098: in     |
875                                                          |  ref[0]           |
876                                                          |-------------------|
877                                                          |  @4099->8196: out |
878                                                          |  ref[1]           |
879                                                          +-------------------+
880
881
882#### Indexes Page Structure
883
884    typedef uint32_t PVCALLS_RING_IDX;
885
886    struct pvcalls_data_intf {
887    	PVCALLS_RING_IDX in_cons, in_prod;
888    	int32_t in_error;
889
890    	uint8_t pad[52];
891
892    	PVCALLS_RING_IDX out_cons, out_prod;
893    	int32_t out_error;
894
895    	uint8_t pad[52];
896
897    	uint32_t ring_order;
898    	grant_ref_t ref[];
899    };
900
901    /* not actually C compliant (ring_order changes from socket to socket) */
902    struct pvcalls_data {
903        char in[((1<<ring_order)<<PAGE_SHIFT)/2];
904        char out[((1<<ring_order)<<PAGE_SHIFT)/2];
905    };
906
907- **ring_order**
908  It represents the order of the data ring. The following list of grant
909  references is of `(1 << ring_order)` elements. It cannot be greater than
910  **max-page-order**, as specified by the backend on XenBus. It has to
911  be one at minimum.
912- **ref[]**
913  The list of grant references which will contain the actual data. They are
914  mapped contiguosly in virtual memory. The first half of the pages is the
915  **in** array, the second half is the **out** array. The arrays must
916  have a power of two size. Together, their size is `(1 << ring_order) *
917  PAGE_SIZE`.
918- **in** is an array used as circular buffer
919  It contains data read from the socket. The producer is the backend, the
920  consumer is the frontend.
921- **out** is an array used as circular buffer
922  It contains data to be written to the socket. The producer is the frontend,
923  the consumer is the backend.
924- **in_cons** and **in_prod**
925  Consumer and producer indexes for data read from the socket. They keep track
926  of how much data has already been consumed by the frontend from the **in**
927  array. **in_prod** is increased by the backend, after writing data to **in**.
928  **in_cons** is increased by the frontend, after reading data from **in**.
929- **out_cons**, **out_prod**
930  Consumer and producer indexes for the data to be written to the socket. They
931  keep track of how much data has been written by the frontend to **out** and
932  how much data has already been consumed by the backend. **out_prod** is
933  increased by the frontend, after writing data to **out**. **out_cons** is
934  increased by the backend, after reading data from **out**.
935- **in_error** and **out_error** They signal errors when reading from the socket
936  (**in_error**) or when writing to the socket (**out_error**). 0 means no
937  errors. When an error occurs, no further reads or writes operations are
938  performed on the socket. In the case of an orderly socket shutdown (i.e. read
939  returns 0) **in_error** is set to ENOTCONN. **in_error** and **out_error**
940  are never set to EAGAIN or EWOULDBLOCK (the data is written to the
941  ring as soon as it is available).
942
943The binary layout of `struct pvcalls_data_intf` follows:
944
945    0         4         8         12           64        68        72        76
946    +---------+---------+---------+-----//-----+---------+---------+---------+
947    | in_cons | in_prod |in_error |  padding   |out_cons |out_prod |out_error|
948    +---------+---------+---------+-----//-----+---------+---------+---------+
949
950    76        80        84        88      4092      4096
951    +---------+---------+---------+----//---+---------+
952    |ring_orde|  ref[0] |  ref[1] |         |  ref[N] |
953    +---------+---------+---------+----//---+---------+
954
955**N.B** For one page, N is maximum 991 ((4096-132)/4), but given that N needs
956to be a power of two, actually max N is 512 (ring_order = 9).
957
958#### Data Ring Structure
959
960The binary layout of the data ring follow:
961
962    0         ((1<<ring_order)<<PAGE_SHIFT)/2       ((1<<ring_order)<<PAGE_SHIFT)
963    +------------//-------------+------------//-------------+
964    |            in             |           out             |
965    +------------//-------------+------------//-------------+
966
967#### Why ring.h is not needed
968
969Many Xen PV protocols use the macros provided by [ring.h] to manage
970their shared ring for communication. PVCalls does not, because the [Data
971Ring Structure] actually comes with two rings: the **in** ring and the
972**out** ring. Each of them is mono-directional, and there is no static
973request size: the producer writes opaque data to the ring. On the other
974end, in [ring.h] they are combined, and the request size is static and
975well-known. In PVCalls:
976
977  in -> backend to frontend only
978  out-> frontend to backend only
979
980In the case of the **in** ring, the frontend is the consumer, and the
981backend is the producer. Everything is the same but mirrored for the
982**out** ring.
983
984The producer, the backend in this case, never reads from the **in**
985ring. In fact, the producer doesn't need any notifications unless the
986ring is full. This version of the protocol doesn't take advantage of it,
987leaving room for optimizations.
988
989On the other end, the consumer always requires notifications, unless it
990is already actively reading from the ring. The producer can figure it
991out, without any additional fields in the protocol, by comparing the
992indexes at the beginning and the end of the function. This is similar to
993what [ring.h] does.
994
995#### Workflow
996
997The **in** and **out** arrays are used as circular buffers:
998
999    0                               sizeof(array) == ((1<<ring_order)<<PAGE_SHIFT)/2
1000    +-----------------------------------+
1001    |to consume|    free    |to consume |
1002    +-----------------------------------+
1003               ^            ^
1004               prod         cons
1005
1006    0                               sizeof(array)
1007    +-----------------------------------+
1008    |  free    | to consume |   free    |
1009    +-----------------------------------+
1010               ^            ^
1011               cons         prod
1012
1013The following function is provided to calculate how many bytes are currently
1014left unconsumed in an array:
1015
1016    #define _MASK_PVCALLS_IDX(idx, ring_size) ((idx) & (ring_size-1))
1017
1018    static inline PVCALLS_RING_IDX pvcalls_ring_unconsumed(PVCALLS_RING_IDX prod,
1019    		PVCALLS_RING_IDX cons,
1020    		PVCALLS_RING_IDX ring_size)
1021    {
1022    	PVCALLS_RING_IDX size;
1023
1024    	if (prod == cons)
1025    		return 0;
1026
1027    	prod = _MASK_PVCALLS_IDX(prod, ring_size);
1028    	cons = _MASK_PVCALLS_IDX(cons, ring_size);
1029
1030    	if (prod == cons)
1031    		return ring_size;
1032
1033    	if (prod > cons)
1034    		size = prod - cons;
1035    	else {
1036    		size = ring_size - cons;
1037    		size += prod;
1038    	}
1039    	return size;
1040    }
1041
1042The producer (the backend for **in**, the frontend for **out**) writes to the
1043array in the following way:
1044
1045- read *[in|out]_cons*, *[in|out]_prod*, *[in|out]_error* from shared memory
1046- general memory barrier
1047- return on *[in|out]_error*
1048- write to array at position *[in|out]_prod* up to *[in|out]_cons*,
1049  wrapping around the circular buffer when necessary
1050- write memory barrier
1051- increase *[in|out]_prod*
1052- notify the other end via evtchn
1053
1054The consumer (the backend for **out**, the frontend for **in**) reads from the
1055array in the following way:
1056
1057- read *[in|out]_prod*, *[in|out]_cons*, *[in|out]_error* from shared memory
1058- read memory barrier
1059- return on *[in|out]_error*
1060- read from array at position *[in|out]_cons* up to *[in|out]_prod*,
1061  wrapping around the circular buffer when necessary
1062- general memory barrier
1063- increase *[in|out]_cons*
1064- notify the other end via evtchn
1065
1066The producer takes care of writing only as many bytes as available in
1067the buffer up to *[in|out]_cons*. The consumer takes care of reading
1068only as many bytes as available in the buffer up to *[in|out]_prod*.
1069*[in|out]_error* is set by the backend when an error occurs writing or
1070reading from the socket.
1071
1072
1073[xenstore]: https://xenbits.xen.org/docs/unstable/misc/xenstore.txt
1074[XenbusStateInitialising]: https://xenbits.xen.org/docs/unstable/hypercall/x86_64/include,public,io,xenbus.h.html
1075[address]: http://pubs.opengroup.org/onlinepubs/7908799/xns/syssocket.h.html
1076[in]: http://pubs.opengroup.org/onlinepubs/000095399/basedefs/netinet/in.h.html
1077[socket]: http://pubs.opengroup.org/onlinepubs/009695399/functions/socket.html
1078[connect]: http://pubs.opengroup.org/onlinepubs/7908799/xns/connect.html
1079[shutdown]: http://pubs.opengroup.org/onlinepubs/7908799/xns/shutdown.html
1080[bind]: http://pubs.opengroup.org/onlinepubs/7908799/xns/bind.html
1081[listen]: http://pubs.opengroup.org/onlinepubs/7908799/xns/listen.html
1082[accept]: http://pubs.opengroup.org/onlinepubs/7908799/xns/accept.html
1083[poll]: http://pubs.opengroup.org/onlinepubs/7908799/xsh/poll.html
1084[ring.h]: https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/include/public/io/ring.h;hb=HEAD
1085