Benchmarking message passing and shared memory on macOS
March 2017
While briefly dwelling on different UNIX interprocess communication mechanisms during the lesson, we have seen how
powerful shared memory is: a chunk of logical memory (which will be the same physical page pointed by processes'
page tables) arranged by the OS memory pool is given to the process which requested it, then another process maps
the segment into its own virtual address space so that both can read/write data of the same memory region. It's
cool, especially if you think that virtual addresses are not mandatory to be the same between processes. One value, is that it is
also said to be by far the fastest form of IPC available. Now I was wondering how much of this is actually true. Surely,
the fact of accessing userland memory at a fast rate and not even being required a kernel mediation when exchanging
data ensures high performance, even though this could come much more in useful when sharing quite large block of memory
and besides that, coordinating access between processes (with some synchronization mechanism) is needed. Likewise, message
passing may be pretty much fast when dealing with a not so big amount of data and it doesn't commit excessive resources either,
even if implementation might seem more complicated.
So let's try to see how they differ in terms of both speed and implementation on OS X (in reality they almost always
complement one another and are closely related). POSIX API includes msgsnd()
to send a message, whereas, msgrcv()
will be invoked to receive the message. OS X kernel instead, which
ultimately relies on message passing principles and whose design is a lot object-oriented, provides some user-transparent
low-level primitives abstractions such as tasks (sort of BSD processes which have threads executing in it), ports (kernel-maintained
messages queues, that's what allows tasks/threads to communicate with each other), a set of rights (specific capabilities which
define whether a task should send messages to a port or receive messages on that port), indeed, raw mach messages (the means
of communication which are referenced via ports) and some
other objects. Almost all other forms of IPC (e.g. CFMessagePort and CFMachPort, XPC) end up using mach messages (XPC
is much funnier though). So say we want to exchange something between two processes (client/server model). We allocate a new port
in our own current task (mach_task_self()
), and we give the task receive and send right to the new port, so that
we can send messages to it:
kern_return_t kr; mach_port_t srv_port; kr = mach_port_allocate(mach_task_self(), MACH_PORT_RIGHT_RECEIVE, &srv_port); if (kr != KERN_SUCCESS) __error("could not allocate port"); kr = mach_port_insert_right(mach_task_self(), srv_port, srv_port, MACH_MSG_TYPE_MAKE_SEND); if (kr != KERN_SUCCESS) __error("could not insert right");
Ports are specific to the process though, so we need to make them reachable by the client. One way to let the client know the
port it will have to reply to can be to register the service globally via the bootstrap server (which is provided by launchd,
which manages all other system services) by calling bootstrap_register2 (mach_port_t bootsrap_port, const name_t service_name,
mach_port_t service_port, uint64_t flags)
(since bootstrap_register()
is marked as deprecated), which provides a
send right for the service_port
and – after having looked up the service – will give the client a send right as well
(so mach_port_insert_right()
is kinda redundant), and after getting the bootstrap port with
task_get_bootstrap_port(mach_task_self(), bootstrap_port)
, we will have something like this:
kr = bootstrap_register2(bootstrap_port, SERVICE_NAME, srv_port, 0); if (kr != BOOTSTRAP_SUCCESS) __bootstrap_error("could not register port");
On the other side, the client – after having allocated a new client port too – will have to search for the server:
kr = bootstrap_look_up(bootstrap_port, SERVICE_NAME, &srv_port); if (kr != KERN_SUCCESS) __bootstrap_error("could not find server port");
Note that we could have also retrieved the task port of the client task given the process ID (task_for_pid()
), or
made a port well-known (task_set_special_port())
, both of them however require elevated privileges. Messages are
delivered and received to other userland processes through the trap mach_msg(mach_msg_header_t *msg, mach_msg_option_t option,
mach_msg_size_t send_size, mach_msg_size_t rcv_size, mach_port_name_t rcv_name, mach_msg_timeout_t timeout, mach_port_name_t notify)
,
which takes as input a pointer to a fixed-length message header (mach_msg_header_t
structure), some scalars
such as whether should be used to send or receive, bytes of the message to be sent or to be received, ports where it should be
received or notified and a time to wait (if none is indefinite). So as for the server-side, we first need to prepare the message which
will wait for the client task port, and then reply to the client with some appropriate data:
mach_msg_header_t *snd_header, *rcv_header; msg_snd_t snd_msg; msg_rcv_t rcv_msg; do { rcv_header = &(rcv_msg.header); rcv_header->msgh_local_port = srv_port; /* port which message should be received */ rcv_header->msgh_size = sizeof(msg_rcv_t); mach_msg(rcv_header, MACH_RCV_MSG, 0, rcv_header->msgh_size, srv_port, 0, 0); /* message used for receiving client's task port */ snd_header = &(snd_msg.header); snd_header->msgh_bits = MACH_MSGH_BITS_LOCAL(rcv_header->msgh_bits); /* port name of msgh_local_port (i.e. client port) is extracted */ snd_header->msgh_remote_port = rcv_header->msgh_remote_port; /* destination port previously received */ snd_header->msgh_local_port = MACH_PORT_NULL; snd_header->msgh_size = sizeof(msg_snd_t); strcpy(snd_msg.buffer, DATA); /* a char array */ kr = mach_msg((mach_msg_header_t*)&snd_msg, MACH_SEND_MSG, snd_header->msgh_size, 0, 0, 0, 0); } while (kr != MACH_MSG_SUCCESS);
Where msg_snd_t
and msg_rcv_t
structures describe the message to send and to receive respectively (they both consist
of an header and inline data, except for the fact that sizeof(msg_rcv_t)
must be greater than sizeof(msg_snd_t)
since
some bytes are added by the kernel so to ensure that message sent doesn't fail we end up adding a trailer to the receiver); and as regards
the client-side, we first let the server know that our local port is going to reach the server and then we wait for the data coming from the server:
do { snd_header = &(snd_msg.header); /* we get a send right from the local_port (which hold a receive right), a send right is provided also for the remote_port */ snd_header->msgh_bits = MACH_MSGH_BITS(MACH_MSG_TYPE_COPY_SEND, MACH_MSG_TYPE_MAKE_SEND); snd_header->msgh_local_port = clt_port; snd_header->msgh_remote_port = srv_port; snd_header->msgh_size = sizeof(msg_snd_t); mach_msg(snd_header, MACH_SEND_MSG, snd_header->msgh_size, 0, 0, 0, 0); rcv_header = &(rcv_msg.header); rcv_header->msgh_bits = MACH_MSGH_BITS_LOCAL(snd_header->msgh_bits); rcv_header->msgh_local_port = clt_port; rcv_header->msgh_remote_port = srv_port; rcv_header->msgh_size = sizeof(msg_rcv_t); kr = mach_msg((mach_msg_header_t*)&rcv_msg, MACH_RCV_MSG, 0, rcv_header->msgh_size, clt_port, 0, 0); } while (kr != MACH_MSG_SUCCESS);
The client, which should have successfully received the message on its port, can then read the buffer (DATA
), which in this case was
an array of characters. Now say we want to exchange something more than just a string. Turns out that on OS X, when passing not simple inline data
(like a memory object, just a pointer), a message can include a pointer to out-of-line (OOL) data, that's a memory address location of
a region of the sender's virtual address space. Clearly, the bigger the chunk gets, the more inefficient copying all data from the sender to
the receiver is, therefore memory pages are transferred with virtual copy (copy-on-write) techniques so that most of the times actual copying is
not even performed. The way I figured it out, depending on whether we basically want to pass a void*
(e.g. to map a file or device
into memory via mmap()
syscall) or share a pair of mach_port_t
, we need to include additional structures, which are either
mach_msg_ool_descriptor_t
to pass out-of-line data or mach_msg_port_descriptor_t
to pass around a port right. The latter allows
a way to implement shared memory, which implies reserving a region of memory (vm_allocate()
) and getting a named reference (mach_make_memory_entry_64()
)
to the given memory object to send to other tasks; after sending the handle, receiving task will be able to map it into its own address space (vm_map()
).
In both cases, message needs to be marked as complex (MACH_MSGH_BITS_COMPLEX
). We go with the first one, so sending and receving message structures
are changed as follows:
typedef struct { typedef struct { mach_msg_header_t header; mach_msg_header_t header; mach_msg_body_t body; mach_msg_body_t body; mach_msg_ool_descriptor64_t data; mach_msg_ool_descriptor64_t data; } msg_snd_t; mach_msg_trailer_t trailer; } msg_rcv_t;
Where the header is unchanged and mach_msg_body_t
outlines the beginning of kernel-processed data. We fill a 4MB region of memory with e.g. /dev/urandom
:
vm_address_t address = 0; vm_size_t size = 4 << 20; kr = vm_allocate(mach_task_self(), &address, size, VM_FLAGS_ANYWHERE); if (kr != KERN_SUCCESS) __error("could not allocate memory"); ... memcpy((char*)address, buf, size);
Maybe not exactly a great idea, at least we are pretty sure memory has been granted (comparing the state of the process at the beginning and the end of the process with vmmap
,
MALLOC_SMALL
has risen), next we set up the fields of the previously modified structures sending-side:
snd_header = &(snd_msg.header); snd_header->msgh_bits = MACH_MSGH_BITS_LOCAL(rcv_header->msgh_bits); snd_header->msgh_bits |= MACH_MSGH_BITS_COMPLEX; /* complex message */ snd_header->msgh_remote_port = rcv_header->msgh_remote_port; snd_header->msgh_size = sizeof(msg_snd_t); snd_msg.body.msgh_descriptor_count = 1; /* follows one descriptor */ snd_msg.data.address = address; /* vm_allocate()'d address (uint64_t) */ snd_msg.data.size = (mach_msg_size_t)size; snd_msg.data.deallocate = FALSE; /* deallocate manually */ snd_msg.data.copy = MACH_MSG_VIRTUAL_COPY; /* ask the kernel to skip actual copy of data */ snd_msg.data.type = MACH_MSG_OOL_DESCRIPTOR; mach_msg((mach_msg_header_t*)&snd_msg, MACH_SEND_MSG, snd_header->msgh_size, 0, 0, 0, 0);
And by doing so, client will be happily reading the chunk of memory.
Concerning shared memory, since on OS X it is mostly accomplished through message passing – being indeed, as a microkernel, a message-oriented kernel –, just as messages
extensively use shared memory, we will take advantage of the BSD layer (to get rid of messages) by using shmget(key, size, 0666 | IPC_CREAT)
(key
is just a number both sides agree to use) to obtain an identifier of the shared memory segment and shmat(shmid, NULL, 0)
(where shmid
identifies
the given segment) to get a pointer to it. Issue here is just to set up a proper way of synchronization between the two processes. File locking would be easier to implement,
System V semaphores are more complicated instead, although they allow for more control and flexibility. Anyway just keep in mind that we can perform atomically operations on
the semaphore of wait and signal kind (the semaphore is of type struct sembuf
, which contains the semaphore number, the operation,
and some flags) through semop(int semid, struct sembuf *sops, size_t nsops)
, after having initialized its value to 0:
/* Server */ /* Client */ semid = semget(key, 1, 0666|IPC_CREAT); while((semid = semget(key, 1, 0666)) == -1); strcpy((char*)shm_address, DATA); /* Signal semaphore by increment /* Decrement sem_val so that we wait it by 1 */ until sem_val gets 0 */ sem_operation.sem_num = 0; sem_operation.sem_num = 0; sem_operation.sem_op = 1; sem_operation.sem_op = -1; sem_operation.sem_flg = 0; sem_operation.sem_flg = 0; semop(semid, &sem_operation, 1); semop(semid, &sem_operation, 1); printf("%s\n", (char*)shm_address); /* If sem_val is 0 we return otherwise wait for client to read data */ sem_operation.sem_num = 0; sem_operation.sem_op = 0; sem_operation.sem_flg = 0; semop(semid, &sem_operation, 1); /* Detach and release the shared /* Detach the shared memory */ memory */
Conclusions
For the sake of laziness, we just time(1)
a process which forks and waits for the child: the child forks again so that this and the new child execute the client and server instances each.
strlen("Hello World") + 1 |
4 MB of /dev/urandom |
|
---|---|---|
Message passing | 9ms | 14ms |
Shared memory | 11ms | 13ms |
I have done multiple tests and then took an average, don't know how much these results can be reliable (since to get statistics more plausible we perhaps should have allocated much more memory). Looks like as if they are roughly equivalent, but yes, message passing is slightly faster when dealing with small amount of data, shared memory remains mostly costant in time. Times notwithstanding, it's been interesting to see how one method actually uses the other to achieve the exchange of data (and vice versa).
References
- Mac OS X and iOS Internals - To The Apple's Core by Jonathan Levin, 2012.