Benchmarking message passing and shared memory on macOS

March 2017

While briefly dwelling on different UNIX interprocess communication mechanisms during the lesson, we have seen how powerful shared memory is: a chunk of logical memory (which will be the same physical page pointed by processes' page tables) arranged by the OS memory pool is given to the process which requested it, then another process maps the segment into its own virtual address space so that both can read/write data of the same memory region. It's cool, especially if you think that virtual addresses are not mandatory to be the same between processes. One value, is that it is also said to be by far the fastest form of IPC available. Now I was wondering how much of this is actually true. Surely, the fact of accessing userland memory at a fast rate and not even being required a kernel mediation when exchanging data ensures high performance, even though this could come much more in useful when sharing quite large block of memory and besides that, coordinating access between processes (with some synchronization mechanism) is needed. Likewise, message passing may be pretty much fast when dealing with a not so big amount of data and it doesn't commit excessive resources either, even if implementation might seem more complicated.

So let's try to see how they differ in terms of both speed and implementation on OS X (in reality they almost always complement one another and are closely related). POSIX API includes msgsnd() to send a message, whereas, msgrcv() will be invoked to receive the message. OS X kernel instead, which ultimately relies on message passing principles and whose design is a lot object-oriented, provides some user-transparent low-level primitives abstractions such as tasks (sort of BSD processes which have threads executing in it), ports (kernel-maintained messages queues, that's what allows tasks/threads to communicate with each other), a set of rights (specific capabilities which define whether a task should send messages to a port or receive messages on that port), indeed, raw mach messages (the means of communication which are referenced via ports) and some other objects. Almost all other forms of IPC (e.g. CFMessagePort and CFMachPort, XPC) end up using mach messages (XPC is much funnier though). So say we want to exchange something between two processes (client/server model). We allocate a new port in our own current task (mach_task_self()), and we give the task receive and send right to the new port, so that we can send messages to it:

kern_return_t kr;
mach_port_t srv_port;

kr = mach_port_allocate(mach_task_self(), MACH_PORT_RIGHT_RECEIVE, &srv_port);
if (kr != KERN_SUCCESS)
    __error("could not allocate port");

kr = mach_port_insert_right(mach_task_self(), srv_port, srv_port, MACH_MSG_TYPE_MAKE_SEND);
if (kr != KERN_SUCCESS)
    __error("could not insert right");

Ports are specific to the process though, so we need to make them reachable by the client. One way to let the client know the port it will have to reply to can be to register the service globally via the bootstrap server (which is provided by launchd, which manages all other system services) by calling bootstrap_register2 (mach_port_t bootsrap_port, const name_t service_name, mach_port_t service_port, uint64_t flags) (since bootstrap_register() is marked as deprecated), which provides a send right for the service_port and – after having looked up the service – will give the client a send right as well (so mach_port_insert_right() is kinda redundant), and after getting the bootstrap port with task_get_bootstrap_port(mach_task_self(), bootstrap_port), we will have something like this:

kr = bootstrap_register2(bootstrap_port, SERVICE_NAME, srv_port, 0);
if (kr != BOOTSTRAP_SUCCESS)
    __bootstrap_error("could not register port");

On the other side, the client – after having allocated a new client port too – will have to search for the server:

kr = bootstrap_look_up(bootstrap_port, SERVICE_NAME, &srv_port);
if (kr != KERN_SUCCESS)
    __bootstrap_error("could not find server port");

Note that we could have also retrieved the task port of the client task given the process ID (task_for_pid()), or made a port well-known (task_set_special_port()), both of them however require elevated privileges. Messages are delivered and received to other userland processes through the trap mach_msg(mach_msg_header_t *msg, mach_msg_option_t option, mach_msg_size_t send_size, mach_msg_size_t rcv_size, mach_port_name_t rcv_name, mach_msg_timeout_t timeout, mach_port_name_t notify), which takes as input a pointer to a fixed-length message header (mach_msg_header_t structure), some scalars such as whether should be used to send or receive, bytes of the message to be sent or to be received, ports where it should be received or notified and a time to wait (if none is indefinite). So as for the server-side, we first need to prepare the message which will wait for the client task port, and then reply to the client with some appropriate data:

mach_msg_header_t *snd_header, *rcv_header;
msg_snd_t snd_msg;
msg_rcv_t rcv_msg;
do {
    rcv_header = &(rcv_msg.header);
    rcv_header->msgh_local_port = srv_port; /* port which message should be received */
    rcv_header->msgh_size = sizeof(msg_rcv_t);
    mach_msg(rcv_header, MACH_RCV_MSG, 0, rcv_header->msgh_size, srv_port, 0, 0);  /* message used for receiving client's task port */
        
    snd_header = &(snd_msg.header);
    snd_header->msgh_bits = MACH_MSGH_BITS_LOCAL(rcv_header->msgh_bits); /* port name of msgh_local_port (i.e. client port) is extracted */
    snd_header->msgh_remote_port = rcv_header->msgh_remote_port; /* destination port previously received */
    snd_header->msgh_local_port = MACH_PORT_NULL;
    snd_header->msgh_size = sizeof(msg_snd_t);
    strcpy(snd_msg.buffer, DATA); /* a char array */
    kr = mach_msg((mach_msg_header_t*)&snd_msg, MACH_SEND_MSG, snd_header->msgh_size, 0, 0, 0, 0);
} while (kr != MACH_MSG_SUCCESS);

Where msg_snd_t and msg_rcv_t structures describe the message to send and to receive respectively (they both consist of an header and inline data, except for the fact that sizeof(msg_rcv_t) must be greater than sizeof(msg_snd_t) since some bytes are added by the kernel so to ensure that message sent doesn't fail we end up adding a trailer to the receiver); and as regards the client-side, we first let the server know that our local port is going to reach the server and then we wait for the data coming from the server:

do {
    snd_header = &(snd_msg.header);
    /* we get a send right from the local_port (which hold a receive right), a send right is provided also for the remote_port */
    snd_header->msgh_bits = MACH_MSGH_BITS(MACH_MSG_TYPE_COPY_SEND, MACH_MSG_TYPE_MAKE_SEND);
    snd_header->msgh_local_port = clt_port;
    snd_header->msgh_remote_port = srv_port;
    snd_header->msgh_size = sizeof(msg_snd_t);
    mach_msg(snd_header, MACH_SEND_MSG, snd_header->msgh_size, 0, 0, 0, 0);

    rcv_header = &(rcv_msg.header);
    rcv_header->msgh_bits = MACH_MSGH_BITS_LOCAL(snd_header->msgh_bits);
    rcv_header->msgh_local_port = clt_port;
    rcv_header->msgh_remote_port = srv_port;
    rcv_header->msgh_size = sizeof(msg_rcv_t);
    kr = mach_msg((mach_msg_header_t*)&rcv_msg, MACH_RCV_MSG, 0, rcv_header->msgh_size, clt_port, 0, 0);
} while (kr != MACH_MSG_SUCCESS);

The client, which should have successfully received the message on its port, can then read the buffer (DATA), which in this case was an array of characters. Now say we want to exchange something more than just a string. Turns out that on OS X, when passing not simple inline data (like a memory object, just a pointer), a message can include a pointer to out-of-line (OOL) data, that's a memory address location of a region of the sender's virtual address space. Clearly, the bigger the chunk gets, the more inefficient copying all data from the sender to the receiver is, therefore memory pages are transferred with virtual copy (copy-on-write) techniques so that most of the times actual copying is not even performed. The way I figured it out, depending on whether we basically want to pass a void* (e.g. to map a file or device into memory via mmap() syscall) or share a pair of mach_port_t, we need to include additional structures, which are either mach_msg_ool_descriptor_t to pass out-of-line data or mach_msg_port_descriptor_t to pass around a port right. The latter allows a way to implement shared memory, which implies reserving a region of memory (vm_allocate()) and getting a named reference (mach_make_memory_entry_64()) to the given memory object to send to other tasks; after sending the handle, receiving task will be able to map it into its own address space (vm_map()). In both cases, message needs to be marked as complex (MACH_MSGH_BITS_COMPLEX). We go with the first one, so sending and receving message structures are changed as follows:

typedef struct {                            typedef struct {
    mach_msg_header_t header;                   mach_msg_header_t header;
    mach_msg_body_t body;                       mach_msg_body_t body;
    mach_msg_ool_descriptor64_t data;           mach_msg_ool_descriptor64_t data;
} msg_snd_t;                                    mach_msg_trailer_t trailer;
                                            } msg_rcv_t;

Where the header is unchanged and mach_msg_body_t outlines the beginning of kernel-processed data. We fill a 4MB region of memory with e.g. /dev/urandom:

vm_address_t address = 0;
vm_size_t size = 4 << 20;
kr = vm_allocate(mach_task_self(), &address, size, VM_FLAGS_ANYWHERE);
if (kr != KERN_SUCCESS)
    __error("could not allocate memory");
...
memcpy((char*)address, buf, size);

Maybe not exactly a great idea, at least we are pretty sure memory has been granted (comparing the state of the process at the beginning and the end of the process with vmmap, MALLOC_SMALL has risen), next we set up the fields of the previously modified structures sending-side:

snd_header = &(snd_msg.header);
snd_header->msgh_bits = MACH_MSGH_BITS_LOCAL(rcv_header->msgh_bits);
snd_header->msgh_bits |= MACH_MSGH_BITS_COMPLEX; /* complex message */
snd_header->msgh_remote_port = rcv_header->msgh_remote_port;
snd_header->msgh_size = sizeof(msg_snd_t);
snd_msg.body.msgh_descriptor_count = 1; /* follows one descriptor */
snd_msg.data.address = address; /* vm_allocate()'d address (uint64_t) */
snd_msg.data.size = (mach_msg_size_t)size;
snd_msg.data.deallocate = FALSE; /* deallocate manually */ 
snd_msg.data.copy = MACH_MSG_VIRTUAL_COPY; /* ask the kernel to skip actual copy of data */
snd_msg.data.type = MACH_MSG_OOL_DESCRIPTOR;
mach_msg((mach_msg_header_t*)&snd_msg, MACH_SEND_MSG, snd_header->msgh_size, 0, 0, 0, 0);

And by doing so, client will be happily reading the chunk of memory.

Concerning shared memory, since on OS X it is mostly accomplished through message passing – being indeed, as a microkernel, a message-oriented kernel –, just as messages extensively use shared memory, we will take advantage of the BSD layer (to get rid of messages) by using shmget(key, size, 0666 | IPC_CREAT) (key is just a number both sides agree to use) to obtain an identifier of the shared memory segment and shmat(shmid, NULL, 0) (where shmid identifies the given segment) to get a pointer to it. Issue here is just to set up a proper way of synchronization between the two processes. File locking would be easier to implement, System V semaphores are more complicated instead, although they allow for more control and flexibility. Anyway just keep in mind that we can perform atomically operations on the semaphore of wait and signal kind (the semaphore is of type struct sembuf, which contains the semaphore number, the operation, and some flags) through semop(int semid, struct sembuf *sops, size_t nsops), after having initialized its value to 0:

/* Server */                                /* Client */
semid = semget(key, 1, 0666|IPC_CREAT);     while((semid = semget(key, 1, 0666)) == -1);

strcpy((char*)shm_address, DATA);

/* Signal semaphore by increment            /* Decrement sem_val so that we wait 
it by 1 */                                   until sem_val gets 0 */
sem_operation.sem_num = 0;                  sem_operation.sem_num = 0;
sem_operation.sem_op = 1;                   sem_operation.sem_op = -1;
sem_operation.sem_flg = 0;                  sem_operation.sem_flg = 0;
semop(semid, &sem_operation, 1);            semop(semid, &sem_operation, 1);
                                            printf("%s\n", (char*)shm_address);

/* If sem_val is 0 we return otherwise 
wait for client to read data */
sem_operation.sem_num = 0;
sem_operation.sem_op = 0;
sem_operation.sem_flg = 0;
semop(semid, &sem_operation, 1);
/* Detach and release the shared            /* Detach the shared memory */
memory */

Conclusions

For the sake of laziness, we just time(1) a process which forks and waits for the child: the child forks again so that this and the new child execute the client and server instances each.

	`strlen("Hello World") + 1`	4 MB of `/dev/urandom`
Message passing	9ms	14ms
Shared memory	11ms	13ms

I have done multiple tests and then took an average, don't know how much these results can be reliable (since to get statistics more plausible we perhaps should have allocated much more memory). Looks like as if they are roughly equivalent, but yes, message passing is slightly faster when dealing with small amount of data, shared memory remains mostly costant in time. Times notwithstanding, it's been interesting to see how one method actually uses the other to achieve the exchange of data (and vice versa).

References

Mac OS X and iOS Internals - To The Apple's Core by Jonathan Levin, 2012.