Using Memcache at Scale (Facebook)

My last post summarized why I chose memcached over Redis for our system’s caching layer. Memcache fit our desired feature set and was also faster (probably due to the relatively simple workload, set and get whole objects with no updates).

In this post I’ll be summarizing some things I’ve learned about memcached from the Facebook tech talk and tech post.

TCP vs UDP

One of the tradeoffs that Facebook seems to have made is using UDP instead of TCP for it’s transmission protocol. My understanding of the protocols are very shallow but I do understand that one tradeoff is the lack of ACK with UDP which means that we give up the guarantee that the packets will arrive in the same order they were sent. TCP is connection-oriented while UDP is datagram-oriented which means UDP will be more like a steady stream of data with no persistent connections. Packets could be lost but it doesn’t have the overhead of a connection which means less memory used for the transmission.

Server optimizations

This section interested me greatly. I especially liked how they tweaked the memory slab allocated by memcache for stores. By default the slabs were powers of 2 (256, 512, etc.) but after they analyzed their cached objects, the found that slabs with power of “1.3” resulted in much more efficient memory usage. I think that’s brilliant! In our case the stored objects are almost always the same size and if we can size the server-side memory allocation correctly, we could minimize memory fragmentation. Note to self: Find out the memory size distribution of the objects-to-store.

Another really cool optimization they did was to use a different system call for write operations (use system writev instead of write). This is also known as scatter/gather IO or vector IO which basically means that multiple write buffers will be written as one continuous data stream. It’s unclear to me right now if we’ll need such low-level optimizations but it inspired me to think about memcache memory-management and the linear nature of both our writes and eviction. Basically we want to implement a TTL for each entry in the cache and if each objects has the same TTL then we should be able to minimize memory fragmentation by evicting objects in the same order as they were stored. This means in terms of storing and eviction, the objects could be seen as continuous array so that each new objects can be stored at the end of the array and eviction would only happen in order starting from the beginning of the array. Essentially this could be represented as a circular buffer where new objects would just take the place of the last-stored object that still exists in the cache.

Conclusion

It is actually pretty amazing but it seems that at least the scale that Facebook is publishing seems to match the scale that we would operate at, and also mentions the performance that I expect given my previous benchmarking. Specifically:

Since we’ve made all these changes, we have been able to scale memcached to handle 200,000 UDP requests per second with an average latency of 173 microseconds. The total throughput achieved is 300,000 UDP requests/s, but the latency at that request rate is too high to be useful in our system. This is an amazing increase from 50,000 UDP requests/s using the stock version of Linux and memcached.

My back-of-the-envelope calculations estimate that our workload will be around the scale of 200K stores/s per datacenter.