An evaluation of Erlang global process registries: meet Syn

An evaluation of Erlang global process registries: meet Syn

Due to my personal interests and history, I often find myself building applications in field of the Internet Of Things. Most of the times I end up using Erlang: it is based on the Actor’s Model and is an ideological (and practical) perfect match to manage IoT interactions.

I recently built an application where devices can connect to, and interact with each other. Every device is identified via a unique ID (its serial number) and based on this ID the devices can send and receive messages. Nothing new here: it’s a standard messaging platform, which supports a custom protocol.

Due to the large amount of devices that I needed to support, this application runs on a cluster of Erlang nodes. Once a device connects to one of those nodes, the related TCP socket events are handled by a process running on that node. To send a message to a specific device, you send a message to the process that handles the devices’s TCP socket.

While building this application, I was early in the process faced with a very common problem: I needed a global process registry that would allow me to globally register a process based on its serial number, so that messages can be sent from anywhere in the cluster. This registry would need to have the following main characteristics:

  • Distributed.
  • Fast write speeds (>10,000 / sec).
  • Handle naming conflict resolution.
  • Allow for adding/removal of nodes.

Therefore I started to search for possible solutions (which included posting to the Erlang Questions mailing list), and these came out as my options:

The Stress Test

I decided to evaluate every one of these solutions based on a variety of considerations. However, I also wanted to see how they would perform when submitted to some kind of a stress test. Therefore, I defined and wrote a simple one that:

  1. Launches a certain number of processes per node (for example, 25,000 processes per node).
  2. Registers these processes (25,000 processes per node), each with a globally unique Key.
  3. Waits for those Keys to be propagated to all the nodes.
  4. Unregisters all of these processes.
  5. Waits for those Keys to be removed from all the nodes.
  6. Re-registers all of the processes, to check for unwanted effects of subsequent add/remove operations.
  7. Again, waits for those Keys to be propagated to all the nodes.
  8. Kills all the processes (this time, without previously unregistering them).
  9. Waits for those Keys to be removed from all the nodes (to check for process monitoring).

The test measures how long each one of these steps takes.

The following is the code for this stress test. You can see that it defines a behaviour: this is to implement callback modules that match the different syntax used by the different libraries.

To run this stress test:

For instance, to launch it with the callback module  global_bench for 100,000 processes running on a cluster of 4 nodes  ['1@127.0.0.1', '2@127.0.0.1', '3@127.0.0.1', '4@127.0.0.1']:

Running this test returns an output similar to:

The Process Registry Libraries

The following are the considerations that I made for every solution.

1. Erlang’s native global module

Considerations

The Erlang global module has native functionalities to support a global process registry. I was not particularly attracted to it, because:

  • I always think that this module should be used to identify application’s long-running services.
  • I didn’t know if millions of entries can be supported. This module wasn’t built with my use case in mind: as per my previous point, it is generally used to register long-running processes.
  • It has a locking mechanism to ensure that the registration is atomic. I felt this could become a serious bottleneck to the registration of processes.

However, this is a native Erlang module, which also allows to define a resolve function to be used for conflict resolution (i.e. in case of race conditions, or during net splits, when a Key gets registered simultaneously on two different nodes). It is able to satisfy the distributed requirements out of the box, with no need for additional libraries.

Stress Test

I gave it a go at my stress test, with the following callback module:

Note that  process_loop (which is the loop running in the processes) does nothing, except keeping the process alive.

The results of the stress test are:

1 Node 2 Nodes 3 Nodes 4 Nodes
Reg / second 27,233 2,673 1,997 1,579
Retrieve registered Key (ms) 0 0 0 0
Unreg / second 29,491 2,908 2,206 1,596
Retrieve unregistered Key (ms) 0 0 0 0
Re-Reg / second 27,149 2,993 2,131 2,542
Retrieve re-registered Key (ms) 0 0 0 0
Retrieve Key of killed Pid (ms) 0 timeout timeout timeout

Conclusions

  • The locking mechanism heavily influences the decrease in performance that can be seen when adding nodes. With a cluster of 2+ nodes we already are under the spec of 10,000 registrations / second.
  • The monitoring of processes is slow. After having killed all the processes, in a cluster of 2+ nodes it takes more than 60 seconds to have  global:whereis_name/1 return undefined (this is what timeout means in the table here above). I had to decrease the number of processes to around 80,000 to have the stress test pass in a cluster of 4 nodes, and it would take around 55 seconds for a killed process’ Key to be removed from the registry.

For these reasons, it didn’t look like I could use this module.

 

2. Erlang’s native pg2 module

Considerations

Erlang pg2 module has native functionalities to support a global process registry. I was not particularly attracted to it, because:

  • This library handles Process Groups, which is very different from handling unique Registered Names. We can use it for our purpose though, by basically creating Groups with a single entry. These groups are named according to our Keys, and every Group has a single entry: the Pid that we are registering. This is kind of a trick, but it’s not a showstopper.
  • Having Process Groups basically means that conflict resolution isn’t covered. If two processes are registered on different nodes with the same Key (because of race conditions or during a net split) this will result in having a Process Group with two elements instead of one. Sometimes this is fine; however, I wanted to ensure that there would be a clearly identified single Pid per device in the whole system. Not a showstopper either, but a turn-off.
  • I didn’t know if millions of entries can be supported. This module wasn’t built with my use case in mind.
  • Here too, it has a locking mechanism to ensure that the registration is atomic which could become a bottleneck to the registration of processes.

Stress Test

Here’s the callback module:

The results of the stress test are:

1 Node 2 Nodes 3 Nodes 4 Nodes
Reg / second 25,062 3,823 2,914 1,862
Retrieve registered Key (ms) 0 0 0 0
Unreg / second 39,522 6,903 5,191 3,425
Retrieve unregistered Key (ms) 0 0 0 0
Re-Reg / second 25,701 3,794 2,783 1,817
Retrieve re-registered Key (ms) 0 0 0 0
Retrieve Key of killed Pid (ms) timeout timeout timeout timeout

Conclusions

  • The locking mechanism heavily influences the decrease in performance that can be seen when adding nodes. With a cluster of 2+ nodes we already are under the spec of 10,000 registrations / second.
  • The monitoring of processes is slow. After having killed all the processes, even on a single nodes it takes more than 60 seconds to have  pg2:get_members/1 return that the group no longer exits. I had to decrease the number of processes to around 45,000 to have the stress test pass in a cluster of 4 nodes, and it would take a little less than 60 seconds for a killed process’ Key to be removed from the registry.

For these reasons, it didn’t look like I could use this module.

 

3. Gproc

Considerations

gproc is a well-known process registry which is normally used for the additional features that it provides on top of Erlang’s native process dictionary (for instance, it is able to provide pub/sub patterns). It is a solid and well-supported library, and you can often see Ulf Wiger (one of the library’s authors) generously providing support for it.

However, there were some concerns I had:

  • For the distributed part it relies on gen_leader, on which I’ve heard too many horror stories (maybe that’s not a thing anymore). Ulf pointed me to a gproc branch that uses locks_leader, where he is mainly concentrating his efforts for gproc’s support for distributed operations.
  • I felt that the main purpose of this library is not to provide a distributed process registry as much as extending the existing Erlang registration mechanisms with some additional features. The README in gproc’s Github page clearly depicts it as being an “Extended process dictionary”; it just felt that the distributed part hasn’t been the primary focus in the development of this library.
  • I could not understand how conflict resolution is managed in a distributed environment.

Stress Test

Here’s the callback module:

Note: in gproc, to ensure thread safety, a process can only set its own values. That’s why the register/2 and unregister/2 callbacks here above send messages to the processes, which then register or unregister themselves (see process_loop). As you can see here above I’ve decided to provide a locking call for these functions (by using a receive block), to emulate the locking calls that I’ve used in the other libraries.

The results of the stress test are:

1 Node 2 Nodes 3 Nodes 4 Nodes
Reg / second 67,011 19,111 22,048 15,659
Retrieve registered Key (ms) 0 0 0 0
Unreg / second 118,228 22,845 24,282 22,312
Retrieve unregistered Key (ms) 0 0 0 0
Re-Reg / second 127,200 22,115 25,884 20,228
Retrieve re-registered Key (ms) 0 0 0 0
Retrieve Key of killed Pid (ms) 178 1,890 7,584 10,600

Conclusions

  • These are overall very good results.
  • I didn’t need to reduce the process count to make all of the test pass.
  • The monitoring of processes can be optimized. After having killed all the processes, on a cluster of 4 nodes it takes >10 seconds for  gproc:lookup_pid/1 to not find the Pid once a process has exited.
  • Unfortunately, I had some inconsistent results running this test in a cluster of 2+ nodes. Often, the test could not retrieve the registered Key (after the first registration round) in less than 60 second, and timed out.

I was a little skeptical though on the inconsistency that I saw in the test results, which might be related to the gen_leader  issues that I’ve occasionally heard about. The author’s choice to move towards locks_leader might be a sign of this. Despite these thoughts, this looked like a good potential candidate.

 

4. CloudI Process Groups

Considerations

cpg is an actively maintained library, and his main author Michael Truog is often very available to discuss his choices and provide support. cpg deals with Process Groups and not unique Registered Names, therefore my concerns where similar to the ones I had with pg2:

  • Handling Process Groups is very different from handling unique Registered Names. We can use the same trick used with pg2, i.e. creating Process Groups named with Key, with a single entry (the Pid).
  • Here too, having Process Groups basically means that conflict resolution isn’t covered. This made me a little uncomfortable because I wanted to ensure that there would be a clearly identified single Pid per device in the whole system.

Stress Test

Here’s the callback module:

The results of the stress test are:

1 Node 2 Nodes 3 Nodes 4 Nodes
Reg / second 110,198 42,680 20,703 8,488
Retrieve registered Key (ms) 0 0 0 0
Unreg / second 109,374 32,264 25,599 15,128
Retrieve unregistered Key (ms) 0 1 0 0
Re-Reg / second 126,791 30,862 32,138 20,791
Retrieve re-registered Key (ms) 0 0 0 0
Retrieve Key of killed Pid (ms) error error error error

Conclusions

  • These are overall very good results.
  • I was surprised of the major drop in a cluster of 4 nodes. I run this test multiple times and it always returned similar results.
  • The monitoring of processes didn’t work appropriately. Even on a single node, the test experienced an internal timeout:
I had to decrease the number of processes to around 25,000 to have the stress test pass in a cluster of 4 nodes. The monitoring issue didn’t make me feel particularly at ease, however this library did look like a potential candidate.

 

5. Custom Solution: Syn

Considerations

Since it became clear that I could not use Erlang’s native global or pg2 modules, and that the two other libraries I looked into were candidates but each one with their own little twerks, I decided to try a custom solution, which I called syn  (short for synonym).

In any distributed system you are faced with a consistency challenge, which is often resolved by having one master arbiter performing all write operations (chosen with a mechanism of leader election), or through atomic transactions. As said here above, I needed a global process registry for an application of the IoT field. In this context, Keys used to identify a process are often the physical object’s unique identifier (for instance, its serial or mac address), and are therefore already defined and unique before hitting the system. The consistency challenge is less of a problem in this case, since the likelihood of concurrent incoming requests that would register processes with the same Key is extremely low and, in most cases, acceptable.

Therefore, Availability has been chosen over Consistency and Syn is eventually consistent.

Under the hood, Syn performs dirty reads and writes into a distributed in-memory Mnesia table, replicated across all the nodes of the cluster. This made me feel comfortable that I wouldn’t need to reinvent the replication mechanisms of Erlang’s native DB, however I needed a way to handle conflict resolution and net splits. For this reason, Syn can automatically manage conflict resolution by implementing a specialized and simplified version of the mechanisms used in Ulf Wiger’s unsplit framework.

You can read more about Syn in its github repo.

Stress Test

Here’s the callback module:

The results of the stress test are:

1 Node 2 Nodes 3 Nodes 4 Nodes
Reg / second 106,324 52,792 60,958 40,929
Retrieve registered Key (ms) 0 0 0 56
Unreg / second 105,506 50,591 67,042 42,896
Retrieve unregistered Key (ms) 0 0 0 0
Re-Reg / second 106,424 51,322 77,258 47,125
Retrieve re-registered Key (ms) 0 0 0 0
Retrieve Key of killed Pid (ms) 719 995 1,577 1,825

Conclusions

  • These are overall very good results. I’m not sure why Syn is performing better with 3 nodes than with 2 (and I’ve repeated this test more than once).
  • I didn’t need to reduce the process count to make all of the test pass.
  • The monitoring of processes worked appropriately.

 

Final notes

I want to stress out how comparisons and these tests are difficult to perform. Every library behaves differently, and it is hard (if not impossible) to define some kind of a common stress test to allow for a better understanding of their performance levels. I gave it a go, but looking at the above definition of my stress test for instance I ask myself: “Why did I set the process count to 100,000? I can see that most libraries behave fine with lower numbers”. Also, “What would happen if instead of registering processes sequentially in a single process per node, we had them register themselves simultaneously, therefore increasing the load on the registry?”. More importantly, “Does this test represent some kind of real life scenario?”.

This article wants to share my thoughts and how I ended up writing Syn. Sure, Syn performs well in the defined use case and stress test, but this does in no way mean that the other libraries here won’t perform way better in other stress tests and scenarios. I’d actually be glad to know that someone else is willing to take the time to evaluate these, and other, global process registries. They are a kind of holy grail; and let’s remember that anything distributed is never easy, nor given.

As a final note, I’d enjoy reading comments from the library authors or other Erlang enthusiasts. This is such a delicate matter that I’d love to have a healthy exchange of opinions, hopefully contributing to improving all of our experiences.

 

9 Comments

  1. Adam Lindberg

    Would be interesting to see Locker (https://github.com/wooga/locker) compared in these benchmarks. It’s a distributed locker service that can also be used as a process registry.

    • Hi Adam, Knut did point at it after my thread on Erlang Questions mailing list.

      As I told him, this is an expiring key/value store DB, which I read can be used as a mechanism for leader election. However, I assume all the rest still needs to be developed.
      For instance, does it do process monitoring? How does it handle conflict resolution?

  2. Some issues with the test above:
    1) No hardware or instance type was mentioned, making the test irreproducible and limiting the impact of the results.
    2) No common timeout value was enforced for the operations among the separate process registries. It is important to determine the performance given some amount of time for a synchronous request.
    3) Syn is likely to lose process registry data during a netsplit, due to process registry modifications that occur during the netsplit that are unable to be resolved after the partitions merge. This is a common problem with mnesia, so it is probably a good thing to explain here, since other process registry solutions do not have this problem.
    4) The usage of CloudI Process Groups (https://github.com/okeuday/cpg) was erroneously mentioned as having monitoring processes, but the error clearly shows a timeout occurred, due to the default timeout value being used (5000 ms). For cpg, it does appear like you are reaching a bottleneck on a single scope process, which is why it is possible to create more than 1 scope (to avoid always using the single default scope, i.e., cpg_default_scope). The other thing to keep in mind is that cached cpg data can be used to avoid putting extra load on a scope process by utilizing the cpg_data module. These details may not match your use case, but it is at least important that you are using the same timeout values for all the process registry tests, to avoid premature timeout exceptions.

    • Michael,
      Sometimes I have a hard time in understanding what you refer to. :) You say that Syn:

      is likely to lose process registry data during a netsplit, due to process registry modifications that occur during the netsplit that are unable to be resolved after the partitions merge

      …No. And that’s the whole point actually. Syn is able to resolve registry modifications that have happened during a net split. You might want to check Syn’s test suite that covers net splits. So, why do you think it is unable to do it?

      As far as cdg usage with multiple scopes, I understand them, but I don’t care about using them. My use case is extremely simple, I don’t see why I need to use multiple scope to be able to avoid a cdg bottleneck. You also refer to cdg being “erroneously mentioned as having monitoring processes”. No, I said that it monitor processes and removes them when a process dies. Was I mistaken in this?

      Finally, since you’re interested in hearing hardware specs here it goes: all of these tests were run on a local box, a 2014 MacBook 15-inch Retina running Erlang 17.5 on a Yosemite 10.10.4 (i7 2.2 GHz, 16 GB Ram).

      • The comment on Syn, “is likely to lose process registry data during a netsplit, due to process registry modifications that occur during the netsplit that are unable to be resolved after the partitions merge”, is the case where a netsplit occurs so that partition A and partition B exist separately. Groups are modified in both parition A and paritition B. Generally, a mnesia merge operation has to choose whether partiion A or partition B is the accurate data for the modified groups, when merging the data after the partition A and partition B reconnect. If Syn is able to not discard process registry data from both partitions and merge in a way where all data survives the merge, it would be important to mention that and describe how it works. If it does work, it likely sacrifices availability to make it possible.

        I didn’t enter the comment on cpg properly. To quote the comment on cpg in the test results above, “The monitoring of processes didn’t work appropriately.”, however, the monitoring of processes worked fine. Your usage of the cpg module didn’t specify a timeout long enough to allow a response and since no common timeout was used among all process registries, it is unclear what conclusions you can draw, due to having no data when a timeout occurred without waiting for the same period with all process registries.

  3. Sean Cribbs

    I’d be interested to see a benchmark with Christopher Meiklejohn’s riak_pg. https://github.com/cmeiklejohn/riak_pg

  4. In Altenwald we was working for long time in our own solution for global registering of processes available in a cluster (only one process with a specific name in the whole cluster) and we develop:

    https://github.com/altenwald/forseti

    I’ll check the benchmarks to see if our solution is worthy… good post! :-)

Leave a Reply