Fault tolerance in Go

because Erlang seems to have more tools out of the box for building distributed applications.

Oh, well, you're not going to like my answer then, which is precisely that I've come to loathe the tools it ships for building distributed applications. (No sarcasm.)

I like the ideas themselves, but the actual implementation is maddeningly opaque. When it works, it works. When it doesn't work... it's extremely difficult to work out why. We just fire net_adm:ping commands back and forth between the nodes and randomly flail until it starts working again, but it just breaks spontaneously all the damned time. I've still never been able to find any sort of logs with useful problems, or much of anything else to help with diagnosing the problems. I do not believe it is a fundamental problem with Erlang's ideas per se, I believe it is a level of opacity the technology has.

So for context I've got about an 8-node cluster that holds several thousand encrypted TCP connections open to the external devices continuously. I can tell I'm torturing Erlang a bit by sending around multi-kilo- or mega-byte sized messages intermittently rather than single-digit-kilobyte-sized messages frequently, but still, I have a lot of problems I just really shouldn't have.

I've progressively reduced the amount of mnesia that my system uses and reduced ever more and more the amount of messages flowing between the nodes, but the clustering still just breaks all the time. Why does it break? Heck if I know. And I've been fighting with it for years. I'm down to sending single-digit numbers of messages per second between the nodes, and the clustering still just breaks every couple of weeks. (In fact it's broken right now.)

When I was using Erlang in a single-node context it was great, but I keep squeezing ever more "clustering" out of the system. Before much longer I'll hardly be using any of the Erlang clustering support and it'll just be nodes talking to each other that happen to be Erlang. This is the plan then to start folding in Go nodes, because once I'm no longer using the special stuff I'll have nothing left stopping me from switching in Go.

We've re-implemented this daemon entirely in Go for single-machine cases. (We have both a "cloud" use case and a single-machine use case.) I've got a guy starting in one month whose first task will be to help me finish reign. Then with just a bit of tweaking I can drop the current system into a cluster.

Now, I do not suffer from the delusion that my very first implementation of Erlang-style clustering is going to be instantly better than Erlang's clustering. But my (IMHO reasonable) hope is that I can create something that is a great deal less opaque and that I'll be able to far, far more rapidly respond to any issues we discover with the cluster messaging technology than I can in Erlang. If nothing else, reign will at least TELL me why it is having problems! (Because I wrote it to!)

See also suture.

I like Erlang's ideas, and there are many things I've learned from it and even like enough to reimplement. But at the same time, I've run myself out of options. If I "had" to stick to Erlang at work, I honestly don't know what I'd do next to fix my system.

/r/golang Thread Link - thediscoblog.com