Erlang: searching for a convincing argument

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Erlang: searching for a convincing argument

Todd Greenwood
Hello all,

OVERVIEW

I've been a long-time Erlang/OTP fan...but I'm caught in a catch-22. For
years, I've wanted to write a substantial system in Erlang/OTP...but
I've been stymied b/c none of my colleagues or managers wanted to risk
investing in this unknown-to-them platform. W/O any significant personal
experience, I have yet to convince anyone that this would be a great
path... So I've watched, time and time again, as various portions of the
Erlang platform be poorly implemented in Java, Python, etc. etc...only
to wind up wading through the inevitable profusion of bugs and
scalability issues.

CHALLENGE

So, I challenged a long-time colleague to come up with a problem that
would convince him that he should have implemented some problem in
Erlang. He came up with this problem from a previous company...
Periodically, say once a month, his prev company had to send out a mass
email (templated) to their customer base. This grew from thousands to
millions over the course of a few years. As the number of emails
increased, their simple script started to run from minutes to hours to
days... Furthermore, the email providers impose throughput constraints
such that you can only send X number of emails per hour in the first
hour, Y in the second, etc. The ramp-ups were explicitly documented and
not adhering to them could get you throttled or black-listed.

TEST CASE

To showcase why Erlang is so great, I suggested that we could model
external and internal failures and show that the only end result was a
change in throughput.

Erlang Nodes  Fake SMTP Relay    Throughput
------------------|-------------------------|-----------------------
[1...N]              [1]
FailureRate1    FailureRate2           emails/sec

* Inputs: 1 million email addresses, read from a file.
* The Erlang nodes have code/app that processes the email addresses.
* The Fake SMTP Relay just receives the emails and writes them to a file
or /dev/null, whatever.
* FailureRate1 is the percentage of Nodes that are dead (killed, etc.),
simulating hardware faults etc.
* FailureRate2 is the percentage of errors that the Fake SMTP relay
reports, simulating 3rd  party endpoint failures.

I also suggested that originally, he give me an incorrect address for
his SMTP relay, and I'd perform a hot code update to correct this.
Pretty sick (cool) right?

DESIGN

At this point, I have some design questions...

# Design 1 : Use a database
I could dump the 1Million email addresses into a database (ETS/mnesia,
etc.) and have processes reading/writing state to the db as they process
each email. But he was unimpressed, as this is so much like just using
any old language that uses the db as a work queue (so long as the db is
replicated).

# Design 2 : Use erlang processes, all in memory at the same time
I could create an Erlang process for each email address... but scaling
is memory bound, so this doesn't seem right at all.

# Design 3: Use erlang processes, but only read in a subset of the email
addresses at a time
I could read from the input file and create only M erlang processes at a
time and then write to ETS to signify completion. But if I'm writing to
ETS, I may as well read all the data into ETS/mnesia at the start, and
use it as a work queue. Back to Design # 1.

Ok, putting that question aside for a second...

DISTRIBUTION

# Distribution 3 : How to distribute the app for resiliency?
I'd like to run this on N nodes and have a random reaper (chaos monkey,
whatever) kill the Erlang nodes (or the underlying VM) randomly to
simulate hardware errors.  Again, my thinking feels constrained. I keep
coming back to: stuff the state in a db, spin up a supervisor and a
bunch of worker processes on a separate node. If the node with the
worker processes dies, the supervisor creates worker processes on a
different node, and so forth.

Despite having read all the books I can find on Erlang and reading the
list for years now... I still don't really know the best way to have
supervisors living on separate nodes, reacting to node failures such
that the application picks up where it left off on a new node. It might
be simpler to just fan the workers out across all the nodes since the
state is maintained in the db/queue. I'd still have to maintain state in
the db to ensure that all the processes are adhering to the rate limits.

But the central question remains, if I'm randomly killing my nodes, and
if the node with the supervisor dies, what then? How do I replicate
supervisors? What's the pattern that I'm missing here?

FINAL

So, when my friend first described his problem, I thought, "I can do
this in like, 4 lines of Erlang!" Then he started adding constraints
like rate limitations, etc... and I thought, "Ok, 20-30 lines...". Now,
looking at the problem, I've spent a couple hours just writing it down
and trying to consider how to solve it... I have more questions than
when I started.

I'd love to hear thoughts about solving this (or similar) problem(s).
I've come to the conclusion that I cannot evangelize Erlang if I don't
know how to solve even simple problems with it.

-Todd

BTW - over the years I've read:
https://joearms.github.io/index.html
Programming Erlang (ed 2 is on order)
Learn You Some Erlang For Great Good
Erlang and OTP In Action
Erlang Programming
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Erlang: searching for a convincing argument

Fred Hebert-2
On 12/10, Todd Greenwood-Geer wrote:
>Periodically, say once a month, his prev company had to send out a mass
>email (templated) to their customer base. This grew from thousands to
>millions over the course of a few years. As the number of emails
>increased, their simple script started to run from minutes to hours to
>days... Furthermore, the email providers impose throughput constraints
>such that you can only send X number of emails per hour in the first
>hour, Y in the second, etc. The ramp-ups were explicitly documented and
>not adhering to them could get you throttled or black-listed.
>

This sounds like a perfect example for the techniques I recently
mentioned in http://ferd.ca/handling-overload.html

Specifically, you could use libraries such as safetyvalve[1] or jobs[2]
to schedule all your tasks according to the required limits of the
system, or use circuit breakers of all kinds[3][4][5] to regulate that
load as a reaction from the systems you communicate with.

Some of the strengths of Erlang don't come from its semantics, but on
the strong operational focus of its community, and its dedication to
writing solving problems in a space where such challenges are common.

I'm a bit short on time tonight to respond with relation with
distribution (it's a complex topic!); if I remember I'll try to come
back to it later.

[1] https://github.com/jlouis/safetyvalve
[2] https://github.com/uwiger/jobs
[3] https://github.com/klarna/circuit_breaker
[4] https://github.com/jlouis/fuse
[5] https://github.com/mmzeeman/breaky
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Erlang: searching for a convincing argument

Michael Truog
In reply to this post by Todd Greenwood
On 12/10/2016 05:29 PM, Todd Greenwood-Geer wrote:
> Hello all,
>
> OVERVIEW
>
> I've been a long-time Erlang/OTP fan...but I'm caught in a catch-22. For years, I've wanted to write a substantial system in Erlang/OTP...but I've been stymied b/c none of my colleagues or managers wanted to risk investing in this unknown-to-them platform. W/O any significant personal experience, I have yet to convince anyone that this would be a great path... So I've watched, time and time again, as various portions of the Erlang platform be poorly implemented in Java, Python, etc. etc...only to wind up wading through the inevitable profusion of bugs and scalability issues.
Generally, the main argument for Erlang use is for its ability to provide fault-tolerance for source code.  The scalability advantage can be provided in various programming languages with an actor model library.  It is important to notice Erlang is a functional programming language that attempts to avoid side-effects (errors managing state lead to system instability).  Avoiding instability on the server-side is important, due to many clients depending on the server.

>
> CHALLENGE
>
> So, I challenged a long-time colleague to come up with a problem that would convince him that he should have implemented some problem in Erlang. He came up with this problem from a previous company... Periodically, say once a month, his prev company had to send out a mass email (templated) to their customer base. This grew from thousands to millions over the course of a few years. As the number of emails increased, their simple script started to run from minutes to hours to days... Furthermore, the email providers impose throughput constraints such that you can only send X number of emails per hour in the first hour, Y in the second, etc. The ramp-ups were explicitly documented and not adhering to them could get you throttled or black-listed.
>
> TEST CASE
>
> To showcase why Erlang is so great, I suggested that we could model external and internal failures and show that the only end result was a change in throughput.
>
> Erlang Nodes  Fake SMTP Relay    Throughput
> ------------------|-------------------------|-----------------------
> [1...N]              [1]
> FailureRate1    FailureRate2           emails/sec
>
> * Inputs: 1 million email addresses, read from a file.
> * The Erlang nodes have code/app that processes the email addresses.
> * The Fake SMTP Relay just receives the emails and writes them to a file or /dev/null, whatever.
> * FailureRate1 is the percentage of Nodes that are dead (killed, etc.), simulating hardware faults etc.
> * FailureRate2 is the percentage of errors that the Fake SMTP relay reports, simulating 3rd  party endpoint failures.
>
> I also suggested that originally, he give me an incorrect address for his SMTP relay, and I'd perform a hot code update to correct this. Pretty sick (cool) right?
>
> DESIGN
>
> At this point, I have some design questions...
>
> # Design 1 : Use a database
> I could dump the 1Million email addresses into a database (ETS/mnesia, etc.) and have processes reading/writing state to the db as they process each email. But he was unimpressed, as this is so much like just using any old language that uses the db as a work queue (so long as the db is replicated).
>
> # Design 2 : Use erlang processes, all in memory at the same time
> I could create an Erlang process for each email address... but scaling is memory bound, so this doesn't seem right at all.
>
> # Design 3: Use erlang processes, but only read in a subset of the email addresses at a time
> I could read from the input file and create only M erlang processes at a time and then write to ETS to signify completion. But if I'm writing to ETS, I may as well read all the data into ETS/mnesia at the start, and use it as a work queue. Back to Design # 1.
>
> Ok, putting that question aside for a second...
Design #1 without ETS or mnesia would be a good approach.  A SQL or NoSQL database would be picked based on the usage patterns and operational concerns.  The impressing part is having a system that can survive failure scenarios independent of the database, so runtime problems related to the source code that were unanticipated by the developers.

I would choose to use http://cloudi.org/ due to it saving me development time.  I would probably have 2 CloudI services, 1 for periodically reading from the database (ServiceA) and 1 for sending an email based on the contents of a received service request (ServiceB) where ServiceA sends to ServiceB.  That allows the concurrency concerns and throughput concerns to be service configuration settings, due to the various features in CloudI.
>
> DISTRIBUTION
>
> # Distribution 3 : How to distribute the app for resiliency?
> I'd like to run this on N nodes and have a random reaper (chaos monkey, whatever) kill the Erlang nodes (or the underlying VM) randomly to simulate hardware errors.  Again, my thinking feels constrained. I keep coming back to: stuff the state in a db, spin up a supervisor and a bunch of worker processes on a separate node. If the node with the worker processes dies, the supervisor creates worker processes on a different node, and so forth.
>
> Despite having read all the books I can find on Erlang and reading the list for years now... I still don't really know the best way to have supervisors living on separate nodes, reacting to node failures such that the application picks up where it left off on a new node. It might be simpler to just fan the workers out across all the nodes since the state is maintained in the db/queue. I'd still have to maintain state in the db to ensure that all the processes are adhering to the rate limits.
>
> But the central question remains, if I'm randomly killing my nodes, and if the node with the supervisor dies, what then? How do I replicate supervisors? What's the pattern that I'm missing here?
CloudI provides node auto-discovery with LAN multicast and AWS EC2 API usage, so that can help simplify managing a group of nodes, with the same services on each so that failover can occur based on the routing of CloudI service requests.  For system testing, the service configuration options monkey_chaos and monkey_latency exist, so that means Chaos Monkey testing and/or Latency Monkey testing can occur with a tweak to the configuration of existing services, for a separate environment, automated tests, or whatever is required.
>
> FINAL
>
> So, when my friend first described his problem, I thought, "I can do this in like, 4 lines of Erlang!" Then he started adding constraints like rate limitations, etc... and I thought, "Ok, 20-30 lines...". Now, looking at the problem, I've spent a couple hours just writing it down and trying to consider how to solve it... I have more questions than when I started.
>
> I'd love to hear thoughts about solving this (or similar) problem(s). I've come to the conclusion that I cannot evangelize Erlang if I don't know how to solve even simple problems with it.
Not sure about the line count, since it depends on the details, like what templating needs to be supported for creating the email data. I am also assuming that the DB is modified elsewhere, and that the Erlang-side is updating the DB as it processes the DB entries that require emails to be sent.  CloudI allows services in any supported programming language, so: C/C++, Erlang/Elixir, Java, JavaScript/node.js, Perl, PHP, Python and/or Ruby, which helps avoid risk people perceive from usage of Erlang (difficulty finding people with Erlang knowledge, expensive to train people, easy to find Java developers, etc.).  CloudI allows you to utilize Erlang's advantages even if you are unable to develop with Erlang (for example, the Java CloudI tutorial is implemented only in Java).

I created CloudI due to being in similar situations as you in the past, and I understand that CloudI saves me development time, so I would naturally use it to solve a problem like this.  However, everyone approaches a problem differently, so there are many ways of approaching this with Erlang.

Best Regards,
Michael


>
> -Todd
>
> BTW - over the years I've read:
> https://joearms.github.io/index.html
> Programming Erlang (ed 2 is on order)
> Learn You Some Erlang For Great Good
> Erlang and OTP In Action
> Erlang Programming
> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions
>

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions