Newbie help on OTP architecture/scaling

813 views
Skip to first unread message

dlinn...@economicmodeling.com

unread,
Jun 2, 2015, 4:41:35 PM6/2/15
to elixir-l...@googlegroups.com
I'm in the process of evaluating different concurrency-based languages (mainly node, Go, and Erlang/Elixir) for an API Gateway service, which we'll be translating a single request into multiple requests to more backend microservices. For evaluation, my team is building a tiny app in a few languages and testing it out in terms of code readability, scalability, tooling, etc. I've got a version working in Elixir, but I immediately ran into a scaling problem because of what I think is a misuse of OTP, so I'd love to get some design pattern pointers to see how difficult it is to get it resolved.

The only requirement for the app currently is to take in an MSA code and respond with JSON like this:  { "code": "10100", "name": "Aberdeen, SD", "centroid": {"long": -98.696, "lat": 45.521} }. To try and replicate our actual environment, the name for the code comes from a mysql database, while the centroid comes from a postgres database. I set up the app as an umbrella app, first getting the core data-gathering functionality working - msas - then I added a web server using plug - msas_server. The working code is here: https://github.com/dlinnemeyer/elixir-msas-app

The main problem I run into is that as we scale up concurrency, we very quickly start running into timeouts when calling Msas.Meta.get(). This sends a call to the Msas.Meta gen server. From what I can tell, this seems to be happening because there's only a single Msas.Meta worker process running (see apps/msas/lib/supervisor.ex), and it processes each query one at a time. Is this likely where the problem is? How would I go about solving this? My gut feeling would be that there'd be two approaches:

1) have Msas.Meta.handle_call start an async call with ecto (not sure how to do this? start an ad hoc process, or ecto itself?), then send back a :noreply, in some way adding the query to a queue in the server state? I believe I read that you can use this approach, then have handle_info receive any finished queries and reply to the original calling process. This gives the calling process a sync feel, but avoids blocking up the meta gen server?
2) have a pool of Msas.Meta gen servers running, either by having a set number of in the supervisor, or somehow dynamically allocate them as requests come in?

Which one is more standard, or is there a standard approach? Are there OTP, or other library-based ways of doing this quickly, or do I mainly have to hand-code a solution?

Also, this sort of bottleneck seems like it would be a pretty occurrence. In your experience, is it typical that scaling an Elixir/Erlang app is mainly a matter of discovering worker servers that are getting bogged down and switching them to a lighter-weight, non-blocking approach? Or is it typical/advisable to architect that way from the beginning?

Some other questions:
  • Have I architected this application best for my purposes? Currently I have a gen server for msa meta info (mysql) and msa gis info (postgres). The controller for the web request simply farms out the work to those servers and sends back the responses. So as far as I can understand it, then main scalability issue would be to make sure each data provider doesn't get blocked up. Or is there something I'm missing? Are there other architectures that may work better for this kind of use case?
  • How would I get the web controller to call for the meta and gis info asynchronously? In this example the queries are quick so it's less of a direct concern in this case, but in the api gateway we're specing out, some data pulls can be pretty sluggish, so async would be necessary. In an erlang example, I just created a couple processes by hand and had them send back the results to the parent process when finished. As far I can tell, though, the recommendation is to avoid this sort of thing and opt for patterns in OTP? But as far as I could tell, casts aren't supposed to return or send back data. Any advice along these lines? Any standard patterns, aside from just firing up a couple ad hoc processes?
  • I ended up in a situation where I have to call Msas.Meta.get(Msas.Meta, code) in the web controller (app/msas_server/lib/router.ex). As I understand it, the first Msas.Meta is the actual module, the second is the name I gave for the supervised process in apps/msas/lib/supervisor.ex. Why the redundancy? Is that necessary, or did I overcomplicate things?
  • Any other pointers? I'm mainly focusing on code architecture, but other little points and standard practices would be great. Anything from "you're screwing up pattern matching" to "there's a better way to use plug" would be awesome. 



Thanks for any time you have to dig through. Overall I've really enjoyed the architecture that Elixir seems to be pushing toward, and it's proven to be a lot more accessible than Erlang itself (I tried this same exercise there). The main concern I have at this point is getting a little lost in how heavy-weight OTP feels, at least at first.

-Donny

José Valim

unread,
Jun 2, 2015, 5:16:05 PM6/2/15
to elixir-l...@googlegroups.com
Absolutely fantastic work on splitting the applications apart in an umbrella project. 
 
The main problem I run into is that as we scale up concurrency, we very quickly start running into timeouts when calling Msas.Meta.get(). This sends a call to the Msas.Meta gen server. From what I can tell, this seems to be happening because there's only a single Msas.Meta worker process running (see apps/msas/lib/supervisor.ex), and it processes each query one at a time. Is this likely where the problem is? How would I go about solving this? My gut feeling would be that there'd be two approaches:

This is exactly the problem.

My answer would be: simply remove the GenServer.

Ecto already provides a pool of workers. Accessing it from a GenServer means you are serializing the access to the whole pool. Your get function should directly access Ecto:

defp get_meta(msa) do
  query = """
  SELECT code, name
  FROM def_us_area_2014_4
  WHERE level = 'MSA' AND code = ?
  """
  %{rows: [result | _]} = Ecto.Adapters.SQL.query(Msas.MySQL, query, [msa])
  {areaid, name} = result
  %{areaid: areaid, name: name}
end

In other words, both Postgres and MySQL adapters already provide their GenServers that are properly pooled by Ecto. You don't need to do (or even redo) this work by putting it in a GenServer.
 
Also, this sort of bottleneck seems like it would be a pretty occurrence. In your experience, is it typical that scaling an Elixir/Erlang app is mainly a matter of discovering worker servers that are getting bogged down and switching them to a lighter-weight, non-blocking approach? Or is it typical/advisable to architect that way from the beginning?

I answer this with more detail down below but there is no such thing as "non-blocking approach" in Elixir as you see in most other languages. Because everything happens in tiny processes, if one tiny process is waiting on IO, the VM is simply going to choose another tiny process to run. There is no way one process can directly block another process due to CPU usage or because it is waiting on IO. So non-blocking is the default.

In any case, your observation is right that part of scaling is a matter of discovering worker servers that are getting bogged down. Luckily the VM and tooling make it extremely easy to find. The solutions will vary though. If the bottleneck is because of data access, then ETS may be a solution. If it is because of serial work that could be parallelism, then it is another solution and so on.
  • Have I architected this application best for my purposes? Currently I have a gen server for msa meta info (mysql) and msa gis info (postgres). The controller for the web request simply farms out the work to those servers and sends back the responses. So as far as I can understand it, then main scalability issue would be to make sure each data provider doesn't get blocked up. Or is there something I'm missing? Are there other architectures that may work better for this kind of use case?
See above. Don't put them in a GenServer because then you are serializing access to your whole pool.
  • How would I get the web controller to call for the meta and gis info asynchronously? In this example the queries are quick so it's less of a direct concern in this case, but in the api gateway we're specing out, some data pulls can be pretty sluggish, so async would be necessary.
You don't need to call anything asynchronously. Let's suppose you have a long request that is waiting on the database. While that particular Elixir process waits on the database I/O, the VM will be able to schedule other Elixir processes to do the work. You absolutely don't need to worry about this. Elixir will be able to efficiently distribute requests, regardless if they are IO or CPU bound. Every request is already running on its own Elixir process.
  • I ended up in a situation where I have to call Msas.Meta.get(Msas.Meta, code) in the web controller (app/msas_server/lib/router.ex). As I understand it, the first Msas.Meta is the actual module, the second is the name I gave for the supervised process in apps/msas/lib/supervisor.ex. Why the redundancy? Is that necessary, or did I overcomplicate things?
Although you don't need a GenServer in this particular case, that is indeed necessary in a GenServer. The module is about code. The name is one of the ways to identify the process that is running the code so you can send commands to it. They are orthogonal abstractions.

I have actually mentioned a couple days ago that most languages allow you to think how you organize and structure your code (modules, packages, classes). Very few languages give you an abstraction about how you interact with the code and the data in the runtime system and processes offer exactly it.

Thanks for any time you have to dig through. Overall I've really enjoyed the architecture that Elixir seems to be pushing toward, and it's proven to be a lot more accessible than Erlang itself (I tried this same exercise there). The main concern I have at this point is getting a little lost in how heavy-weight OTP feels, at least at first.

Welcome! Nice to hear you are enjoying it.

One final note: we have uncovered some bottlenecks with the mariaex driver for MySQL in some occasions. Fixes are coming during this weekend. Just a heads up in case the benchmark is below your expectations.





José Valim
Skype: jv.ptec
Founder and Lead Developer

--
You received this message because you are subscribed to the Google Groups "elixir-lang-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elixir-lang-ta...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elixir-lang-talk/42cddc1b-4dbe-4c50-a579-53d630ef4e76%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Alex Shneyderman

unread,
Jun 2, 2015, 5:19:42 PM6/2/15
to elixir-l...@googlegroups.com


On Tuesday, June 2, 2015 at 4:41:35 PM UTC-4, dlinn...@economicmodeling.com wrote:
I'm in the process of evaluating different concurrency-based languages (mainly node, Go, and Erlang/Elixir) for an API Gateway service, which we'll be translating a single request into multiple requests to more backend microservices. For evaluation, my team is building a tiny app in a few languages and testing it out in terms of code readability, scalability, tooling, etc. I've got a version working in Elixir, but I immediately ran into a scaling problem because of what I think is a misuse of OTP, so I'd love to get some design pattern pointers to see how difficult it is to get it resolved.

The only requirement for the app currently is to take in an MSA code and respond with JSON like this:  { "code": "10100", "name": "Aberdeen, SD", "centroid": {"long": -98.696, "lat": 45.521} }. To try and replicate our actual environment, the name for the code comes from a mysql database, while the centroid comes from a postgres database. I set up the app as an umbrella app, first getting the core data-gathering functionality working - msas - then I added a web server using plug - msas_server. The working code is here: https://github.com/dlinnemeyer/elixir-msas-app

The main problem I run into is that as we scale up concurrency, we very quickly start running into timeouts when calling Msas.Meta.get(). This sends a call to the Msas.Meta gen server. From what I can tell, this seems to be happening because there's only a single Msas.Meta worker process running (see apps/msas/lib/supervisor.ex), and it processes each query one at a time. Is this likely where the problem is? How would I go about solving this? My gut feeling would be that there'd be two approaches:

a typical erlang use would be to create a process per request. And do not worry you  already have a process once you hit the plug (or rather cowboy which plug uses underneath). So, you have N requests (processes) trying to run a query via a single (or rather a few gen_servers). My suggestion is unhide code from behind the gen_server:

    defp get_meta(msa) do
        query = """
SELECT code, name
FROM def_us_area_2014_4
WHERE level = 'MSA' AND code = ?
"""
        %{rows: [result | _]} = Ecto.Adapters.SQL.query(Msas.MySQL, query, [msa])
        {areaid, name} = result
        %{areaid: areaid, name: name} 
    end

If you get rid of gen_server and simply put this code into 

defmodule Msas.Meta do
    ## Client API

    def get( msa) do
       
query = """
SELECT code, name
FROM def_us_area_2014_4
WHERE level = 'MSA' AND code = ?
"""
        %{rows: [result | _]} = Ecto.Adapters.SQL.query(Msas.MySQL, query, [msa])
        {areaid, name} = result
        %{areaid: areaid, name: name}
end
end

You will now make your DB to be the bottleneck. Ecto configures, pooling by default. Each DB has its recommended pool size. If memory serves me right postgres will recommend ((core_count * 2) + effective_spindle_count). MySQL will recommend something else, perhaps. 
 
For the most part the tools you are using already have concurrent behavior erlang is famous for. You do not need to pile on more. Unless you really have to. In your particular case since you are querying two DBs something like this might work in router.ex (with the changes I outlined above). 

get "/:code" do
    t0 = Task.async(fn() -> Msas.Meta.get(code).name end)
    t1 = Task.async(fn() -> Msas.Gis.get(code).centroid end)
    meta = Task.await(t0)
    gis  = Task.await(t1)
    send_resp(conn, 200, Poison.encode!(%{
        code: code,
        name: meta,
        centroid: gis
    }))
end

In this scenario you are spawning request to get both pieces of the data in parallel (although you are waiting for the answer to the first and then waiting for the second which is acceptable since your result depends on both results).

With this solution if you increase arrival rate you will be processing to the limit of your DB connection pools. In a real world solution you will want to apply some form of polite back-pressure (see for example jobs framework, but there are plenty others).

Cheers,
Alex.

Donny Linnemeyer

unread,
Jun 3, 2015, 2:22:01 AM6/3/15
to elixir-l...@googlegroups.com
Thanks for the quick replies. Really appreciate the help.
 
This is exactly the problem.

My answer would be: simply remove the GenServer.

Ecto already provides a pool of workers. Accessing it from a GenServer means you are serializing the access to the whole pool. Your get function should directly access Ecto:

defp get_meta(msa) do
  query = """
  SELECT code, name
  FROM def_us_area_2014_4
  WHERE level = 'MSA' AND code = ?
  """
  %{rows: [result | _]} = Ecto.Adapters.SQL.query(Msas.MySQL, query, [msa])
  {areaid, name} = result
  %{areaid: areaid, name: name}
end

In other words, both Postgres and MySQL adapters already provide their GenServers that are properly pooled by Ecto. You don't need to do (or even redo) this work by putting it in a GenServer.

Okay, that makes sense. But are there general pointers on when to use a gen server and when to use build a module with functions to call? Is the main distinction whether you have any server state to keep track of? 

So in this case, let's say when the app initializes, we want need some meta information about which table(s) to pull data from for these queries (say def_us_area_2014_4 is dynamic, which it would be if we built out the app a bit). Would that sort of thing start warranting some sort of GenServer structure? And if so, how would you go about addressing the bottleneck? Or would you farm the meta info itself off to a GenServer, and just have the querying functions pull from that? Retrieving the meta info would hypothetically just be from memory, so I'd imagine we wouldn't run into the same bottleneck, at least as early.

Not so much concerned with the details itself; just trying to get a handle on the principles. So if a better example comes to mind, you can run with that one.
 
I answer this with more detail down below but there is no such thing as "non-blocking approach" in Elixir as you see in most other languages. Because everything happens in tiny processes, if one tiny process is waiting on IO, the VM is simply going to choose another tiny process to run. There is no way one process can directly block another process due to CPU usage or because it is waiting on IO. So non-blocking is the default.

In any case, your observation is right that part of scaling is a matter of discovering worker servers that are getting bogged down. Luckily the VM and tooling make it extremely easy to find. The solutions will vary though. If the bottleneck is because of data access, then ETS may be a solution. If it is because of serial work that could be parallelism, then it is another solution and so on.

What sort of tooling should I check into for this? One thing we're trying to do is get a handle on what debugging looks like in Elixir/Erlang vs. Go, since that seems to be one of the most important comparisons. We won't be producing tons of concurrent apps, so most of our time will probably be in maintenance and optimization. 

You don't need to call anything asynchronously. Let's suppose you have a long request that is waiting on the database. While that particular Elixir process waits on the database I/O, the VM will be able to schedule other Elixir processes to do the work. You absolutely don't need to worry about this. Elixir will be able to efficiently distribute requests, regardless if they are IO or CPU bound. Every request is already running on its own Elixir process.

Yeah, I'm not so much worried about that aspect of it, since I assumed plug/cowboy was handling each request on a separate process. I want to dig into async to get a speed up for the client. In our actual target API, we'll be firing off potentially 5-6 calls, several of which could be slow, and there's no reason we couldn't do them in parallel, and it's save the client a good bit of time.

This is actually the main use case I'm comparing concurrent languages. It's not so much for scalability - though that's a nice side benefit - it's mainly for finding solid code architecture for handling concurrency within an individual request.

So if that's the case, what's the typical pattern? Sometime like what Alex recommends, with Tasks.async? Or are there other patterns to be aware of?

Also, on this level, what do you feel is Elixir/Erlang's main benefit over, say, Go? There I could pretty easily spawn up a couple processes, wait on channels, and respond. On the code architecture level, if the benefit is mainly in the area of something like OTP, I'm trying to understand when and where that impacts my code. Or do you see advantages on the more primitive level of basic async operations, too?

Although you don't need a GenServer in this particular case, that is indeed necessary in a GenServer. The module is about code. The name is one of the ways to identify the process that is running the code so you can send commands to it. They are orthogonal abstractions.

I have actually mentioned a couple days ago that most languages allow you to think how you organize and structure your code (modules, packages, classes). Very few languages give you an abstraction about how you interact with the code and the data in the runtime system and processes offer exactly it.

This part is hard to adjust to. I think I like it, but it creates some strange parallels. In OO code, you care about a class, and you instantiated it into an object that you can carry around. Gen servers feel like this; you initialize them and get a processid back that you can pass in as a first parameter to further method calls. So even though the underlying code is very different, I'm running into strange similarities. Can't tell if that's just me over-translating into an OO background.

Maybe the comparison would be singletons here then? So the question would be how I'd call Msas.Meta.get without passing in the reference to its instance. But you already answered that, moving away from the gen server structure. 

Welcome! Nice to hear you are enjoying it.

One final note: we have uncovered some bottlenecks with the mariaex driver for MySQL in some occasions. Fixes are coming during this weekend. Just a heads up in case the benchmark is below your expectations.

Thanks for the heads up. I'll try re-running the non-genserver version soon to see how that scales.

-Donny

Donny Linnemeyer

unread,
Jun 3, 2015, 2:24:04 AM6/3/15
to elixir-l...@googlegroups.com
Thanks, Alex. Can you elaborate on what you mean by "polite back-pressure"? What is that and what problem is it supposed to be potentially solve?

-Donny

--
You received this message because you are subscribed to a topic in the Google Groups "elixir-lang-talk" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elixir-lang-talk/Tr8ayRHMOh4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elixir-lang-ta...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elixir-lang-talk/cf7ef574-8518-4a22-8d38-63353b6e4a6a%40googlegroups.com.

José Valim

unread,
Jun 3, 2015, 3:00:48 AM6/3/15
to elixir-l...@googlegroups.com
Okay, that makes sense. But are there general pointers on when to use a gen server and when to use build a module with functions to call? Is the main distinction whether you have any server state to keep track of? 

Needing state is a good starting point. For example, let's suppose some information from the database never changes. You could load it and store it in a GenServer so you don't need to do a database roundtrip every time.

For your questions regarding running it in production and what Elixir brings to the table, I recommend two things:


* Read Erlang in Anger, a fantastic free book about running Erlang in Production: http://www.erlang-in-anger.com/ 
 
So if that's the case, what's the typical pattern? Sometime like what Alex recommends, with Tasks.async? Or are there other patterns to be aware of?

Yes, tasks are excellent if you need to perform multiple computations at the same time.
 
Maybe the comparison would be singletons here then? So the question would be how I'd call Msas.Meta.get without passing in the reference to its instance. But you already answered that, moving away from the gen server structure. 

If they are named processes, a comparison with a singleton is correct from the code organization perspective. But the difference, really, is the runtime abstraction of a process with all the amazing monitoring, introspection and resilience built into it. That's the main difference over the mainstream OO languages and Go imo. :)

Saša Jurić

unread,
Jun 3, 2015, 4:31:24 AM6/3/15
to elixir-l...@googlegroups.com

Okay, that makes sense. But are there general pointers on when to use a gen server and when to use build a module with functions to call? Is the main distinction whether you have any server state to keep track of? 

It's of course hard to generalize, but IME a gen_server is needed when you need to serialize competing actions and you want to maintain some state. If those are not your cases, then you should consider whether you need gen_server. A gen_server (just like any other process) is internally sequential - it can do only one thing at a time. Hence, if you're channelling multiple requests through a single gen_server it may become a bottleneck. This is beneficial if those requests are trying to manipulate/query the same data, because you serialize them, and perform one by one. However, if those requests are completely independent then there's really no need to channel them through a single process because they end up blocking each other needlesly.

In some cases a gen_server can be replaced with an ETS table, and this is what I usually do if I have many concurrent reads and more seldom writes. In many ways an ETS table seems like a process, but you can actually have concurrent reads/writes to it, so it may boost your scalability.

 
In any case, your observation is right that part of scaling is a matter of discovering worker servers that are getting bogged down. Luckily the VM and tooling make it extremely easy to find. The solutions will vary though. If the bottleneck is because of data access, then ETS may be a solution. If it is because of serial work that could be parallelism, then it is another solution and so on.

What sort of tooling should I check into for this? One thing we're trying to do is get a handle on what debugging looks like in Elixir/Erlang vs. Go, since that seems to be one of the most important comparisons. We won't be producing tons of concurrent apps, so most of our time will probably be in maintenance and optimization. 

As José mentioned, Erlang in Anger has a lot of good tips for analyzing the system. In my experience, a lot of logging is crucial, so you can understand what was happening. Another great thing is the ability to introspect the running system. Try running :observer.start from the shell, and you'll see all sort of information about the system. For example, you can get a top-like view of processes and see which ones have the largest message queue. This can serve as an indication of a possible bottleneck. It is possible in various ways to get same information from the running system, either through :observer, or through a text like equivalent (I used this one with lot of success: https://github.com/mazenharake/entop). Finally, you can open an Elixir shell to a live system, and interact with it in all sorts of ways, turning on dynamic traces of function calls or messages being passed, and even interacting with individual processes. This is a fantastic tool, which allows you to probe the production and discover what is going wrong. I used this approach a few times to discover production problems.
 

You don't need to call anything asynchronously. Let's suppose you have a long request that is waiting on the database. While that particular Elixir process waits on the database I/O, the VM will be able to schedule other Elixir processes to do the work. You absolutely don't need to worry about this. Elixir will be able to efficiently distribute requests, regardless if they are IO or CPU bound. Every request is already running on its own Elixir process.

Yeah, I'm not so much worried about that aspect of it, since I assumed plug/cowboy was handling each request on a separate process. I want to dig into async to get a speed up for the client. In our actual target API, we'll be firing off potentially 5-6 calls, several of which could be slow, and there's no reason we couldn't do them in parallel, and it's save the client a good bit of time.

This is actually the main use case I'm comparing concurrent languages. It's not so much for scalability - though that's a nice side benefit - it's mainly for finding solid code architecture for handling concurrency within an individual request.

So if that's the case, what's the typical pattern? Sometime like what Alex recommends, with Tasks.async? Or are there other patterns to be aware of?

Broadly speaking the idea is to use processes a lot. I call this "running different things separately". So if you need to do a couple of independent tasks in a single request, you could start a couple of processes and then wait for them to finish. Most probably a Task will be appropriate for this, though there are cases where a GenServer will suit better. The main idea still remains the same: consider running different activities in separate processes. This may boost your performance  and fault tolerance. I explained some of these ideas in my ElixirConfEU talk (https://www.youtube.com/watch?v=wYttHG3S76Y&feature=youtu.be). Sadly the sound sucks for first 10 minutes :/
 

Also, on this level, what do you feel is Elixir/Erlang's main benefit over, say, Go? There I could pretty easily spawn up a couple processes, wait on channels, and respond. On the code architecture level, if the benefit is mainly in the area of something like OTP, I'm trying to understand when and where that impacts my code. Or do you see advantages on the more primitive level of basic async operations, too?

This is a really broad question but I'l provide a couple of points why I personally regard Go inferior to Erlang when it comes to building complex systems that must run continuously:

1. A crash (unhandled panic) of goroutine crashes the entire system. A whole system may halt due to an individual bug.
2. There is shared memory in Go. Hence, if one goroutine crashes, even if you catch the panic, it might leave some permanent in-memory junk i.e. corrupt data that will compromise other parts of the system.
3. It's impossible to unconditionally terminate a goroutine. You can do something manually, but this amounts to sending a message "please kill yourself", and hoping this will happen. If due to some bug, you have a goroutine that ends up in a tight CPU bound loop, you can't possibly stop it, and it will waste your resources forever.

There are some other issues, but these three are for me show stoppers. All of those things can be worked around with careful programming, testing, and debugging, but then it seems to me that I have to do more work to get around some fundamental limitations of the platform. And no matter how hard I try, I'll never be able to completely eliminate those issues. In contrast, I feel that Erlang has a more structured approach, and I tried to explained this in my talk.

I personally regard Go as an interesting option for some heavy processing tool, that needs to take some input and produce some output in finite time. In such scenarios, raw speed will usually be more relevant, while high-availability is probably not needed.

However, in my experience a system is a completely different beast. Many activities are happening there: requests, background jobs, communication with other systems, and it's beneficial that individual failures have as low impact as possible to the entire system. In addition most of those things are I/O bound, with some amount of CPU processing involved. This is where Erlang's approach feels more appropriate to me. Erlang is not as fast as Go, but it deliberately sacrifices raw speed to get other benefits, such as predictable latency with less variations, and better fault tolerance, both of which are possible due to total process isolation.
 

Although you don't need a GenServer in this particular case, that is indeed necessary in a GenServer. The module is about code. The name is one of the ways to identify the process that is running the code so you can send commands to it. They are orthogonal abstractions.

I have actually mentioned a couple days ago that most languages allow you to think how you organize and structure your code (modules, packages, classes). Very few languages give you an abstraction about how you interact with the code and the data in the runtime system and processes offer exactly it.

This part is hard to adjust to. I think I like it, but it creates some strange parallels. In OO code, you care about a class, and you instantiated it into an object that you can carry around. Gen servers feel like this; you initialize them and get a processid back that you can pass in as a first parameter to further method calls. So even though the underlying code is very different, I'm running into strange similarities. Can't tell if that's just me over-translating into an OO background.

Yes, this is how I think about gen_servers. They are kind of like concurrent objects. They have identity, they have interface, and they encapsulate state. Unlike typical OO style, most of the time you'll actually want to have some registration mechanism and refer to processes via aliases. This is mostly needed for fault-tolerance reasons: a process may fail, and be restarted, in which case the new process will get a different pid. Thus, rather than keeping and passing pids around, it's frequently better to use registration/discovery mechanism. A server process registers itself under some name, and clients discover it through this name. If a server process is restarted, the new one will register itself under the same name, and system will continue to work normally. There are various approaches to registration/discovery. Local registration (e.g. via :name option), and gproc (https://github.com/uwiger/gproc) are the ones I use the most.
 

Ed W

unread,
Jun 3, 2015, 5:31:42 AM6/3/15
to elixir-l...@googlegroups.com
Hi

> Okay, that makes sense. But are there general pointers on when to use
> a gen server and when to use build a module with functions to call? Is
> the main distinction whether you have any server state to keep track of?

You have had replies from some experts who either are building the
language or writing books on it (I have just bought Sasa's book, and I
would also recommend "Designing for Scalability with Erlang/OTP" from
Oreilly), but I wonder if they are answering in more depth than you were
asking for?

As a beginner myself I would answer your question:

Erlang/Elixir is curious in that it is arguably very OO if you consider
"processes" to be "objects". This is justified in that you send them
messages and they respond to those messages - this is literally the
definition of OO. However, of course the idea is that each process is
single threaded, but also they are extremely cheap to spawn, hence why
it's almost acceptable to treat them as "objects"

I guess the above is well discussed, but the special feature of
Erlang/Elixir seems to be the ethos about code construction. You are
supposed to think about failure differently and the technique is to
"fail early and fail fast", but most importantly you get tools to help
you compartmentalise the failure and *restart* the system in a sane state.

Personally this feels very much like advocating "exceptions" in a
traditional language to me, but there is a key difference which is about
handling the exception if it happens. Erlang/Elixir you build a
supervisor to start your processes, and you have ways that if an error
occurs you can terminate a tree of processes (and then restart them
correctly) in order to deal with the condition and ensure that your
state is returned to something sane


You mentioned Go. On the surface both offer similar features, ie in
Elixir your *processes* are named and you send a message to a process,
but in Go the channel is named and you send a message to the channel. I
think Jose and certainly others have written that you can convert one
style to the other without too much trouble. Arguably also its simpler
to spin up a bunch of Go processes to listen to a channel than it is in
Erlang (some would argue the reverse in Elixir).

The key difference to an Erlang'er is that when something goes wrong you
can terminate the process and restart it sanely (which others might
argue is why you used an architecture with a channels design in the
first place to have an arms length relationship between the code...).
With Go all bets are off, you can have a pool of processes watching a
channel in case one gets stuck, but even then all processes could get
stuck or one can die and take down all the others (shared state...),
there is no reliable way to deal with failure on the other end of the
channel. Writing perfect code and thoroughly testing can reduce this to
arbitrary small risk, but THIS is the difference in philosophy between
Erlang/Elixir and Go.


Back to OTP. Given that erlang processes are quite cheap to spawn and
we said it would be nice if we could monitor them if they die, restart
them, etc, etc. Question is how to do all that? So OTP is a debugged
library on top of processes to give you a leg up. There is no reason to
use OTP, it's a wrapper, its a philosophy, but its not mandatory and its
just hiding underlying "processes".

Gen_server gives you a process, just like you would spawn normally, BUT,
its got a bunch of hooks built in to help abstract message handling,
hook into supervisors which can start and restart your process
consistently, handle code upgrading and restarts, handle failure in a
related module in a predefined manner, handle clean shutdown, etc.
Strongly recommend you use it where appropriate, BUT, its just a normal
process with a library on top.



So my understanding of Erlang/Elixir is that you want to think about
being very concurrent with your design, rather than being event based.
There is a post on writing Conway Game of Life where each cell is an
Erlang process...! In your case each incoming connection looks like a
good fit for a process (and your web server will make this so).

However, you need "back pressure", ie if 1,000 clients visit
simultaneously and ask for some work, do you want to hit your DB with
1,000 queries? If you want say to max at 100 queries simultaneously
(say) then you need to think about having some of the subsequent process
pools that you interact with being a bit smaller, this in turn will
cause some clients not to be serviced immediately, which will cause a
backlog and so on back up the chain. There are a number of libraries
which can help you avoid ballooning the number of spawned processes too
quickly and cascade the queue back up the pipeline. ie the logic is to
avoid having large buffers further down the pipeline, ie if your
database can't keep up, then push that limit back up towards the client
(obviously you can also make the database faster to keep up, etc). This
is a popular metaphor and for example zeromq advocates avoiding large
queues and pushing failure and retry back towards the client.


So putting that all together. You correctly have one process per
incoming connection, you have a configurable sized pool of database
clients, BUT you accidentally ended up with a pool size of one process
in the middle doing some dispatching (ie you created one OTP gen_server
in the middle), this is applying back pressure back to the incoming
connections, nothing wrong with that, but it doesn't sound like what you
wanted.


Apologies if I'm teaching you to suck eggs. I'm learning also and the
above is me trying to clarify my own thoughts on the philosophy of
Erlang/Elixir. To my eye: you would pick Elixir because you want the
final end product to be absolutely rock solid and have predictable
failure modes. We can argue about syntax and ease of code construction,
speed, etc, etc, but at the pointy end every so often code will have a
bug... With Go/C/Ruby/whatever you test and test and try to eliminate
bugs, but eventually someone will feed you input you don't know how to
handle, at that point its anyone's guess what will happen, with Elixir
you log the event so someone can go fix it and then restart cleanly and
carry on handling clients! *That* is quite unusual and quite a
distinction to other tools.


Good luck

Ed W

Ed W

unread,
Jun 3, 2015, 5:40:38 AM6/3/15
to elixir-l...@googlegroups.com
On 03/06/2015 07:24, Donny Linnemeyer wrote:
> Thanks, Alex. Can you elaborate on what you mean by "polite
> back-pressure"? What is that and what problem is it supposed to be
> potentially solve?

I replied separately, but I think its worth looking at the ZeroMq
project for philosophy. Languages like Erlang/Go are arguably about
loosely coupled processes interacting with each other through messages.
Especially with Erlang its terribly easy to spin up (literally) a
million processes to handle an incoming wave of client connections. *IF*
you have some process further down, say a database, which cannot handle
that number of requests then you have a bottleneck, either fix this (but
then there will be another bottleneck somewhere else), or consider what
you want to happen?

If you consider a factory making stuff and one part of the pipeline is
going faster than the rest then:
- for short periods of time you can "buffer" just before the bottleneck
- but eventually any buffer fills up and you need to decide what to do...

Back pressure is about artificially moving the bottleneck back up the
chain. So now perhaps the sales guy accepting orders for widgets in the
factory is forced to make some decisions... There is no hard and fast
rule on what to do here, but the point is once you hit a bottleneck in
your architecture you NEED to do something, you are failing, so how do
you want to fail. One philosophy is to move the failure up the chain
back towards the client, perhaps we can make better decisions at that
end... So "backpressure" is about artificially flow restricting
whatever feeds our process so it becomes an issue for the previous
process step, they in turn might flow control upwards until eventually
some process higher up the chain can make a decision on what to do...

This isn't an Elixir thing per se, but there are some good ways one
could handle this in Elixir (perhaps better than some other systems)

All the best

Ed W

Donny Linnemeyer

unread,
Jun 3, 2015, 2:03:36 PM6/3/15
to elixir-l...@googlegroups.com
Needing state is a good starting point. For example, let's suppose some information from the database never changes. You could load it and store it in a GenServer so you don't need to do a database roundtrip every time.

So in that case, is there a standard method to avoid bottlenecking on that gen server? A pool of genservers? A non-blocking (in itself) genserver (with :noreply and handle_info)? Or does it just depend too much on the details? In your experience, is this a common problem to run into? We're trying to play with the language to try to identify anticipated pain points, but I don't want to fixate on something that really just isn't a huge deal.
 
For your questions regarding running it in production and what Elixir brings to the table, I recommend two things:


* Read Erlang in Anger, a fantastic free book about running Erlang in Production: http://www.erlang-in-anger.com/ 

Thanks; I'll definitely get to those.
 
If they are named processes, a comparison with a singleton is correct from the code organization perspective. But the difference, really, is the runtime abstraction of a process with all the amazing monitoring, introspection and resilience built into it. That's the main difference over the mainstream OO languages and Go imo. :)

 Got it. I'll dig more into some of the debugging/monitoring tools and see how those go. Thanks again.

José Valim

unread,
Jun 3, 2015, 2:37:00 PM6/3/15
to elixir-l...@googlegroups.com
So in that case, is there a standard method to avoid bottlenecking on that gen server? A pool of genservers? A non-blocking (in itself) genserver (with :noreply and handle_info)? Or does it just depend too much on the details? In your experience, is this a common problem to run into? We're trying to play with the language to try to identify anticipated pain points, but I don't want to fixate on something that really just isn't a huge deal.

It depends. It may be a pool, it may be ETS. I typically don't worry about this because I will only know my bottlenecks once the code runs in production. Bottlenecks will happen in any software, in any language, it just happens they are typically easier to find here (as all communication is explicit).

dlinn...@economicmodeling.com

unread,
Jun 3, 2015, 5:32:49 PM6/3/15
to elixir-l...@googlegroups.com
Thanks Sasa, that helps.

Do you have any pointers to materials that might help with learning where and how to use otp vs more plain approaches. I think one of my problems is that for whatever reason I came away from tutorials and joe armstrong's erlang book thinking otp was the hammer and everything was a nail. Not sure if it was the tutorials or me, but if you've found any particular materials helpful on learning when to use or not to use otp, that'd be awesome.

For go, can you outline any particular problems you've run into where you think go would have made things more difficult? I think I understand the points, but in trying to weigh costs and benefits I'm trying to get a gauge on how big of a practical benefit those are for our situation.

Thanks,
Donny

dlinn...@economicmodeling.com

unread,
Jun 3, 2015, 5:46:15 PM6/3/15
to elixir-l...@googlegroups.com
Thanks, Ed. The back pressure explanation especially was helpful. For things like supervisors gracefully restarting some workers, I think I'm starting to grasp the principles, but I'm also trying to get a handle on more specifics. What class of bugs does this handle better and what classes of bugs end up about the same?

For example, in go, you can catch panics from your whole web server, meaning a failure in a controller won't crash the server. Of course it's pretty far back, so I'd imagine it's caught later, but still not catastrophic? Then you have other portions of your app, maybe a process that updates some meta data in an ets table every few minutes, not part of the web server. In erlang, a crash there is easily caught and the process restarted, which is nice, but it obviously doesn't prevent a bug that corrupts data in the ets and crashes maybe after the corruption. Those, I'd imagine, are roughly equivalent?

Part of this is my own lack of experience with systems like this, so I don't have prior knowledge of the types of failures to anticipate, how common or difficult to debug they are, etc. so it's hard to weigh he value of things like supervisors if we don't necessarily need crazy high availability. That make any sense? Any help along those lines?

Thanks,
Donny

Saša Jurić

unread,
Jun 3, 2015, 6:19:05 PM6/3/15
to elixir-l...@googlegroups.com

On 03 Jun 2015, at 23:32, dlinn...@economicmodeling.com wrote:

> Thanks Sasa, that helps.
>
> Do you have any pointers to materials that might help with learning where and how to use otp vs more plain approaches. I think one of my problems is that for whatever reason I came away from tutorials and joe armstrong's erlang book thinking otp was the hammer and everything was a nail. Not sure if it was the tutorials or me, but if you've found any particular materials helpful on learning when to use or not to use otp, that'd be awesome.

When it comes to production, I’d say your Elixir/Erlang systems should always be structured around OTP principles. It took me awhile to figure this out, and in the process I was still able to produce some stable production software, despite using a lot of antipatterns in the process. So in my experience, many things can be done without OTP. After all, OTP is written in Erlang. But you’ll end up reinventing the wheel, so it’s better to understand OTP principles, which are not all that hard.

Of course, as long as you’re experimenting and evaluating, this is probably not so important. But for production, I’d really recommend using OTP. In particular, build your system as an OTP application(s), use a supervision tree, even if it’s very flat, and run all processes as OTP compliant, somewhere under your supervision tree. If you use mix project, and abstractions such as GenServer, GenEvent, Supervisor, Task, Agent, you’ll be very far down that road.

> For go, can you outline any particular problems you've run into where you think go would have made things more difficult? I think I understand the points, but in trying to weigh costs and benefits I'm trying to get a gauge on how big of a practical benefit those are for our situation.

I evaluated go after I already had some production experience with Erlang. I was looking for Erlang alternative, as there were some parts I didn’t like about Erlang. I really hoped Go would be it, but it simply wasn’t. For me, the issues I listed are showstoppers, because I’m mostly a backend-side developer, which means I develop server side systems that needs to run continuously and be resilient. My experience with Erlang was that, despite many of its quirks, the system was stable and fault-tolerant, even if I didn’t pay special attention to it. In contrast, looking at Go, I saw much more ways I could accidentally shoot myself in the foot. I should mention that I have a lot of C/C++/C# background, so Go would in fact feel more natural to me at the time I looked at it (I believe it was about 2012). And yet, it felt fundamentally wrong tool for the job I was doing.

Not thoug that I didn’t really write anything relevant in Go. I read a book, played a bit, and concluded that it doesn’t really stand up to what Erlang gave me. A bit later I discovered Elixir, and never looked back :-)
YMMV of course.

I’m not saying Elixir or Erlang is magical. Things can still go wrong and systems may fail in many ways. However, my experience is that it’s less of an uphill battle, and the supporting building blocks, while very simple are super powerful, allowing me to systematically tackle the challenge of building a complex system that is scalable, reliable and has predictable latency.

Chris McGrath

unread,
Jun 3, 2015, 6:20:12 PM6/3/15
to elixir-l...@googlegroups.com, Christopher McGrath

On 3 Jun 2015, at 22:46, dlinn...@economicmodeling.com wrote:

> Thanks, Ed. The back pressure explanation especially was helpful. For things like supervisors gracefully restarting some workers, I think I'm starting to grasp the principles, but I'm also trying to get a handle on more specifics. What class of bugs does this handle better and what classes of bugs end up about the same?
>
> For example, in go, you can catch panics from your whole web server, meaning a failure in a controller won't crash the server. Of course it's pretty far back, so I'd imagine it's caught later, but still not catastrophic? Then you have other portions of your app, maybe a process that updates some meta data in an ets table every few minutes, not part of the web server. In erlang, a crash there is easily caught and the process restarted, which is nice, but it obviously doesn't prevent a bug that corrupts data in the ets and crashes maybe after the corruption. Those, I'd imagine, are roughly equivalent?
>

One of the lovely advantages of structuring your app according to OTP principles is that you write a lot less error handling code and so it’s a lot easier to see the working logic of your app vs having to worry about errors every other statement. This makes it easier to write more correct code to start with in my experience so far.

For avoiding corrupt data there is the concept of an “Error Kernel”. The idea here is that the process that updates your ETS table or GenServer state does not have any logic other than writing to the table / updating a data structure. You put the “dangerous” work that might result in corruption in another process. If the work results in a crash then the process dies and the rest of the system proceeds and the vital state isn’t touched. If the work is completed successfully you ask the process managing the writes / state to update which vastly reduces the risk of something corrupt getting in there.

http://jlouisramblings.blogspot.co.uk/2010/11/on-erlang-state-and-crashes.html explains this is much better than I just have.

Compare this with OO where the application state is distributed around all your objects and so you have to have error handling code around vital state in every object. (TBH when I saw how error handling was done in Go I immediately closed the browser tab!)


> Part of this is my own lack of experience with systems like this, so I don't have prior knowledge of the types of failures to anticipate, how common or difficult to debug they are, etc. so it's hard to weigh he value of things like supervisors if we don't necessarily need crazy high availability. That make any sense? Any help along those lines?
>

The Erlang in Anger book was mentioned further up the thread. It’s written by Fred Herbert who works on the Heroku routing team and has seen a lot of failure modes. It’s free but could perhaps be a bit too in depth if you’re still learning about OTP. Definitely worth skimming though, especially chapter 3.

> Thanks,
> Donny

Cheers,

Chris

Saša Jurić

unread,
Jun 3, 2015, 6:45:07 PM6/3/15
to elixir-l...@googlegroups.com
On 03 Jun 2015, at 23:46, dlinn...@economicmodeling.com wrote:

For example, in go, you can catch panics from your whole web server, meaning a failure in a controller won't crash the server. Of course it's pretty far back, so I'd imagine it's caught later, but still not catastrophic? Then you have other portions of your app,

A few months ago, when I was playing with gin, I started a goroutine from a request handler, and in that goroutine I induced a division by zero. That caused a crash of the entire web server. For me, this is a deal breaker, because this means that a single error can crash the entire system. Yes, you can use upstart/monit, or something of the kind, to bring the server up again. But still, a single bug will make all current activities fail, including all requests being served, as well as any background activities. Furthermore, if the error is recurring (maybe some bug while parsing some data source), bringing the server back will not help, since the bug will appear again, and everything will crash again.

This is diametrically opposite to Erlang where a crash is by default isolated, and then as developers, we work to propagate the crash information through the system, so that someone else can do corrective measures.

Of course, I could obsessively try to recover from a panic in every goroutine I start. But besides being very tedious and error prone, this doesn’t solve all problems. What if my dependency starts a goroutine that crashes? That will take down my system, and I have no way of preventing it.

maybe a process that updates some meta data in an ets table every few minutes, not part of the web server. In erlang, a crash there is easily caught and the process restarted, which is nice, but it obviously doesn't prevent a bug that corrupts data in the ets and crashes maybe after the corruption. Those, I'd imagine, are roughly equivalent?

This is in fact not related to ETS, it could be a database, or file system as well. The general pattern to mitigate this is: 

a) Consider carefully whether a state needs to be preserved and minimize the persistence of the state of processes as much as possible.
b) When you need to persist the state, try to do it at the very end of the job, after all critical actions have been taken. It’s most likely that the state at this point is stable.

The gist is that it’s a conscious and explicit action. In Erlang world, state is by default not preserved. Hence, when you want to do it, you should be aware of risks (possible corruption), and do it sparingly, and with more care.

It’s also worth mentioning, that an ETS table is related to an owner process, so you can make sure that if a process crashes, the table is taken down with it (but you can also make the table outlive the process).
Finally, Chris already mentioned the great article on the topic, but I’ll repeat it for a good measure: http://jlouisramblings.blogspot.co.uk/2010/11/on-erlang-state-and-crashes.html

Donny Linnemeyer

unread,
Jun 3, 2015, 7:39:37 PM6/3/15
to elixir-l...@googlegroups.com
Thanks everyone; the responses have been really helpful, and I think that's enough info to get me going through the next step in terms of debugging tools and OTP best practices. You all mentioned Erlang in Anger, Designing for Scalability with Erlang/OTP, Elixir in Action (Sasa's book), and a handful of articles/videos. Anything else you'd recommend, especially along the lines of getting newcomers up to speed with OTP patterns/anti-patterns?

A couple other misc. questions. Sasa, you mentioned working with some sample go code and killing a web server with a division by zero error. It's my understanding that you can wrap your web server and recover from any panics caused by goroutines called within any controllers. I think that's how it panned out in our testing, but we only handled I think one or two examples. Am I right about that, or are you talking about non-recoverable errors?

Also, anyone have any insights into training/hiring for elixir/erlang vs. Go? Elixir seems very small in itself, but I'd guess you could potentially hire erlang devs for an elixir project? Or train people with a background in Ruby, Python, PHP, etc.? My guess is that concurrency in general is in high demand, so I'm expecting that it'd be relatively difficult regardless of language. I'd also expect training to be comparatively difficult regardless of language, given how complicated concurrent systems seem to be. But I'd definitely like to take it into account if Go has a leg up here. 

Thanks,
Donny

--
You received this message because you are subscribed to a topic in the Google Groups "elixir-lang-talk" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elixir-lang-talk/Tr8ayRHMOh4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elixir-lang-ta...@googlegroups.com.

Saša Jurić

unread,
Jun 4, 2015, 3:51:32 AM6/4/15
to elixir-l...@googlegroups.com
On 04 Jun 2015, at 01:39, Donny Linnemeyer <dlinn...@economicmodeling.com> wrote:

A couple other misc. questions. Sasa, you mentioned working with some sample go code and killing a web server with a division by zero error. It's my understanding that you can wrap your web server and recover from any panics caused by goroutines called within any controllers. I think that's how it panned out in our testing, but we only handled I think one or two examples. Am I right about that, or are you talking about non-recoverable errors?

My knowledge of Go is vary basic, but if I understand correctly, if no function in the call stack recovers from a panic, a goroutine will die and bring the entire system down with it. A web server framework may mitigate this, because it wraps your callback functions, and thus it can recover from a panic. However, this holds only for the request goroutine. If you create another goroutine while e.g. handling a request, the framework can’t possibly recover from the panic in that goroutine.

This article (http://launchdarkly.com/blog/golang-pearl-goroutine-panic-gosafely/) gives a nice explanation and also provides a workaround, which will address the problem as long as your entire team uses this approach absolutely everywhere, and the same thing holds for all your dependencies. This is hard, if not impossible, to satisfy in a moderately complex system, so the risk of the situation where a single error can take down the whole system never disappears.

Erlang has in my opinion a much more structured approach to failure isolation and recovery:

1. A process crash is completely isolated, disturbing no one else, leaving no garbage behind.
2. Anyone can learn about a process crash and do something about it
3. You can even link multiple processes, making sure that a crash of one process takes down all linked processes, leaving all other processes untouched.
4. You can unconditionally stop a process.

None of these things are possible in Go. I’ve seen some hacks that simulate those features, but they are far cry from Erlang.

Ed W

unread,
Jun 4, 2015, 6:59:21 AM6/4/15
to elixir-l...@googlegroups.com
Hi

> For example, in go, you can catch panics from your whole web server, meaning a failure in a controller won't crash the server. Of course it's pretty far back, so I'd imagine it's caught later, but still not catastrophic? Then you have other portions of your app, maybe a process that updates some meta data in an ets table every few minutes, not part of the web server. In erlang, a crash there is easily caught and the process restarted, which is nice, but it obviously doesn't prevent a bug that corrupts data in the ets and crashes maybe after the corruption.

See, now turn your thought around and you are arguing *FOR* erlang's
design methodology! What you just described is "oh cr*p, shared state
is really hard to get right!". This is the Erlang methodology, minimise
your shared state (in an absolutely pure functional language you carry
all your state in the parameters and pass it into each function. ETS is
an example of a way to have "mutability" and if you can avoid it, it
would be preferred to have as little mutability as possible in the design.

So, another school of thought is something equivalent to having a really
good monit setup, watching your go service. Whenever you hit an
unexpected state you just let the whole go process die, let monit bring
it back up again and off you go servicing the next request. Erlang kinda
advocates something like this... Seems a bit scary really... However,
there are very strong tools (called OTP) which allow you to point at
specific processes and draw out a dependency tree and basically say, "if
something happens to this process, kill all those processes as well,
restart them and notify this thing over here".

So it's not quite the same as monit restart, in fact you can say, look,
this thing over here crashed, that makes our datastore iffy, so lets
kill any data relating to this node, then lets kill the processes
handling the TCP connection (so the connection drops), and also kill
these resources over here which the process had checked out. OK, now
optionally you can bring all these processes back up. Neat!

So you can get "notified" when any process crashes. You can also
trigger other processes to crash if you crash. Joining those together
you can create chains of dependencies and be quite precise about what
happens. My understanding is that whilst its always an option to trap
errors and try to reverse out the situation, Erlang argues in favour of
just blowing away *all* state relating to the situation, then rebuilding
a fresh set of services.

So nothing stops you catching a database timeout and retrying it, or
retrying some microservice call, but what it gives you is the option
once the microservice/database/etc looks really dead is that you can
just give up, let the exception climb the stack, something near the top
of the stack can take higher level actions (as with backpressure), or
even allow it to bubble right up and cause this process to be
restarted. Lots of options. Note, I'm personally at the point of
struggling a lot with letting go of control...

Oh, also be aware that when stuff crashes, you also get to run some
code... OTP crashes are actually more of a "I'm not feeling well, here
is my state right now, do something sensible with it, I'm going for a
lie down".

> Part of this is my own lack of experience with systems like this, so I don't have prior knowledge of the types of failures to anticipate, how common or difficult to debug they are, etc. so it's hard to weigh he value of things like supervisors if we don't necessarily need crazy high availability. That make any sense? Any help along those lines?

You are thinking in the wrong terms. Testing gets rid of all those
failures you are thinking about right now. However, consider another
scenario, say that your physical hardware runs some other stuff, that
causes the database to get slow, this in turn causes some other services
to get slow intermittently respond (perhaps someone is also restarting
things to try and fix the issue), somewhere way up the chain this
impacts your app. Can we plan for this? Hmm, unlikely, at best its
tricky. What will happen to your Go/Ruby app at this point? Dunno,
probably down to how much effort you put into thinking about these very
remote corner cases. How will your Elixir app respond? Dunno, BUT, you
can build some sane worst cases into your app, eg "if timeout too long,
return some error message to the client instead. If too many clients
queuing on requests, flip to some new behaviour, eg serving "too busy"
messages".

So I'm eyeing Elixir less for the case of programming bugs, but more for
the case of being able to *design* the worst state behaviour (ok, stuff
can still die, but I hope you get what I mean). With Ruby I try and
avoid hitting worst case, with Elixir I can design for "gone wrong", and
"normal" fairly easily. This seems quite attractive!


Good luck!

Ed W

Chris McGrath

unread,
Jun 4, 2015, 7:21:32 AM6/4/15
to elixir-l...@googlegroups.com, Christopher McGrath

On 4 Jun 2015, at 00:39, Donny Linnemeyer <dlinn...@economicmodeling.com> wrote:

Also, anyone have any insights into training/hiring for elixir/erlang vs. Go? Elixir seems very small in itself, but I'd guess you could potentially hire erlang devs for an elixir project? Or train people with a background in Ruby, Python, PHP, etc.? My guess is that concurrency in general is in high demand, so I'm expecting that it'd be relatively difficult regardless of language. I'd also expect training to be comparatively difficult regardless of language, given how complicated concurrent systems seem to be. But I'd definitely like to take it into account if Go has a leg up here. 

Michael Schaefermeyer’s talk at ElixirConf.eu talks about how they introduced Elixir at Bleacher Report and covers training / hiring, running in production etc. so I’d recommend watching that too.


Chris

dlinn...@economicmodeling.com

unread,
Jun 7, 2015, 11:05:47 PM6/7/15
to elixir-l...@googlegroups.com
Whoa, I completely misunderstood how that worked in go. That scares me a ton. Thanks for pointing that out.

Donald Linnemeyer

unread,
Jun 7, 2015, 11:43:08 PM6/7/15
to elixir-l...@googlegroups.com
Thanks, I'll check that out.

Sent from my iPhone
--
You received this message because you are subscribed to a topic in the Google Groups "elixir-lang-talk" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elixir-lang-talk/Tr8ayRHMOh4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elixir-lang-ta...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages