Golang: Discrete Event Simulation 101
Hey friend 👋
I've been doing a lot of simulation work lately trying to figure out how a system with various moving components behaves in different scenarios and I thought it would be great to share how I approached it.
Writing a simulation environment has been fun and I had a chance to dust off my university notes on modeling stochastic processes. 🤓
What and Why
In this post, I'll show how to leverage the two data structures I mentioned earlier, so you have a chance to read through those posts to recap on how those containers work internally:
Building a simulation model is a convenient way (and sometimes the only way) to stress test your system by taking it through scenarios that don't often occur in real life.
This could be super useful when your system is complex and have various probabilistic processes and cross-component dependencies which are hard to reason about and you cannot afford to risk reliability and test your scenarios in production.
Hope you feel excited and ready to build some simulations 😉
Simulation Scenario: Feeding Pets
This is super exciting!
We'll be building an automated feeding system for our animal companions! 🐕 🐈
In reality, all systems are similar in some ways and you can apply the same approach to reasoning about backend microservices or any other applied domain.
Here's the assumptions:
- There are N pets we need to feed, all of them hungry at the start.
- We've got K automatic feeders (K <= N)
- Once a feeder is available, a random hungry pet would start using it.
- When a pet is at a feeder, it eats for time T and leaves satisfied releasing the feeder.
- Then the cycle repeats until there's no more hungry pets.
- Once in a while (with probability P) a feeder breaks and someone needs to go and fix it, which takes time R.
What we'd like to know:
- Time duration for 50%, 95%, 99% feeding completion.
- Can we reasonably feed all the pets twice a day and how many feeders we would need for N pets.
Now, this might look artificial, but bear in mind that the goal in designing a simulation system is a balance act of conflicting priorities:
- Trying to make the simulation as simple as possible so that it's performant and easy to build and evolve.
- Getting to the level of detail that reflects the properties you're trying to model.
If we can get away with not modeling everything ideally and still get a reasonable approximation of reality, that's good enough 😉
Application of Queue
The key element in a simulation system is an event queue.
Unless you're simulation something without durations, you need to track when certain things are supposed to happen.
What we want from an event queue:
- Push future events onto the queue
- Retrieve the earliest event from the queue
Now, this should look familiar 😉
This is exactly what a Heap container does. And in a few special cases you can use a Ring queue as well for more performance (constant instead of logarithmic time).
How to build a simulation
If you want to follow along, open your favourite editor/IDE and make a new go binary package. Alternatively, you can use Go Playground.
There are two key ways to simulate time:
- Incremental Time Progression – you simulate X time steps in a loop moving forward in fixed time intervals.
- Event-time Progression – jump to the earliest future event time at every simulation step.
In general, if you need to simulate a year of time and only 100 events happen in this year in your system, you can see how jumping to event-time (2) is significantly more efficient than uselessly looping over hourly or minute time increments (1).
We'll look at both appoaches.
Here's the start:
package main
// our event queue interface, this is anything that we can
// add elements to, remove minimal, peek minimal and check its size.
type EventQueue[T any] interface {
Push(element T) error
Pop() (T, error)
Peek() (T, error)
Size() int
}
func main() {
// ...
}
A structure that would hold our simulation parameters (configuration that doesn't change from run to run in case we want to repeat this simulation).
type Simulation struct {
// min-queue for tracking repair times
eventQueue EventQueue[int]
// total number of pets we need to process i.e. feed
taskCount int
// number of parallel processing gates i.e. feeders
gateCount int
// probability of a gate breaking
breakChance float32
// units of time required for repairs
repairTime int
// processing time
processingTime int
}
// run the simulation and return 3 duration percentiles:
// time to 50%, 90% and 99% done
func (s *Simulation) run(seed int64) (percentileTime [3]int) {
// this isn't doing anything useful now, we'll code this later
return
}
And here's the main function stub, this is how we'll run our simulation program:
func main() {
var r EventQueue[int] = nil // nil for now
// these are example settings, feel free to play around with the simulation parameters!
stepSim := Simulation{
eventQueue: r,
taskCount: 1_000_000, // feed a millon pets!
gateCount: 10, // with 10 feeders
breakChance: 0.05, // each has a 5% chance of failure
repairTime: 120, // requiring 120 time units of repair time
processingTime: 15, // and with 15 time units of eating time
}
// give a seed for random generated probabilistic events
res := stepSim.run(100)
// print out results
fmt.Printf("Time to X%%: 50%% = %d, 95%% = %d, 99%% = %d\n", res[0], res[1], res[2])
fmt.Printf("Relative Time to X%%: 50%% = 1x, 95%% = %.2fx, 99%% = %.2fx\n", float32(res[1])/float32(res[0]), float32(res[2])/float32(res[0]))
}
At this point, although the program isn't doing anything useful yet, we have the main structure in place and can compile and run it 🙌
See this step on Go Playground
Fixed Time Increments
The idea here is that your events happen on (or approximately on) a time grid i.e. every hour, every day etc.
In this case, this is your natural time step and the granularity / precisions of your simulation.
The smaller the time step, the more precision you have, but at the cost of compute efficiency since there will be more simulation cycles.
Let's work on our run
function:
func (s *Simulation) run(seed int64) (percentileTime [3]int) {
// set the random generator seed
rgen := rand.New(rand.NewSource(seed))
var time int // current simulation time
var done int // done count
var broken int // broken gate count
// run simulation until all done
for done < s.taskCount {
// "fix" previously broken gates that
// have finished their repair time
for s.eventQueue.Size() > 0 {
ts, _ := s.eventQueue.Peek()
// if the minimal even of the event queue is in
// future, stop
if ts > time {
break
}
s.eventQueue.Pop() // remove the past event
broken-- // reduce broken
}
// process all healthy gates
time += s.processingTime // fixed time step
done += s.gateCount - broken // complete all healthy gate tasks
// roll a chance to of breakage for all healthy gates
// if it's broken already, it cannot break again
for idx := 0; idx < s.gateCount-broken; idx++ {
// test breakage by generating a uniform random number [0,1)
// [0-----[5%]------------------------1]
// ^ ^
// | brakes | survives
if rgen.Float32() < s.breakChance {
broken++
// schedule a repair at current + repair time
s.eventQueue.Push(time + s.repairTime)
}
}
// Metrics: percentiles of done
percentDone := 100 * done / s.taskCount
percentiles := []int{50, 95, 99}
for idx, p := range percentiles {
if percentDone >= p && percentileTime[idx] == 0 {
percentileTime[idx] = time
}
}
}
return
}
This should look pretty straight-forward.
Now, in order to run this, we need to provide an actual event queue container (remember, we have a nil
in the main
function now).
I'll chose a RingQueue
in this case because fortunately, we track only breakage events in this scenario and they have a fixed repair time.
This means that out future event time will only go up with every single step and will always be larger than the oldest event on the queue.
So there is no need to re-order events, we just need to carefully add them to the end of the queue and pull from the start.
This is a perfect use case for a RingQueue
(github repo).
import (
"fmt"
"math/rand"
// import RingQueue from the github repo
roundrobin "github.com/sombr/go-container-roundrobin"
)
// ...
func main() {
// here's our RingQueue container
r := roundrobin.NewRingQueue[int](10)
stepSim := Simulation{
eventQueue: r,
taskCount: 1_000_000,
gateCount: 10,
breakChance: 0.05,
repairTime: 120,
processingTime: 15,
}
res := stepSim.run(100)
fmt.Printf("Time to X%%: 50%% = %d, 95%% = %d, 99%% = %d\n", res[0], res[1], res[2])
fmt.Printf("Relative Time to X%%: 50%% = 1x, 95%% = %.2fx, 99%% = %.2fx\n", float32(res[1])/float32(res[0]), float32(res[2])/float32(res[0]))
}
See this code on Go Playground
Although we're using a fast event-queue container here, iteration itself can be slow on large timelines.
So, let's see how we can speed this up by jumping through time.
Event-Time Jump
Time jumping is super easy! We already have core simulation code that steps through the increments and we have an event-queue that gives us the next earliest event.
Knowing that no other events happen while we're waiting for the next event (repairs), we can jump straight to it! 🚀⏳
func (s *Simulation) run(seed int64) (percentileTime [3]int) {
rgen := rand.New(rand.NewSource(seed))
var time int
var done int
var broken int
for done < s.taskCount {
// "fix" broken
for s.eventQueue.Size() > 0 {
ts, _ := s.eventQueue.Peek()
if ts > time {
break
}
s.eventQueue.Pop()
broken--
}
// --------------- THIS CODE HAS CHANGED ---------------
// default next time step to the next processing increment
nextTime := time + s.processingTime
// if we have events on the queue, get the earliest time
if s.eventQueue.Size() > 0 {
nextTime, _ = s.eventQueue.Peek()
}
// add however many we'll be able to do in the meantime
done += (s.gateCount - broken) * (nextTime - time) / s.processingTime
time = nextTime // jump!
// ------------------- END OF CHANGE -------------------
// roll a chance to of breakage for all healthy gates
for idx := 0; idx < s.gateCount-broken; idx++ {
if rgen.Float32() < s.breakChance {
broken++
s.eventQueue.Push(time + s.repairTime)
}
}
// percentiles of done
percentDone := 100 * done / s.taskCount
percentiles := []int{50, 95, 99}
for idx, p := range percentiles {
if percentDone >= p && percentileTime[idx] == 0 {
percentileTime[idx] = time
}
}
}
return
}
See this code on Go Playground
Now, this is much more performant.
However, you might be able to spot a logical regression. Because we're jumping through the repair events and tracking completion along the way, our percentiles might be skewed. Additionally, breakages also happen only at repair processing times, which degrades simulation quality.
This is a tradeoff and might be tolerable depending on your goals.
However, I will show a more general approach to handling simulated events that is still performant and capable of handling different types of events and probabilistic durations.
Event Priority Queue
What if we need to track multiple different event types? Or if the event durations aren't fixed, but generated randomly.
We cannot continue using the RingQueue
in this case since we won't have ascending order of timestamps anymore.
Imagine that on step 1 we schedule an event for time 10 and on step 2 we need to schedule for time 5.
The best solution for this that holds the requirements of always giving us the earliest timestamp (minimal element) and still allowing to add arbitrary timestamps into the mix is a HeapQueue
(also known as Priority Queue).
Take a look at this code and see if you can trace how this works:
// both repairs and processing are events now
type Event struct {
time int // event timestamp as before
kind byte // event type - repairs or processing
}
type Simulation struct {
// note that this container holds items of Event type now
eventQueue EventQueue[Event]
taskCount int
gateCount int
breakChance float32
repairTime int
processingTime int
}
func (s *Simulation) run(seed int64) (percentileTime [3]int) {
rgen := rand.New(rand.NewSource(seed))
var time int
var done int
// track how make tasks are in flight (i.e. pets eating)
var inflight int
var broken int
for done < s.taskCount {
// since everything is events, we can start with
// rolling a breakage chance and scheduling repairs
for idx := 0; idx < s.gateCount && broken < s.gateCount; idx++ {
if rgen.Float32() < s.breakChance {
broken++
// probabilistic repair time
// mean = repairTime, stdev = repairTime / 4
repairTime := s.repairTime + int(rgen.NormFloat64()*float64(s.repairTime)/4)
// truncate a marginal chance that it goes negative which would not make sense
if repairTime < 0 {
repairTime = 0
}
repairedAt := time + repairTime
s.eventQueue.Push(Event{
kind: 'R', // Repair
time: repairedAt,
})
}
}
// top up inflight processing
// tasks (pets) fill the available healthy gates
for inflight < s.gateCount-broken {
inflight++
// probabilistic processing time
// mean = processingTime, stdev = processingTime / 4
processingTime := s.processingTime + int(rgen.NormFloat64()*float64(s.processingTime)/4)
if processingTime < 0 {
processingTime = 0
}
doneAt := time + processingTime
s.eventQueue.Push(Event{
kind: 'D',
time: doneAt,
})
}
// jump to the next earliest event
nextEvent, _ := s.eventQueue.Pop()
if nextEvent.kind == 'R' { // a gate got repaired
broken--
} else { // a pet finished eating :)
done++
inflight--
}
// time travel
time = nextEvent.time
// percentiles of done
percentDone := 100 * done / s.taskCount
percentiles := []int{50, 95, 99}
for idx, p := range percentiles {
if percentDone >= p && percentileTime[idx] == 0 {
percentileTime[idx] = time
}
}
}
return
}
See this code on Go Playground
This one was a little lengthy, but we got there 😉
You've been awesome :)
Thank you for reading and let me know if this helped you solve your engineering problems! I'm super interested to know what use cases you have!
As always, please subscribe - it makes my day brighter ☀️
Member discussion