Hi guys,
When implementing the real time pipeline, handling a lot of data, I ended up with this architecture:
AWS:
[Step 1]
-> Load balancer ->
[Step 2]
-> 3 collectors instances ->
[Step 3]
-> kinesis [6 shards for collectors output] ->
[Step 4]
-> 3 enrichment instances ->
[Step 5]
-> kinesis [6 shards for enrichment output] ->
[Step 6]
-> sink instance, 3 process sinking into -> Elasticsearch [Single node]
But when I was at a debug session to identify where I was "losing data", I realized that I could send the output/stdout of [Step 2] directly to enrichments process on the same instance, cutting [Step 2], 1 kinesis stream with 6 shards at [Step 3] and eliminating 3 instances for [Step 4].
The output of enrichment process is sent to Kinesis just because I cant send data directly to Elasticsearch if my input is from stdin.
Does it make sense? What are the cons about this decision?
thanks in advance,
André