Hi,
as usual, first a few questions to understand the problem before solving it :)
1. what is the SPARK version you are using? And what is the environment?
2. Secondly if you are have to join multiple data sources then how do you manage latency? For example, what if you want to join CSV files from folder A to folder B and they arrive at 5 mins lag?
3. Thirdly do all your data landing in different areas have same schema? And whether their schema changes? What is the format and volume of these files?
Regards,
Gourav Sengupta