delta table streaming from multiple sources

279 views
Skip to first unread message

Guru V

unread,
Feb 5, 2021, 3:01:39 AM2/5/21
to Delta Lake Users and Developers
HI,

my job 
  • has to read data from different sources(csv files from different folders). 
  • Each source publishes the data in scheduled batch of 6 -10 hrs.(so every 6/10 hr there will be multiple files written in folder).
  • my job has to join data from all sources and then execute on joined dataframe.
  • join has to happen on full published  data(files). For example File1 in fodler1, might have data to be joined with File2 in Folder2.
Is there any recommendation on how I can
  • read data as soon as it is published(written in folder) and write to something like staging table.
  • my job somehow know that all data has been written into staging table and then start processing.
  • i can do multiple pass over staging table, so If for some reason if data can't be joined in first pass(data to join was not present), i can process such records in second pass.

Thanks

suresh kumar pathak

unread,
Feb 5, 2021, 4:09:09 AM2/5/21
to Guru V, Delta Lake Users and Developers
Hi Guru,

First of all you need to make sure the schema will be the same for each folder and if not same then you will get a corrupt record.I am not sure how your source system will apply schema matching across the folder. I think if you use Azure ADF then it will be easier to automate the process and land the data into the stage area as an ADLS gen2 file and then through Azure Data brick you can land the data into Delta lake.

Regards,
Suresh

--
You received this message because you are subscribed to the Google Groups "Delta Lake Users and Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to delta-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/delta-users/484bbdd4-b9f0-4d91-bb9a-5c2915ee49b0n%40googlegroups.com.


--
Thanks & Regards,
Suresh Kumar Pathak(+918884772233)

Chinny

unread,
Feb 5, 2021, 8:32:55 PM2/5/21
to Delta Lake Users and Developers
Hello Guru,
                You can use the Databricks Autoloader feature to incrementally process your source files in-place as they from different source folders. And then merge your data and update schema as needed. Also, if you are writing output to Delta files, you could use the mergeSchema option to update target table columns automatically in realtime. I am attaching a link that talks about Autoloader. I hope this helps:  Incrementally Process Data Lake Files Using Azure Databricks Autoloader and Spark Structured Streaming API. | Chinny Chukwudozie, Cloud Solutions.  

Guru V

unread,
Feb 8, 2021, 5:48:49 PM2/8/21
to Delta Lake Users and Developers
Thanks for all reply.
can you kindly elaborate how to handle second part.  (ie when data has landed and join and move to another table)

let say i have 2 sources/folders, with different schema of files. I  ingest files into 2 delta tables (with each table having same schema as file). There are chances the each folder might get data at different rate. Now my job processing has to join both the tables and write in 3rd delta table. If the data can't be joined across 2 tables (one of my table is primary table and second table as referenceId for first table). the data stay in table 1. When a late arriving transaction (from folder2 to delta table 2) comes, the earlier record should be joined and moved to table 3.

Thanks

Gourav Sengupta

unread,
Feb 11, 2021, 7:51:57 AM2/11/21
to Guru V, Delta Lake Users and Developers
Hi,
as usual, first a few questions to understand the problem before solving it :)
1. what is the SPARK version you are using? And what is the environment?
2. Secondly if you are have to join multiple data sources then how do you manage latency? For example, what if you want to join CSV files from folder A to folder B and they arrive at 5 mins lag?
3. Thirdly do all your data landing in different areas have same schema? And whether their schema changes? What is the format and volume of these files?


Regards,
Gourav Sengupta

--
Reply all
Reply to author
Forward
0 new messages