I have two different proposed approaches:
Option 1:
1. Scan all ARC/WARC files and fetch all .JS files.
2. Parse the .JS files.
Option 2:
1. Collect all URLs for .JS files and the related info, including ARC/WARC file name, offset, length, etc.(This can be done offline at no cost)
2. Partially fetch .JS files from ARC/WARC files.
3. Parse the .JS files
Option 2 need read whole ARC/WARC file while option 1 only read a smaller part of ARC/WARC file,
Option 1 Option 2
1. Loading ARC/WARC file Loading .JS files
2. Scan all records one by one. Parse .JS files.
3. Parse .JS files.
I am not sure how many percentage of time spending on loading whole ARC/WARC file in option 1 and on scan the records one by one(Although 97% of records are just skipped).