You can use a tool like Site Sucker to spider a site and download all the files that are linked from the main access access points. For example, you could find out all of the pages there are linked from the home page. You should also consider that some pages might not be linked but are still accessed such as if you have local API calls through JavaScript, robots.txt file, favicon, or site map.
If the site is active then you an also look at analytics or web logs to find out what pages have actually been accessed but this might not be as useful before a site launches.
Whatever strategy you use, using version control is a good idea since that will make it easier to get back any files you delete that you find you need later.
[fletcher]