Re: Missing Files?

1 view
Skip to first unread message

Owen Ambur

Jan 19, 2023, 9:54:38 AMJan 19
to Naval Sarda,
Yes, Naval, I figured I could just FTP all of the files to the query service again.  However, it seems like that might be case of the tail wagging the dog and I don't think I want to do that to capture the relatively few documents that seem to be missing, at least not yet.

How is the capability to import files from their URLs coming along?  If it includes the capability to apply sitemap listings to the import process, I might want to try that.  However, since Joe's cataloguer hasn't recently identified any files in the listing that are not at the URLs cited, I'm wondering if there might be a problem with it too.  While I'm pretty careful about checking them when I update the sitemap listing, mistakes do happen. 

On Thursday, January 19, 2023 at 08:55:10 AM EST, Naval Sarda <> wrote:

Hi Owen,

We did not find a manual way to compare with Jeo Carmel's Stratml sitemap catalog.

You can upload all the files you have on FTP and it will process all of them. So if it is already there, it will replace.


On 19/01/23 3:56 am, Owen Ambur wrote:
Thanks, Naval, but I don't know what I might reasonably do with such a long listing of files that were imported into the query service.

What I need is a reasonbly simple way to see which files were NOT indexed in the query service but DO appear in the catalogue derived from the sitemap listing by Joe Carmel's StratML cataloguer Perl script.

I don't necessarily want to spend a lot of time and effort on it, but I'm thinking part of the problem might have been assoicated with files with the same names that were in different subdirectories before being consolidated into the new docs directory.

On Wednesday, January 18, 2023 at 03:39:10 PM EST, Naval Sarda <> wrote:

Please see below

-------- Forwarded Message --------
Subject: Re: Fwd: Missing Files?
Date: Wed, 18 Jan 2023 21:50:16 +0530
From: Sudarshana <>
To: Naval Sarda <>, Jitendra Shende <>,, Balasaheb Pandarkar <>


Herewith we are sending you list of all files which are indexed. See attached.

On 1/18/2023 9:50 AM, Naval Sarda wrote:

See below

-------- Forwarded Message --------
Subject: Re: Missing Files?
Date: Tue, 17 Jan 2023 17:10:23 +0000 (UTC)
From: Owen Ambur <>
Reply-To: Owen Ambur <>
To: Naval Sarda <>

Thanks, Naval, but I'm having a hard time understanding why Joe Carmel's cataloguer, which runs off the sitemap listing: 

a) finds 36 more files than have been imported into the query service, and 
b) if these 12 files were not at the URLs listed in the sitemap, why his cataloguer did not identify them as missing, as it has done for missing files in the past (even if they were not really missing but were missed due to network issues).

I'm also having a hard time finding these 12 files in my local archives.  However, these two entries in a previous version of the sitemap may provide a clue regarding CDIR_2:


There were both Part 1 & Part 2 versions of that plan.  The Part 2 version has been indexed in the query service but the Part 1 version apparently has not.

I don't think that would be the case for as many as 36 files and I haven't yet found the other 11 in my local archives.  However, I also did a bit of sleuthing in the Internet Archive and was able to discover these typos in the URLs:

M4GA stands for Mayors for Guaranteed Income and should be M4GI:  

I'm also led to believe that OSBP is DODOSBP and I was able to find USCC and LOC 2019

I FTP'ed those five files for indexing in the query service.  The other six are still a bit of a mystery.

I'm not going to worry too much about this and I don't know if I'll be able to make sense of a complete listing of >5.5K files in the query service in comparison to either my sitemap listing, Joe Carmel's catalog, or my hyperlinked listing.  However, I'll look forward to learning if there might be a way to reconcile the discrepancy without taking too much time or effort.

On Tuesday, January 17, 2023 at 08:05:38 AM EST, Naval Sarda <> wrote:

Hi Owen,

The following list of files were on sitemap,xml but not on the locations pointed by sitemap when we scrapped them in Decemeber.


We will share entire list of files on query server soon so that you can compare what is missing.


On 16/01/23 9:52 pm, Owen Ambur wrote:
Naval, that enabled me to identify three files saved on 12/8 that appear in my sitemap listing above the CCA.xml file and were apparently not included in the batch import into the query service:

However, that still leaves 36 unaccounted for.

If you can tell me the date that the files were downloaded from the site for transformation to conform to the latest version of the schema, I can determine which ones may have been created after that but before I started FTP'ing others into the query service.

On Saturday, January 14, 2023 at 10:11:38 PM EST, Naval Sarda <> wrote:

Hi Owen,

This was the topest file in the sitemap we have downloaded last


On 15/01/23 8:00 am, Owen Ambur wrote:
Naval, I realized I could probably tell where the cut-off was for files that were imported in batch into the query service, based on the date they were all conveted and copied to my stratml/docs folder, i.e., December 9.

I just re-ran Joe Carmel's cataloguer and see that there are now 31 more files in the collection (5,609) than there were when I last ran the cataloguer on December 10 (5,578). 

It appears that 5,570 files have now been indexed in the query service, meaning that 39 of them may be missing, but I'm now sure how to determine which ones they might be.

Any suggestions?

Thanks & Regards

Naval Sarda

Jan 19, 2023, 10:23:19 AMJan 19
to Owen Ambur,

Hi Owen,

Developer working on your project is occupied on a time sensitive project. Once she is free, she will complete URL import feature. Most likely next week she will work on your project. We did not review entire Joe's cataloguer though. It is hard to figure out the difference manually. Programmatically it can be verified but that involves engaging programmer to do so. 


Owen Ambur

Jan 19, 2023, 10:41:16 AMJan 19
to Naval Sarda,
Naval, I'll be away for the next couple weeks.  So I may not have much time to deal further with this until the second or third week in February anyway.

Moreover, since the number of documents in question is relatively small, I don't want to spend more time and attention on it than it warrants.  At this point, it is just a matter of curiosity.

However, it does hightlight the fact that the URLs have always been a key issue for the project.  Since they could be inferred for all of the files FTP'ed from the folder, that was not an issue in the initial bulk upload but it will be for files indexed elsewhere on the Web -- which is the actual use case in the long run (because the entire existing collection is prototypical in nature).

Reply all
Reply to author
0 new messages