Re: Next Step(s)? - Stratml

10 views

Skip to first unread message

Owen Ambur

unread,

Jul 10, 2023, 9:47:09 PM7/10/23

to Naval Sarda, aboutthe...@googlegroups.com

Naval, with respect to the current search results listings, placing the hit listing count at the bottom of the page is fine. No need to change that.

However, if we enable saving of search results as static HTML pages, it might be better to place a count at the top of the page, similar to what Joe Carmel did in his StratML catalog.

On the other hand, if each of the entries in the listing is numbered, it would not be necessary to provide a separate count because viewers could scroll to the bottom to see the total count. The issue with that option, of course, is to provide sequential numbers that don't appear in current search results and will change each time a new query is run.

Regarding the need to assign a UID to each document in order to directly reference it in a static hypertext listing, the UID would need to remain the same each time a new listing is generated. Otherwise, direct references cited in the past would no longer work when a new HTML listing is posted, at the same URL. Presumably, that could be done by associating a UID with each document when it is indexed but, again, the UID would need to be maintained even if the document is re-indexed, from the same URL. In my current listing, I use acronyms as identifiers for each document, like this. Regardless of how many other documents I add to the listing, the link (direct reference) to that one remains the same.

(If the standardized StratML schema were to be updated, it might make sense to include an identifier for the documents as a whole, in addition to most of the elements within them. However, the lack of such an element shouldn't preclude including one in the search service database and assigning a persistent identifier to each document based upon its URL.)

I'm glad to see that you believe search results listings can readily be saved as static HTML documents by clicking a button on the site. I'll look forward to learning how much that feature might cost and seeing if the results make me feel comfortable no longer maintaining my listing manually. Since Joe Carmel's Perl script no longer seems to work, his catalog doesn't seem to be an alternative anymore.

Your response does not appear to address prospects for adding another query field for the <Source> element, to enable selective discovery of files based upon their domains (e.g., .gov) and subdomains (e.g., fda.gov). Such capability would improve upon Joe Carmel's catalog and enable me to save selective HTML listings for .gov agencies. What would it cost to add such a query field? Might it significantly degrade query performance?

Presumably, a plain textual query of that element would be sufficient to parse out domains and subdomains, and there may be other patterns in the URLs that might be useful to query as well (although I don't know what they might be). On the other hand, since domains and subdomains are regular expressions within source URLs, I wonder if there might be a more efficient way to query them than indexing the full URLs.

Owen Ambur

https://www.linkedin.com/in/owenambur/

On Monday, July 10, 2023 at 11:49:46 AM EDT, Naval Sarda <nsa...@epicommonline.com> wrote:

Please see below

On 7/6/2023 9:43 AM, Naval Sarda wrote:

See below

-------- Forwarded Message --------

Subject: Next Step(s)?

Date: Thu, 6 Jul 2023 02:56:46 +0000 (UTC)

From: Owen Ambur <owen....@verizon.net>

Reply-To: Owen Ambur <owen....@verizon.net>

To: Naval Sarda <nsa...@epicomm.net>

Naval, I wired the payment.

As for what we might do next, a full-text query for periods (.) now turns up 5,714 hits, which presumably is the entire collection. Now that we've added the plan/report name to the search results list, I'm thinking that enabling the entire listing to be saved as an HTML document might be sufficient to relieve me from maintaining my listing at https://stratml.us/drybridge/index.htm

Subject:	Next Step(s)?
Date:	Thu, 6 Jul 2023 02:56:46 +0000 (UTC)
From:	Owen Ambur <owen....@verizon.net>
Reply-To:	Owen Ambur <owen....@verizon.net>
To:	Naval Sarda <nsa...@epicomm.net>

EPI: Please check attached screenshot. Do you want to shift the total count on top?

In order for that to occur, I think I'd want each of the entries to be numbered (currently from 1 to 5,714) or, at least, I'd want the count to be provided, perhaps at the top of the page. Be that as it may, I definitely would want to be able to refer others directly to each entry in the listing, as I frequently do with my current listing. Over time the sequential numbers in the listing would change as documents are added, but the direct referencing URL extentions for each entry should remain the same so that they continue to work. Perhaps that might require creating a UID for each document/entry. (In my listing, I currently use acronyms but often have to add numbers to them or come up with other acronyms when they are repeated.)

EPI: Currently we are not using any UID, instead we are using url for indexing. We can assign UID to each entry in XML file by adding one tag like <UID>aaddffghgh</UID> This will be unique for each file. We will need some clarification if we will upload same file again for rewrite and for duplicate entries.

Please let me know what you think about that, and if it makes sense to you, what it might cost.

Thinking out loud, it might be good if any query could be saved as an HTML document by the user. Presumably, if the entire listing could be saved so could any subset produced by any query.

For example, I'd like to be able to generate selective listings of documents in the .gov domain, including subdomains. However, I believe that would require adding another search field for the <Source> element. Currently, a full-text query for ".gov" turns up 2,308 hits but not all of them are actually .gov documents. (A full-text query for subdomains, e.g., fda.gov, works better.)

Please let me know what adding that feature would cost.

EPI: We can give one export button which will generate HTML file with the listing of searched result. That HTML file will have same data which is looking on screen.

It would also be good to be able to limit search results to non-profit, tax-favored orgs. However, that isn't possible since they are not consistently identified as such in the documents. For example, a full-text query for "501(c)(3)" turns up only 169 hits and there are many more that are not explicitly identified as such.

If we do decide to enable saving of HTML listings, it might be good to:

a) include alphabetical links at the top of the page to enable users to jump to the section they'd like to see without having to scroll to get there, and/or

b) enable the listings to be sorted by date by clicking on that column heading.

The latter feature would also be nice to have in the dynamic search results listings but I don't know how feasible that might be.

EPI: We can give search functionality on web application, on clicking of column name sorting will work. Then sorted list can be exported in HTML file.