On May 3, 7:19 pm, Jayesh Salvi <
jayeshsa...@gmail.com> wrote:
(switched the order on these blocks to better flow)
> The rest of the files in cache are actual resources (html or image
> files), their size is not under RS's control. They could be bunched
> together into one single file, but then accessing and deleting
> individual resources from that single file would be an expensive in terms
> of cpu and i/o time (imagine it to be a file system inside a
> file). Using the existing file system as a cache is advantageous in
> that case.
The existing behavior of the content cache for news item text/html
makes perfect sense. It is the perfect storage medium for that type of
data; html FILES, in the FILEsystem :) I would also guess that the
"size discrepancy" discussed above would be minimized storing this
type of data (arbitrary length text). Attempting to store in a
secondary file system certainly would lead to increased IO time.
> The .link files are meant to have temporary existence. They contain the
> URLs of the actual content. To speed up the downloading the news feed,
> RS doesn't fetch the embedded images immediately but stores a pointer to
> them in .link files. A background thread then fetches these resources
> and deletes the .link files OR if you happen to access that news item
> before the background thread reaches it, it will be cached then. In both
> cases the .link file will be deleted.
(Bear in mind that I make some assumptions here about how your
software is setup)
The .link files on the other hand, do not seem to be the natural
storage choice for this type of information. To paraphrase, these
files contain a collection of (string) urls or 'references' to other
resources (pictures, css probably, etc) associated with a particular
news item. These target files/items may be found in the cache
directory, or may have to be fetched from the web.
This information seems to clearly be 'structured data' associated
strongly to entries in your item database; as such it should live IN
the database. Just as you store a reference to html cache in the db
(you must, right?), you should also store references to _other_ items
required to display that news-item. This could be implemented as a one-
to-many relationship from your "item" table to a "resources" table,
which could contain a link to the parent news-item, an original
location (url), and the cached location.
Structuring this data in the database should lead to LESS I/O and
faster response all around, since you would have a lot fewer trips to
disk. You should have cleaner and speedier code in: Filling the cache--
background thread just SELECTs items to process; purging items and/or
cache--links would always be synchronized with items by cascading or
transactional DELETEs; viewing items--no need to parse .link files to
load on-demand. You also would mitigate the size discrepancy problem
and get a smaller and more efficient cache!
> Therefore, if due to storage issues the news item cache file gets lost
> the only way to automatically reload the content is accessing the entire
> feed again (which is min 20 items at a time). So instead of doing this
> automatically for a remote case of storage accidents, it is left to
> manual choice to refresh the feed and fetch the contents again.
Wow, that's a bummer. Too bad there is no API to retrieve a single
item. If there is no simple way to retrieve specific items (or small
sets of items that will definitely contain the target), there's not
much you can do except to display an error message (or visual cue)
indicating that the item could not be found. How about "greying out"
the content area, or adding a "cache item not found" image/message?