precache URL

105 views
Skip to first unread message

Léo Poitout

unread,
Dec 30, 2020, 10:38:08 AM12/30/20
to mod-pagespeed-discuss
Hello,

Is it possible to add a list of URLs to precache?

When loading for the first time, the page is not cached. These are pages for SEO, I wish they were fast on the first call.

Thank you so much,
Leo

Otto van der Schaaf

unread,
Dec 30, 2020, 10:47:50 AM12/30/20
to mod-pagesp...@googlegroups.com
There's nothing off the shelve but a quick google search on "headless chrome crawler" shows https://github.com/yujiosaka/headless-chrome-crawler. Maybe try what happens when that runs periodically? There's one gotcha: depending on configuration, this might have to be done for different user agents, but using chrome might just catch a decent % 

--
You received this message because you are subscribed to the Google Groups "mod-pagespeed-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mod-pagespeed-di...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mod-pagespeed-discuss/8a01b11c-f123-432c-84f6-4ea6470aaecan%40googlegroups.com.

Joshua Marantz

unread,
Dec 30, 2020, 11:14:35 AM12/30/20
to mod-pagespeed-discuss
Agreed. Periodic crawling is the best way to keep the cache fresh. It would be ideal to crawl using the user-agent and Accept headers for the browser/bot you want to optimize for. 

Longinos

unread,
Jan 2, 2021, 6:13:50 AM1/2/21
to mod-pagespeed-discuss
Yes, using a headless browser is the way. With the chorme headless you can change the UA string with the --user-agent parameter.
For RedHat/Centos distros you have a chromium-headless rpm package in the epel repo. I don´t know if for Debian/Ubuntu there is a deb package.

WebDesires

unread,
Jan 2, 2021, 10:21:13 PM1/2/21
to mod-pagespeed-discuss
this would not work since from my understanding Mod_PageSpeed does more than just detecting the user-agent, it is also feature detecting with JavaScript, and also detecting screen resolution and dimensions in order to perform all the various optimizations it does.

So you would have to do more than just setup the user-agent string, and there is no guarantee that something else it does around detecting device characteristics wouldn't just throw you off anyway.

Léo Poitout

unread,
Jan 3, 2021, 3:05:35 AM1/3/21
to mod-pagespeed-discuss
Thank you for your different ideas. I dont really want to add another application in my stack... i want to keep that simple and easy

The traffic on my website allows to generate the cache in some hours.
I think, i just need to change the cleaning time.
I have augmented the FileCacheSizeKb and the FileCacheInodeLimit but the cache in the var folder is about from 3h instead of 1h and the "Cache misses" graph is always about 50%... if i understand 50% of the requests has no cache...?

I have disabled all the cleaning cache (ModPagespeedFileCacheCleanIntervalMs -1) and now i'm checking and waiting. I think the images folder is big and thats why the cache size limit is rapidely full...

Which parameters are you using ? 

Thank you all !!

Joshua Marantz

unread,
Jan 3, 2021, 11:06:00 AM1/3/21
to mod-pagespeed-discuss
RE "not work due to JS": the suggestion was not just to curl the requests but to run a JS-enabled headless browser, so JS-triggered features will work. FWIW MPS mostly does *not* do feature detection with JS; instead it uses the Accept header and, in some cases, User-Agent. However it *does* use JS to compute rendered properties of the page, such as which images are above the fold and which CSS is critical, so crawling with a JS-enabled client is needed.

RE "adding application to the stack": you don't need to run the crawler on your servers. You can (and probably should) run it on different machines. It is up to you whether it's worth the trouble though, because it's clearly more work than *not* running a crawler :)

RE "50% cache hit rate": this is entirely reasonable and it depends entirely on the entropy of your URLs and which filters you have enabled. Some filters cache properties of your HTML pages and if you have URL-params with user IDs or timestamps other entropy in them, your cache hit rate will be low. That's fine; you just may not get the best optimization (e.g. inlining above-the-fold images and lazy-loading below-the-fold ones) in those scenarios, for first-time visitors.

RE "tweaks to file-cache": I am very skeptical of the file-cache configuration you propose. If you don't allow MPS to clean the file cache, it will run until it fills your disk and ultimately that will not make your systems very happy. And it won't help what you are trying to do, because cache-cleaning and cache-expiration are two different things. If you want cache expiration to be longer, that's something you can control with a combination of origin resource cache-control settings and, in some cases, configurations you can do in mod_pagespeed like https://www.modpagespeed.com/doc/system#implicit_cache_ttlhttps://www.modpagespeed.com/doc/system#load_from_file_cache_ttl , There may be a few more of those type of settings lurking around the docs.

WebDesires

unread,
Jan 3, 2021, 8:04:20 PM1/3/21
to mod-pagespeed-discuss
"jmarantz
RE "not work due to JS": the suggestion was not just to curl the requests but to run a JS-enabled headless browser, so JS-triggered features will work. FWIW MPS mostly does *not* do feature detection with JS; instead it uses the Accept header and, in some cases, User-Agent. However it *does* use JS to compute rendered properties of the page, such as which images are above the fold and which CSS is critical, so crawling with a JS-enabled client is needed."

This is exactly what i was on about... And it SHOULD be using some forms of feature detection for things such as .webp.
But with the render properties detection which is what I was mostly referring to... this would mean running an automated cache creator would be either difficult or a waste of time to get right. There would be no guarantee or way to know you are doing it in a way that you are creating perfect optimize caches for what you need.

Joshua Marantz

unread,
Jan 3, 2021, 8:43:10 PM1/3/21
to mod-pagespeed-discuss
Feature detection is cool and correct but then you can't deliver any image data (even inlined) until after the browser parsers and executes JS, which would not be a good user experience. The Accept header works well though and is reasonably friendly to caches, and enables you to deliver inlined optimized images in response to the initial HTML request. I don't think that's possible with JS feature detection.

I'm not sure what you mean by "automatic cache creator". MPS would still manage its own cache. I think in the crawler you'd have to guess the window size. E.g. is an image positioned 1000 pixels from the top above-the-fold or not? Depends on the size of the window. But I think getting that wrong is not catastrophic and you'd improve the experience for most users.



Longinos

unread,
Jan 4, 2021, 7:24:22 AM1/4/21
to mod-pagespeed-discuss
What we do?

Use chromium headles, javascript enabled. With some tool like Google analitycs chosse 2-3 UA that fit 75% of request and 2 screen sizes most used (chromium headles had parameters for the width and heigth). Then we make 4-6 request with combined UA + screen size for each url and use the site map to walk the web. And run a script 2 times a day.

In this way most of the users get the work from the pagespeed module done when it comes to the web, some not.

Alexander Gran

unread,
Jan 4, 2021, 8:30:55 AM1/4/21
to mod-pagespeed-discuss, Longinos
Hey

Sounds exactly what I'm after as well.
Did you already start a script? I'd like to join forces!

Regards
Alex
> >>>>>>> To view this discussion on the web visit
> >>>>>>>
> >>>>>>>> https://groups.google.com/d/msgid/mod-pagespeed-discuss/CAHqmWiNF_H
> >>>>>>>> YMQECrfZeofEiiiDDKTW8fz-wHrvMFQ4DL2FZTnA%40mail.gmail.com
> >>>>>>>> <https://groups.google.com/d/msgid/mod-pagespeed-discuss/CAHqmWiNF
> >>>>>>>> _HYMQECrfZeofEiiiDDKTW8fz-wHrvMFQ4DL2FZTnA%40mail.gmail.com?utm_med
> >>>>>>>> ium=email&utm_source=footer> .
> >>>>
> >>>> You received this message because you are subscribed to the Google
> >>>> Groups "mod-pagespeed-discuss" group.
> >>>> To unsubscribe from this group and stop receiving emails from it, send
> >>>> an email to mod-pagespeed-di...@googlegroups.com.
> >>>
> >>> To view this discussion on the web visit
> >>>
> >>>> https://groups.google.com/d/msgid/mod-pagespeed-discuss/019db659-499d-4
> >>>> 611-8bc7-d2c666d26438n%40googlegroups.com
> >>>> <https://groups.google.com/d/msgid/mod-pagespeed-discuss/019db659-499d
> >>>> -4611-8bc7-d2c666d26438n%40googlegroups.com?utm_medium=email&utm_source
> >>>> =footer> .
> >>
> >> You received this message because you are subscribed to the Google Groups
> >> "mod-pagespeed-discuss" group.
> >> To unsubscribe from this group and stop receiving emails from it, send an
> >> email to mod-pagespeed-di...@googlegroups.com.
> >
> > To view this discussion on the web visit
> >
> >> https://groups.google.com/d/msgid/mod-pagespeed-discuss/9842cc8a-590b-471
> >> e-bf50-a3d1bd2d58cdn%40googlegroups.com
> >> <https://groups.google.com/d/msgid/mod-pagespeed-discuss/9842cc8a-590b-4
> >> 71e-bf50-a3d1bd2d58cdn%40googlegroups.com?utm_medium=email&utm_source=foo
> >> ter> .





--


Treffen Sie bobbie jederzeit in einer Videokonferenz. Zu gegebener
Zeit auch wieder auf den relevanten Messen.












bobbie Deutschland GmbH    Wilhelm-Dieß-Weg 2    81927 München




Büro München:    Leopoldstraße 7    80802 München    Eingang Georgenstraße




Büro Aachen:  Kasinostraße 44    52066 Aachen






Geschäftsführer: Tim Kuhlmann; Dipl.-Inform. Alexander Gran MBA; Dipl.-Ing.
Stephan Langkau MBA




Amtsgericht München HRB 233452     
Ust.-ID-Nr. DE311909262


Longinos

unread,
Jan 4, 2021, 9:01:38 AM1/4/21
to mod-pagespeed-discuss
Try some like the file posted.
Make a cron job, some like:

* * * * * *  user-to-run-script /path-to-file/warm.sh your.domain.com >/dev/null 2>&1

change * for apropiate time you will run the script.

My site uses a "main" sitemap_index.xml, in this url, we have others xml sitemap.
and also is a Centos, so we have chromium-headless rpm package in the epel repo.
In the script we use https, change it to whatever you use.

We have added Googlebot also Ligthhouse UA, so we don´t need to do some hits in a url before test it in PageSpeed Insigth.
warm.sh

Léo Poitout

unread,
Jan 8, 2021, 9:47:50 AM1/8/21
to mod-pagespeed-discuss
@ jmarantz 
I have set the cache to not fill the disk, it's more safe.
ModPagespeedFileCacheSizeKb 100000000 (100GB)

I still don't understand how long an image (webp) is kept after its creation. When I look in the cache sometimes a new one is created with no changes on the site.

I would like the images to be created in webp in the cache once and forever (if no change).

Do I have to change the duration of the page cache?
By default I see this in the files :
DateFri, 08 Jan 2021 14:33:24 GMTJ (
ExpiresFri, 08 Jan 2021 14:38:24 GMTJ
Cache-Control max-age = 300J

I'm using CoreFilters.
- convert_png_to_jpeg
+ lazyload_images, collapse_whitespace

@Longinos
I have tested the script, it's perfect ! Thx
With chromium-browser on ubuntu.

Thank you all,
Leo

Joshua Marantz

unread,
Jan 8, 2021, 11:16:35 AM1/8/21
to mod-pagespeed-discuss
max-age=300 means that it drops out of cache every 5 minutes, potentially. PageSpeed tries to keep things fresh, by proactively refreshing almost-expired resources if it gets frequent requests. For example, if a request is made after 4 minutes, PageSpeed will see the source image is almost expired and will refresh it. If the image contents haven't changed, it can refresh the metadata cache so the optimized image doesn't need to be recomputed.


The problem is that if you don't get frequent enough requests for the image and it becomes stale, mod_pagespeed will drop its cached data and re-fetch and re-optimize the image.

To solve this problem, you could try to change your origin cache TTL from 5 minutes to something longer, say and hour (max-age=3600) or a day.


--
You received this message because you are subscribed to the Google Groups "mod-pagespeed-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mod-pagespeed-di...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages