"noindex" pages are still being found in CrUX APIs

133 views
Skip to first unread message

Ari F

unread,
Apr 10, 2023, 12:40:35 PM4/10/23
to Chrome UX Report (Discussions)
Hi all,
The following two pages are "noindex"ed and have been for many months. For some reason, though, they still sporadically surface when querying the CrUX APIs. We want to make sure theses pages aren't contributing to web vitals in the Google Search Console. According to the CrUX Methodology, these pages should not be surfacing on the CrUX Report. If you query the CrUX History API you can see how sometimes there's no data but sometimes there is.


Example request:
Screenshot 2023-04-10 at 12.37.20 PM.png

Let me know if you have an idea of what could be going on.
Thanks!

❄ Johannes Henkel

unread,
Apr 10, 2023, 1:08:02 PM4/10/23
to Ari F, Chrome UX Report (Discussions)
https://developers.google.com/search/docs/crawling-indexing/block-indexing
Please take a look at the red box starting with "Important!" on that page.
It looks like https://hidrb.com/robots.txt blocks crawlers from https://hidrb.com/start, so that means if it's marked noindex there is just no way to notice the noindex.

It could be that https://hidrb.com/covid/start may have been blocked in robots.txt earlier as well, I don't have a good way to check this quickly. PSI says that it's blocked from indexing (https://pagespeed.web.dev/analysis/https-hidrb-com-covid-start/0ea0kc22ol?form_factor=mobile), so if it was recently unblocked from crawling in robots.txt then it should drop from the regular CrUX once the most recent 28 days are processed and serving.

If it's not working after that please feel free to say hi again and we'll do some more debugging on our side. Good wishes!

--
You received this message because you are subscribed to the Google Groups "Chrome UX Report (Discussions)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chrome-ux-repo...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/chrome-ux-report/42080ca4-cda3-4802-bce2-e90e52731913n%40chromium.org.


--
johannes :-)

Ari F

unread,
Apr 10, 2023, 2:16:51 PM4/10/23
to Chrome UX Report (Discussions), joha...@google.com, Chrome UX Report (Discussions), Ari F
Ah I see. Though this is surprising and to some extent contradicts what we're seeing. 

At least for https://hidrb.com/start, the same week we added it to robots.txt (deployed on 3/6), is the same week CrUX stopped reporting on that page.

According to the CrUX History API, this is the "lastDate" of the last collection period that reported a non-null p75 CLS for this page:

{
"firstDate": {
"year": 2023,
"month": 2,
"day": 5
},
"lastDate": {
"year": 2023,
"month": 3,
"day": 4
}
},

That week lines up exactly with when we added it to robots.txt. Now I'm concerned that removing this page from robots.txt will cause CrUX to start reporting on it again. 

❄ Johannes Henkel

unread,
Apr 11, 2023, 5:19:34 PM4/11/23
to Ari F, Chrome UX Report (Discussions)
On Mon, Apr 10, 2023 at 11:16 AM Ari F <ar...@hidrb.com> wrote:
Ah I see. Though this is surprising and to some extent contradicts what we're seeing. 

At least for https://hidrb.com/start, the same week we added it to robots.txt (deployed on 3/6), is the same week CrUX stopped reporting on that page.

According to the CrUX History API, this is the "lastDate" of the last collection period that reported a non-null p75 CLS for this page:

{
"firstDate": {
"year": 2023,
"month": 2,
"day": 5
},
"lastDate": {
"year": 2023,
"month": 3,
"day": 4
}
},

That week lines up exactly with when we added it to robots.txt. Now I'm concerned that removing this page from robots.txt will cause CrUX to start reporting on it again. 

Yes I think unfortunately, this concern makes sense.

The difficulty with getting it right / working as intended is that there is state in the system. The aggregations that are published in the CrUX report include 28 days of data, and for URL granularity what matters is whether or not it was marked 'noindex' when a particular measurement was made.
So, it's conceivable that even though the robots.txt edit happened in some week, the last data point that was published also just happened then - depending on which noindex info what accessible to the crawler before, and whether there was enough data, etc.
I would think that at the moment, since it's excluded from the crawl, that URL is not noindex any more, so it may just take a while for there to be enough metric samples that aren't marked noindex and then the URL may show up in CrUX again. :-( Sorry!

The correct way is to make sure that the page is identified as noindex with one of the usual mechanisms (meta tag, header, etc.) *and* that it can be crawled, as in, allowing it in robots.txt (https://developers.google.com/search/docs/crawling-indexing/block-indexing).

On Monday, April 10, 2023 at 1:08:02 PM UTC-4 joha...@google.com wrote:
https://developers.google.com/search/docs/crawling-indexing/block-indexing
Please take a look at the red box starting with "Important!" on that page.
It looks like https://hidrb.com/robots.txt blocks crawlers from https://hidrb.com/start, so that means if it's marked noindex there is just no way to notice the noindex.

It could be that https://hidrb.com/covid/start may have been blocked in robots.txt earlier as well, I don't have a good way to check this quickly. PSI says that it's blocked from indexing (https://pagespeed.web.dev/analysis/https-hidrb-com-covid-start/0ea0kc22ol?form_factor=mobile), so if it was recently unblocked from crawling in robots.txt then it should drop from the regular CrUX once the most recent 28 days are processed and serving.

If it's not working after that please feel free to say hi again and we'll do some more debugging on our side. Good wishes!

On Mon, Apr 10, 2023 at 9:40 AM 'Ari F' via Chrome UX Report (Discussions) <chrome-u...@chromium.org> wrote:
Hi all,
The following two pages are "noindex"ed and have been for many months. For some reason, though, they still sporadically surface when querying the CrUX APIs. We want to make sure theses pages aren't contributing to web vitals in the Google Search Console. According to the CrUX Methodology, these pages should not be surfacing on the CrUX Report. If you query the CrUX History API you can see how sometimes there's no data but sometimes there is.


Example request:
Screenshot 2023-04-10 at 12.37.20 PM.png

Let me know if you have an idea of what could be going on.
Thanks!

--
You received this message because you are subscribed to the Google Groups "Chrome UX Report (Discussions)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chrome-ux-repo...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/chrome-ux-report/42080ca4-cda3-4802-bce2-e90e52731913n%40chromium.org.


--
johannes :-)


--
johannes :-)

Ari F

unread,
Apr 11, 2023, 5:33:10 PM4/11/23
to Chrome UX Report (Discussions), joha...@google.com, Chrome UX Report (Discussions), Ari F
Got it. We just deployed the change today to remove it from robots.txt. I'll monitor over the next 28 days. 
Thanks for all of your feedback!

Reply all
Reply to author
Forward
0 new messages