some of our subpages are not listed in the Google Search Console because they are supposedly excluded by a noindex tag.
However, we only initially had this in the tag of the associated CMS collection pages, but we have removed it for a few months now.
Resubmitting the page in the Search Console does not help. When I try to validate in the Search Console, the process fails every time and only a few irrelevant pages are checked.
The sitemap status in the Search Console is “success”.
Attached are two read-only links for some of the pages that supposedly have a noindex tag.
Why is it that the pages are not crawled despite the set canonical tag URL, the removed noindex tag and the repeated submission of the page in the Search Console? And why does the check in the Search Console fail every time?
I looked at the first one and looks fine ( no noindex on that published page ).
If you’re looking at GSC reports, you’ll see the revalidate option to tell it you’ve fixed a specific issue and it will recheck. Also make certain you’re looking at the far right side of the graph which represents the current state, it shows history of each issue and that confuses a lot of people.
It sounds like the pages are still being treated as “noindex” despite the removal of the tag. I’d suggest checking for any lingering noindex directives in the robots.txt file or in any other meta tags. Also, try fetching the pages directly via the URL Inspection tool in Google Search Console to see if any issues are highlighted. If the Search Console validation fails, it could indicate a deeper issue with the indexing or the CMS setup.
Regarding GSC @memetican:
The checking of noindex pages fails after about 50 and then stops. A large proportion of the pages, including the affected CMS pages, are then no longer checked. This is still the case even after several attempts. So there is most likely something that is interrupting the check each time.
Regarding your suggestions @markwilly:
The robots TXT looks like this, but we don’t have a hidefrombots directory, so there shouldn’t be any negative impact:
The URL Inspection Tool in GSC tells me that the page was last crawled months ago, even though I submitted the pages multiple times. In addition, a few pages are shown as indexed, so there is no logical pattern.
I think that there is an underlying problem, as you suggested, e.g. some third-party script or settings on other CMS collection pages that send the wrong signal here.
I was now thinking of adding a follow tag at the CMS collection page level:
<meta name = “robots” content = “index, follow”>
Are there any further suggestions from the community to solve this issue?