Why does Google not index pages inside folders?

Izzywashere · August 29, 2022, 7:12pm

All pages except for pages inside folders are being indexed by Google.

How can I index pages inside of folders to Google?

Drew_Schafer · August 29, 2022, 10:15pm

Howdy @Izzywashere welcome to the forum.

Can you share a link to your published website and read only link?

Also, in your project settings, SEO tab, what is set in the robots.txt file?

Cheers!

Izzywashere · August 30, 2022, 12:26am

Hey Drew!

here is link
https://preview.webflow.com/preview/btr-596c1a?utm_medium=preview_link&utm_source=designer&utm_content=btr-596c1a&preview=dac7b6b2d1c9d3d94c68a354bb5431d4&workflow=preview

and the robots.txt field under SEO is empty

memetican · August 30, 2022, 2:53am

Google is quite arbitrary in its decision on which pages it will index, and which pages it won’t. In general my experience is that Google will index the main areas of a site, between 20% and 60% of the site, and then expand that if it’s seeing interest.

Why? Because of Zipf’s Law. Most the Internet never gets searched, so why index it?

On your site, everything generally looks good, with the exception possibly of your robots.txt.
I’ll explain why below.

Here’s what I’ve checked-

Your “foldered” pages are your Communities pages, only. All other /pathed/pages are Collection-generated pages.

SITEMAP.XML

Your community pages are appearing fine in your sitemap.xml so Google is aware of them.

https://www.buildtorentlist.com/sitemap.xml

   <url>
      <loc>https://www.buildtorentlist.com/communities/tx/build-to-rent-homes-dallas</loc>
   </url>

They’re also not password protected, or containing any obvious “no robots” instructions in the <head>. I have not checked links for no-index instructions, but that’s unlikely.

ROBOTS.TXT

You also have no robots.txt configured, which normally would not be an issue. However on Webflow’s 404 page, you have a redirect script back to your home page.

I.e. If you try to access your robots.txt-

https://www.buildtorentlist.com/robots.txt

The server correctly responds with a 404 but then you are redirected. Again, that should not cause an issue, but you’d probably be better off setting your robots.txt content explicitly to allow-all;

User-agent: *
Allow: /

Other than that, my best advice is to work on making those pages more valuable. Add good content, update them regularly, keep adding content to your site over time, etc.

If you’re using Google Search Console, you might get some better information on what’s being indexed, and what’s not, and why.

Izzywashere · August 30, 2022, 1:49pm

Thank you this is helpful.

I am using Google Search Console and that is where I found that it was not indexing those pages inside folders (specifically)

Ok so I will add as you suggested to the robots.txt field this:
User-agent: *
Allow: /

Would I need to delete the redirect of 404 to the homepage script?

Also, since the sitemap.xml recognizes the pages inside folders, should I submit the sitemap to Google directly?

memetican · August 30, 2022, 9:21pm

No I think that’s fine, and kinda cool. If it were me, I might change it to show a nice 404, and then redirect automatically after 4 seconds. Or alternately provide a search, etc to help people find what they were looking for.

Basically, “it can’t hurt.” On client sites I always setup Google Search Console and submit their sitemap.xml. It gives me a faster way to track.

Izzywashere · August 31, 2022, 1:59am

I’ve ended up inputting a custom sitemap in webflow (after I cleaned it up) and submitted it to Google as well. webflow was generating links I didn’t want to be indexed (e.g. eCommerce/checkout pages that came with the site and no way to delete or draft them)

Thank you for the insight

memetican · August 31, 2022, 4:09am

As I understand it, removing pages from sitemap.xml won’t tell Google “don’t index this.” It’s simply an assist to help Google find what should be indexed, and to determine whether it’s changed since the last visit.

To prevent indexing on a page, you’ll probably need to add META to the HEAD of the page;

<meta name="robots" content="noindex">

it would be ideal if you could mark pages as “no search engine indexing”, and then have Webflow automatically update the HTTP response headers, METAs, robots.txt, and sitemap.xml appropriately.

https://wishlist.webflow.com/ideas/WEBFLOW-I-211

Izzywashere · August 31, 2022, 1:29pm

Ah ok I see. I will add this to exclude pages and switch back on the auto sitemap from webflow

Izzywashere · September 1, 2022, 2:27pm

Hi I inputted your code <meta name="robots" content="noindex"> but webflow auto sitemap is still generating these pages

changes should be instant correct?

memetican · September 1, 2022, 9:29pm

Yes that’s correct, you don’t need to worry about the sitemap.xml.
It’s the META noindex that tells Google not to index your page.

Search engine indexing evolved over years with a number of different initiatives, and I’m sure it will keep evolving, but today you can think of it as 3 primary parts…

/sitemap.xml
Tells search engines where your pages are, and when they’ve last changed.
It doesn’t tell engines what they can look at, or what they should index, it’s just a convenience mechanism that makes crawling and change detection more efficient.

/robots.txt
Tells search engines what they are allowed to look at.
That’s different from telling engines what they’re allowed to index.
In general, it’s not useful to exclude pages here because if Google has already indexed your page, and then you add it to your robots.txt, Google will never look at it again, which means it will never update or un-index it.

<meta name="robots" content="noindex">
When you have this META in the HEAD of a page, it tells search engines not to index the page, and to remove the page if they’ve already indexed it.

For 99% of Webflow websites, the best approach is an auto-generated sitemap.xml that shows everything, and META NOINDEX on pages you do not want indexed.

Izzywashere · September 1, 2022, 10:05pm

Got it thanks for clarification. Sounds like the noindex script will achieve what I’m wanting

Topic		Replies	Views
Robots.txt dissalove ignored when auto publish sitemap SEO	2	60	April 9, 2025
Basic robots.txt question SEO	8	16217	March 29, 2022
Question about SEO, sitemap, and robots.txt SEO	1	864	October 24, 2021
No way to create index page for page-folders [Not working for folders - 2020] SEO	17	5444	January 26, 2024
Prevent Search Engines from Indexing Public Sites (Disallow) General	28	8033	February 14, 2017

Why does Google not index pages inside folders?

Related topics