Why does Google not index pages inside folders?

All pages except for pages inside folders are being indexed by Google.

How can I index pages inside of folders to Google?

Howdy @Izzywashere :wave: welcome to the forum.

Can you share a link to your published website and read only link?

Also, in your project settings, SEO tab, what is set in the robots.txt file?

Cheers!

Hey Drew!

here is link
https://preview.webflow.com/preview/btr-596c1a?utm_medium=preview_link&utm_source=designer&utm_content=btr-596c1a&preview=dac7b6b2d1c9d3d94c68a354bb5431d4&workflow=preview

and the robots.txt field under SEO is empty

1 Like

Google is quite arbitrary in its decision on which pages it will index, and which pages it won’t. In general my experience is that Google will index the main areas of a site, between 20% and 60% of the site, and then expand that if it’s seeing interest.

Why? Because of Zipf’s Law. Most the Internet never gets searched, so why index it?

On your site, everything generally looks good, with the exception possibly of your robots.txt.
I’ll explain why below.

Here’s what I’ve checked-

Your “foldered” pages are your Communities pages, only. All other /pathed/pages are Collection-generated pages.

image

SITEMAP.XML

Your community pages are appearing fine in your sitemap.xml so Google is aware of them.

https://www.buildtorentlist.com/sitemap.xml

   <url>
      <loc>https://www.buildtorentlist.com/communities/tx/build-to-rent-homes-dallas</loc>
   </url>

They’re also not password protected, or containing any obvious “no robots” instructions in the <head>. I have not checked links for no-index instructions, but that’s unlikely.

ROBOTS.TXT

You also have no robots.txt configured, which normally would not be an issue. However on Webflow’s 404 page, you have a redirect script back to your home page.

I.e. If you try to access your robots.txt-

https://www.buildtorentlist.com/robots.txt

The server correctly responds with a 404 but then you are redirected. Again, that should not cause an issue, but you’d probably be better off setting your robots.txt content explicitly to allow-all;

User-agent: *
Allow: /

Other than that, my best advice is to work on making those pages more valuable. Add good content, update them regularly, keep adding content to your site over time, etc.

If you’re using Google Search Console, you might get some better information on what’s being indexed, and what’s not, and why.

1 Like

Thank you this is helpful.

I am using Google Search Console and that is where I found that it was not indexing those pages inside folders (specifically)

Ok so I will add as you suggested to the robots.txt field this:
User-agent: *
Allow: /

Would I need to delete the redirect of 404 to the homepage script?

Also, since the sitemap.xml recognizes the pages inside folders, should I submit the sitemap to Google directly?

No I think that’s fine, and kinda cool. If it were me, I might change it to show a nice 404, and then redirect automatically after 4 seconds. Or alternately provide a search, etc to help people find what they were looking for.

Basically, “it can’t hurt.” On client sites I always setup Google Search Console and submit their sitemap.xml. It gives me a faster way to track.

I’ve ended up inputting a custom sitemap in webflow (after I cleaned it up) and submitted it to Google as well. webflow was generating links I didn’t want to be indexed (e.g. eCommerce/checkout pages that came with the site and no way to delete or draft them)

Thank you for the insight

1 Like

As I understand it, removing pages from sitemap.xml won’t tell Google “don’t index this.” It’s simply an assist to help Google find what should be indexed, and to determine whether it’s changed since the last visit.

To prevent indexing on a page, you’ll probably need to add META to the HEAD of the page;

<meta name="robots" content="noindex">

it would be ideal if you could mark pages as “no search engine indexing”, and then have Webflow automatically update the HTTP response headers, METAs, robots.txt, and sitemap.xml appropriately.

https://wishlist.webflow.com/ideas/WEBFLOW-I-211

1 Like

Ah ok I see. I will add this to exclude pages and switch back on the auto sitemap from webflow

1 Like

Hi I inputted your code <meta name="robots" content="noindex"> but webflow auto sitemap is still generating these pages

changes should be instant correct?

Yes that’s correct, you don’t need to worry about the sitemap.xml.
It’s the META noindex that tells Google not to index your page.

Search engine indexing evolved over years with a number of different initiatives, and I’m sure it will keep evolving, but today you can think of it as 3 primary parts…

/sitemap.xml
Tells search engines where your pages are, and when they’ve last changed.
It doesn’t tell engines what they can look at, or what they should index, it’s just a convenience mechanism that makes crawling and change detection more efficient.

/robots.txt
Tells search engines what they are allowed to look at.
That’s different from telling engines what they’re allowed to index.
In general, it’s not useful to exclude pages here because if Google has already indexed your page, and then you add it to your robots.txt, Google will never look at it again, which means it will never update or un-index it.

<meta name="robots" content="noindex">
When you have this META in the HEAD of a page, it tells search engines not to index the page, and to remove the page if they’ve already indexed it.

For 99% of Webflow websites, the best approach is an auto-generated sitemap.xml that shows everything, and META NOINDEX on pages you do not want indexed.

Got it thanks for clarification. Sounds like the noindex script will achieve what I’m wanting