I am trying to tell crawleres to not index any pages inside the modules folder but the auto-generate sitemap function from webflow still include all urls inside the /modules/ folder unfortunately.
Using robots.txt isn’t the best approach for preventing specific pages from being discovered, as it only suggests to search engines not to crawl certain pages but doesn’t guarantee they won’t be indexed. Here are two more effective methods:
The simplest solution is to use Webflow’s Sitemap indexing toggle. You can find this in your page settings by clicking the settings icon (gear) next to each page in the Navigator panel. Toggle “Include in sitemap” to off for any pages you want to exclude from search engine indexing.
For more control, you can disable the auto-generate sitemap feature in your site settings and create your own custom sitemap.xml file. This allows you to specifically define which URLs should be included in your sitemap. You can then upload this custom sitemap through your project settings under the SEO tab.
Hopefully this helps! If you still need assistance, please reply here so somebody from the community can help.
A few things that are not currently supported here;
Webflow does not have collection page setting to exclude collection pages from the sitemap, the way it does for static pages.
Robots.txt is just a text file for bots, Webflow does not use or analyze it in any way, including sitemap gen.
So, options depending on what exactly you want.
Static sitemap.xml. Generate what you want. This is the only way natively to publish the collection pages and exclude them from the sitemap.xml at the same time.
Robots.txt, can tell bots to avoid downloading and indexing those pages. Some bots, especially AI bots, ignore this. Googlebot respects it, however if your page is already indexed it may stay indexed. Robots.txt directs robots where to avoid, not what should be in the index.
Meta robots on the page. You can add a meta tag into the collection page head to indicate that it should not be indexed. It will still be in the sitemap, which is probably fine.
If you’re trying to make it invisible from the world, password protection is your best bet.