First - Webflow is incredible. Amazing work and kudos to the entire team.
So I’ve come across two problems with how Webflow generates sitemaps when there are multiple custom domains (without setting the primary domain):
First, we publish sites to a staging.* URL for client reviews/approvals, then to www for production. However, custom domains appear to be sorted alphabetically, and the one highest in the list appears to be used in the sitemap.xml file. So our sitemap.xml on www.* is full of staging.* URLs.
According to the spec, this doesn’t seem to be valid without some additional setup (which Webflow doesn’t do, and really doesn’t seem necessary). I’d suggest that the sitemap.xml file use the published host URL when generated, as this seems a more expected behavior. Alternatively, maybe some sort of configuration options so you can specify which custom domain should be used for the sitemap - but again, this seems like an edge case considering best-practices. I realize the SEO implications of duplicate content, but the sitemaps as Webflow handles it now won’t fix that issue, and users who truly are publishing identical public content to two separate URLs - well, can we really help them?
Second, due to the issue above, I deleted the staging custom domain and republished the www (to correct the sitemap.xml issue), however the sitemap.xml file was not updated (still contained staging URLs). It was only after I made some material change to content/design and republished that the sitemap was correctly regenerated. This issue would also be resolved by addressing the one above though!
Thanks for the kind words and for taking the time to write out this detailed post—I really appreciate your feedback!
Currently this is expected behavior for the sitemap to use all custom domains added to a project. The ability to manually choose which domains are added to the sitemap would be an awesome feature to have in Webflow. It would give you a lot more control for sure—this is something you could post about in our Webflow Wishlist.
So just to clarify, it sounds like you’re saying this is expected behavior for the sitemap on one custom domain to contain URLs from a different custom domain in the project. Can you tell me how that determination is made when the site is published? Is it in fact alphabetical, the first custom domain is used for all sitemaps?
That would at least help us when we’re setting up domains for clients.
I spoke with our team about this further and confirmed that the domain used is based on which URL is set as the default domain. If no domain was set as the default, there’s a lot of additional logic to prioritize specific domains over others.
With this in mind, I’d suggest setting a default domain prior to publishing your projects going forward.
Thanks @Brando - unfortunately setting a default domain isn’t a viable solution, as we then lose access to the use of the staging.* URL (since it redirects all others to the default). Essentially if we used the default domain setting, we wouldn’t encounter this issue.
It’s a bummer that there’s not more clarity around which URL is selected for the sitemap - that would make this issue something we could workaround. But I do appreciate the responses! If you ever decide to document the specifics, it would be much appreciated!
Hi @nealwhite, that is a good point, until the feature changes. Will put up an update for that in the Help center doc on the sitemap, to point out that if no default domain is selected, all custom domains will be shown in the map without prioritization.
One other thing to try, is to setup a robots.txt entry in the SEO section of site settings, and disable crawling of the subdomain via robots.txt.
The sitemap.xml provides search engines with a direct path to each page of your site, which offer quick indexing for all the pages whereas robots.txt is used to tell search engines what pages to crawl and what not to crawl.