Protect/exclude PDFs from search engines

Hi, we need to fairly securely exclude PDFs uploaded via Webflow CMS. There are links to the PDF on Collection Templates (unprotected pages).

Two questions:

  1. What can we implement to make sure that these PDFs are not crawled and indexed by Google and Co?

  2. Is it correct that should Google gets hold of the PDFs addresses and their content, there is no way for us to request removal because the PDFs are hosted under Webflow’s domain?

Many thanks!

Not link to them or qualify the outbound link. See https://support.google.com/webmasters/answer/96569?hl=en

This does not mean that other search engines will respect your intentions.

Yep and Webflow does not provide a way for you to deal with this issue last time I checked. Deletes can take time to propagate or not at all. There are some threads about this in the forum.

Recommendation: Don’t use Webflow to host assets where you need access control / restrictions. This also applies to any of the “Member” third party solutions. If someone has the Asset URI, they can access the file at will (bots included). The only time the resource requires authentication is if it was an asset created by the file field on a Webflow form. Personally I would move them a cloud service I can control like AWS S3 buckets or something similar.

1 Like

Thank you for your advise on this, Jeff @webdev

I think that best for our circumstances right now would be to host the files outside of Webflow, as you suggest, if it was possible to simply allow links from the Webflow site and block everything else. I’m thinking of bog standard Apache account, and managing it with .htaccess instructions. On balance, this seems to be the best right now. I know that this isn’t 100% ideal and that it can break, e.g. if a browser is configured not to send cross-site referrer information.

For some reason I can’t get it fixed and I wanted to ask whether anyone knows what the reason might be. This is the .htaccess code I’ve got in place:

RewriteEngine On
RewriteBase /

# traffic from this referrer through
RewriteCond %{HTTP_REFERER} ^https://www.webflowsite.com/
RewriteRule ^ - [L]

# redirect other traffic to page with error message
RewriteRule ^ https://www.webflowsite.com/errormessage [R,L]

The re-direct to the page with the error message works, but links from the webflow site are not allowed through either. I’ve tired linking to https://www.whatismyreferer.com and that seems to confirm that clicking the link submits the correct referrer domain.

Why won’t linking from the Webflow site go through?

Thank you.

I have no way to test this for you. I would suggest disabling the rule and sampling the apache access log. Paying attention to the referrer. Isolate from there.

Thank you :ok_hand: :+1: