Google Search Console Reports Blocked Resources for AWS Cloudfront Content

After configuring  AWS cloudfront to server my Magento images, JS and CSS I noticed Google Search Console started to report a lot of blocked resources. This can and did effect SEO and traffic so I obviously wanted to fix asap but it wasn’t as simple as I expected.

I found a few forums posts referring to the same problem such as this one on webmasters.stackexchange and this one on the aws forums and whilst they offered some clues they did not provide a solution.

Troubleshooting was quite difficult and slow seeing that of course I could do fetch and render tests on the main domain and but it is not possible to test the fetching of cdn resources via calls to the main domain.

The crux of the problem seemed to be with robots.txt and specifically how cloudfront handles it. The cloudfront distribution simply copied the robots.txt from the origin to the distribution meaning it was identical. The robots.txt file I was using shouldn’t have caused any problems and google crawled the origin server fine but to test I changed it to a much simpler one that forced an allow for GoogleBot, invalidated the cache for file and waited to see what happened. The errors though continued so it seemed like it wasn’t robots.txt after all.

My next theory to test was to see if http2 was the culprit. I had http2 enabled on the server and in cloudfront but GoogleBot does not work with http2 so I wondered if it somehow got confused when it arrives at the page on the server with an http1 request (which the server also works for no problem) but something happens during the rerouting of the request for the static assets in the CDN and the request is presented as http2 to cloudfront resulting in the response not being understood by GoogleBot. I tested by turning http2 off on server and cloudfront. It was a nice theory but it was not the case – the blocked resources reported continued to rise.

At this point I put my attention back to robots.txt as the forum posts above suggested that maybe cloudfront might process robots.txt in a non standard way – although some people say it does not process it at all. My idea was to force the response header to GoogleBot to receive the correct robots.txt and a  200 status response somehow by using a lambda function triggered by a cloudfront event to return the header instead of leaving it to cloudfront to return it. So I implemented the following:

'use strict';let content = `User-agent: *Disallow:`;exports.handler = (event, context, callback) => {const response = {status: '200',statusDescription: 'OK',headers: {'cache-control': [{key: 'Cache-Control',value: 'max-age=100'}],'content-type': [{key: 'Content-Type',value: 'text/plain'}],'content-encoding': [{key: 'Content-Encoding',value: 'UTF-8'}],},body: content,};callback(null, response);};

Then I added a behavior to the cloudfront distribution with the path pattern /robots.txt and uner Lambda Function Associations configured a CloudFront Event of type “Viewer Request” to trigger the Lambda Function.

N.B – I had to create the Lambda Function in US EAST (N. Virginia) Region in order to it to be accessible to use with a CloudFront Event – it ONLY works if your function is in this region.

I saved everything and waited to see the results. The very next day I saw a reduction in blocked resources reported in GSC and this continued to fall over the next days  therefore fixing the problem.

So it seems certain that Cloudfront handles robots.txt in a non standard way, at least not the same as a standard web server does, and if GoogleBot is crawling your server and finds resources served from a cdn.subdomain on cloudfront it can cause problems. There doesn’t seem to be an official confirmation of this though in any AWS documentation and using Lambda was a bit hacky but luckily got the job done.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.