Cloudfront CDN showing as blocked by robots.txt in Search Console

Recently I switched a magento site to use cloudfront as a CDN for the media, skin and js directories  and shortly after I started seeing warnings in Blocked Resources reports of Google Search Console saying that the resources served by the cdn where blocked to google.

I did the usual tests through GSC of checking robots,txt access , verifying that everything was accessible etc and it showed no problems. The robots.txt of the main domain was being synchronized to the cdn so the exact same one was being served when it was access by www.domain.com or cdn.domain.com and it was not restricting access to those resources.  As a further check I verified the cdn.domain.com in GSC and ran the same checks on the robots txt from there but everything seemed fine.

I tried some tests like adding specific ALLOW directives in robots.txt to try to force google to see things where accessible.  To allow for the time it takes in GSC I was of course waiting several days to a week to see if things changed (one of the most frustrating things of troubleshooting this kind of thing) but the errors continued to increase.

Googling the issue there was no much information but I found one stack exchange post that was reporting the same issue and whilst the guy seemed to resolve his problem the answers didn’t seem completely conclusive but at least gave me something to explore. In short it was suggesting that cloudfront reads the robots.txt itself , processes it differently to how a standard crawler would and gives a response.

I focused my testing on cloudfront rather than the site as that seemed to be the cause of the issue. I inspected the response headers using curl commands and spotted that the response seemed without problems but that it was being served by http2 which I had enabled on both the server and cloudfront. Googlebot doesnt use http2 so I wondered if somehow that was confusing things. I wasnt sure exactly how but maybe it was something like Googlebot crawls a url on the server which responds with http1 (which it does correctly) but then somehow the serving of the cdn resources where called by the server and not the crawler so therefore responded with http2 which the bot didnt understand. Something along those lines maybe and easy to test by disabling http2 on cloudfront (and later on the server). Neither of these worked and error continued.

I also  tried a standard free for all robots.txt of  “User-agent: * Disallow:” on the server and forced an expiry on the cdn to update it  but that didnt fix.

I tried posting in Google Web Master forums and Amazon forums and no one seemed to know the answer but i stumbled across something that I could at least try . I found this article on medium that was from someone that adds a little lamba function to disallow robots on his cloudfront distributions so I simply used it to do the opposite and force serving a robots.txt of “User-agent: * Disallow:” to Googlebot each time it visited.

After a about 3 or 4 days after implementing the lamba function the errors dropped for the first time and continued to do so in the coming weeks and it fixed the problem. So it seems that cloudfront does somehow process your robots.txt and present a non standard response to requests. Im still not sure of the exact reasons why or if it is a bug but at least I found a solution in the end.

 

Leave a Reply

Your email address will not be published. Required fields are marked *