RT Cunningham

404 Errors and Nonexistent Web Pages – Errors that Won’t Go Away

The Bing and Google search engines don’t seem to know how to let go of nonexistent web page URLs. 404 errors continue to return even after more than a year.

I understand Google’s policy (and I know nothing about Microsoft’s policy). They don’t automatically remove web pages from the index. They have to produce 404 errors continuously for some amount of time, which I thought was 90 days. I guess I was wrong.

404 Errors Forever

I have a script that parses my access log. I have it ignore the lines with 404 errors. After all, those 404 errors should eventually disappear, right? Wrong!

A few days ago, I commented out the line in the script that ignores 404 errors. When I viewed the results on the next run, I saw dozens of URLs for pages I removed more than a year ago. Some of them go back to 2011 and 2012 (pages I removed in 2013).

I’m confused. I guess Bing and Google both consider a URL as forever if it returns a 404 error. Who knows, maybe that page will suddenly reappear? Yeah, right.


410 Errors instead of 404 Errors

The list of HTTP status codes is long, but I’m only concerned about the ones that tell the search engines the pages don’t exist anymore.

404 means “not found”. The specs say the client can ask for that page again later. 410 means “gone” and tells the client the page will never reappear. I use the Nginx web server and I can make every 404 error turn into a 410 error, but that isn’t a good idea. I need to do it only for the pages I know don’t exist.

After I decided to do this, shortly after finding all those resurrected web page URLs, I created a special file that’s included at the HTTP level of the web server. The content looks like this (I removed all the entries, leaving only an example for display purposes):

map $request_uri $gone_uri {
default 0;
~*^/all\-tags\-list\.html 1;
}

I created another file that’s included at the SERVER level of the web server:

if ($gone_uri) { 
  return 410;
}

I tested a couple of URLs I’d placed in the HTTP level file and they displayed “410 Gone” in the web browser. This is exactly what I want. I don’t want them bringing up a specialized 404 error page because no human being should ever ask for them.

Will this Solve My Error Page Problem?

Only time will tell. I’m listing the nonexistent pages by month, starting with November. In February, if this website still exists, I’ll turn November’s list into comments and see what happens.

If this works the way it should, those pages should disappear from the indexes for both Bing and Google. This is silly. 404 errors should disappear in a reasonable amount of time and I consider 90 days reasonable. Unfortunately, Bing and Google don’t seem to agree with me.

The reason I didn’t mention Baidu, Yandex, Seznam or any of the other large search engines is because I block them. Most of my web traffic comes from the United States, followed by the Philippines (less than a third). I get random visits from Canada and the United Kingdom.

The only visits I ever used to get from China, Russia, Ukraine and the Czech Republic were bots. Those bots included search engine crawlers and spam bots.

I block a lot more countries, but I won’t mention them all. Some of those were causing 404 errors for URLs that will never exist – mostly exploit attempts.


November 10, 2017
Web Development

You May Also Like: