In August of 2016, I wrote about an Nginx application firewall. I focused on the server software side of things. Today, I’m going to focus on the bots themselves.
Your website obviously needs to be visited by search engines. Does it need to be visited by every search engine in the world? I doubt it. I don’t have a comprehensive list of search engine bots, but I think I have a lot of them.
I’ll list the bots and you can decide if you would like to let them in or block them:
I know I don’t have all of them. I’m obviously missing the ones I’ve never received visits from.
Again, I can only list the ones I’ve received visits from:
If I have any information about these bots, I’ll give it to you. Otherwise, I won’t display any notes at all:
If you have a dynamic website, especially one that uses a database, you need to block the aggressive bots. At the very least, try to control them using your robots.txt file.
The most dangerous bots are those that cloak themselves as regular web browser user agents. When you see one, you should block either the single IP address or the complete CIDR range, depending on the country of origin.
If you’re target audience is English-speaking Americans, it may make sense to block the search engines from other countries. Some Americans use some of them, especially those where English is their second language. Consider carefully.
My website is static. It only invokes PHP on the contact page. The rest of the files are plain text files. Some of the bots used to bother me. These days, I just don’t care.
If you get a lot of human traffic, you can afford to block a lot of bots. When you don’t, blocking a bunch of them isn’t a good idea. Concentrate on the ones that could be eating up all your bandwidth or connecting dozens of times a second (like MJ12bot).
I’ll try to keep my lists updated as much as possible. I can’t guarantee anything because, again, I don’t care. I have better things to do with my time. No one should have to bother with this nonsense.