Bots vs. Humans – Protect your Website from all Kinds of Bots
Unless you have an extremely popular website, you will always get more visits from bots than humans. It seems like everyone wants of a piece of your pie. They don’t even care if your pie is old and stale.
In August of 2016, I wrote about an Nginx application firewall. I focused on the server software side of things. Today, I’m going to focus on the bots themselves.
Search Engine Bots
Your website obviously needs to be visited by search engines. Does it need to be visited by every search engine in the world? I doubt it. I don’t have a comprehensive list of search engine bots, but I think I have a lot of them.
I’ll list the bots and you can decide if you would like to let them in or block them:
- Baiduspider – Chinese search engine
- bingbot – Microsoft’s search engine
- coccocbot – Vietnamese search engine
- daum – South Korean search engine
- Findxbot – European search engine
- googlebot – Google’s search engine
- MegaIndex – Russian search engine
- OrangeBot – French search engine
- SeznamBot – Czech search engine
- Yahoo! Slurp – The search engine for Yahoo!
- YandexBot – Russian search engine
- yoozBot – Iranian search engine
I know I don’t have all of them. I’m obviously missing the ones I’ve never received visits from.
Again, I can only list the ones I’ve received visits from:
- James Bot
If I have any information about these bots, I’ll give it to you. Otherwise, I won’t display any notes at all:
- Buzzbot – buzzstream.com
- Dataprovider – dataprovider.com
- finbot – financialbot.com
- GarlikCrawler – garlic.com
- Scrapy – User Crawler – scrapy.org
- Voltron – User crawler – 80legs.com
If you have a dynamic website, especially one that uses a database, you need to block the aggressive bots. At the very least, try to control them using your robots.txt file.
The most dangerous bots are those that cloak themselves as regular web browser user agents. When you see one, you should block either the single IP address or the complete CIDR range, depending on the country of origin.
If you’re target audience is English-speaking Americans, it may make sense to block the search engines from other countries. Some Americans use some of them, especially those where English is their second language. Consider carefully.
My website is static. It only invokes PHP on the contact page. The rest of the files are plain text files. Some of the bots used to bother me. These days, I just don’t care.
If you get a lot of human traffic, you can afford to block a lot of bots. When you don’t, blocking a bunch of them isn’t a good idea. Concentrate on the ones that could be eating up all your bandwidth or connecting dozens of times a second (like MJ12bot).
I’ll try to keep my lists updated as much as possible. I can’t guarantee anything because, again, I don’t care. I have better things to do with my time. No one should have to bother with this nonsense.