Over the past few years, I’ve done a pretty good job of blocking robots on this website. I shouldn’t have to block anything, robots or humans.
I”ve removed all IP address, referrer and user agent blocks. The robots.txt file disallows nothing. Yes, I know my web server is going to get inundated with all kinds of robot traffic. Most of it will be good, but a lot of it will be bad.
I’m probably opening a can of worms, but that’s okay. If I’ve done everything else correctly, it shouldn’t have any physical effects on the web server. The effects on my search engine rankings is what I have to worry about.
I’m still going to maintain my bot list and I’m still going to keep the information up-to-date on my Nginx application firewall, regardless of this experiment. I’m not going to change the wording at all. The only things I’ve done is disable the application firewall and replace the robots.txt file.
As I mentioned when I wrote about blocking data centers, it’s not a good idea to do that kind of thing anymore. More and more people are using virtual private networks (VPNs) for various reasons, mostly privacy, and those VPNs rely on data center IP addresses. On the other hand, some VPNs are now using residential IP addresses because streaming services are blocking the VPN addresses.
I’ve been blocking a lot of user agents, with a lot of them being specialized clients. It’s not a good idea to block the user agents for search engines and certain types of crawlers. Unfortunately, it can be difficult to weed out the bad ones. Some of the bad ones may be obvious. Maybe not. It sometimes depends on the type of website you run and sometimes it’s a matter of perspective.
I’ve gotten much better at ignoring vulnerability scans. Once I switched from WordPress to a custom headless content management system, I removed a lot of possible vulnerabilities. These days, I get annoyed more by contact form spam than anything else.
With a lot of vulnerability scans, I no longer log the requests while feeding the miscreants 410 error codes. Nginx can handle dozens of requests per second and PHP 7+ isn’t far behind. Because I don’t use a database, that bottleneck can’t have any effect on anything. I’m not worried about getting overrun by requests.
Many self-proclaimed experts believe it’s a bad idea to let robots run rampant on a website. They may be right in most circumstances. Bad actors will scrape entire websites and try to pretend their copies are the original versions. The idea is that Google and other search engines will give the scraped versions better rankings than the original versions. I’ve seen that happen in the past and it may still happen today.
Fortunately, reporting a scraped website that comes up in search results can get it removed from Google’s search index. Unfortunately, it can’t happen quickly enough. I regularly do a vanity search for my name and certain keywords to see what appears. Since my website isn’t extremely popular, I haven’t had any issues with it yet.
I don’t have to worry about certain exploits. You can’t brute force through a login form when a login form doesn’t exist. You can’t make changes when the interface to do so doesn’t exist. Things like that.
No matter what happens, good or bad, I’m sure I’ll have something to write about sooner or later.