The original version of Sphider had very erratic support for indexing HTTPS pages, and wouldn’t even look at the robots.txt file on a HTTPS site. That failing has never been addressed, and even the latest version, 1.5.2, has the same failings when it comes to HTTPS. This has never really been an issue for me before, and even now it is more annoyance than issue as I can work around it.
Still, the “problem” does seem intriguing. After a bit of experimenting, a fix may not be all that difficult. (Famous last words, right?)
I am debating now whether or not to continue investigating alternatives and make more code changes which would improve HTTPS support in Sphider, not only to ensure more reliable connectivity but to enable the robots.txt to be utilized as well. I don’t know that there is that big of a need. We’ve never received any complaints or comments on the issue…
Anyway, at this point there is a POSSIBILITY, but no definite plans one way or the other.
*******************************
UPDATE (Apr 6): I was able to get the robots.txt file read from a https site. First problem, regardless of http or https, the parsing of allowed or disallowed user agents and disallowed files/directories was iffy. If the robots.txt file had lines like “user-agent” or “disallow”, it was parsed, but “User-agent” or “Disallow” was not. It was a case issue. That is now fixed (on my side, not published yet). Second problem, now that I know the file IS being read and parsed, Sphider will STILL index some files in disallowed directories!
If you have any files or directories listed as “url_not_inc” in your settings, that will work, but not the robots.txt disallows, even though that SHOULD be the case. Well, this situation certainly has gotten my interest!
*******************************
UPDATE (Apr 7): I have begun the process of troubleshooting the code to see what is going awry and where. Working alone and having other things to do in life, this can be both time consuming and frustrating. So far, I do know the robots.txt is read and parsed properly. Just where and why the instructions are not acted upon is another matter. At least the question of whether or not I will be attempting another modification has been answered!
*******************************
UPDATE (Apr 8): GOT IT! Preliminary tests show robots.txt is now being followed in both http and https. More testing to follow (found a couple other misc issues and fixed them). Once everything is validated, there will be a 1.5.3. Stay tuned.