Sphider and Sphiderlite — and 301’s!

Sphider 4.2.0 and Sphiderlite 2.2.0 have recently been released. These editions corrected a few issue which have slowly crept in. Stray white space was interfering with phrase searches, Some MySql installations (or was it PHP?) was causing some mysqli errors which resulted in dropped connections. We discovered some new code deprecation in PHP 8.1. Filters started to cause some corruption of certain Unicode characters.

Well, these recent releases corrected those issues. And even though these releases are stable, we have more improvements on the way! Sphider 4.2.1 and Sphiderlite 2.2.1 pre-identified some code deprecation from the not-yet-released PHP 8.2. We also improved identification of web page encoding. On rare occasions, a web page would throw an error during indexing due to a wrong interpretation of the page encoding. The odds of that happening have been greatly reduced. (NEVER say it can’t happen!) Also, the size of a spidering log is now displayed in the spidering log list. Look for these releases very soon!

One “issue” that remains is that SOME websites, typically WordPress sites, just refuse to be indexed! MOST WordPress sites do fine … some don’t. The very first page comes back with a “301” (relocated) error, no other pages are found, and the indexing run halts with nothing being indexed. Upon investigation, the 301 is bogus. There is no redirection. We thought maybe it is something with WordPress, but now doubt that is the case. We really don’t have a clue as to the cause. Our latest thought is MAYBE it is something done intentionally to ward off indexing by small potatoes, like Sphider?

If anyone out there knows the cause of these phony 301 errors being given to Sphider, let us know!

At any rate, those stubborn pages CAN be indexed by Sphider/Sphiderlite, using a hack. And a hack is exactly what it is … not something you would want as a normal part of Sphider. The hack can be found on the Sphider forum.

(There are other reasons for web sites that won’t index or won’t totally index, but that is for another post.)

EDIT: 7/15/2022
Found another possible cause of “fake” 301 errors! It may be that some websites do not like or recognize the User Agent string and block the crawl with a 301 error. Changing the User Agent string (in Settings) may help!