Sphider 2.3.0 principally addressed security concerns, but it also was intended to bring Sphider into PHP 7.2 compliance by removing any use of the deprecated each() function. The function was used extensively, and the majority of the code replacement was very run-of-the-mill straightforward. There were four times the usage was atypical. Substitute code was put in place and tested. It seemed all worked well as many sites were indexed and searches performed as expected.
Well! It seems indexing and searching was being done properly — but only for words composed of Western characters. Words utilizing non-Western characters were not being indexed! And any searches for those words not only returned as “not found” (expected since they weren’t indexed), those searches also complained of gibberish characters/words being either too short or too common.
Investigation of the issue led to three of the four code segments replacing the non-standard usage of the deprecated each() function. The code replacements themselves have been replaced in 2.3.1. Testing on the problem sites now shows that all words are being indexed, those containing Western characters as well as those containing non-Western characters. The search anomalies are gone and searches for non-Western foreign languages is yielding expected results. If a search word really IS too short or too common, it is reported as such, and not as gibberish. Sphider is now truly PHP 7.2 compliant.
Sphider 2.3.1, both legacy and PDO, are available for download on this blog’s download page, or from the Sphider Home page.
Hi!
I’m testing the sphider 2.3.1-PDO in my own server.
The server runs Ubuntu 18.04, Apache 2.4.24, mysql Ver 14.14 Distrib 5.7.25 and php-7.2.15.
I’ve found two issues:
– the db doesn’t capture the fulltext from the webpages
– if use the option Re-index all (web o cli) only visit 8 sites from 21.
I’ll make more testing and I’ll send you messages.
Thanks for your great work!.
Appreciate the feedback!
Not to argue, but what leads you to believe that fulltext is not being indexed? The database can accept a page with up to 16,777,215 bytes. Using utf8mb4, which is 4-byte encoding, that is still something over 4 million characters!
As to the re-index, is it always the same 8? What is the log file indicating? Any mention in the log of the other 13? Will any of the 13 individually re-index without a problem?
While this may not have anything to do with your re-index issue, I have found that PDO can choke when dealing with massive amounts of data. PDO tries to read everything at once, exhausts its memory resources, and shuts down. I’ve toyed with forcing Sphider to buffer the data but to no avail. The non-PDO version can take the same data and just keep chugging. I’m sure it has its limits too, but I haven’t found it yet.