Sphider 1.6.0 and Sphider 1.6.0 PDO version have been released.
Also released is the Sphider Image Indexer, a companion add-on to Sphider allowing the user to index and search images from a website.
And finally, there is also a conversion kit which will allow the PDO version of Sphider to work with SQLite databases in place of MySQL.
Hello,
I’ve been using sphider 1.6 for some time, and I’m quite happy with it.
Still I’d like to signal a bug, which can easily be mended.
My content is utf-8 encoded, but special characters (accents, umlauts, etc) were not properly displaying in the database and the search results. To correct this, I had to delete line 346 in admin/sphider.php.
346. $fulltxt = utf8_encode($fulltxt);
The utf8_encode() function encodes a Latin1 string to utf-8, but when one uses this function on a string that’s utf-8 already, it will return garbled characters.
After deleting line 346, and re-indexing my content, all foreign characters show up as expected.
Simply deleting line 346 may fix your problem, but then creates problems for a lot of other people. You DO, however, have a legitimate complaint! Applying utf 8 encoding to a string that is already utf 8 is going to cause some problems, as you have already seen.
The solution is obvious. Strings need to be checked for their current status and encoded only if necessary. While it is impossible to determine what character set a string is encoded in, it is possible to what it ISN’T encoded in. The PHP function mb_check_encoding could be a solution, but not every PHP installation has the mb package installed. That is a scenario which can be worked around.
The next release of Sphider will contain a check and will only encode a string to utf 8 IF it isn’t already so encoded. Not sure just when the next release will be, but it will be a major one, probably 2.0. I am testing a preliminary version now. With so many changes (images, RSS feeds added) and both “vanilla” and PDO versions, there is a LOT of testing required. A few months?
For now, thank you for bringing the problem to my attention. It is a valid concern that may be affecting others and WILL be addressed!
My php skills are limited, but maybe you could check out this: https://github.com/neitanod/forceutf8.
Thanks for the input. I’ve been looking at a lot of sources, github included. I’ve already tested a more limited fix which will prevent encoding of an already encoded utf 8 string. Actually, the problem was affecting the searches on worldspaceflight.com as well and I hadn’t noticed it! Now those kind of searches are returning properly.
Now to find and test an even more inclusive solution. In the interim, IF all your inputs are definitely utf 8, your solution works. IF, on the other hand, SOME of your inputs are ASCII, I can instruct you how to use my current solution to encode those while leaving the utf 8 strings alone.
UPDATE: Delving into the world of character encoding, I haven’t really learned anything new, but I certainly have managed to draw many of the individual pieces of knowledge into something that is at least SOMEWHAT coherent.
The MOST COMMON encodings are ASCII, ISO-8859-1 (aka, latin1), and UTF-8. There are MANY more, 15 more just in the ISO-8859 family.
ASCII and ISO-8859-1 are single byte encodings. ASCII actually only needs 7 bits, but with a leading 0 it makes a full byte.
Every ASCII string is a valid ISO-8859-1 string, and every ISO-8859-1 string is a valid UTF-8 string. UTF-8 uses 1 to 4 bytes to encode a character. If you have a UTF-8 string that can be interpreted (correctly) as an ASCII or ISO-8859-1 string, you can utf8_encode that string again without causing any damage whatsoever because nothing gets changed! Do that with a UTF-8 string that can’t be correctly interpreted as ASCII or ISO-8859-1 and you get garbled results.
A problem can arise in that two or more consecutive ISO-8859-1 characters can represent a single (different) UTF-8 character and a multi-byte UTF-8 character can be interpreted as two or more ISO-8859-1 characters!
Unless you KNOW what the source encoding scheme is, you can only give it your best guess. Add in that not all PHP installations incorporate mbstring, that makes a very iffy task of guessing the encoding nearly impossible. The method I came up with will work for the most common schemes (ASCII, ISO-8859-1, and UTF-8) to ensure the database is populated properly with a high degree (but NOT 100%) of accuracy. It’s just a matter of probabilities for character sequences. (The ASCII strings “~p” and “aP” may also be valid Han characters in UTF-8! So which is it? ASCII or UTF-8? Without knowing the context, it’s a pure guess. The longer the strings, the better the guess. If one of the non-printable ASCII characters pops up, UTF-8 is a pretty sure thing.)
Now, what would make Sphider more encoding-portable and reliable, would be to have an option, possibly in database.php, to designate what the input encoding WILL be, and if so designated, convert the inputs from that designated encoding to UTF-8. If nothing is designated, then we go with the top three and make our best guess.
Sometimes life is a crap shoot.
UPDATE #2: Well, DUH! Since we are indexing web pages, I just read the http headers! Bang! I have the charset the vast majority of the time. What is stated in the headers takes precedence over anything which may or may not occur in the meta tags. From what I can determine, in the few cases where the headers don’t contain the character set, ISO-8859-1 (aka, latin1) is assumed. Now to test it.