Sphider 2.0.0 has been released and may be obtained from the Downloads tab.
9 Replies to “Sphider 2.0 Released”
Comments are closed.
Commentary, rants, WorldSpaceFlight updates, and Sphider news
Sphider 2.0.0 has been released and may be obtained from the Downloads tab.
Comments are closed.
I’m searching soooooo long for a good spider-script. Thank you for your good work.
Thank you for updating the sphider script. That saved me a ton of work.
Due to various reasons, I ended up using the sphider-2.0.0-PDO download. I found a few errors that were not in the sphider-2.0.0 files. Specifically:
In spider.php line 1602 is:
$stmt->execute()
or die(“Execution failed: “.$stmt2->errorInfo()[2]);
I believe it should be:
$stmt2->execute()
or die(“Execution failed: “.$stmt2->errorInfo()[2]);
In spiderfuncs.php line 736 starts with:
preg_match(
“/<meta +name *=[\"']?keywords[\"']? *content=[\"']?([^’\”]+)[\”‘]?/i”,
$headdata, $res
);
if (isset($res)) {
$base = $res[1];
}
I believe it should be (as copied from sphider2.0.0):
preg_match(
“/<meta +name *=[\"']?keywords[\"']? *content=[\"']?([^’\”]+)[\”‘]?/i”,
$headdata, $res
);
if (isset($res)) {
$keywords = $res[1];
}
// e.g.
preg_match(“/<base +href *= *[\"']?([^’\”]+)[\”‘]?/i”, $headdata, $res);
if (isset($res)) {
$base = $res[1];
}
You may want to add check for a canonical link after the above base tag:
preg_match(
“/<link +rel *=[\"']?canonical[\"']? *href=[\"']?([^’\”]+)[\”‘]?/i”,
$headdata, $res
);
if (!base && isset($res)) {
$base = $res[1];
}
Again, thanks for the good work.
Regards,
Ed Parrish
Regarding spider.php, line 736, you are correct. It should read:
$stmt2->execute()
This also affects the PostgreSQL and SQLite versions.
In spiderfuncs.php, the code on lines 736-742 is in fact correct. That doesn’t mean there isn’t a problem! Shame on me. Between lines 739 and 740 should be: ‘\”]+)[\”‘]?/i”, $headdata, $res);
if (isset($res)) {
$keywords = $res[1];
}
// e.g.
preg_match(“/
Yes, five lines of code were omitted! This is ONLY in the PDO version.
Corrections to both files will be published later today.
Regarding the addition of a canonical link check… Checking canonical links is a great idea for crawlers such as Google, Bing, Yahoo, etc. In fact, these large search engines DO precisely that kind of checking. A major purpose of including “rel=canonical” in web pages is to improve SEO. Many people believe Sphider is like a mini-Google for personal use. It is not. Sphider’s intended purpose is to simply index the contents of a particular site. If there is duplicate content, then it will (and, IMHO, should) report ALL such instances. Further, I doubt that the simple addition of 8 lines of code is going to do much in achieving the purpose of becoming a canonical search engine. Please don’t take this the wrong way. The suggestion does show excellent thinking and ingenuity. If you can really make it work in a useful fashion, feel free to fork the code. After all, that’s what I did with the original Sphider, and I can assure you I am very far removed from being a guru, expert, or anything else besides an old guy who was able to hack and improve (hopefully) some script which had been out of development for years.
Thanks Captain Quirk for looking into the problems. Again, I appreciate the work you put into updating the Sphider script. It has helped me immensely in my quest to set up a secure site search since Google has discontinued their site search.
And I thank YOU for bringing my typos to my attention!
You mentioned Google discontinuing their site search. That is exactly why I started using the original Sphider, too. Then when my version of PHP was upgraded, it started to give me errors. Not being able to find a suitable replacement, I decided to try to fix it myself. Sphider 2.0.0 is the latest culmination of that effort. I’m glad my efforts can benefit others.
Hi,
I’ve just installed this new version and noticed that it was able to find keywords that previous version didn’t.
But I have an issue with searching on words with accents (website is French). This is how they’re added in the DB:
(1968, ‘états-uni’),
(612, ‘étiquett’),
(1969, ‘évident’),
(606, ‘​grâce’),
(755, ‘âge’),
I’ve created
CREATE DATABASE `sphider_db` CHARACTER SET utf8 COLLATE utf8_general_ci; and html header is UTF-8.
Any idea what the reason could be ?
Thanks
Several ideas, not all of which may apply to your case.
1. The last FULL indexing may have been BEFORE you upgraded to 2.0. A simple reindex will NOT correct existing database anomalies. IF this is the case, the best solution is to purge (empty) the database and start over.
2. The web page may have source code like “À” instead of “À”. (If the former is used but the charset defined in the header is ISO, this isn’t a problem as Sphider will make the conversion.)
3. The web page header may be incorrect. “Content-Type: text/html; charset=utf-8” would be proper. Do not confuse the header with a meta tag like “<meta http-equiv=”Content-Type” content=”text/html; charset=utf-8″>”. If the header is missing from the source code, your web server is most likely adding it, and it MIGHT not be utf8! (Headers appear BEFORE even the DOCTYPE declaration and do NOT show when you display the source in a browser window. The W3C Markup Validator Service can tell you the charset used.)
4. The declared charset (from the header) and the actual charset used may be mismatched.
From personal experience, the problem you are experiencing is a real annoyance! In my case, I had to double check all meta tags with “Content-Type”, include headers on all the pages, check all pages to be sure I was actually using utf8 characters and not ASCII character codes. In my case, much page content is built from another database (not Sphider) and I had to go ensure that database was utf8 and that the content contained ONLY utf8 and not ASCII code for those characters. That, in turn, led to problems in the processes which populated that database in the first place.
You may find https://www.w3.org/International/articles/definitions-characters/#charsets helpful.
Hi,
i tried all your recommendation and it still wasn’t retrieving the results so I ended adding this :
$stmt = $db->prepare(
“SET NAMES utf8”
);
$stmt->execute();
and
$new2old = array(
‘á’ => ‘á’,
‘À’ => ‘〒,
‘ä’ => ‘㤒,
‘Ä’ => ‘Ä’,
‘ã’ => ‘㣒,
‘å’ => ‘Ã¥’,
‘Å’ => ‘Ã…’,
‘æ’ => ‘æ’,
‘Æ’ => ‘Æ’,
‘ç’ => ‘㧒,
‘Ç’ => ‘㇒,
‘é’ => ‘ã©’,
‘É’ => ‘㉒,
‘è’ => ‘㨒,
‘È’ => ‘㈒,
‘ê’ => ‘㪒,
‘Ê’ => ‘ãš’,
‘ë’ => ‘ã«’,
‘Ë’ => ‘Ë’,
‘í’ => ‘Ã-’,
‘Í’ => ‘Ã’,
‘ì’ => ‘ì’,
‘Ì’ => ‘ÃŒ’,
‘î’ => ‘ã®’,
‘Î’ => ‘ãž’,
‘ï’ => ‘㯒,
‘Ï’ => ‘Ã’,
‘ñ’ => ‘ã±’,
‘Ñ’ => ‘ã‘’,
‘ó’ => ‘ó’,
‘Ó’ => ‘Ó’,
‘ò’ => ‘ò’,
‘Ò’ => ‘Ã’’,
‘ô’ => ‘ã´’,
‘Ô’ => ‘Ô’,
‘ö’ => ‘㶒,
‘Ö’ => ‘Ö’,
‘õ’ => ‘õ’,
‘Õ’ => ‘Õ’,
‘ø’ => ‘ø’,
‘Ø’ => ‘Ø’,
‘œ’ => ‘Å“’,
‘Œ’ => ‘Å’’,
‘ß’ => ‘ß’,
‘ú’ => ‘ú’,
‘Ú’ => ‘ãš’,
‘ù’ => ‘ã¹’,
‘Ù’ => ‘ã™’,
‘û’ => ‘ã»’,
‘Û’ => ‘ã›’,
‘ü’ => ‘ã¼’,
‘Ü’ => ‘㜒,
‘€’ => ‘€’,
‘’’ => ‘’’,
‘‚’ => ‘‚’,
‘ƒ’ => ‘Æ’’,
‘„’ => ‘„’,
‘…’ => ‘…’,
‘‡’ => ‘‡’,
‘ˆ’ => ‘높,
‘‰’ => ‘‰’,
‘Š’ => ‘Å ‘,
‘‹’ => ‘‹’,
‘Ž’ => ‘Ž’,
‘‘’ => ‘‘’,
‘“’ => ‘“’,
‘•’ => ‘•’,
‘–’ => ‘–’,
‘—’ => ‘—’,
‘˜’ => ‘Ëœ’,
‘™’ => ‘â„¢’,
‘š’ => ‘Å¡’,
‘›’ => ‘›’,
‘ž’ => ‘ž’,
‘Ÿ’ => ‘Ÿ’,
‘¡’ => ‘¡’,
‘¢’ => ‘¢’,
‘£’ => ‘£’,
‘¤’ => ‘¤’,
‘¥’ => ‘Â¥’,
‘¦’ => ‘¦’,
‘§’ => ‘§’,
‘¨’ => ‘¨’,
‘©’ => ‘©’,
‘ª’ => ‘ª’,
‘«’ => ‘«’,
‘¬’ => ‘¬’,
‘®’ => ‘®’,
‘¯’ => ‘¯’,
‘°’ => ‘°’,
‘±’ => ‘±’,
‘²’ => ‘²’,
‘³’ => ‘³’,
‘´’ => ‘´’,
‘µ’ => ‘µ’,
‘¶’ => ‘¶’,
‘·’ => ‘·’,
‘¸’ => ‘¸’,
‘¹’ => ‘¹’,
‘º’ => ‘º’,
‘»’ => ‘»’,
‘¼’ => ‘¼’,
‘½’ => ‘½’,
‘¾’ => ‘¾’,
‘¿’ => ‘¿’,
‘à’ => ‘ã ‘,
‘†’ => ‘†‘,
‘”’ => ‘‒,
‘Á’ => ‘Ã’,
‘â’ => ‘㢒,
‘Â’ => ‘ã‚’,
‘Ã’ => ‘Ã’,
);
foreach( $new2old as $key => $value ) {
$new[] = $key;
$old[] = $value;
}
$strTemp = str_replace( $old, $new, $word );
$word = utf8_encode($strTemp);
in different places.
It’s more of a hack than a solid solution but it seems to work now.
Thank you for your help and you’re update is really cool.
Thanks for the wonderful manual