Sphider 1.5.0 was a major departure from older versions of Sphider in that it incorporated prepared statements, adding significantly to the security of Sphider. It performed very nicely.
But we did not like the database backup and restore procedures. Backup was quick enough, but restore was S-L-O-W!. The larger the database, the worse it got. There had to be a better way. There was, and we found it. We grew our database to include:
- 10 sites
- 10 categories (5 top level, 5 sub-categories)
- 10, 641 links (pages)
- 70,317 keywords
- 40,006 kb of cached text
- 171,495 kb total size
A backup, producing a gzip file of 14,079 kb, was accomplished in 16 seconds.
A total restoration took 32 seconds. This was a definite improvement over the 6 1/2 HOURS for a smaller database.
Also, as we were no longer looking for coding errors, we began concentrating on the results (or outcomes of admin actions) looking for anything that just was not exactly what we expected to see. We found several bugs which were repaired and tested. Nothing earth-shattering, but bugs nonetheless. Sphider 1.5.1 is the result.
Since Sphider 1.5.1 seems to be the achievement of what we originally set out to do, namely, dispensing with deprecated code, improving security, fixing a few bugs in the original releases, etc., this will probably be the last release for awhile. In the event of some operational problem of immediate concern, a simple patch should be sufficient instead of a whole new release.
Now despite the hours of testing and line-by-line code reviews and results analysis, Murphy’s Law still reigns. We’ll leave it at that.
hello ,i tried to use sphider-1.5.1.1 but the index didn`t work
although the pdo is enabled
it works with sphider-1.3.6 and sphider-1.4.2
Sorry for the difficulties.
Do the admin functions work, for example, can you view the database tab, add/edit sites, view statistics, add/edit categories, etc.?
Are you receiving an error message when attempting to index?
If the index function runs but produces zero results, please post the url to the site and I will try it with my 1.5.1.1 installation.
Thanks
thanks for your reply
yes every thing works fine
just when i try to index or reindex don`t work
it dosen`t work at all i just get a blank page
no loading no zeros nothing
One coding error sneaked in undetected.
In the file sphiderfuncs.php, located in the admin directory, line 528 should read:
$stmt->execute() or die(“Execution failed: “.$stmt->errorInfo()[2]);
There was an extra “)” near the end.
The online source has been corrected.
Now my tests indicated this error would ONLY affect a reindex, AND it should have given you an error if it popped up…. at my age I’ve seen stranger things.
Make that change and retry. If you STILL get no indexing response, give an example of a url you are trying to index. It isn’t likely you would just happen to hit multiple sites that won’t index, but as I said, I’ve seen stranger things happen.
it dosn`t work with any site ,sphider-1.3.6 and sphider-1.4.2 works
What you mean to say is “it doesn’t work YET” – LOL
Patience. We will MAKE it work!
LOL ,you get it all wrong body.i am not trying to rush you
you thought it didn`t work with specific site ,i just tried to tell you it didn`t work with any site .any way it works now index and reindex and all
thanks and May the Gods always stand between you and harm in all the dark places you must walk .
peace out
LOL
Well, YOU did not rush me, I DID! When I present a piece of software that is supposed to work and it doesn’t, I take it personally. Me vs. system, and I can’t rest until I come out on top. LOL
Yes, you DID tell me it didn’t work with ANY site. Decades of experience tell me it didn’t work with any site YOU TRIED! (In retrospect, you could have tried every url to the end of the internet and it wouldn’t have indexed.) I have personally come across a couple sites that just won’t index. What are the chances you encountered several just sites in a row? EXTREMELY LOW, nearing impossibility. I was just looking for ONE url that DEFINITELY didn’t work. Fix that one and there is a good chance all get fixed. Just a troubleshooting method I use. I have learned that the only sure thing is that there isn’t any sure thing. LOL
Regardless, I am pleased the problem is solved. In THEORY, that error should have only affected reindexing and not the initial index. One possible explanation is the server you are on is set to handle errors differently than mine, thus it produced — NOTHING!
If you have any other problems, let me know. For now, HAPPY INDEXING!
your enthusiasm is quite remarkable and you are doing a great job here
i actually used sphider-1.4.2 with the same db to index ,i didn`t thought you will fix it so fast . i don`t have any other problems ,thanks.
Glad you like it. Version 1.4.2 was good for a basic start at updating very old code. Version 1.5.1 and 1.5.1.1 (for PDO) have better security and have also fixed a lot of functionality problems from 1.4.2. Many people are fine with 1.4.2 because they are not using the deeper functionalities of Sphider.
Even though I am long since retired (and have lost some of my touch) I still feel that if a job is worth doing, it is worth doing right.
Much appreciate the feedback.
Hi, i was trying your 1.5.2 Sphider version and its have awesome performance against original sphider. But i meet issue, when im trying to index, everytime i got stucked about 309 page. I got no error msg or something like that.
Also problem should not be server timeout, im setting this
ini_set(‘max_execution_time’, 600);
ini_set(‘display_errors’,1);
ini_set(‘mysql.connect_timeout’, 600);
ini_set(‘default_socket_timeout’, 600);
And when im trying to use old sphider to index exactly same content, it works
What is the url of the problem page? I can have a look to see if there is an issue.
Sorry, but i cant show you url. Im using sphider to index files on network drive. (via index file, which reading files and displaying them as html)
Unfortunately, that makes it very difficult, if not impossible, to determine what the problem may be. I would use the url to replicate the issue and follow the code to see what was going on.
You said there are no error codes. This can mean either the process is terminating abnormally but failing to report that, or the process is hung in a loop. To know which, watch the browser for an activity indicator (often a spinning icon in one of the corners). If activity stops, the process is terminating abnormally. If activity continues but nothing actually progresses, the process may be in a loop.
Also, can you share the log file?
I know its not easy to help me with this, but i know that process is terminating abnormally, because browser indicator stops. Then when i go to administration again, i can continue indexing…
By log you think my php server error log? Or sphider log?
Thanks! Knowing the process is terminating and not looping or hanging is good to know.
I’d like to see the spider log, at least the tail end of it. It might also be useful to know what the http code is from the server log for that particular page which fails. I strongly suspect you know how to read the server log, but if you don’t I can help you with that.
hello again, it seems that the spider never get a full website ,it always stops
Apologies in advance for the long reply, but bear with me.
This has happened to me. Looking at log files, both sphider and access logs, the problem SEEMED to be pages timing out. The code for pages and server settings where such that this should have been precluded. And yet, sometimes it would stall at some random spot. Didn’t always happen, but VERY frustrating when it did.
I then found that an indexing operation would run and successfully complete when I ran Sphider from the command prompt! Another tactic that works is a bit more cumbersome, but works well also. I set up an empty database on my development machine, run Sphider on that machine but crawl my real website (not the development one) that resides on remote server. Doing so remotely does slow the process down a bit, but it completes. I then use a tool my MySQL Workbench to backup the development machine database and restore it to the actual server.
The only thing I can think of, and maybe I am wrong, is that the Sphider mod has become TOO streamlined? Older, slower versions did not seem to have the issue. My mod has NOT changed any of the indexing logic, but does use a faster and more efficient query methods. Going through all the code MANY TIMES to ensure that the substance of the queries has NOT changed, that only leaves the procedural used, being a migration from the mysql extension to the mysqli extension, to a prepared statement process using mysqli and mysqlnd. Each successive change made things a bit faster. Some hosts, I have found, do not support mysqlnd, so I developed a version (1.5.1.1) which uses PDO (PHP Data Objects). PDO is more widely supported, especially on shared hosting, but since it is not MySQL specific it is a bit slower. I do not have it on my real website, so I don’t know how it would respond to the unexpected termination issue. Maybe that is a possible solution as well.
Running Sphider from the command prompt is not difficult and is described in the User Guide. If I were you, this would be my preferred alternative.
Q? do i have to open command line until the index complete ?
Yes. Closing the command line before completion will terminate the process. You should have the cursor return to know it is complete. Depending on the size of the site, it could take awhile.
It dosn`t work ,it is the same .
would it help to set time out to 0
Set timeout to zero and retry. If that doesn’t work, I would like to know if it is stalling on the same page each time, or does it vary?
It would be useful to see the Sphider log and also the portion of the raw access (html) log for the site for the time period the Sphider was running.
If that is not possible or practical, give me the url of the site being indexed and I will try to index it remotely from my test machine.
(Any information concerning log files or urls will be for my use only to track the problem and will NOT appear in the comments for others to see.)
I got your information. Might take me a bit but I’ll get back to you. If memory serves me correctly, you are using the PDO version.
Apologies for taking so long to reply. I will address http://www.movie4k.to directly, but this applies to your other problem sites as well.
Using 1.3.6, I was able to complete the indexing. There was constant deprecation errors (expected) and MANY non-fatal sql errors.
Using 1.4.2, the indexing completed, but as with 1.3.6, there were a great many non-fatal sql errors.
Using 1.5.1.1 (PDO version), I was not able to complete indexing. The sql errors were fatal (due to my stricter code). By entering each page producing an sql error into the “must not include” list, I would be able to advance further on each iteration, but gave up seeing too many pages producing the errors.
The sql errors are a result of the pages being indexed claiming to be utf-8, but containing non-compliant utf-8 encoding.
In spite of all that, looking at your log files shows that the sphider is terminating early when it encounters a unusually large page. I have to suspect that your web provider has limited you in space and/or memory and thus the script is unable to process the larger pages.
About the only thing I can do is to be more tolerant of sql errors during the sphidering process (as in 1.5.0 and earlier) in later versions of Sphider. Unfortunately, that probably won’t help you as the problem appears to be a space or memory allocation issue.
thanks, but could it be my connection while spidering ? ?
That, too, is a real possibility. The larger pages could be timing out. Obviously, I didn’t look at every line of every log, but what I did look at shows that for movie4k.to, pages are typically under 100k in size, a few as big as 350k. Then you hit a page nearly 8000k and it shuts down.
Unfortunately, unless you manage the site being indexed, you can’t check the html access logs to confirm that is the case.
I did not see a correlation between page size and the sql errors. What you COULD possibly do is make an index run until it errors out, make a “must not include” entry for that page, and reindex which will bypass the problem page. Do this for each overly large page encountered and you should be able to index the rest of the site using 1.4.2. (1.5.1 and 1.5.1.1 will still exit with fatal sql errors). Not exactly an ideal tactic, in my opinion, but you would get MOST of the site.
every time the indexer start from the beginning
how could i start indexing from the point it stops ?
i mean continue indexing not reindex
Good question. I don’t know if that can be done or not. I’ll have to snope around the code to answer that one.
nohup php spider.php -all &
adding nohup allows execution to continue even if you lose the session
add an ampersand (&) to the end a command to background
then if you want to see the lo
tail -f nohup.out
I just wanted to say a very big thank you to Captain Quirk for taking the time and effort to update Sphider.
I’ve used Sphider for a number of years and after recent server updates I had been getting the Warning message about Deprecated: mysql_pconnect but I didn’t want to just suppress the warning.
Your update and future proofing work is much appreciated and offered in the original spirit of the web.
Live long and Prosper.
Hey Captain Quirk, fantastic job with the updates! Your work solved a lot of the issues prevalent, and it works great as a simple web spider for smaller projects.
One thing I noticed when I tested it out is that site descriptions weren’t showing, neither from the meta description field or from body copy. Any thoughts on how to fix that?
Check your search settings on the Settings tab. Be sure “Show meta descriptions” is checked. Also, look at your “Maximum length of page summaries”. I believe the default is 250. Setting this number too low may cause the summaries not to show.
Hi,
I downloaded the 1.5.1 a while ago but have been sat on the older 1.4.x for a while. I just followed the install.txt and everything goes fine until I come to index a site. I had the error with the extra bracket on line 528, after fixing that and running an index I now get:
“Parse error: syntax error, unexpected ‘Content’ (T_STRING) in /home/gohobo96/public_html/search/admin/spiderfuncs.php on line 542”
Any suggestions?
Thanks
Sphider was intended to be aggressive with errors. I have found that it is sometimes TOO aggressive during a crawl, and while reporting an error is fine, having that error be fatal was a bit too much. Next release I will be toning that done a bit.
For now, in spider.php and sphiderfuncs.php, look for occurrences of or die(“Execution failed: “.$stmt->errorInfo()[2]) or or die(“Execution failed: “.$stmt->error) (depending on PDO or non-PDO code) and simply delete it. Leave the trailing “;” in place. While the reporting of parse errors during a crawl will be entirely eliminated, at least execution will not halt. The next version will be middle ground, that is, report the error but continue execution.
Also, I tired indexing a single page at the lowest level, same error. This is line 542:
preg_match(“@]*>(.*?)@si”,$file, $regs);
Line 542 should read:
preg_match(“@<<>head[^>]*>(.*?)<<>\/head>@si”,$file, $regs);
This particular line (while the precise line number varies) is EXACTLY has it has been since the old 1.3.6 from the sphider.eu site I began working with. If what you posted is correct, some file corruption has occurred somehow. Missing the “head” in the preg_match may account for the looping.
Going back to a previous comment concerning line 528 and an extra bracket, as I look at that line I see no unmatched () or []. There is no { or } on that line either.
Perhaps you should re-download the PDO code and install it again. If you decide to revert to 1.4.2, be aware that you will need to destroy and recreate the database as the database layout for the 1.5.1 versions is different than for 1.4.2.
It just keeps messing up the post, don’t know why. I will download 1.5.1.1 and try again. One last attempt:
preg_match(“@]*>(.*?)@si”,$file, $regs);
I see what you mean! Not that it addresses your sphider problem, I DID figure out how to make the line appear correctly.
The two parts of the line that contain “<<>head”, I changed to “<<><<>>head”.
As to your sphider problem — If the issue persists, can you give me the url for the site you are trying to index and I will give it a try with my archived PDO version (which is identical to the zip file downloaded from here).
OK progress. Downloading again and starting fresh now runs. I have the same problem with indexing on a loop with the header/footer links. I guess this is an issue with how my site is built but I have no idea how to resolve. I have added the to the footer. If I do this for the header also virtually nothing gets indexed. For my site it doesn’t seem to work well at all. A lot of pages are not indexed. What do you use as a separator for “URL must not include:”?
In the must/must not include sections in the admin, use one url per line.
Unless your site is on your localhost only, send me the url to your website. I won’t publish the comment with the url, so it would just be you and me. I’ll try an index to my own local test database and pdo code. Maybe between the two of us we can figure out what the devil is going on.
LATEST UPDATE: It seems Stu has resolved most issues and is now happily indexing. For all readers to know, there is a minor issue in that recent versions (1.5.1, 1.5.1.1) are too strict and will abort when an improperly encoded page is encountered. This is not common as most sites seem to be acceptably coded. At some point, a newer version will correct this, but so far there hasn’t much other motivation to produce a next version!
Hi Captain Quirk,
I was looking for a search engine that could index PDF files to replace my previous installation with Sphider and PHP 7. I would like to thank you very much for updating the code that runs perfectly. I did the update yesterday without big problem so far.
Again, thanks a lot !
You’re welcome!