What’s next for Sphider?

Work is proceeding with Sphider 1.6!

What will be new in 1.6?

  • The ability to truncate selected tables from the database tab
  • The ability to clear all site data without deleting the site
  • The ability to crawl a site using a sitemap.xml, provided one exists
  • The option to preview pages from the results listing
  • An issue with resuming suspended indexing has finally been resolved
  • Support for an optional Sphider Image Indexer

At this point, the changes have been made in both the vanilla and PDO versions of 1.6 and testing is ongoing.

And what? An optional Sphider Image Indexer?  This is an add-on that will work with Sphider 1.6. You will be able to build a catalog of images from sites where you have previously indexed the pages. Currently, the indexer itself is being tested, with excellent results. Work has begun on an image search function, but that is still in the VERY early stages and nowhere near being a viable tool. While the indexer required some modification of the core Sphider, the search function will not.

What this means is that once testing of the vanilla and PDO versions of 1.6 are complete, it can be released. The Image Indexer add-on still has to have the search function completed, then both the indexer and search function ported to PDO, and finally fully tested. At that time it will be released as version 0.99.

Since the search function of the add-on is in the very early stages of development, input as to how you would like to see it operate would be considered.

Just what IS this Sphider, anyway?

Sphider is a program designed to visit a web site in an ordered fashion to find the information necessary to create an index for a search engine. This, in turn, allows the site to be searched for pages containing certain keywords or phrases. Spidering programs are also called web crawlers or bots. They operate by following the hyperlinks on each page.

The crawlers which build major internet search sites (Google, Yahoo, Bing, etc.) are quite sophisticated and can find not only keywords and phrases, but images and other content as well. The ranking system of these crawlers is equally sophisticated. Not only are keywords, considered, but so is keyword location and density, relevancy, traffic patterns, tld names, page design, and domain registration length. In fact, Google has a list of over 200 page ranking factors.

Sphider is much simpler. Pages are ranked solely on keyword weighting. Keyword weighting is calculated by word position and frquency and the user has a level of control over the weighting process. Images are not indexed and relevancy is not a factor (although better word position and greater frequency DO indicate higher relevance). While Sphider can index practically any website, the main purpose of the application is for the user to index his or her won website so that an internal search can be made available to site visitors.

There are a number of Sphider flavors. The original Sphider (version 1.3.6) can be found at http://www.sphider.eu. It is free, but has the disadvantages of being insecure and badly outdated. It is no longer maintained and will start throwing errors on any system running PHP 6.6 or greater. It will not function at all on PHP 7.

Sphider-Plus (http://www.sphider-plus.eu) and Sphider-Pro (http://www.sphiderpro.eu) are both paid versions of the original and do have added features. I cannot speak as to security or support. Sphider-Pro is at version 3.3, which has a date of 2013, so that may not speak well as to its status. For a small website, many of the enhancements provided by these variations may be overkill.

Then there is the Sphider located here on our Downloads page, It, too, is based upon the original, but has been updated. It functions without error with PHP 5.5 or greater, even with PHP 7. It is much more secure. All SQL queries are made using prepared statements to avoid the risk of SQL injection. Other security measures have also been taken. We even have a variation (PDO) which can not only operate in environments lacking MySQLnd support, but can be used with databases other than MySQL (with some tweaking). It can work with SQLite, PostgreSQL (port kits available for both), ODBC, Microsoft SQL Server, and others. Both the normal and PDO variations are supported. And best of all, they are still free!

Sphider 1.5.4 and Sphider 1.5.4 PDO may not have installed properly

If you did an upgrade, the regular and PDO versions of Sphider 1.5.4 may not have installed properly. You can check whether or not you are affected by checking the Settings tab on the Sphider admin page. If a version other than 1.5.4 (or 1.5.4 PDO) is reported, there is a problem. The settings table in your database is missing a column. Any downloads from this point on will not be affected.

The issue can be easily fixed and is addressed on this sphiderform post.

Sphider 1.5.3 has a similar defect and can be repaired the same way, by editing update_rollup.php and re-running. However, 1.5.3 is not so critical as no changes to the settings table take place excepting for the version number update.

Sphider 1.5.4 and Sphider 1.5.4-PDO to be released on 29 May

On 29 May 2017, Sphider versions 1.5.4 and 1.5.4-PDO will be released and posted on our Downloads page.

Although addressed in the 1.5.3 series, table prefixes containing a hyphen continued to be a problem. Hopefully this time we have tracked down ALL the sources of this problem and corrected them.

Another problem was that the presence of an emoji on a web page (generally uncommon except on blog or forum pages) would cause an error and that page would not be indexed. Emojis are now purged before indexing.

The ability to index decimal numbers has been added. In earlier versions, numbers could be indexed but decimals numbers would be not be. For example, ‘12345.56789’ would be indexed as ‘12345’ and ‘56789’. If the setting for indexing decimals (on the settings page) is checked, ‘12345.56789’ will now be correctly indexed. A side benefit is ANY numerical string with a period will be recognized. For example ‘123.456.789’ would be indexed. This could be useful for pages containing part numbers. The mixing of numeric and alpha characters will still omit the period. ‘12345.abcde’ will still be indexed separately as ‘12345’ and ‘abcde’.

Also changed in these versions are the language files. Since the search page is utf-8 compliant, “special characters” like è or ç would fail to display properly. The Cyrillic alphabet with characters such as Ц or й will also now display correctly. This does NOT mean the text displayed will be the proper translation, as I am no linguist and am either relying on the work of others where possible, or winging it with the use of Google translate. Simply put, these characters are now coded in the language files as unicode entities.

Tax Freedom Day

This being April 15th, and being in the United States, I got to thinking about taxes. And then I started thinking about Tax Freedom Day, the day which, on average, a person has earned enough in the current year to pay all of his/her federal, state, and local taxes for the year (and starts working to provide for his own needs).

First of all, April 15th isn’t the day taxes are due in 2017. April 15th being a Saturday, and Monday, April 17th being Emancipation Day (I never even knew there was such a holiday), taxes aren’t due (in the USA) until Tuesday, April 18th, 2017.

Anyway, back to Tax Freedom Day… it turns out that this year Tax Freedom Day in the USA falls on April 24th. I was thinking “Gee, that kinda s**ks!” Then I found out what it is like elsewhere. In the United Kingdom, Tax Freedom Day doesn’t arrive until 13 May. But it could STILL be worse. In Finland, it isn’t until 15 June, and in Sweden it is 30 June. The end of June, which means you work half the year just to pay your taxes. Turns out, in Germany the day doesn’t arrive until 19 July, and in France it is 26 July. And worst of all is Belgium with a date of 3 August! I didn’t check to see if there were any countries even worse off. It would have been too depressing.

I guess 24 April isn’t that bad after all.

Sphider 1.5.3 and Sphider 1.5.3.PDO have been released

Updates to the Sphider search engine have been made. The latest version is 1.5.3. Sphider 1.5.3 is for use when both MySQLi and MySQLnd modules are available in PHP. For individuals who’s host does NOT provide MySQLnd support, but DO provide PDO support, Sphider 1.5.3.PDO is also available. You may find both on the Downloads page (Click the Downloads tab at the top of this page.)

To avoid confusion concerning versions, the PDO version not longer contains a “.1” at the end of the version number, but a simple “.PDO” to distinguish it from the non-PDO version. (Some people thought 1.5.2.1 was an minor update from 1.5.2 when it actually was identical but coded for PDO instead of MySQLnd.)

Changes in 1.5.3 from 1.5.2 are:
Better support for https sites.
Ability to better recognize and follow the directives in a robots.txt file.
Correction of a potential problem when using the CleanDomains function in the event there was only a single domain to clean.
Fixed a number of errors which could appear when a database table prefix contains a hyphen.
Fixed a potential error when running under PHP 7.

Sphider Help Forum is now available

The new Sphider Help Forum for help concerning Sphider 1.4.2 or later is now open, at least on a trial basis. Out of necessity, ALL posts will be moderated. This is because of the tremendous amount of blog, forum, and guestbook spam present on the internet. Apologies for the inconvenience, but that’s life.

Hopefully, this forum can be used by the slowly growing community of users of the updated Sphider. The original Sphider Forum (located at sphider.eu) has become steadily less help and more sales pitch for Sphider-Plus. We have no gripe about Sphider-Plus, per se, but the original Sphider was free and just because the original developer moved on to other interests several years ago, we don’t see why the original can’t live on and evolve with the rest of technology.

The original (1.3.6 and before) has problems with anything later than PHP 5.4, and here we are, most platforms on 5.5 or 5.6 and the trend well underway towards PHP 7.  Any internet technology which simply stands still for 4 to 7 years is going to become lost in the cloud of dust.

Anyway, hopefully the forum will be a better place to air problems and find solutions than blog comments.

Considering another Sphider improvement

The original version of Sphider had very erratic support for indexing HTTPS pages, and wouldn’t even look at the robots.txt file on a HTTPS site. That failing has never been addressed, and even the latest version, 1.5.2, has the same failings when it comes to HTTPS. This has never really been an issue for me before, and even now it is more annoyance than issue as I can work around it.

Still, the “problem” does seem intriguing. After a bit of experimenting, a fix may not be all that difficult. (Famous last words, right?)

I am debating now whether or not to continue investigating alternatives and make more code changes which would improve HTTPS support in Sphider, not only to ensure more reliable connectivity but to enable the robots.txt to be utilized as well. I don’t know that there is that big of a need. We’ve never received any complaints or comments on the issue…

Anyway, at this point there is a POSSIBILITY, but no definite plans one way or the other.

*******************************

UPDATE (Apr 6): I was able to get the robots.txt file read from a https site. First problem, regardless of http or https, the parsing of allowed or disallowed user agents and disallowed files/directories was iffy. If the robots.txt file had lines like “user-agent” or “disallow”, it was parsed, but “User-agent” or “Disallow” was not. It was a case issue. That is now fixed (on my side, not published yet). Second problem, now that I know the file IS being read and parsed, Sphider will STILL index some files in disallowed directories!

If you have any files or directories listed as “url_not_inc” in your settings, that will work, but not the robots.txt disallows, even though that SHOULD be the case. Well, this situation certainly has gotten my interest!

*******************************

UPDATE (Apr 7): I have begun the process of troubleshooting the code to see what is going awry and where. Working alone and having other things to do in life, this can be both time consuming and frustrating. So far, I do know the robots.txt is read and parsed properly. Just where and why the instructions are not acted upon is another matter. At least the question of whether or not I will be attempting another modification has been answered!

*******************************

UPDATE (Apr 8): GOT IT! Preliminary tests show robots.txt is now being followed in both http and https. More testing to follow (found a couple other misc issues and fixed them). Once everything is validated, there will be a 1.5.3. Stay tuned.

Point to ponder

When told the reason for daylight saving time the old Indian said…
‘Only a white man would believe that you could cut a foot off the top of a blanket and sew it to the bottom of a blanket and have a longer blanket.’

Daylight Saving Time is NOT followed in Arizona, with the exception of the Navajo Nation in the northeast corner of the state, which does. Meanwhile, the Hopi Reservation in Arizona, which is COMPLETELY surrounded by the Navajo Nation, does not. Does this make sense?

The reason given for this is that the Navajo Nation covers 27,245 square miles in parts of three states, Arizona, Utah, and New Mexico, and that Utah and New Mexico DO follow DST. Rather than having two different times in just one nation, the Indian leaders have opted to follow DST on the Arizona part of the Nation.

You can see from the map that the vast majority of the Navajo Nation is in Arizona. While it does make sense for the entire Navajo Nation to be observing a single time, wouldn’t it make more sense for the Navajo leaders to follow Arizona’s lead and declare that the parts of the Nation in Utah and New Mexico NOT follow DST?

Just wondering…

Current state of rocket landings

As of this time, Blue Origin has nailed 5 successful landings in a row. The last landing was actually unexpected as the launch was to test (successfully) the launch escape system. The push back was expected to damage the launcher and make it unable to land. In a big plus for Blue Origin, not only did the escape system perform well, the booster was able to make a successful landing as well. Blue Origin may start launching tourists for suborbital flights this year. At least, that’s the plan.

Meanwhile, SpaceX just nailed a landing in Florida after a successful launch of a Dragon cargo vessel to ISS. This was the third success of bringing Falcon 9 first stage back to LZ1. There have also been 5 successful barge landings (4 in the Atlantic, 1 in the Pacific). So what is SpaceX’s record at this juncture? They have 8 successful landings in 18 tests. Consider that on the first 5 tests, all at sea, there was no barge involved. These were strictly systems tests and all the stages were intentionally lost at sea. Now we are talking 8 of 13. There have been 4 successes in a row, 7 successes in the last 8 attempts, 8 successes in the last 11 attempts. Overall, considering the complexity of the systems involved, not a bad record at all!

SpaceX is currently constructing a second landing pad in Florida, LZ2, and LZ3 is in the works. This will come into play when the Falcon 9 Heavy comes on line later this year. Three cores coming down at once! It is anticipated that two will return to LZ1 and LZ2, and the third to a barge in the Atlantic. If SpaceX can pull this one off, it will be a sight to see. A Falcon 9 Heavy launch and three core landings in a single act!

Between Blue Origin and SpaceX, 2017 could be quite a year.