Content-Type meta tags and HTTP response headers

How many of us have used a meta tag to define content type and default character sets? The tag may appear something like this:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

But do we REALLY understand what is going on? This tag is important when a webpage is being opened locally. It instructs the browser as to what character encoding to use to display the page. This may override the platform default.

But what about when a page is being viewed by HTTP? Well, the tag is important if the HTTP response header(s) being sent fail to designate a default character encoding. What if the response header(s) DO include a default character set? AHHH! Then the meta tag is (are you ready for this?)… IGNORED!

Let’s say you designate a page, via meta tag, to have a character set of UTF-8, but your web server is sending a response header setting the default as Windows-1252. Your page is going to display in Windows-1252!

And guess what? Your page, viewed over HTTP,  just may still appear correctly giving you the impression that the meta tag is working! Then you force your browser to actually display in UTF-8 and that beautiful page suddenly becomes what is referred to as “mojibake!”

There are at least a couple ways to get this all sorted out. If you are coding in PHP, one way would be to set the response header in the code for each page. Here is an example PHP header:
header('Content-Type: text/html; charset=utf-8');
This needs to appear in the PHP BEFORE a single bit of HTML is displayed.

Another way is if you have access to your server settings, you can specify a default character set.

Still another way, with Apache servers, is to specify a default character set in you .htaccess file.
AddDefaultCharset UTF-8

So…. knowing all this, just HOW do you go about confirming that the character set you want is the character set actually being set? With Firefox/Waterfox/SeaMonkey, bring up the page in question. Up in the url display area, to the left of the url, click on the little circle with the upside-down “!”. There will be information on whether or not the connection is secure, then a “>”. Click on that. Click on “More information”, then the “General” tab.  This will display the text-encoding AND the meta tags. If they don’t agree, the response header being sent isn’t what you want it to be. This applies to Waterfox in Linux, also.

Google Chrome USED to allow the option to see what the default character set REALLY is, but they removed it. Fortunately, there is an extension that does it for you. The extension is simply named “Charset”, and allows you to not only see what the actual character set is for a given page, allows you to change it. The results may be an eye opener. BTW, this applies to Linux Chromium as well.

What about IE/Edge? You’re on your own! I won’t touch those monstrosities! LOL!!

The future of the PDO edition of Sphider…

Sphider comes in two editions, the legacy version and a PDO version. The legacy version is definitely the more stable, faster, easier to maintain version. The PDO version exists primarily for those who are restricted by their shared hosting providers.

Shared hosting has its advantages in that it is very cost effective (cheap) and very simple to use. It is great for personal use or for small businesses or organizations just getting started on the web.

But shared hosting has its downsides, too. It isn’t nearly as efficient, isn’t as secure, suffers from limited resources, and has limited functionality. One of the features commonly lacking in shared hosting is MySQLnd. Thus the need for PDO.

The are quite a few users of the PDO edition, and to simply drop PDO would be a great disservice. On the other hand, trying to keep the PDO edition in sync with the legacy edition is getting harder and requiring much time and effort.

The PDO version, as it stands, is quite usable. It is PHP 7.3 compliant, so it should be reasonably set for awhile, as the majority of shared hosting plans are still at least a few versions behind 7.3!

The thought is that the time for legacy and PDO to part paths, with most future effort going into the legacy edition. Because of the user base, PDO version 2.4.0 would remain and receive hot fixes as needed.

No decision has been made and feedback will be given consideration.

Emojis and Sphider

Quite sometime back, Sphider had an indexing issue when emojis were encountered on a web page. The sql errors would fly! The solution at that time was to filter out emojis before storing in the database. This solution was working just fine, but admittedly the filter has not been updated and there are ALWAYS new emojis making their appearance.

While even the new emojis themselves have not been an issue, there was a very curious case of an emoji-free site in which the filter was clearing the entire full text of pages and storing — NOTHING! Well, that isn’t good. The workaround for that site was to disable the emoji removal function. Not an ideal fix, but very doable. As to WHY the function has this effect on that particular site is still a mystery.

But now may be the time to revisit the need for the filter in the first place. At the time the filter was installed, Sphider used the default MySQL utf8 scheme, which is 3-byte. Some emojis are 3-byte, but the vast majority are 4-byte, with even a few 8-byte emojis. You see the problem, don’t you? MySQL is not going to be happy when you try to stick a 4-byte character into 3 bytes!

Since that time, however, Sphider has moved to utf8_mb4, which IS 4-byte. This means that the troublesome 4-byte characters WILL fit into the database. As to those 8-byte emojis, well they are commonly composed of TWO 4 byte characters, which means — NO PROBLEM!

The next version of Sphider, 2.4, is VERY near release. The emoji filter remains in place. But after serious thought and consideration, and some testing, and this filter may be removed in the following release.  It is logical, but how will it test out?

What to expect in Sphider 2.4.0

Sphider 2.4.0 is on track for an April 10th release. For the user, the changes are focused on cosmetics. Up until this point, search results ALWAYS had a result number and, after the description, a text url to the page containing the search result. In 2.4.0, you will have the option to either display or not to display those items. Also, the option to display the page’s indexing date has been added.

As to search templates, what were probably seven of the crappiest, lamest templates to have ever seen the light of day have been scrapped. Seven NEW templates are being introduced. Depending on your tastes, you might consider some of them crappy, too, but at least they have a bit of style to them. The “newspaper” template was introduced in an earlier post. Here are the other six:

“black” template
“green” template
“grey” template
“simple” template
“terminal” template
“yellow” template

The “green” style is, well, VERY GREEN! The purpose isn’t so much for actual use as to demonstrate the ability and flexibility of CSS in creating your own templates, even using an image as a border.

The “yellow” template features a bit of simple artwork in the upper left corner. This artwork is “logo.png”, located in the templates/yellow directory. The size is 150×150 and has a transparent background. By creating your own similarly sized logo/picture/artwork, and replacing “logo.png”, this template can be customized for your website.

Since everyone has different tastes, different needs, and every website is somewhat unique, these templates can serve as guides in customizing your own templates. With all the above, the ONLY thing different is the CSS.  Start with a copy of the “standard” template and start tweaking away! The basic Sphider modules remain the same.

Additionally in Sphider 2.4.0, the ‘settings’ table has been completely reworked. While this change is transparent to the user, it will make life much easier on the developer as Sphider moves forward.

Besides some minor fixes and tweaks, the only other big change is in the word stemming process. While the majority of Sphider users probably never use word stemming, those who do will be pleased to learn that the algorithm (for English) has been updated to Porter2. Completely new is the ability to use stemming for ten other languages!

The next Sphider is in the pipeline

Sphider 2.3.1 is brand new, but work has already begun on 2.4.0.

Among the features already being implemented are the ability to hide the result number when displaying search results. Also, for the regular text search, the option to display the index date is being added. (This will not be available for the image or RSS searches.) The RSS and image searches will have the option to turn off the advanced search features.

A new template is being added. Unlike nearly all the current templates, this one has some class. Here is a screen shot:

The Newspaper template

In the sample above, in “settings” the result number is turned off, the index date is turned on, and the description length has been increased to 1000.

Probably the biggest change will be transparent to the user. The “settings” table is being reworked. As Sphider has changed, so has the table, with new columns being appended on a regular basis. Now, while the position of columns within a table is totally immaterial to functionality, after awhile it can be really confusing for the developer having to bounce all over the place to gather data.  This change will organize the data in a regular flow which will be much easier to maintain going forward.

Other improvements are also being considered, but whether or not they are implemented at this time is yet to be determined. No release date has been set.

When 2.4.0 is released, whenever that may be, the downloads for the SQLite and PostgreSQL versions will likely be removed due to lack of demand.

Also, earlier thoughts of adding audio (mp3, wav, ogg) indexing support to Sphider have been dropped, also due to lack of demand. The actual indexing algorithm has been proven and sketched out, but there is no rationale for implementing it other than “Gee, that’s a neat feature.”

Sphider 2.3.1 Released

Sphider 2.3.0 principally addressed security concerns, but it also was intended to bring Sphider into PHP 7.2 compliance by removing any use of the deprecated each() function. The function was used extensively, and the majority of the code replacement was very run-of-the-mill straightforward. There were four times the usage was atypical. Substitute code was put in place and tested. It seemed all worked well as many sites were indexed and searches performed as expected.

Well! It seems indexing and searching was being done properly — but only for words composed of Western characters. Words utilizing non-Western characters were not being indexed! And any searches for those words not only returned as “not found” (expected since they weren’t indexed), those searches also complained of gibberish characters/words being either too short or too common.

Investigation of the issue led to three of the four code segments replacing the non-standard usage of the deprecated each() function. The code replacements themselves have been replaced in 2.3.1. Testing on the problem sites now shows that all words are being indexed, those containing Western characters as well as those containing non-Western characters. The search anomalies are gone and searches for non-Western foreign languages is yielding expected results.  If a search word really IS too short or too common, it is reported as such, and not as gibberish. Sphider is now truly PHP 7.2 compliant.

Sphider 2.3.1, both legacy and PDO, are available for download on this blog’s download page, or from the Sphider Home page.

Sphider – PDO vs MySQLi

There are TWO editions of Sphider… the classic edition using MySQLi and the PDO edition.

Why are there two versions? The classic edition uses MySQLi and prepared statements. While MySQLi, by itself, does support prepared statements, there are a couple functions used in Sphider that require MySQLnd (the “nd” stands for “native driver”). These functions are used because they are the most efficient way of doing things.

MySQLnd has been the default driver since PHP 5.4. If you install a modern version of PHP and want MySQLi, you are going to get MySQLnd. Yet SOME hosting companies DISABLE MySQLnd for those using shared hosting. (I suppose they want people to shell out a few more bucks to get VPS or Dedicated hosting.) In those situations, the classic edition just ain’t gonna work! So, there is the PDO edition.

There are those who will tell you that PDO is what you should be using anyway. They will tout how versatile PDO is, how it can do anything MySQLi can do, only better. It is true the PDO IS versatile. It can work with many different databases, not just MySQL. But there ARE some things PDO just can’t do, at least not efficiently. And there is overhead. And memory requirements.

With PDO:              PHP <==> PDO <==> Your data
With MySQLi:       PHP <==> Your data

The classic version of Sphider is the better, more capable edition! The PDO edition is capable enough PROVIDED you aren’t trying to build your personal version of an internet search engine. It IS possible to tax the PDO edition to the point it chokes. (It is probably possible to choke the classic edition as well, but it takes more effort.)

Remember, the intent of Sphider was/is to index a web site for the benefit of that site’s visitors. In can be used to index a number of related sites for the same purpose. An individual may stretch Sphider for personal use to index MANY sites… but it is STILL just a small indexing tool and not a Google replacement!

NOW… the final point. If you REALLY need Sphider to stretch its capabilities to the absolute limit, maybe you should be using the classic edition and not PDO. If that is the case, shell out a couple extra bucks to your host so you can get access to MySQLnd. Don’t try to pull a 20′ travel trailer with a Honda Civic.

WorldSpaceFlight code changes

The WorldSpaceFlight pages dealing with the various flights (US, Russia, China) have undergone a behind-the-scenes update. A new class was introduced which eliminates a lot of code duplication and makes maintenance easier. Of course, this means a number of modules had to be changed to accommodate the new class structure. I might have missed something. If you notice anything “funky” about a particular page, please let me know so I can fix it. This can be something like weird characters, misplaced text, missing data, inappropriate data like a number where text is expected, or some kind of error message.