The Urbano A. Company’s Blog
Did you mean… ? In php
In a new website I am developing for a client I had to add the usual "Did you mean... ?" in the search results for her. Si I started thinking for the easiest way to do this.
There are actually a lot of php functions out there to look for similar text. The most obvious one?
similar_text()
You must pass 2 parameters plus an optional third. The two first are the strings to compare, and the optional one is the percentage of "closeness" you want them to have. It is quite useful, although it is too expensive in terms of time to use with huge database searches, so I wouldn't recommend it.
There are two other methods that might be good for some cases, and another function that is just the best. I'll show you first the best way to achieve this:
It is the Levenshtein algorithm, which basically finds the number of characters you must add, edit, or remove from a string to make it match another one. At first it doesn't sound too useful, but take a look at this example:
< ?php // input misspelled word $input = 'carrrot'; // array of words to check against $words = array('apple','pineapple','banana','orange', 'radish','carrot','pea','bean','potato'); // no shortest distance found, yet $shortest = -1; // loop through words to find the closest foreach ($words as $word) { // calculate the distance between the input word, // and the current word $lev = levenshtein($input, $word); // check for an exact match if ($lev == 0) { // closest word is this one (exact match) $closest = $word; $shortest = 0; // break out of the loop; we've found an exact match break; } // if this distance is less than the next found shortest // distance, OR if a next shortest word has not yet been found if ($lev <= $shortest || $shortest < 0) { // set the closest match, and shortest distance $closest = $word; $shortest = $lev; } } echo "Input word: $input\n"; if ($shortest == 0) { echo "Exact match found: $closest\n"; } else { echo "Did you mean: $closest?\n"; } ?>
This is an example where even a misspelled word can be found. It uses the Levenshtein to look for the word which is the most similar one, and then it is returned.
This is the output of the code before:
Input word: carrrot Did you mean: carrot?
The use of this function is quite simple, although there are many optional parameters for more precise use. See the php.net reference for this function.
The other ways I said that could be used for this are soudex and metaphone, although their use might be more complicated for this particular suggestions use.
Soundex will create a key that is the same for all words that are pronounced the same.
For example, the following code:
< ?php echo soundex('beard').''; echo soundex('bird').''; echo soundex('bear'); ?>
Will produce this output:
B630 B630 B600
Where beard and bird are the same. This could make suggestions fast if you have already created a column in the mysql tables with the soundex key of the tags for example, so that you could search not only for the string, but also for its soundex key...
UPDATE: You can use MySQL's built in function SOUNDEX() to search both for the string as-is, or for the soundex too, to provide also misspelled words.
And finally, the metaphone function, is a variation of the soundex key that produces also a key that is the same for all words pronounced the same, but more accurately than soundex, since metaphone actually knows the rules of English pronounciation.
The use would be exactly the same as soundex, and if you are going to use something of the sort I would recommend metaphone over soundex for its improved accuracy.
But bear in mind that both soundex and metaphone won't probably work fine in most other languages, or at least for languages with phonemes that don't exist in English.
Hope you found this useful,
Alex
| Print article | This entry was posted by alex on May 30, 2008 at 4:12 pm, and is filed under PHP, Programming, Reviews. Follow any responses to this post through RSS 2.0. You can leave a response or trackback from your own site. |
New threat for all Joomla and WordPress installations
about 5 months ago - 2 comments
There is a new BOT out there, and one of the bad ones. I have started receiving traffic from it in my servers over the past week, and after some investigation it turns out it is quite a powerful bot, and so simple to use even a kid with a computer could use it.
The bot More >
Parse links in user comments
about 5 months ago - No comments
When you allow users to comment and post stuff to your website, it is interesting and useful allowing them to post links and other stuff. But how can we do so easily?
Surely there is BBCode, phpBB, allowing only some HTML tags… etc but how easy is this approach for the end user? Of course some More >
Calculate age in PHP from timestamp
about 6 months ago - No comments
If you ever wanted to calculate someone’s age in PHP from a birth timestamp, you must take into account that the age is more than the number of years, since days and months are also important, so I wrote a simple function that will return the exact age for a given timestamp:
function getAge($birth){
$t = time();
$age More >
Easiest PHP file upload
about 7 months ago - No comments
Hello people,
I want to share with all of you a file upload class I have developed, that makes it stupid simple to upload files haha
The PHP class:
First of all, here is the PHP class you will need:
< ?php
//Uploader class, by Alex
// This class is meant to handle all kinds of file uploads for DJs Music
// More >
Easily assign an image to a post in WordPress
about 1 year ago - 1 comment
Have you ever wondered how to assign an image to a certain post using WordPress? Surely there are some plugins that try to do this, and maybe they accomplish it, but probably slowing down your blog.
Well, here is a way of doing it without slowing the blog or installing any sort of additional plugins. When More >
How we became web developers…
about 1 year ago - 3 comments
I normally write to those web developers/programmers who are already good, experienced, and thus the articles are somewhat advanced.
But today I got up feeling nostalgic I guess, or I just felt like remembering back on my www birth, on my first impression of the Internet, my first site online, my first steps in w3c standards, More >
Using Gravatars in your blog comments
about 1 year ago - No comments
I’m sure you have seen in some blogs or websites that as soon as you enter your email, or other people do, an avatar is displayed for them. If you haven’t still set your own you should do that now.
A Gravatar is a “Globally Recognized Avatar”, meaning that anyone can display yours if they know More >
PHP easy image editing:
about 1 year ago - 3 comments
Do you have a picture upload and you don’t know how to easily resize/edit the uploaded images?
Well here is a solution for php that will make your life really easy!
It is called Asido, so you may go and download their code, to follow this tutorial.
First of all I’ll suppose you already know how to upload More >
Visitor’s language detection in php
about 1 year ago - No comments
Is your site translated into more than one language? In case it is, you will find this topic really interesting, since it will allow you to automatically adapt your site’s language to the user’s.
In case your site is not translated, you can always use this to know the visitor’s language… I’m sure you’ll be able More >
Writing a simple Facebook App
about 1 year ago - 1 comment
In this tutorial I am going to show you how to create a simple Facebook application.
Getting the basics
First of all you’ll need to install the developer application on Facebook. To do this log in to your Facebook account and visit this link here to install the Developer application. You should be redirected to this screen:
* More >
about 1 year ago
“(…) already created a column in the mysql tables with the soundex key (…)”
Or use SOUNDEX() function provided by MySQL.
about 1 year ago
Wow, didn’t even know these functions existed. Thanks for the lowdown.
about 1 year ago
cool and sweet tip .. I like you syntax highlighting plugin in the Wordpres abc :))
about 1 year ago
how big is the dictionary that you use? Levenshtein is O(mxn). you cant get much better for this sort of operation, but assuming that the input is 7 chars, and target string is an average length of 6, then we have 42 operations to determine its length. assuming a rather puny dictionary size of a 1000 words every page view/search will require 42000 operations? that sort of math doesnt scale.
you may want to look at storing the results of each search in a dbms as a sort of long term cache.
about 1 year ago
Well of course I wouldn’t use Levenshtein for large volumes of data without some sort of index or cache as you say…
In case you want to do it with large dictionaries I would probably create a column with the soundex or metaphone keys and then search both for the string as-is and then for the soundex key.
Using it that way you could even display the percentage of closeness…
Probably i would say that Levenshtein is great for small sized databases, or array search. But for large scale search the soundex is just perfect.
Thanks for the correction,
Alex
about 1 year ago
Nice, thanks for the advice, I didn’t know MySQL had that function. Then yes, use it directly, unless you want to use the metaphone function, which apparently isn’t included yet in MySQL (There is a request to have it included in following versions already)
Thanks for that, I’ll update the post now to reflect that,
Alex
about 1 year ago
I had no idea there were such functions in PHP. Cheers for the article, I’ll be using this in the near future!
about 1 year ago
This is an example in Python, but very useful. http://norvig.com/spell-correct.html
about 1 year ago
@Sergei:
Nice!
Although I’m not sure that they are using the best approach, since it is language-dependent, whereas I doubt Google actually performs a real spell check.
I am guessing it searches for similar words using one of the functions (Or altered versions of them) to get slighly variated words, and then performs searches also with them…
But you never know
about 1 year ago
a better solution, in our work, was to make a custom dictionary. see:
http://www.indirecthit.com/2007/08/24/google-did-you-mean-on-search-pages-using-php-4/
about 1 year ago
Well that might be indeed a better solution, although I would like to test both with several different terms and cases and note down results and processing time to ensure what is the best way to approach this.
Your might be better, although for normal sites that don’t really need that good search engines I guess it would be easier for them to run the MySQL query using SOUNDEX() and the soundex key of the input word…
Thanks for that