"Did You Mean?" for HIP 2 or 3

[update: we’re now using Aspell and the Text::Aspell module]

HIP 4 contains a spellchecking “did you mean?” facility which, although not as powerful as Googles, is certainly a step in the right direction.  One of the basic rules of designing any web based system that supports searching or browsing is to always give the user choices — even if they have gone down a virtual one way street and hit a dead end.

Unfortunately it’s going to be another few months before SirsiDynix release the UK enhanced version of HIP 4 for beta testing, so I thought I’d have a stab at adding the facility to our existing HIP 3.04 server.

Fortunately Perl provides a number of modules for this kind of thing, including String::Approx, Text::Metaphone, and Text::Soundex.

String::Approx is good at catching simple typos (e.g. Hudersfield or embarassement) whereas the latter two modules attempt to find “sounds like” matches — for example, when given batched, Text::Metaphone suggests scratched, thatched and matched.

To set something like this up, you need to have a word list.  You could download one (e.g. a list of dictionary words), but it makes more sense to generate your own — in my case I’ve parsed Horizon’s title table to create a list of keywords and frequency.  That’s given me a list of nearly 67,000 keywords that all bring up matches in either a general or title keyword search.

Once I’d got the keyword list, I ran it through Text::Metaphone and Text::Soundex to generate the relevant phonetic values — doing that in advance means that your spellchecking code can run faster as it doesn’t need to generate the values again for each incoming request.

Next up, I wrote an Apache mod_perl handle to create the suggestions from a given search term.  As String::Approx can often give the best results, the term is run against that first.  If no suggestions are found, the term is run against Text::Metaphone and then Text::Soundex in turn to find broader “sounds like” suggestions.

Assuming that one of the modules comes up with a least one suggestion, then that gets displayed in HIP:

There’s still more work to do, as the suggestions only appear for a failed single keyword.  Handling two misspelled words (or more) is technically challenging — what’s the best method of presenting all the possible options to a user?  You could just give them a list of possibilities, but I’d prefer to give them something they can click on to initiate a new search.

Advertisements
12 comments
  1. Davey P said:

    It’s been interesting watching the first 24 hours of suggestions — whilst some are the result of simply not having anything on the catalogue that matches the search term, many are simple typos.

    We have a large number of Human & Health Science students and many of the typos are related to that topic and (touch wood) the correct spellings are being suggested.

    If I was to walk up to the Information Desk and ask where the books on “diabetees” are, I wouldn’t expect the member of staff to reply “Sorry – we don’t anything on that”. Why should the OPAC be any different?

    Getting back to the coding, I’ve now indexed our author table. That means that any failed author keyword searches now get a more relevant selection of names (e.g. Cuningham).

  2. Davey P said:

    Another day, another tweak!

    All three modules have their stengths and weaknesses, so I decided to rejig the code and run the search term through each. If all three suggest the same word, then it’s usually spot on.

    Heck, it even manages to get my horrendous mispelling of “pnuemonia” spot on!

    newmonia

    I’ve also replaced the Text::Metaphone with the newer Text::DoubleMetaphone module.

    I still need to find a way of making suggestions for multiple search terms.

  3. casey said:

    For ours I used the contents of the word table, which contains all words indexed by Horizon, as the word list. Are you using the separate word lists based upon the index? (so if you misspell while doing an author KW search it uses the author list, but with a title KW it uses the title list)?

    The problem with multiple words is that if you provide a list of all possible suggestions, that sorta sets up the expectation that all those possible combinations return hits. Since that would require performing every possible search and seeing if you get any hits, ours just returns the most likely suggestion for each word. So if you search on “Mork Twian”, it only gives you the one suggestion of “Mark Twain”. That’s far from ideal, but it’s what Google does (and it was the easiest to code…)

  4. Davey P said:

    Hi Casey!

    Yep – I knocked up a quick and dirty Perl script that runs through the title, author, and subject tables indexing every word. All the words were then placed in a MySQL table, along with the soundex and metaphone representations of the word, and also the number of times the word appears in each Horizon tables.

    By checking the number of occurances, the suggestions can then be based on a single index (or not in the case of a general keyword search). For example, searching for tubercolosis will give different suggestions based on which keyword index you’ve chosen to search. If you search the author keyword, then the best suggestion is Tiberghien.

    When I get a spare half hour, I’m going to index our journal titles and add those to the code.

    I’ve been thinking along the lines of your suggestion for multiple words – especially if all but one of the words appear in the list of known words.

    What’s impressed me the most is the speed at which Perl can rattle through the 90,000+ words that now appear in the word index, doing a String:Approx comparison of each with the search term — most suggestions appear in about a second.

    …and so my love affair with Perl continues πŸ˜€

  5. Davey P said:

    I’ve finally figured out how to get Text::Aspell working, so I’m also testing that.

    My initial reaction is that it’s giving even better results than I’ve been able to generate so far πŸ™‚

  6. Davey P said:

    A bit more tweaking and it’s starting to handle multiple keywords:

    newmonia tweetment care

    I’m still not sure it’s the best way of handling multiple keywords, but it least it gives the user several options and (hopefully) a link to what they really meant.

    Unfortunately it does highlight the number of bibliographic typos we have in our catalogue – I’m not sure anyone would want to study mangement or busines πŸ˜€

  7. Davey P said:

    Hi Camden

    I think my only reservation about using Google is that it doesn’t guarentee that the correctly spelled term will find results on the catalog.

  8. Denice said:

    I every time spent my half an hour to read this webpage’s articles or reviews every day along with a cup of coffee.

  9. Bradly said:

    It is the best time to make a few plans for the future and it is time to be happy. I’ve read this publish and if I may I desire to recommend you few attention-grabbing things or tips. Perhaps you could write subsequent articles regarding this article. I wish to read more things about it!

  10. Britney said:

    What i don’t understood is actually how you are no longer actually a lot more neatly-favored than you might be right now. You’re so intelligent. You know therefore considerably in terms of this subject, produced me personally believe it from a lot of varied angles. Its like women and men are not interested until it’s something to do with Girl gaga! Your individual stuffs nice. All the time maintain it up!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: