The "Harry Potter Effect"

If you look at the overall keyword cloud for HotStuff 2.0, you can see librar* bloggers like to talk about libraries, books, reading, books and libraries.

When some things are more popular than others, this gives rise to Tim Spalding’s “Harry Potter Effect” — everyone’s got the HP books on their shelves, so, if you’re not careful, they end up becoming the top recommendations/suggestions for almost any type of book.

In our case, in many of the keyword clouds, “library” and “book” keep on coming out as the largest words. Whilst this is an accurate reflection of what the blogs are talking about, it does hide some of the more interesting and relevant keywords.

In honour of Mr Spalding, and at the risk of getting sued silly by Mrs Rowling, I’ve added a bit of JavaScript to toggle between a full version of the cloud (“incrementum!”) and one that can sometimes bring out more interesting/relevant keywords (“redactum!”).*

As an example, the full keyword cloud for presentation has “library” as the largest word…

…click on “redactum!” and you get a cloud with some more interesting words such as “audiences” and “interaction”…

* apologies for the cod Latin!

About these ads
5 comments
  1. that’s fascinating, Dave. I’m curious to know how you guaged relevancy, as this shift is clearly quite semantically meaningful (or at least I think it is!)

  2. I did mean to blog a bit more about how it’s done, but it was getting a bit late!

    The full cloud is generated by finding all the blog posts that contain the word “presentation” and then pulling together all of the words from those blog posts (ignoring common stop words). The more blog posts there are, the more likely it is that the most popular words (library, book, etc) will start to dominate the cloud at the expense of potentially more relevant/interesting words.

    The second cloud adds an extra step — after finding all of the words used in those posts and totalling up the number of times they occur (which is used to determine the font size in the first cloud), that total is then divided by the number of times that word appears in all of the blog posts (i.e. not just those with “presentation”).

    The result, for each word, is a value between 0 and 1. For arguments sake, lets say that the word “audiences” appears in 5 blogs posts and all of those 5 also contain “presentation” — the result would be 1 (5 divided by 5).

    That means the second cloud favours words which are rarely used outside of blog posts containing “presentation”.

    The disadvantage of this approach is that the final value doesn’t reflect the number of times the word actually occurs — anything divided by itself is always 1. To try and compensate for that, the final value is then boosted a little by multiplying it with something.

    It took a little while to figure out what that “little something” should be, but using the square root of the square root of the square root of the square root(!) of the total number of times the words appears in all blog posts seemed to work well :-D

    So, let’s say both “audiences” and “comfortable” only occur in those “presentation” posts, and they appear 25 times and 5 times respectively. In the first full cloud, “audiences” would be bigger than “comfortable” simply because it appears more times. In the second cloud, they’d both be the same size (25/25=1 and 5/5=1).

    So, for the second cloud, both values are then multiplied by that little booster:

    audiences = 1 x 1.2228 = 1.2228
    comfortable = 1 x 1.1058 = 1.1058

    Out of interest, we’re using a similar calculation to generate the book recommendations in the OPAC. Doing it that way stops heavily borrowed books from dominating the sugegstions.

  3. Matthew Phillips said:

    This is an interesting problem. What you’re really wanting is a sort of automatic stop list. The best test of the method is to turn off the stop-list and see if words like “the” are eliminated from the tag-cloud.

    I like the ideas, but I think that the solution you’ve adopted for the boosting is theoretically unjustified. I think what you should be looking at is the probabilities that words appear significantly in the same blog postings. (Strictly speaking, it’s not probabilities as you *have* the full population of blog posts that you’re processing, so you know exactly the frequency of occurrences.)

    However, I’m not sure how exactly you need to change the method, but let me illustrate it with a similar problem from our OPAC spell checker at Dundee (see http://library.dundee.ac.uk/ ). I wanted to provide spell-check even if the user entered multiple words. The problem then is how can you make suggestions which are likely to exist in combination in your catalogue records? Especially when the aspell software can suggest two words as a replacement for a single one, e.g. if you type twowords it might suggest “towards” and “two words”.

    What I do is take the number of catalogue records the word appears in, divided by the total number of catalogue records. These can then be multiplied to give estimates for how likely two words are to appear together. (In fact, what I do is precalculate the logarithms and use addition as it’s faster.)

    So I think to be theoretically justified, you probably ought to be looking at the number of times a word appears in a blog posting, divided by the number of words in the posting and using that as a value which can then be combined in some way with the figures from other postings, probably using correlation or a measure of spread.

    For example, the word “the” would get a high score in all postings, but the words “OPAC” and “searching” would vary a lot more from posting to posting. It must be possible to use some measure to see that there is a tendency for the scores of “OPAC” and “searching” to go up and down together, and hence they are related, whereas the score for “the” is stuck at the top all the time, and is of no interest.

    Trouble is, I’ve forgotten so much maths I can’t follow it through to the conclusion. I just instinctively feel there is a better approach.

    There may still be a problem with the sizing of the words at the end of the day. It depends on how the correlation is measured: as I say, I can’t remember enough maths to suggest a suitable approach!

    On the other hand, a lot of this tag cloud stuff is about serendipity, so ad hoc methods are probably OK!

  4. Hi Matthew — many thanks for the comments!

    I think you are right. Both of the clouds have advantages and disadvantages, and there’s an ideal cloud lurking somewhere inbetween!

    I think as HotStuff collects more and more data, it’ll be possible to do more interesting things. I’ve just picked up “Programming collective intelligence” (Toby Segaran, O’Reilly) and that seems to contain some useful ideas.

    By the way, if anyone is interested in having a dump of the data so that they can play with it, please let me know!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: