[Book Comment] - Programming Collective Intelligence, By Toby Segaran

I am currently reading Programming Collective Intelligence, By Toby Segaran from O'Reilly for one of my courses and I really recommend reading it. It is a really good start for people who are trying to understand topics like recommendation systems, clustering, indexing, search engines, and so on. It even provides the reader with the python code for each topic with detailed illustration of how the code work.

It really suitable for students and doesn't require more than the knowledge of the basic programming concepts.

However, while reading Chapter 3 of the book ,which mainly discusses discovering clusters/groups/structures in a dataset, specifically in the part Counting Words in a feed I noticed something in the code generatefeedvector.py that will cause it to generate wrong data.

What the code should do is to parse number of blogs (the urls are given in feedlist.txt) and generates the word counts for each blog, then for each word (in the set of all words found in all blog) it should calculate the number of blogs where that word appeared (apcount).
Since a word like "the" will appear in almost all of the blogs and a word like "flim-flam" might only appear once in one blog we do not really want to include them in the set of words that we will be using for clustering!
This is because words like "the" do not distinguish a blog from another and words like "flim-flam" are very rare that they can't be considered as topic-related words! (as the main goal of generating this words vector is to use it later in grouping blogs according their topics).
For that, the author suggests that for each word  the percentage of blogs where the word appeared is computed, and we select just words with percentages within maximum and minimum values.

So far so good! Here comes the part I'm concerned about, the following line of code is written to compute this percentage/ratio:
frac=float(bc)/len(feedlist)
Where bc is the number of blogs in which a word appeared and feedlist is the list of all urls. Looks fine huh? but actually it is not! because while parsing the feeds, some blogs urls were broken! and the number of these blogs was quite significant. For the feedlist I was using the total number of urls was 100 and the number of urls that the code failed to parse was 20.
This means that all the percentages are calculated wrong, and as a consequence some significant words were not included.

What I suggest is a tiny little modification, we will count the number of the blogs that were not successfully parsed and change how we compute the frequency to be like this:
frac = float(bc) / (len(feedlist)- failed)
where failed is the  number of "unparsed" blogs.

To calculate fail we will just add: fail +=1 to the except block of the parsing part. Something like this:

 for feedurl in feedlist:
    try:
        (title, wc) = getwordcounts(feedurl)
        wordcounts[title] = wc
        for (word, count) in wc.items():
            apcount.setdefault(word, 0)
            if count > 1:
                apcount[word] += 1
    except:
        print 'Failed to parse feed %s' % feedurl
        failed += 1

Again, I highly recommend reading the book! And let me know about what you think :D

All the best,
Gihad


This entry was posted on Tuesday, November 11, 2014. You can follow any responses to this entry through the RSS 2.0. You can leave a response.

One Response to “[Book Comment] - Programming Collective Intelligence, By Toby Segaran”

  1. The details are what really matters ! , a tiny details changes the deal

    ReplyDelete