A bit of me

I am currently reading Programming Collective Intelligence, By Toby Segaran from O'Reilly for one of my courses and I really recommend reading it. It is a really good start for people who are trying to understand topics like recommendation systems, clustering, indexing, search engines, and so on. It even provides the reader with the python code for each topic with detailed illustration of how the code work.

It really suitable for students and doesn't require more than the knowledge of the basic programming concepts.

However, while reading Chapter 3 of the book ,which mainly discusses discovering clusters/groups/structures in a dataset, specifically in the part Counting Words in a feed I noticed something in the code generatefeedvector.py that will cause it to generate wrong data.

What the code should do is to parse number of blogs (the urls are given in feedlist.txt) and generates the word counts for each blog, then for each word (in the set of all words found in all blog) it should calculate the number of blogs where that word appeared (apcount).
Since a word like "the" will appear in almost all of the blogs and a word like "flim-flam" might only appear once in one blog we do not really want to include them in the set of words that we will be using for clustering!
This is because words like "the" do not distinguish a blog from another and words like "flim-flam" are very rare that they can't be considered as topic-related words! (as the main goal of generating this words vector is to use it later in grouping blogs according their topics).
For that, the author suggests that for each word the percentage of blogs where the word appeared is computed, and we select just words with percentages within maximum and minimum values.

So far so good! Here comes the part I'm concerned about, the following line of code is written to compute this percentage/ratio:
frac=float(bc)/len(feedlist)
Where bc is the number of blogs in which a word appeared and feedlist is the list of all urls. Looks fine huh? but actually it is not! because while parsing the feeds, some blogs urls were broken! and the number of these blogs was quite significant. For the feedlist I was using the total number of urls was 100 and the number of urls that the code failed to parse was 20.
This means that all the percentages are calculated wrong, and as a consequence some significant words were not included.

What I suggest is a tiny little modification, we will count the number of the blogs that were not successfully parsed and change how we compute the frequency to be like this:
frac = float(bc) / (len(feedlist)- failed)
where failed is the number of "unparsed" blogs.

To calculate fail we will just add: fail +=1 to the except block of the parsing part. Something like this:

for feedurl in feedlist:
try:
(title, wc) = getwordcounts(feedurl)
wordcounts[title] = wc
for (word, count) in wc.items():
apcount.setdefault(word, 0)
if count > 1:
apcount[word] += 1
except:
print 'Failed to parse feed %s' % feedurl
failed += 1

Again, I highly recommend reading the book! And let me know about what you think :D

All the best,
Gihad

A bit of me

[Book Comment] - Programming Collective Intelligence, By Toby Segaran

The first post