B659: Web Mining

B659: WEB MINING

Introduction

It is commonly assumed that weblogs - news or diary pages with entries presented in reverse-chronological order [example] - have a "neighborhood" or a community of other blogs which discuss related matters. We question this assumption, and have collected a large corpus of weblog data and compared the performance of link-based and content-based clustering methods on the same data. Our starting assumption is that we will be able to use these methods to infer the presence of both topical and link-based communities.

Data Collection

Sampled 5000 weblog URLs using http://blo.gs.

Collected their RDF/XML feeds, as well as the 'blogrolls'.

Only 810 blogs have had both their feeds and blogrolls found.

Collected 15,000 more blogs in order to increse the sample size. Consequently, 2740 blogs were identified which possess both blogrolls and feed files.

Content-based Analysis

Extracted content from the collected feed files [example].

Created a blog-term matrix by removing stopwords, stemming, removing short content (less than 50 bytes), and removing low DF (document frequency) terms. The size of the resulting matrix is 2583 x 19398 (number of blogs by number of terms).

Applied entropy-based TFIDF (term frequency inverse document frequency; below) and SVD (singular value decomposition).

Applied the K-means clustering algorithm (where number of clusters was empirically set to 20). See the 10 most discriminative words for each cluster (based on chi-square statistics).

Link Analysis using blogrolls

URL normalization:

Discarded those URLs whose protocols are not specified, which seem to be extraction errors.

Removed last slash(es) (e.g., http://cnn.com/ ? http://cnn.com)

Removed port numbers (e.g., http://emrooz.ws:81 ? http://emrooz.ws)

Converted all characters to lower cases.

Converted control codes to characters (e.g., '%7e' ? '~')

Only preserved host names and only the first directories, which are often user names. However, in the case where the first directory is an English word, it is also removed since they are not likely user names (e.g., http://abcnews.go.com/sections/politics ? http://abcnews.go.com). English words are identified by comparing them with WordNet entries.

Created adjacency matrix, which was found extremely sparse. (Around 2400 blogs out of 2740 were isolated!)

To examine the relationship between blogrolls and content, compared the distance between every pair of blogs with the content-based similarity between them. See the scatter plot for geodesic distance and content similarity between blogs

Outlink Structure (blogrolls)

Using the blogroll data, created a co-reference matrix R was created, where R(x,y)=n means blog x has n links to blog y. (x must be contained in our sample set of blogs, whereas y is not necessarily.) The size of R is 2,740 x 32,221 (number of blogs by number of blogrolls).

Applied TFIDF and SVD to the matrix. We consistently treated the co-reference matrix as if it were a term-document matrix, with blogs and blogrolls corresponding to documents and terms, respectively.

Applied K-means. See the 10 most discriminative words for each cluster.

How are outlink-based clusters related to content-based clusters?

MDS map colored by content-based clusters.

Correlation between outlink- and content-based similarities between blogs.

Inlink Structure

Similary, examined the use of inlinks which point to at least one of the 2,740 blogs. Those inlinks were collected by Google backlink search.

After normalizing the inlink URLs, created a co-citation matrix C, where C(x,y)=1 means that web page (or blog) x has a link to blog y. The size of the resulting matrix is 4,518 x 1,209 (number of inlinks by number of blogs).

Applied TFIDF and SVD to the matrix, then K-means. See the 10 most discriminative words for each cluster.

MDS map colored by content-based clusters.

Correlation between outlink- and content-based similarities between blogs.

Conclusions

Overall, although geodesic distances and content similarities between blogs were found to be moderately correlated, it seems to us that LSA (on content) was more effective with our unique data set than inlink or outlink-based clustering. The results might have been significantly different with data gathered in a different manner - in particular, the matrix used to calculate cocitation and coreference would probably have been much less dense with a link-based rather than a random sample. Further research and application of these techniques to other samples of data may provide validation and support for decisions we made in the course of this research.

References

Filippo Menczer (To appear). Lexical and Semantic Clustering by Web Links. Journal of the American Society of Information Science.

Gary Flake, Steve Lawrence, C. Lee Giles, and Frans Coetzee (2002). Self-Organization of the Web and Identification of Communities. IEEE Computer, vol. 35, no. 3, pp. 66–71.

Scott Deerwester, S. T. Dumais, G. W. Furnas, T.K. Landauer, and R. Harshman (1990). Indexing by Latent Semantic Analysis. Journal of the American Society of Information Science, vol. 41, no. 6, pp. 391–407.

Martin Porter (1980). An algorithm for suffix stripping. Program, vol. 14, no. 3, pp. 130-137.

Yiming Yang and Jan O. Pedersen. (1997). A Comparative Study on Feature Selection in Text Categorization. In Proceedings of the Fourteenth International Conference on Machine Learning (ICML'97), pp. 412-420.

Justin Zobel and Alistair Moffat (1998). Exploring the Similarity Space, A Publication of the ACM SIGIR FORUM, 32(1), pp. 18-34.

Elijah and Kaz
Last updated: Mon, 03 May 2004 19:13:47 GMT