B659: WEB MINING
- It is commonly assumed that weblogs - news or diary pages with
entries presented in reverse-chronological order [example] - have a
"neighborhood" or a community of other blogs which discuss related
matters. We question this assumption, and have collected a large
corpus of weblog data and compared the performance of link-based and
content-based clustering methods on the same data. Our starting
assumption is that we will be able to use these methods to infer the
presence of both topical and link-based communities.
- Sampled 5000 weblog URLs using http://blo.gs.
- Collected their RDF/XML feeds, as well as the 'blogrolls'.
- Only 810 blogs have had both their feeds
and blogrolls found.
- Collected 15,000 more blogs in order to increse the sample
size. Consequently, 2740 blogs were identified
which possess both blogrolls and feed files.
- Extracted content from the collected feed files
[example].
- Created a blog-term matrix by removing stopwords, stemming,
removing short content (less than 50 bytes), and removing low DF
(document frequency) terms. The size of the resulting matrix is
2583 x 19398 (number of
blogs by number of terms).
- Applied entropy-based TFIDF (term frequency inverse document
frequency; below) and SVD (singular value decomposition).
- Applied the K-means clustering algorithm (where number of
clusters was
empirically set to 20). See the 10 most discriminative words
for each cluster (based on chi-square statistics).
Link Analysis using blogrolls |
- URL normalization:
- Discarded those URLs whose protocols are not specified, which
seem to be extraction errors.
- Removed last slash(es) (e.g., http://cnn.com/ ?
http://cnn.com)
- Removed port numbers (e.g., http://emrooz.ws:81 ?
http://emrooz.ws)
- Converted all characters to lower cases.
- Converted control codes to characters (e.g.,
'%7e' ?
'~')
- Only preserved
host names and only the first directories, which are often user
names. However, in the case where the first directory is an English
word, it is also removed since they are not likely user names
(e.g., http://abcnews.go.com/sections/politics ?
http://abcnews.go.com). English words are identified by comparing
them with WordNet entries.
- Created adjacency matrix, which was found extremely sparse.
(Around 2400 blogs out of 2740 were isolated!)
- To examine the relationship between
blogrolls and content, compared the distance between every
pair of blogs with the content-based similarity between them.
See the scatter plot
for geodesic distance and content similarity
between
blogs
Outlink Structure (blogrolls) |
- How are outlink-based clusters related to content-based
clusters?
- MDS map colored by
content-based clusters.
- Correlation between outlink- and content-based similarities
between blogs.
- Similary, examined the use of inlinks which point to at least
one of the 2,740 blogs. Those inlinks were collected by
Google backlink search.
- After normalizing the inlink URLs, created a co-citation matrix
C, where
C(x,y)=1 means that
web page (or blog) x has a link to blog y. The size of
the resulting
matrix is 4,518 x 1,209 (number of inlinks by number of blogs).
- Applied TFIDF and SVD to the matrix, then K-means.
See the 10 most discriminative words
for each cluster.
- MDS map colored by
content-based clusters.
- Correlation between outlink- and content-based similarities
between blogs.
- Overall, although geodesic distances and content similarities
between blogs were found to be moderately correlated, it seems to us
that LSA (on content) was more effective with our unique data set than
inlink or outlink-based clustering. The results might have been
significantly different with data gathered in a different manner - in
particular, the matrix used to calculate cocitation and coreference
would probably have been much less dense with a link-based rather than
a random sample. Further research and application of these techniques
to other samples of data may provide validation and support for
decisions we made in the course of this research.
- Filippo Menczer (To appear). Lexical and Semantic Clustering by
Web Links. Journal of
the American Society of Information Science.
- Gary Flake, Steve Lawrence, C. Lee Giles, and Frans Coetzee
(2002).
Self-Organization of the Web and Identification of
Communities. IEEE
Computer, vol. 35, no. 3, pp. 66–71.
- Scott Deerwester, S. T. Dumais, G. W. Furnas, T.K. Landauer, and
R. Harshman (1990). Indexing by Latent Semantic Analysis. Journal of
the American Society of Information Science, vol. 41, no. 6, pp.
391–407.
- Martin Porter (1980). An algorithm for suffix
stripping. Program, vol. 14, no. 3, pp. 130-137.
- Yiming Yang and Jan O. Pedersen. (1997). A Comparative Study on
Feature Selection in Text Categorization. In Proceedings of the
Fourteenth International Conference on Machine Learning (ICML'97),
pp. 412-420.
- Justin Zobel and Alistair Moffat (1998). Exploring the
Similarity Space, A Publication of the ACM SIGIR FORUM, 32(1),
pp. 18-34.
Elijah and Kaz
Last updated: Mon, 03 May 2004 19:13:47 GMT