CSCI B656 Web Mining (3 CR)

Computing, storage and network | Software and data

Computing, storage and network resources

For your class project, it is possible that you will require significant cpu, storage, and/or network bandwidth resources. If you are not a SOIC student, see the instructor to discuss your options. If you are a SOIC student, the following FAQs describe the computer systems for cpu-intensive processing and the storage facilities available to students:

Please follow the guidelines for the use of these facilities as mentioned in these documents. The following additional considerations should be followed when doing processing that is likely to generate high volumes of network traffic:

Processing that will generate sustained periods of high disk activity should be limited to a single process. If you need to have multiple computers and/or processes doing simultaneous, high-bandwidth I/O to central storage facilities please get approval first.
Processing that will generate high volumes of network traffic to non-IU systems should be limited to no more than 200Kbps of sustained traffic. Please get approval before running processing that will exceed this level for more than 1 hour.
Running any process that will systematically scan ranges of IP addresses or TCP port numbers is prohibited. For example, using a utility like nmap to scan a remote system for open ports is prohibited. Likewise, scanning ranges of IP numbers for accessible systems is also prohibited.
Read and follow these guidelines on crawling/scraping online social network and other data. In the past, students who have not complied with crawling etiquette have gotten themselves, the instructor, and IU in trouble. Don't be the next one!

This list is not intended to include all possible activities that are prohibited or likely to cause system disruptions. If you are unsure if your intended activities are within these acceptable use policies, please ask before you proceed. Also if you feel that your project requires resources beyond those available via the above facilities and policies, please see the instructor so that we can discuss a suitable course of action.

Software and data resources

Note: some links may be out of date; please flag those to the instructor.

GiveALink: donate your bookmarks to science -- could be a great source of project ideas and data
Truthy: could be a great source of social media data for projects, or you could contribute to the project
Scholarometer provides an API and linked data about authors and their citation impact
Microsoft Academic Search provides an API to access rich data about authors, publications, and citations
We have a huge Web click database collected at IU, available upon request
Jure Leskovec's SNAP provides a C++ network analysis and graph mining library and a collection of about 50 large datasets including social networks, web graphs, internet networks, citation networks, communication networks, etc.
Common Crawl provides open Web crawl data
ICWSM 2011 provides links to various datasets including two from Spinn3r.com (a newer dataset of 386 million elements and an older one with 44 million blog posts), a Sentiment Corpus, and a Wikipedia User Contribution Dataset
infochimps provides lots of APIs and datasets, many for free
The UMBC Splog Blog Dataset can be used for research on blog spam
Wikimedia provides downloads of all Wikipedia projects
Data Science Toolkit: Various APIs for geolocation, text/HTML analysis and extraction
JavaCrawlers: A Java library for topical crawlers
Nutch: an open-source web search engine
Jakarta Lucene: a high-performance, full-featured text search engine written in Java
Lemur: a Toolkit for Language Modeling and Information Retrieval
Clair Library: intended to simplify a number of generic tasks in Natural Language Processing (NLP) and Information Retrieval (IR)
WIRE: Web IR Environment including a simple format for storing a collection of web documents, a crawler, and tools for generating stats and reports
Terrier: modular software platform for the rapid development of large-scale Web IR applications, providing indexing and retrieval functionalities; Labrador is a distributed web crawler designed to be integrated with Terrier
Network analysis tools: consider Gephi, iGraph, NWB, and many more
Google Code offers a large number of APIs and tools
Yahoo! Developer Network offers many APIs and tools
Bing Developer offers Search and Maps APIs
LETOR: Benchmark Datasets for Learning to Rank from Microsoft Research Asia
The Boost Graph Library (BGL): a generic C++ library of graph algorithms developed at the Open Systems Lab in the IU CS department. It handles large graphs nicely and integrates (fairly) easily with existing code.
WebGraph: a Java framework to study the web graph; WebGraph++ is a C++ port that bypasses some limitations imposed by the JVM
Weka: Data Mining Software in Java
WebBase: The Stanford WebBase project investigates various issues in crawling, storage, indexing, and querying of large collections of Web pages
LWP: The World-Wide Web library for Perl
Libwww: the W3C Protocol Library
Bow: a toolkit for statistical language modeling, text retrieval, classification and clustering
MG: an open-source indexing and retrieval system for text, images, and textual images
WebGlimpse: search engine software including a web administration interface, remote link spider, and the powerful Glimpse file indexing and query system
ht://Dig: a complete world wide web indexing and searching system for a domain or intranet
SWISH-E: a fast, powerful, flexible, free, and easy to use system for indexing collections of Web pages or other files
Internet Archive: a digital library of Internet sites and other cultural artifacts in digital form, providing free access to researchers and scholars (see also Heritrix, the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project)
Classifier Code (download): a collection of example classifier code written in Matlab, donated by Mark Meiss
Software and utilities from Soumen Chakrabarti
Kevin Chai's Homepage links to lots of datasets
... and of course many other data APIs are available from hundreds of services such as Twitter, Last.fm, Flickr, NYTimes, YouTube, etc.