Data Repository for NaN Group


NaN is a research group exploring complex systems, adaptive agents, modeling, simulation, artificial life, and complex (information, biological, and social) networks. We especially focus on the Web as a complex information network in which we leave abundant traces of our social and semantic activities: what we do, what we are interested in, whom we talk to, what knowledge we acquire and contribute.

Here we list our public datasets and tools for processing data for researchers and people who are interested. Most datasets were collected and prepared for research projects by NaN members. Please acknowledge our effort by citing corresponding papers if you have used our datasets. Thank you and enjoy!

Datasets: Web traffic (WebSci14) |  Twitter (WebSci14) |  Social bookmarking (WebSci14) |  Publications (WebSci14) |  Topic diversity |  Virality prediction |  Legitimate classification |  Political polarization |  Web Click Data |  Last.fm

Data tools: Klatsch |  Fast Visualization of network |  WebGraph++ |  Java Crawler |  Topical Crawler Evaluation |  Latent Energy Environments |  OAMulator |  Recruit


→ Web Traffic DatasetWebsci2014! [Download][README

A collection of Web (HTTP) requests for the month of November 2009. This is a small sample of the larger click dataset.

  • Source: Generated by applying a Berkeley Packet Filter to a mirror of the traffic passing through the border router of Indiana University.
  • Date range: Nov. 1, 2009 to November 22. 2009
  • File size: 235M requests; 2.7GB uncompressed
  • Please cite:
    Mark R. Meiss, Filippo Menczer, Santo Fortunato, Alessandro Flammini, and Alessandro Vespignani. Ranking web sites with real user traffic. In Proc. 2008 Intl. Conf. on Web Search and Data Mining, pp.65-76. ACM, 2008.
    Mark R. Meiss, Bruno Gonçalves, José J. Ramasco, Alessandro Flammini, and Filippo Menczer. Modeling traffic on the web graph. In Proc. 7th Workshop on Algorithms and Models for the Web Graph (WAW), pp. 50-61. Springer Berlin Heidelberg, 2010.

→ Twitter DatasetWebsci2014! [Download][README

A collection of records extracted from tweets for the month of November 2012 containing both #hashtags and URLs as part of the tweet.

  • Source: Sampled public tweets from Twitter streaming API.
  • Date range: November 2012.
  • File size: 27.8M tweets; 3.6GB uncompressed.
  • Please cite:
    Karissa McKelvey and Filippo Menczer. Truthy: Enabling the Study of Online Social Networks. In Proc. 16th ACM Conference on Computer Supported Cooperative Work and Social Computing Companion (CSCW), 2013.

→ Social Bookmarking DatasetWebsci2014! [Download][README

A collection of bookmarks from GiveALink.org for the month of November 2009.


→ Publications DatasetWebsci2014! [Download][README

Metadata for the complete set of all PubMed records through 2012 (with part of 2013 available as well), including title, authors, and year of publication. All data provided originates from NLM’s PubMed database (as downloaded April 24, 2013 from the NLM FTP site) and was retrieved via the Scholarly Database.


→ Topical diversity of user interests and content [Download][README

  • Source: Sampled public tweets from Twitter streaming API.
  • Date range: January 1, 2013 to March 31, 2013.
  • Data size: 6.4 GB; about 490 millions tweets.
  • Contains:
    1. Sampled tweets during 3 months.
    2. Each tweet is associated with a timestamp, anonymized user ID, and a list of hashtags.
  • Please cite:
    Lilian Weng and Filippo Menczer. Topicality and Social Impact: Diverse Messages but Focused Messengers. Under review. 2014.

→ Prediction of Viral Memes on Twitter [Download][README


→ Astroturf/Legitimate Classification [Download][README

This is the training data used to produce the results shown in the paper listed below.

  • Source: Sampled public tweets from Twitter streaming API.
  • Date range: September 14 to October 27, 2010.
  • Contains:
    1. data.arff: holds the un-resampled training data.
    2. data_balanced.arff: holds the resampled training data.
    3. data.instance_to_id.pickle: holds a Python pickle relating instance IDs in the data.arff file with Meme IDs in the Truthy database. To view the page for a particular meme ID, go to http://truthy.indiana.edu/m?id=
  • Please cite:
    Jacob Ratkiewicz, Michael Conover, Mark Meiss, Bruno Goncalves, Alessandro Flammini, and Filippo Menczer. Detecting and Tracking Political Abuse in Social Media. Proc. 5th International AAAI Conference on Weblogs and Social Media ICWSM, 2011.

→ Political Polarization on Twitter [Download][README

This is the training data used to produce the results shown in the paper listed below.

  • Source: Sampled public tweets from Twitter streaming API.
  • Date range: 6 weeks prior to the 2010 Congressional midterm elections.
  • Contains:
    1. Three networks of political communication between Twitter users
  • Please cite:
    Michael Conover, Jacob Ratkiewicz, Matthew Francisco, Bruno Goncalves, Alessandro Flammini, and Filippo Menczer. Political Polarization on Twitter. Proc. 5th International AAAI Conference on Weblogs and Social Media ICWSM, 2011.

→ Web Click Dataset [Download

This is the data used to produce the results shown in the paper listed below.

  • Source: Generated by applying a Berkeley Packet Filter to a mirror of the traffic passing through the border router of Indiana University.
  • Date range: September 2006 and May 2010.
  • Contains 2 collections:
    1. raw: About 25 billion requests, where only the host name of the referrer is retained. Collected between 26 Sep 2006 and 3 Mar 2008; missing 98 days of data, including the entire month of Jun 2007. Approximately 0.85 TB, compressed.
    2. raw-url: About 28.6 billion requests, where the full referrer URL is retained. Collected between 3 Mar 2008 and 31 May 2010; missing 179 days of data, including the entire months of Dec 2008, Jan 2009, and Feb 2009. Approximately 1.5 TB, compressed.
  • Please cite:
    Mark R. Meiss, Filippo Menczer, Santo Fortunato, Alessandro Flammini, and Alessandro Vespignani. Ranking web sites with real user traffic. In Proc. 2008 Intl. Conf. on Web Search and Data Mining, pp.65-76. ACM, 2008.
    Mark R. Meiss, Bruno Gonçalves, José J. Ramasco, Alessandro Flammini, and Filippo Menczer. Modeling traffic on the web graph. In Proc. 7th Workshop on Algorithms and Models for the Web Graph (WAW), pp. 50-61. Springer Berlin Heidelberg, 2010.

Last.fm Dataset→  [Download][README

This is the data used to produce the results shown in the paper below.

  • Source: A crawl of Last.fm users, their annotations, friends and neighborhood relations, and group membership.
  • Date range: First half of 2009.
  • Please cite:
    Schifanella, R., Barrat, A., Cattuto, C., Markines, B., and Menczer, F. (2010). Folks in Folksonomies: Social Link Prediction from Shared Metadata. Proc. 3rd ACM International Conference on Web Search and Data Mining (WSDM). arXiv Preprint

→ Klatsch [Download][README

Klatsch is a framework and language for exploring and analyzing feeds of social media data.

  • The purpose of the Klatsch framework is to provide an easy-to-program, flexible interface for exploring and analyzing feeds of social media data. It's meant to be easy to interface to existing algorithms and graph representations and to produce pretty pictures in a variety of formats. The language itself is somewhere between Python and Scheme: dynamic types, procedures as first-class data, call-by-value semantics, and a nod toward object orientation.
  • Klatsch is built around a scripting language implemented by an interpreter written in Java. You don’t need to know how to program Java in order to develop Klatsch scripts; that’s only necessary if you want to extend or modify the interpreter itself.
  • Please cite:
    Jacob Ratkiewicz, Michael Conover, Mark Meiss, Bruno Gonçalves, Snehal Patil, Alessandro Flammini, and Filippo Menczer. Truthy: Mapping the Spread of Astroturf in Microblog Streams. Proc. 20th Intl. World Wide Web Conf. Companion (WWW), 2011.

→ Fast visualization of large dynamic networks [Download][README

  • This is a collection of two tools for visualization of large dynamic networks, that perform the following functions, respectively:
    1. From a chronological sequence of graph links in form of sdnet files produce differential updates to a subgraph of the network delegated for visualization in a format of JSON events. (src/visualize_tweets_finitefile.cpp)
    2. Produce movies of evolving graphs from a feed of the JSON events (scripts/DynamicGraph_wici.py).
  • Please cite:
    Grabowicz, Przemyslaw A., Luca Maria Aiello, and Filippo Menczer. Fast filtering and animation of large dynamic networks. EPJ Data Science 3(1), 2014.

→ WebGraph++ [Github][README

This software is a translation into C++ of the excellent Webgraph library by P. Boldi and S. Vigna.. The original library, written in Java, is easy to use but hampered by some requirements of the Java virtual machine. This C++ translation attempts to preserve much of the ease of use (through integration with the Boost Graph Library), but bypasses requirements imposed by a virtual machine.
Like the original Webgraph library, this work is available under the GNU General Public License.


→ Multi-threaded crawlers in Java [Download][README

The code implements a multi-threaded Web crawler. Please read more about the tool here.

  • Please cite:
    G. Pant, P. Srinivasan, F. Menczer. Crawling the Web. In M. Levene and A. Poulovassilis, eds.: Web Dynamics, Springer, 2004.

→ A General Evaluation Framework for Topical Crawlers. [Download][README

The script and data files are released in association with, and implement/illustrate algorithms described in, the following paper. Please refer to the paper for a detailed illustration of the procedures implemented in the script, and of the data files.


→ Latent Energy Environments [Download][README

An artificial life model and simulator of controlled complexity, using endogenous fitness. Software and documentation available for Unix and Macintosh.
Please read more about the tool here.

  • Please cite:
    F Menczer and RK Belew. Latent Energy Environments. In: R. Belew and M. Mitchell, editors, Adaptive Individuals in Evolving Populations: Models and Algorithms. Addison Wesley, Reading, MA, 1996.

→ OAMulator [Download][README

The OAMulator is a Web based resource to support the teaching of instruction set architecture, assembly languages, memory, addressing, high level programming, and compilation. The tool is based on a simple, virtual CPU architecture called the One Address Machine. A compiler allows to take programs written in a special programming language, called OAMPL, and transform them into OAM assembly. An OAM assembler/emulator allows to interpret and execute OAM assembly code (produced by the compiler or written directly). The OAMulator is targeted at students who take introductory courses in information technology or information systems. The OAMulator is designed to take the mystery out of the CPU architecture and let students gain confidence with the concepts of compilers and binary execution.
Please read more about the tool here.


→ Recruit [Download][README

Recruit is a free and open source Web-based software system to support a faculty search committee in its academic recruiting/hiring tasks. Recruit makes it easy for a department, division, or school faculty search committee to accept, manage, review and annotate job applications on the Web.
Please read more about the tool here.