Following our paper on social phishing, I have received several queries from researchers interested in studying online social networks, about the legality and/or ethics of crawling data from online social networks and using this data for research purposes, as we did. So it may be useful to post here my standard response. (Actually this is not just about crawling online social networks sites, but crawling the Web for research purposes in general.)
Legal aspects are of course country dependent. My view is based on the US. Also please keep in mind that I am not a lawyer, so this is not intended as legal advice! You should seek legal counsel if you really contemplate doing something and you are not sure if it is legal in your country!
Posting something on the Web makes it public, and accessing public information is perfectly legal. Reading a Web site is like reading a street sign or a billboard. The information is broadcast to the entire world and there can be no legal restrictions to accessing this information. (There are exceptions of course, such as laws restricting access to pornographic material to minors; but then it is the responsibility of the provider to check the age of the user and enforce the law; the viewer is not doing anything illegal by merely viewing a public Web page.)
What you do with the information you freely access is up to you --- it is nobody else's business if you just view it with your eyes, print it or save it on your disk for later use or analysis, or share it with others. There is no difference if you access the information interactively with a browser, or download it automatically with crawler/screen-scraping software.
Another ethical aspect is the compliance with the Robots Exclusion Protocol. This is a simple technical standard by which a provider (Web site maintainer) communicates to crawlers (a.k.a. robots) what is okay to be crawled and what is off-limits. The protocol is voluntary but widely adopted, for example all established search engines are more or less compliant. There are some subtle aspects here, such as what distinguishes a robot or crawler from a human. A human reading pages while interactively clicking links in a browser is clearly not a robot, while a large-scale crawler supporting a commercial search engine is clearly a robot. However there are many intermediate cases that are not so clear, such as browser extensions that do pre-fetching, ad blockers, Web proxies, RSS feed readers, and so on. Browsers are becoming increasingly sophisticated and incorporating more and more automation as a result. But let us leave these nuances aside. The point is that if a site uses the Robots Exclusion Protocol to tell you that you should not crawl it, than it is unethical to crawl it anyway. Clearly there are shades of gray here. You could try to contact the Web site administrator to ask permission to crawl notwithstanding the robots guideline, for research purposes. Or you might determine that the protocol is in place to keep out commercial abuse and therefore does not apply to you. It is your call, since this is a voluntary standard. If you do crawl a site, irrespective of the Robots Exclusion Protocol, be sure to implement crawler politeness guidelines, for example to avoid requesting too many pages in a short interval of time. It is a good idea to delay at least a second between consecutive page requests. An impolite crawler may put such a strain on the Web server's network bandwidth that the server becomes unable to respond to normal requests. This is an unintended denial of service attack and it is not well tolerated. Your IP address may be blocked and posted on black lists of abusers.
Finally, social network data is of course about humans. So if you plan to collect social network data and use it for research purposes, you are conducting research on human subjects. Such research has important ethical implications concerning the protection of the subjects. There is a long history uf use and abuse of human subjects in research. Therefore it is important that your research is reviewed by an independent review board to make sure that the human subjects are protected from abuse, discrimination, risk, privacy violations, and other potentially adverse factors. In the US, all publicly funded research institutions (including any university whose employees receive support from federal grant agencies such as NIH, NSF etc) are required to have an Institutional Review Board (IRB) to protect human subjects. The Indiana University Human Subject Office is an example of an IRB. Any research employing human subjects must be reviewed, approved, and monitored by the IRB. Researchers must be trained in the protection of human subjects. In the case of a study of online social networks, the IRB would make sure that only aggregate statistical analysis is published, and no personally identifiable information. The IRB would also ensure that the participants have a way to give informed consent, or if this is not feasible, that they have a way to opt-out after the data is collected. If this is also infeasible, then the IRB would weigh the potential risks of participation by the subjects (who may well be unaware that their information is being used in the study) against the potential benefits for the same subjects. Risks could include, for example, violation of privacy and unintended consequences. Benefits might include the potential positive effects of any discovery on the population from which the subjects are drawn (i.e., users of social network sites).
Back to Web Security Projects and Demos @ NaN