Following our paper on social phishing, I have received several queries from researchers interested in studying online social networks, about the legality and/or ethics of crawling data from online social networks and using this data for research purposes, as we did. So it may be useful to post here my standard response. (Actually this is not just about crawling online social networks sites, but crawling the Web for research purposes in general.)
Legal aspects are of course country dependent. My view is based on the US. Also please keep in mind that I am not a lawyer, so this is not intended as legal advice! You should seek legal counsel if you really contemplate doing something and you are not sure if it is legal in your country!
Posting something on the Web makes it public, and accessing public information is perfectly legal. Reading a Web site is like reading a street sign or a billboard. The information is broadcast to the entire world and there can be no legal restrictions to accessing this information. (There are exceptions of course, such as laws restricting access to pornographic material to minors; but then it is the responsibility of the provider to check the age of the user and enforce the law; the viewer is not doing anything illegal by merely viewing a public Web page.)
What you do with the information you freely access is up to you --- it is nobody else's business if you just view it with your eyes, print it or save it on your disk for later use or analysis, or share it with others. There is no difference if you access the information interactively with a browser, or download it automatically with crawler/screen-scraping software.
A different situation is if the information is not freely and publicly posted, but rather hidden behind some protection, for example if you need to create a password-protected account to access/use the information on a Web site. In this case, the provider may impose conditions and restrictions on the use, for example payment of a fee, non-disclosure to third parties, or other Terms of Use or Terms of Service (ToS). This is like a contract between the provider and the consumer of the service or information. When the ToS for a site states that you are not allowed to crawl the site, then of course you are in breach of the ToS if you do. The ToS typically establishes the consequences for infringement, i.e., what happens if you fail to abide by the rules in the ToS. Typically (and if no other consequence is contemplated) your account is canceled and you lose the privilege of access, but the ToS may contemplate more serious consequences; you might even be sued for breach of contract and damages. It is important to note that the ToS is not a law; just because the ToS forbids some action it does not mean that such an action is illegal. However, a ToS may be a legally binding document, and therefore if you fail to abide by the ToS you may be violating the law about contractual agreements.
Even if it were perfectly legal to crawl a social network site to collect data for research purposes, there may be ethical considerations that make it inappropriate. One issue that may have both ethical and legal ramification is information privacy. There may be information privacy laws that regulate the use of personal information. In the US my understanding is that these laws are pretty lax; mostly there is self-regulation though privacy policies that a Web site must post about what personal information is collected from users, how it is stored, how it may be used, how to opt out, etc. A user who accesses personal information protected by a privacy policy is not limited in the use of such information, except as specified by the ToS (see above). However, depending on what you consider ethical behavior, you may have different views about the gravity of breaching a ToS agreement, and the gravity of violating a third party's right to privacy. I would typically consider a violation of privacy unethical even it is not illegal.
Another ethical aspect is the compliance with the Robots Exclusion Protocol. This is a simple technical standard by which a provider (Web site maintainer) communicates to crawlers (a.k.a. robots) what is okay to be crawled and what is off-limits. The protocol is voluntary but widely adopted, for example all established search engines are more or less compliant. There are some subtle aspects here, such as what distinguishes a robot or crawler from a human. A human reading pages while interactively clicking links in a browser is clearly not a robot, while a large-scale crawler supporting a commercial search engine is clearly a robot. However there are many intermediate cases that are not so clear, such as browser extensions that do pre-fetching, ad blockers, Web proxies, RSS feed readers, and so on. Browsers are becoming increasingly sophisticated and incorporating more and more automation as a result. But let us leave these nuances aside. The point is that if a site uses the Robots Exclusion Protocol to tell you that you should not crawl it, than it is unethical to crawl it anyway. Clearly there are shades of gray here. You could try to contact the Web site administrator to ask permission to crawl notwithstanding the robots guideline, for research purposes. Or you might determine that the protocol is in place to keep out commercial abuse and therefore does not apply to you. It is your call, since this is a voluntary standard. If you do crawl a site, irrespective of the Robots Exclusion Protocol, be sure to implement crawler politeness guidelines, for example to avoid requesting too many pages in a short interval of time. It is a good idea to delay at least a second between consecutive page requests. An impolite crawler may put such a strain on the Web server's network bandwidth that the server becomes unable to respond to normal requests. This is an unintended denial of service attack and it is not well tolerated. Your IP address may be blocked and posted on black lists of abusers.
Finally, social network data is of course about humans. So if you plan to collect social network data and use it for research purposes, you are conducting research on human subjects. Such research has important ethical implications concerning the protection of the subjects. There is a long history uf use and abuse of human subjects in research. Therefore it is important that your research is reviewed by an independent review board to make sure that the human subjects are protected from abuse, discrimination, risk, privacy violations, and other potentially adverse factors. In the US, all publicly funded research institutions (including any university whose employees receive support from federal grant agencies such as NIH, NSF etc) are required to have an Institutional Review Board (IRB) to protect human subjects. The Indiana University Human Subject Office is an example of an IRB. Any research employing human subjects must be reviewed, approved, and monitored by the IRB. Researchers must be trained in the protection of human subjects. In the case of a study of online social networks, the IRB would make sure that only aggregate statistical analysis is published, and no personally identifiable information. The IRB would also ensure that the participants have a way to give informed consent, or if this is not feasible, that they have a way to opt-out after the data is collected. If this is also infeasible, then the IRB would weigh the potential risks of participation by the subjects (who may well be unaware that their information is being used in the study) against the potential benefits for the same subjects. Risks could include, for example, violation of privacy and unintended consequences. Benefits might include the potential positive effects of any discovery on the population from which the subjects are drawn (i.e., users of social network sites).