# Data sources

There are many different types of data sources that can be used to enhance information controls research. Five particularly relevant data sources are discussed below: OONI, RIPE, Wehe, Rapid7 Labs, and Censored Planet. A list of additional sources that can be used to help detect censorship events is also provided.

# OONI

OONI, the Open Observatory of Network Interference, is a global observation network and free software used to detect censorship, surveillance, and traffic manipulation on the Internet. OONI uses Free and Open Source Software (FL/OSS) to share observations and data about network interference. Since 2012, OONI has collected millions of network measurements from more than 200 countries around the world. It serves as a powerful resource for researchers, journalists, lawyers, activists, and advocates interested in exploring network anomalies. Interested researchers can obtain and analyze OONI data through OONI explorer and OONI API.

# OONI explorer

OONI Explorer is an easy way to access and review data that has been gathered by other OONI users. It provides a graphical data repository per country, allowing anyone to explore and interact with the network measurements that have been collected through OONI probes. With it, users can:

  • Quickly perform fast queries of OONI data.
  • View which websites were most recently blocked in each country.
  • Review measurement coverage by test class and tested URL.
  • Search within all OONI data with different criteria (blocked URLs, anomalies, ASN).
  • Limitation: Can be time consuming while performing queries and analyzing reports in a wide data range of reports.

# OONI measurements API

OONI API offers a programmatic way to access, download, and search OONI data. With it, users can:

  • Access complete data (raw network measurements in JSON file format).
  • Perform fast data analysis.
  • Limitation: Data needs to be downloaded (sufficient storage, network bandwidth required).

Additional examples of how to use OONI data can be found in the Data analysis section of the magma guide: How to use OONI data.

# RIPE

RIPE is the regional Internet registry for Europe, the Middle East, and parts of Central Asia. As such, it allocates and registers blocks of Internet number resources to Internet service providers (ISPs) and other organizations. The not-for-profit organization works to support the RIPE (Réseaux IP Européens) community and the wider Internet community. The RIPE NCC membership consists primarily of Internet service providers, telecommunication organizations, and large corporations.

RIPE provides a variety of data sources including public measurements and BGP announcement data. Relevant tools and data include:

# Wehe

Wehe is a research project based out of Northeastern University, the University of Massachusetts – Amherst, and Stony Brook University that collects data on ISP traffic differentiation (typically bandwidth throttling). The project performs network measurements for popular applications such as YouTube, Netflix, Amazon Prime Video, Spotify, Skype, and NBC Sports.

Wehe's data collected after November 2018 can be found here. The code and scripts used to analyze Wehe data (including a sample dataset) can be found here.

# Rapid7 Labs

Rapid7 Labs is the research arm of Rapid7. Its website offers “researchers and community members open access to data from Project Sonar, which conducts internet-wide surveys to gain insights into global exposure to common vulnerabilities.” Key datasets are detailed below:

# 'FDNS' dataset

Forward DNS (FDNS) dataset contains the responses to DNS requests for all forward DNS names known by Rapid7's Project Sonar. Until early November 2017, all of these were for the 'ANY' record with a fallback A and AAAA request if necessary. After that time, the ANY study represents only the responses to ANY requests, and dedicated studies were created for the A, AAAA, CNAME, and TXT record lookups with appropriately named files. The file is a GZIP compressed file containing the name, type, value, and timestamp of any returned records for a given name in JSON format.

# 'RDNS' dataset

Reverse DNS (RDNS) dataset includes the responses to the IPv4 PTR lookups for all non-blacklisted/private IPv4 addresses.

# 'HTTP' dataset

HTTP GET Responses dataset contains the responses to HTTP/1.1 GET requests performed against a variety of IPv4 public HTTP endpoints.

# 'HTTPS' dataset

HTTPS GET Responses dataset contains the responses to HTTP/1.1 GET requests against various HTTPS ports.

# 'SSL' datasets

# Common port (443 port) SSL dataset

SSL Certificates dataset contains X.509 certificate metadata observed when communicating with HTTPS endpoints.

# Non 443 port SSL dataset

SSL Certificates (non-443) dataset includes the X.509 certificate metadata observed when communicating with miscellaneous non-HTTPS endpoints, such as IMAPS, POP3S, or other services.

# 'UDP Scans' dataset

UDP Scans dataset contains regular snapshots of the responses to zmap probes against common UDP services.

# 'TCP Scans' dataset

TCP Scans dataset contains regular snapshots of the responses to zmap probes against common TCP services.

# Censored Planet

Censored Planet is a project from the University of Michigan that collects privacy and security violations in the Internet. Key datasets are detailed below:

# 'Satellite' DNS dataset

Satellite contains a regular snapshot of DNS resolutions of top websites as returned by a large number of Open DNS resolvers located in a wide range of networks.

# 'Quack' HTTP Dataset

Quack contains regular collection of the responses observed when connecting to infrastructural web servers (e.g. those operated by ISPs and governments), and asking the web server to serve content from a range of sensitive domains.

# Other sources

The following is a list of available data sources that can be used to help detect a censorship event that is currently on-going, or has taken place.