# Data sources

There are many different types of data sources that can be used to enhance information controls research. Five particularly relevant data sources are discussed below: OONI, RIPE, Wehe, Rapid7 Labs, and Censored Planet. A list of additional sources that can be used to help detect censorship events is also provided.

# OONI

OONI, the Open Observatory of Network Interference, is a global observation network and free software used to detect censorship, surveillance, and traffic manipulation on the Internet. OONI uses Free and Open Source Software (FL/OSS) to share observations and data about network interference. Since 2012, OONI has collected millions of network measurements from more than 200 countries around the world. It serves as a powerful resource for researchers, journalists, lawyers, activists, and advocates interested in exploring network anomalies. Interested researchers can obtain and analyze OONI data through OONI explorer and OONI API.

# OONI explorer

OONI Explorer is an easy way to access and review data that has been gathered by other OONI users. It provides a graphical data repository per country, allowing anyone to explore and interact with the network measurements that have been collected through OONI probes. With it, users can:

  • Quickly perform fast queries of OONI data.
  • View which websites were most recently blocked in each country.
  • Review measurement coverage by test class and tested URL.
  • Search within all OONI data with different criteria (blocked URLs, anomalies, ASN).
  • Limitation: Can be time consuming while performing queries and analyzing reports in a wide data range of reports.

# OONI measurements API

OONI API offers a programmatic way to access, download, and search OONI data. With it, users can:

  • Access complete data (raw network measurements in JSON file format).
  • Perform fast data analysis.
  • Limitation: Data needs to be downloaded (sufficient storage, network bandwidth required).

Additional examples of how to use OONI data can be found in the Data analysis section of the magma guide: How to use OONI data.

# RIPE

RIPE is the regional Internet registry for Europe, the Middle East, and parts of Central Asia. As such, it allocates and registers blocks of Internet number resources to Internet service providers (ISPs) and other organizations. The not-for-profit organization works to support the RIPE (Réseaux IP Européens) community and the wider Internet community. The RIPE NCC membership consists primarily of Internet service providers, telecommunication organizations, and large corporations.

RIPE provides a variety of data sources including public measurements and BGP announcement data. Relevant tools and data include:

# Wehe

Wehe is a research project based out of Northeastern University, the University of Massachusetts – Amherst, and Stony Brook University that collects data on ISP traffic differentiation (typically bandwidth throttling). The project performs network measurements for popular applications such as YouTube, Netflix, Amazon Prime Video, Spotify, Skype, and NBC Sports.

Wehe's data collected after November 2018 can be found here. The code and scripts used to analyze Wehe data (including a sample dataset) can be found here.

# Rapid7 Labs

Rapid7 Labs is the research arm of Rapid7. Its website offers “researchers and community members open access to data from Project Sonar, which conducts internet-wide surveys to gain insights into global exposure to common vulnerabilities.” Key datasets are detailed below:

# 'FDNS' dataset

Forward DNS (FDNS) dataset contains the responses to DNS requests for all forward DNS names known by Rapid7's Project Sonar. Until early November 2017, all of these were for the 'ANY' record with a fallback A and AAAA request if necessary. After that time, the ANY study represents only the responses to ANY requests, and dedicated studies were created for the A, AAAA, CNAME, and TXT record lookups with appropriately named files. The file is a GZIP compressed file containing the name, type, value, and timestamp of any returned records for a given name in JSON format.

# 'RDNS' dataset

Reverse DNS (RDNS) dataset includes the responses to the IPv4 PTR lookups for all non-blacklisted/private IPv4 addresses.

# 'HTTP' dataset

HTTP GET Responses dataset contains the responses to HTTP/1.1 GET requests performed against a variety of IPv4 public HTTP endpoints.

# 'HTTPS' dataset

HTTPS GET Responses dataset contains the responses to HTTP/1.1 GET requests against various HTTPS ports.

# 'SSL' datasets

# Common port (443 port) SSL dataset

SSL Certificates dataset contains X.509 certificate metadata observed when communicating with HTTPS endpoints.

# Non 443 port SSL dataset

SSL Certificates (non-443) dataset includes the X.509 certificate metadata observed when communicating with miscellaneous non-HTTPS endpoints, such as IMAPS, POP3S, or other services.

# 'UDP Scans' dataset

UDP Scans dataset contains regular snapshots of the responses to zmap probes against common UDP services.

# 'TCP Scans' dataset

TCP Scans dataset contains regular snapshots of the responses to zmap probes against common TCP services.

# The Censored Planet Observatory

Censored Planet is a longitudinal censorship measurement platform that collects remote measurement measurements in more than 200 countries. Censored Planet was launched in August 2018, and has since then collected more than 45 billion measurement data points. Censored Planet measures network interference on the TCP/IP, DNS, and HTTP(S) protocols, using remote measurement techniques Augur, Satellite, and Hyperquack respectively.

Every week, Censored Planet collects reachability data about 2000 popular and sensitive websites from more than 95,000 vantage points around the world. Apart from longitudinal scans, Censored Planet also performs rapid focus measurements of select lists of websites at large scale during censorship events. An academic paper about Censored Planet can be found here.

Censored Planet’s measurement data has been crucial in identifying and monitoring several important censorship and network interference events. In 2019, Censored Planet data was used to study the large-scale HTTPS interception that occurred in Kazakhstan, and was instrumental in driving changes in major web browsers that blocked the interception attack. Censored Planet data has been used to study Russia’s decentralized censorship mechanism, and the throttling attack they performed on Twitter. Censored Planet has also been used to identify the deployment of network censorship devices, and track the blocking of COVID-19 related websites around the world.

Censored Planet data is available to the public through the Censored Planet website. The Censored Planet raw data website contains archived compressed data files corresponding to one scan using each measurement technique. The data formats and tips for analyzing the data for each of the published data files and versions are available in the Censored Planet documentation.

For more information about using the data, please refer to the Censored Planet GitHub, or email Censored Planet at censoredplanet@umich.edu.

# Censored Planet Dashboard

The Censored Planet Dashboard, built in collaboration with Jigsaw Inc. is an exploratory data dashboard that uses data analyzed using the Censored Planet data analysis pipeline, and contains visualizations that allow easy exploration of Censored Planet measurements. The dashboard is currently in beta stage, please reach out to censoredplanet-analysis@umich.edu if you are interested in using the Censored Planet dashboard for your research.

# Documentation of Censored Planet Dashboard

https://docs.censoredplanet.org/analysis.html#censored-planet-dashboard

# Documentation of DNS (Satellite) data

https://docs.censoredplanet.org/dns.html

# Documentation of HTTP(S) (Hyperquack) data

https://docs.censoredplanet.org/http.html

# Documentation about Censored Planet data analysis

https://docs.censoredplanet.org/analysis.html

# Other sources

The following is a list of available data sources that can be used to help detect a censorship event that is currently on-going, or has taken place.