RESOURCES

OFFICE HOURS

Prof. L. Shereen Sakr
SSMS 2020

Wednesdays - 5pm-6pm

lailashereensakr at ucsb dot edu

ADDRESS

Film and Media Studies Department

2243 Social Sciences & Media Studies Bldg.
UC Santa Barbara

Santa Barbara, CA 93106-4010

Free Data Sources

1. World Bank Open Data

As a repository of the world’s most comprehensive data regarding what’s happening in different countries across the world, World Bank Open Data is a vital source of Open Data. It also provides access to other datasets as well which are mentioned in the data catalog.

World Bank Open Data is massive because it has got 3000 datasets and 14000 indicators encompassing microdata, time series statistics, and geospatial data.

2. WHO (World Health Organization) — Open data repository

WHO’s Open Data repository is how WHO keeps track of health-specific statistics of its 194 Member States. The repository keeps the data systematically organized. It can be accessed as per different needs. For instance, whether it is mortality or burden of diseases, one can access data classified under 100 or more categories such as the Millennium Development Goals (child nutrition, child health, maternal and reproductive health, immunization, HIV/AIDS, tuberculosis, malaria, neglected diseases, water and sanitation), non communicable diseases and risk factors, epidemic-prone diseases, health systems, environmental health, violence and injuries, equity etc.

For your specific needs, you can go through the datasets according to themes, category, indicator, and country.

3. Google Public Data Explorer

Launched in 2010, Google Public Data Explorer can help you explore vast amounts of public-interest datasets. You can visualize and communicate the data for your respective uses. It makes the data from different agencies and sources available. For instance, you can access data from World Bank, U. S. Bureau of Labor Statistics and U.S. Bureau, OECD, IMF, and others. Different stakeholders access this data for a variety of purposes. Whether you are a student or a journalist, whether you are a policy maker or an academic, you can leverage this tool in order to create visualizations of public data.

4. Registry of Open Data on AWS (RODA)

This is a repository containing public datasets. It is data which is available from AWS resources.

As far as RODA is concerned, you can discover and share the data which is publicly available.

In RODA, you can use keywords and tags for common types of data such as genomic, satellite imagery and transportation in order to search whatever data that you are looking for. All of this is possible on a simple web interface.

For every dataset, you will discover detail page, usage examples, license information and tutorials or applications that use this data.

By making use of a broad range of compute and data analytics products, you can analyze the open data and build whatever services you want.

While the data you access is available through AWS resources, you need to bear in mind that it is not provided by AWS. This data belongs to different agencies, government organizations, researchers, businesses and individuals.

5. European Union Open Data Portal

You can access whatever open data EU institutions, agencies and other organizations publish on a single platform namely European Union Open Data Portal.

The EU Open Data Portal is home to vital open data pertaining to EU policy domains. These policy domains include economy, employment, science, environment, and education.

Around 70 EU institutions, organizations or departments such as Eurostat, the European Environment Agency, the Joint Research Centre and other European Commission Directorates General and EU Agencies have made their datasets public and allowed access. These datasets have crossed the number of 11700 till date.

6. FiveThirtyEight

It is a great site for data-driven journalism and story-telling.

It provides its various sources of data for a variety of sectors such as politics, sports, science, economics etc. You can download the data as well.

When you access the data, you will come across a brief explanation regarding each dataset with respect to its source. You will also get to know what it stands for and how to use it.

In order to render this data user-friendly, it provides datasets in as simple, non-proprietary formats such as CSV files as possible. Needless to say, these formats can be easily accessed and processed by humans as well as machines.

7. U.S. Census Bureau

U.S. Census Bureau is the biggest statistical agency of the federal government. It stores and provides reliable facts and data regarding people, places, and economy of America.

The Census Bureau considers its noble mission to extend its services as the most reliable provider of quality data.

Whether it is a federal, state, local or tribal government, all of them make use of census data for a variety of purposes. These governments use this data to determine the location of new housing and public facilities. They also make use of it at the time of examining the demographic characteristics of communities, states, and the USA.

8. Data.gov

Data.gov is the treasure-house of US government’s open data. It was only recently that the decision was made to make all government data available for free.

When it was launched, there were only 47. There are now 180,000 datasets.

9. DBpedia

As you know, Wikipedia is a great source of information. DBpedia aims at getting structured content from the valuable information that Wikipedia created.

With DBpedia, you can semantically search and explore relationships and properties of Wikipedia resource. This includes links to other related datasets as well.

There are around 4.58 million entities in the DBpedia dataset. 4.22 million are classified in ontology, including 1,445,000 persons, 735,000 places, 123,000 music albums, 87,000 films, 19,000 video games, 241,000 organizations, 251,000 species and 6,000 diseases.

There are labels and abstracts for these entities in around 125 languages. There are 25.2 million links to images. There are 29.8 million links to external web pages.

All you need to do in order to use DBpedia is write SPARQL queries against endpoint or by downloading their dumps.

10. freeCodeCamp Open Data

It is an open source community. Why it matters is because it enables you to code, build pro bono projects after nonprofits and grab a job as a developer.

In order to make this happen, the freeCodeCamp.org community makes available enormous amounts of data every month. They have turned it into open data.

You will find a variety of things in this repository. You can find datasets, analysis of the same and even demos of projects based on the freeCodeCamp data. You can also find links to external projects involving the freeCodeCamp data.

11. Yelp Open Datasets

The Yelp dataset is basically a subset of nothing but our own businesses, reviews and user data for use in personal, educational and academic pursuits.

There are 5,996,996 reviews, 188,593 businesses, 280,991 pictures and 10 metropolitan areas included in Yelp Open Datasets.

12. UNICEF Dataset

Since UNICEF concerns itself with a wide variety of critical issues, it has compiled relevant data on education, child labor, child disability, child mortality, maternal mortality, water and sanitation, low birth-weight, antenatal care, pneumonia, malaria, iodine deficiency disorder, female genital mutilation/cutting, and adolescents.

13. Kaggle

Kaggle is great because it promotes the use of different dataset publication formats. However, the better part is that it strongly recommends that the dataset publishers share their data in an accessible, non-proprietary format.

14. LODUM

It is the Open Data initiative of the University of Münster. Under this initiative, it is made possible for anyone to access any public information about the university in machine-readable formats. You can easily access and reuse it as per your needs.You can use SPARQL editor or SPARQL package of R to analyze data. SPARQL Package enables to connect to a SPARQL endpoint over HTTP, pose a SELECT query or an update query (LOAD, INSERT, DELETE.
 

15. UCI Machine Learning Repository

It serves as a comprehensive repository of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms.

In this repository, there are, at present, 463 datasets as a service to the machine learning community.

The Center for Machine Learning and Intelligent Systems at the University of California, Irvine hosts and maintains it. David Aha had originally created it as a graduate student at UC Irvine.

Since then, students, educators, and researchers all over the world make use of it as a reliable source of machine learning datasets.

 

Open Data Portals and Search Engines:

While there are plenty of datasets published by numerous agencies every year, very few datasets become recognized and established.

The reason why very few such datasets sustain as useful resource is that it is a challenge to develop, manage and provide the data in a way that people and organizations find it useful and easy to use.

However, please find below a list of other few important open data portals and platforms that permit users to access open data quite easily, study the impact and glean valuable insights.

  1. Google dataset search

  2. Dataverse

  3. Open Data Kit

  4. Ckan

  5. Open Data Monitor

  6. Plenar.io

  7. Open Data Impact Map

​Glitch