Skip to Main Content

OpenRefine

An introduction to this free, open source tool for working with "messy" data.

What is "messy" data?

Inconsistent formats, unnecessary white space, extra characters, and typos are all examples of messy data. Working with the data in the table below would be difficult due to inconsistent formatting. Cleaning this data would ultimately improve the data's quality and enable more accurate analysis.

2019-11-23 $2,000 CT
11/23/2019 2000 C.T.
11/23/19 2,000 US-CT
Nov 11, 2019 2000 dollars connecticut
Thu, Nov 11th US$2000 Connecticut,
39952 $2k Connecticut

 

What is cleaning data and why is it important?

Data cleaning is the process of preparing data for analysis by removing or modifying data that is incorrect, incomplete, irrelevant, duplicated, or improperly formatted. This data is usually not necessary or helpful when it comes to analyzing data because it may hinder the process or provide inaccurate results. the goal of data cleaning is to create data sets that are standardized and uniform to allow business intelligence and data analytics tools to easily access and find the right data for each query. Source

OpenRefine can help!

Formerly known as Google Refine, OpenRefine is a free, open source application for working with "messy" data. It  looks like a spreadsheet, but operates like a database, allowing for increased discovery capabilities beyond programs like Microsoft Excel. With a simple interface, OpenRefine is a powerful but user-friendly program for exploring and cleaning messy data. With its ability to incorporate textual cleaning techniques (such as clustering and faceting), OpenRefine provides an advanced alternative to Excel without needing to understand computer programming.

Use cases for OpenRefine include:

Exploring - navigate and evaluate quality with visualizations and filters that help dig deeply into the data so you can get to know it better…

Cleaning - efficiently discover and fix inconsistency with faceting, clustering, cell transforms, GREL expressions…

Transforming - easily change formats, subset, or reshape with split/join multi valued cells, split columns, transpose columns/rows…

Extending - enrich data by combining files, merging projects, fetching URLs, reconciliation with online databases…