Daily Archives: February 25, 2016

Opening Doors To Massive Amounts of Metadata with OpenRefine

One of this weeks tutorials taught us about how to analyze metadata using a tool called OpenRefine. OpenRefine is a program that is built to clean data that has been accumulated information that is unneeded. There are four major examples of things that it can help with and they are, eliminate duplicate records, separate multiple values in the same field, analyse the distribution of values throughout a data set and finally, group together different representations of the same reality.

From the description it sounds incredibly useful and it will be once I continue to work with it and get used to its ins and outs. However, when first beginning with OpenRefine I had great difficulty just opening it so my first impression was not great. I managed to work out that kink though with Dr Graham as he thankfully helped me figure out that I was not going crazy and that it was not working the way it should have been.

When I began working with it once I had successfully got into the program, I followed the steps but made some kind of mistake so that my numbers did not match those on the tutorial. Despite this though, the program seemed to be altering the numbers in a correct fashion. Everything went well and it has convinced me that it can be useful in future projects.

Below you will find all of my notes kept while completing the tutorial…



- Using OpenRefine can help us with 4 things relating to cleaning data

1. Remove duplicate records

2. Separate multiple values contained in the same field

3.  Analyze the distribution of values throughout a data set

4. Group together different representations of the same reality

- as data gets reused more and more online we need to be sure that it maintains its quality Open Refine helps us with issues like this

- OpenRefine not only allows you to quickly diagnose the accuracy of your data, but also to act upon certain errors in an automated manner.

How OpenRefine Works

- Average spreadsheet programs are really designed to work on one cell, row or column at a time whereas IDTs like OpenRefine are designed to work on much larger amounts of data all at once

- allows users to identify concepts from unstructured texts (this is what is called Name-Entity-Recognition [NER])


- Major problem getting OpenRefine to work as it was downloading successfully on my computer but the only problem was that Google Chrome was blocking it from being active

- once i figured this out i managed to successfully select the document that the tutorial instructed me to work with

- I then followed the instructions and unselected the checkbox marked ‘Quotation marks are used to enclose cells containing column separators’ as PH states that the quotes inside the file have no meaning to the file

- Taking the next step I clicked “Create Project” and it created over 75,800 rows of data

- It then suggests that I can open the persistent link to see the object on the museum website but I can’t see the link…

- Not quite understanding what difference it makes by looking at the data through different facets as it doesn’t seem to make any difference to the data

- was a little confused so i restarted and it came up with fewer rows than the original document, maybe it still remembered the facets that i was working with earlier.

- Solved this problem the next day by clicking on the redo button and they gave me the option of starting from scratch. When I did this all 75,814 rows came back

- moving on to detecting and removing duplicates

- successfully reordered the file numbers from biggest to smallest by clicking sort > numbers > largest/smallest

- also did smallest to biggest but for some reason I could not see button that said make change permanent…

- I then successfully removed duplicates by clicking edit cells > back down

- then turned those cells blank by clicking ‘Facet’ > ‘Customized facets’ > ‘Facet by blank’

- then successfully chose which ones matched that category and eliminated them all by clicking true’, and removing them using the ‘All’ triangle (‘Edit rows’ > ‘Remove all matching rows’)

- I did though get different numbers than they have in the tutorial due to a previous mistake I made somewhere

- next I successfully split the multi-valued cells by clicking the Edit cells’ > ‘Split multi-valued cells options

- again this was successful but I believe my numbers are skewed from theirs as I did not maintain the steps taken yesterday

- can switch between ‘rows’ and ‘records’ view by clicking on the so-labelled links just above the column headers. In the ‘rows view’, each row represents a couple of Record IDs and a single Category, enabling manipulation of each one individually. The ‘records view’ has an entry for each Record ID, which can have different categories on different rows (grouped together in grey or white), but each record is manipulated as a whole

- narrowing down your meta data into different facets allows you to visualize how your information is broken down and it also allows you to see what different types of categories match multiple pieces of data that you may not have imagined

- successfully exported the metadata into a html file using the export tab on the upper right of the screen