Monthly Archives: February 2016

Opening Doors To Massive Amounts of Metadata with OpenRefine

One of this weeks tutorials taught us about how to analyze metadata using a tool called OpenRefine. OpenRefine is a program that is built to clean data that has been accumulated information that is unneeded. There are four major examples of things that it can help with and they are, eliminate duplicate records, separate multiple values in the same field, analyse the distribution of values throughout a data set and finally, group together different representations of the same reality.

From the description it sounds incredibly useful and it will be once I continue to work with it and get used to its ins and outs. However, when first beginning with OpenRefine I had great difficulty just opening it so my first impression was not great. I managed to work out that kink though with Dr Graham as he thankfully helped me figure out that I was not going crazy and that it was not working the way it should have been.

When I began working with it once I had successfully got into the program, I followed the steps but made some kind of mistake so that my numbers did not match those on the tutorial. Despite this though, the program seemed to be altering the numbers in a correct fashion. Everything went well and it has convinced me that it can be useful in future projects.

Below you will find all of my notes kept while completing the tutorial…

 

DON’T TAKE DATA AT FACE VALUE

- Using OpenRefine can help us with 4 things relating to cleaning data

1. Remove duplicate records

2. Separate multiple values contained in the same field

3.  Analyze the distribution of values throughout a data set

4. Group together different representations of the same reality

- as data gets reused more and more online we need to be sure that it maintains its quality Open Refine helps us with issues like this

- OpenRefine not only allows you to quickly diagnose the accuracy of your data, but also to act upon certain errors in an automated manner.

How OpenRefine Works

- Average spreadsheet programs are really designed to work on one cell, row or column at a time whereas IDTs like OpenRefine are designed to work on much larger amounts of data all at once

- allows users to identify concepts from unstructured texts (this is what is called Name-Entity-Recognition [NER])

Tutorial

- Major problem getting OpenRefine to work as it was downloading successfully on my computer but the only problem was that Google Chrome was blocking it from being active

- once i figured this out i managed to successfully select the document that the tutorial instructed me to work with

- I then followed the instructions and unselected the checkbox marked ‘Quotation marks are used to enclose cells containing column separators’ as PH states that the quotes inside the file have no meaning to the file

- Taking the next step I clicked “Create Project” and it created over 75,800 rows of data

- It then suggests that I can open the persistent link to see the object on the museum website but I can’t see the link…

- Not quite understanding what difference it makes by looking at the data through different facets as it doesn’t seem to make any difference to the data

- was a little confused so i restarted and it came up with fewer rows than the original document, maybe it still remembered the facets that i was working with earlier.

- Solved this problem the next day by clicking on the redo button and they gave me the option of starting from scratch. When I did this all 75,814 rows came back

- moving on to detecting and removing duplicates

- successfully reordered the file numbers from biggest to smallest by clicking sort > numbers > largest/smallest

- also did smallest to biggest but for some reason I could not see button that said make change permanent…

- I then successfully removed duplicates by clicking edit cells > back down

- then turned those cells blank by clicking ‘Facet’ > ‘Customized facets’ > ‘Facet by blank’

- then successfully chose which ones matched that category and eliminated them all by clicking true’, and removing them using the ‘All’ triangle (‘Edit rows’ > ‘Remove all matching rows’)

- I did though get different numbers than they have in the tutorial due to a previous mistake I made somewhere

- next I successfully split the multi-valued cells by clicking the Edit cells’ > ‘Split multi-valued cells options

- again this was successful but I believe my numbers are skewed from theirs as I did not maintain the steps taken yesterday

- can switch between ‘rows’ and ‘records’ view by clicking on the so-labelled links just above the column headers. In the ‘rows view’, each row represents a couple of Record IDs and a single Category, enabling manipulation of each one individually. The ‘records view’ has an entry for each Record ID, which can have different categories on different rows (grouped together in grey or white), but each record is manipulated as a whole

- narrowing down your meta data into different facets allows you to visualize how your information is broken down and it also allows you to see what different types of categories match multiple pieces of data that you may not have imagined

- successfully exported the metadata into a html file using the export tab on the upper right of the screen

Mining and Counting Files

In the most recent tutorial lesson from the Programming Historian (http://programminghistorian.org/) we learned all about how to mine and count through files using the Bash Command Line. In a dramatic turn of events over the past two weeks I have been gaining more confidence when going through these tutorials. The main reason for this is that I have been moving cautiously in order to ensure that I do not skip over crucial steps (and I have learned that when using the command line EVERY step is crucial).

In addition, I have stuck with it so that certain basic steps, such as navigating through the computer on the command line, has come to feel more and more natural. More importantly though I have taken to writing down every step that I take. This has helped immensely as I have found that it forces me to think outside the box. What is meant by this is that I feel as though I am talking to myself when doing so, which allows me to more easily see where I am making mistakes.

Taking down notes helped me greatly when it came to getting through my last tutorial. This tutorial was all about learning how to instruct the computer to go through specified files and count certain things, such as number of words, or mine through it and tell you how many times certain words or numbers came up. The lesson also included instructions on how to create a subdirectory and how to move your results into that subdirectory. To get a better sense of what is meant by this please take a gander at my notes and let me know if there is anything that I can be doing more efficiently.

Digital History – Research Data with Unix

- the unix shell gives you access to a range of different commands that help you mine and count through research data

- options for counting and mining data though does depend on the amount of metadata or file names given to you

- in order to get the most out of the unix shell it is important to remember to take the time to structure your filing system.

- dowloaded the files to proghist-text successfully and now am about to open in the command line

- Note: CSV files are those in which the units of data (or cells) are separated by commas (comma-separated-values) and TSV files are those in which they are separated by tabs. Both can be read in simple text editors or in spreadsheet programs such as Libre Office Calc or Microsoft Excel.

- to count the contents of a file enter the command: wc -w “name of file” (worked correctly)

- if you want to know the number of lines instead of an actual word count, type wc -1 “name of file”

- in addition if you want to know a character count enter command: wc -m “name of file”

- ALL OF THESE COMMANDS ARE NOT CASE SENSITIVE

- the most frequent and useful use of the wc command for digital historians is to compare and contrast sizes of a source in digital format.

- wc can also be utilized with other wildcards like * which is an even easier way to compare multiple sources of research data.

- for instance wc -l 2014-01-31_JA_a*.tsv or wc -l “file name”_”file name”*.tsv

- REMEMBER THAT IT IS A SMALL “L” NOT A “1”

- if you wish to get the data put in a new file rather than just appearing in the terminal screen use the >

- for instance wc -l “file name”_”file name”*.tsv > results/”file name”_”file name”_wc.txt

- this will send the results to a newly created file in a subdirectory called results

- As well as counting files, the unix shell can mine through files using the grep command

- For instance you can enter grep “string, or character clusters” (in this case 1999) *.tsv so: grep 1999 *.tsv

- If you add -c to the command it prints how many times the given character cluster or string appears in a given file. In this case grep -c 1999 *.tsv

- Just like earlier you can export this to a brand new file in the results subdirectory in this case though it would look like grep -c 1999 2014-01-31_JA_*.tsv > results/2014-01-31_JA_1999.txt

- It does not need to mine for numbers alone as it can also mine for words

- To do this you simply need to put the word that you are mining for after the flag -c

- So if you were looking for the word “revolution” it would look like this: grep -c revolution 214-01-31_JA_america.tsv 2014-02-02_JA_britain.tsv

- I tried this and did not succeed, BUT i realized that it didn’t work because I didn’t get the file name correct! THIS IS CLEARLY IMPORTANT

- I kept getting the no such file or directory even with the correct file name so i am trying to go back a file, perhaps i am not in the correct directory

- IT WORKED!!! I was just in the wrong directory

- You can “i” flag after the “c” flag to go through the query again and this time have results prints results that are case insensitive, so for example -ci revolution will also pull out results for both “revolution” and “Revolution”. THIS WORKED!

- You can also move these numbers into another file like the other previous example from earlier.

- grep can also create subsets of tabulated data.

- for instance grep -i revolution 2014-01-31_JA_america.tsv 2014-02-02_JA_britain.tsv > 2016-02-12_JA_america_britain_i_revolution.tsv (this worked just fine once i actually included all the information

- Am going to skip the rm step because i am nervous as to what i will erase…

- continuing on though i am adding on the -v on to the command to exclude certain data elements

- you can also transform different files into different platforms using the > flag

Summary

Within the Unix shell you can now:

- use the wc command with the flags -w and -l to count the words and lines in a file or a series of files.

- use the redirector and structure > subdirectory/filename to save results into a subdirectory.

- use the grep command to search for instances of a string.

- use with grep the -c flag to count instances of a string, the -i flag to return a case insensitive search for a string, and the -v flag to exclude a string from the results.

- combine these commands and flags to build complex queries in a way that suggests the potential for using the Unix shell to count and mine your research data and research projects.

Outline of Final Project

For my final project for Digital History (#hist5702w) I am building a game on Twine that takes themes from my honours thesis and puts them into a real world situation. The real world situation that I am choosing to work with is my own. So far it is a very basic concept in my head but I am hoping that it will develop into a game that will allow the player to gain a different perspective on how an individual with a physical disability gets through a post-secondary career.

To accomplish this goal I will create different options for the player/student to choose that will affect both the outcomes of their post-secondary career and their health. These options are based on my own experiences that I have gone through here in my career at Carleton University. Different options will include what format of textbook the student uses to whether or not they take notes in class themselves or have a note taker in class.

Again, this game will be created on Twine. At the present moment I have little experience with the program, for instance I know how to do basic things like create new links but I am hoping to work through the story and then perhaps work on background and other more advanced features later. For now though I will just continue on until the actual story is complete.

Recognizing & Solving Problems

Over this semester I have been learning quite a few new things. First, I have learned how to use new digital tools that will, although it may not seem like it now, help me in complete future research more efficiently. Second, how to put my incomplete work online (as discussed in my previous post). Finally though, I have learned that I don’t need to suffer on my own. My mind was blown by this idea. “What?!?! Dr Graham you are saying that we can share our issues with other people and ask for help?” No word of a lie that was a hard concept to wrap my head around.

Therefore, following these along with these lessons, I am moving forward and using them to help solve problems I am having. For example, I am writing down every single thing I do for each tutorial in Notational Velocity, one of the many new tools that I have been turned on to as I slowly gain more knowledge about its usefulness. Also, I am posting all my notes on this blog, below you can see the notes that I took yesterday (February 9, 2016) when I attempted the Command Line Tutorial for a second time.

When doing this tutorial for a second time I was able to complete it successfully despite the couple times that I thought I had done something wrong, I was later assured that I had not done anything wrong. Check them out and let me know if you have any suggestions.

Digital History – Command Line Tutorial Notes 2

- typed in pwd command to orient myself then hit the ls command to get a listing of the files and directories within my current location which is /users/hollispeirce1. These are:

ApplicationsDownloadsMoviesPublicprojects   Desktop   Dropbox      Music       Sites python    Documents         Library Pictures mallet-2.0.7

- Flags: these are additions to a command that provide the computer with a bit more guidance with what sort of output or manipulation that you want

- Playing around with changing files and directories but for some reason I still don’t understand how to move into files that have two words or more in the title. Tried _ , – , and . .

- Figured out how to do it! to get into files with two or more word titles i just have to use “quotation marks”

- Succeeded in using the -l flag to get more information on the main files and directories

- Adding an h to the -l flag (-lh) commands the computer to display the sizes of the files in a smaller format to make up room

- successfully moved straight to mallet by typing: cd /users/hollispeirce1/mallet-2.0.7

- i also was able to open mallet by typing: open . once i was in mallet and the window opened up

- created a new directory on the desktop called ProgHist-Text by entering the command: mkdir ProgHist-Text

- can now move in and out of it as desired and successfully moved into it using the auto-complete with the tab button remember though auto-complete is case sensitive

- figured out how to read a file on the command line by typing the command: cat name-of-text.txt

- When I hit the up arrow it cycles through the most recent commands and the down arrow goes through the commands in the other direction

- successfully duplicated the file by using the command: cp name-of-text.txt name-of-text2.txt

- and moved it into a smaller one with: cat name-of-text

- to open vim and edit a txt file in terminal enter the command: vim name-of-text.txt

- to man an edit enter the a flag which allows you to edit the text and press escape to go back to reading

- to save anything in vim type : and hit enter then type w and press enter

- to leave vim type : and hit enter then press the q button

- you can also combine these two like all other command BUT WATCH OUT AS YOU CAN QUIT WITHOUT SAVING SO IF YOU DO THIS ENTER wq

- create a back up before moving a file by entering cp file-name.txt file-name-backup.txt

Posting Notes Online

As mentioned in the post “Learning To Share Your Work”, I have learned from Dr Graham that there is a great benefit to posting your finished work online. However, it is also being pointed out in his class now that there is a great benefit to posting your notes online as well. This way you can put your frustrations out there and allow your fellow academics to have the opportunity to help you. Perhaps someone in the community has had similar issues as you and has figured out a solution to that very same problem. I must admit I was a little nervous about posting my finished work online let alone my gibberish notes that I take when working. This is even more nerve-racking as it would seem to me that my notes would only make sense to me.

I must admit though that the concept does seem to make sense. So here goes nothing! These notes are what I wrote down thus far when completing the Programming Historian’s tutorial on how to use the Bash Command Line.

Digital History – Command Line Tutorial

First command worked fine, pwd brought up: /Users/hollispeirce
– Had trouble with the next 1s command
– for some reason it said “command not found”
– No matter what I do to get it to tell me the cd desktop command it says permission denied
– So apparently I was hitting the wrong thing it was “ls” not “1s”
– Learning the hard way that command line only handles one step at a time
– Entered “ls” again and it for some reason gave me a huge list of where i was
– now for some reason my print working directory (pwd) has changed
– Figured out to do absolutely everything separately and it is fairly easy to navigate around
– Following instruction implicitly and being sure to check where I am by typing “ls” to keep track of where I am is important
– Creating new files worked just fine now to try copying files
– A little confused by copying files as it has told me no such file or directory
– seem to have figured it out but not sure how i did it… ASK ABOUT THIS
– if you delete an item in the command line it is gone for good unlike simply moving an item to the trash bin
– deleted a file successfully but now need to look at deleting directories
– for some reason it keeps telling me “-rm: command not found” after typing “-rm -rf anotherdir/”
– am so confused by how to download a developer package.
– installed brew but forgot to press Return to complete the download
– so i tried re entering the download but it keeps saying “400 bad request” so i am reverting back a few steps and figuring out python
– saved python snippet to digital history folder for now as i cannot locate python
– managed to download the most up to date python from https://www.python.org/
– have successfully created a new directory to save python files into
– and i have successfully saved the snippet inside of it
– Learning thanks to the Bash tutorial what a “flag” is when dealing with the command line
– * flag makes the command line display the directory as a list of text files
– Tried to multiple of files and exclude others with the command: ls *-Scan1.jpeg , Scan2.jpeg , Scan3.jpeg but it failed.
– Am going to try to solve problem by moving commas around
– Didn’t work. THIS IS SOMETHING TO ASK ABOUT WHEN THIS TUTORIAL COMES UP IN CLASS
– Not understanding the difference between the basic ls command and adding -1 or also h along with it as for me it just displays the directories in a list without the additional information it is supposed to
– IMPORTANT NOTE!!!!! COMMAND cd — WILL BRING YOU RIGHT BACK TO THE STARTING POINT ALMOST LIKE A RESET BUTTON
– Having trouble moving up and down through my file systems with cd whatever so i will have to investigate this further
– I tried jumping directly to my desired directory or file instead by typing in /users/hollispeirce1$/whatever but Terminal told me the same thing as it did with the other strategy: no such file or directory
– tried using the command: open whatever file . but it stated that that was not a line
– so i also tried another file with a one word title and it worked just fine so it may be that i am not writing the other titles correctly
– tried another file with a one word title and it worked just fine so that is definitely the problem
– using the tab button after writing half of the file name will prompt it to attempt auto complete and by using its subdirectories or files in the current directory
– THIS IS CASE SENSITIVE
– Managed to download and save War and Peace from the gutenberg project website but for some reason it would not open up when i followed the instructions as directed by the tutorial INVESTIGATE THIS FURTHER
– did not get anywhere because of this with editing a file in the command line because of the above problems

Let me know if any of that makes sense, or if you know how to solve any of my problems!