Mining and Counting Files

In the most recent tutorial lesson from the Programming Historian ( we learned all about how to mine and count through files using the Bash Command Line. In a dramatic turn of events over the past two weeks I have been gaining more confidence when going through these tutorials. The main reason for this is that I have been moving cautiously in order to ensure that I do not skip over crucial steps (and I have learned that when using the command line EVERY step is crucial).

In addition, I have stuck with it so that certain basic steps, such as navigating through the computer on the command line, has come to feel more and more natural. More importantly though I have taken to writing down every step that I take. This has helped immensely as I have found that it forces me to think outside the box. What is meant by this is that I feel as though I am talking to myself when doing so, which allows me to more easily see where I am making mistakes.

Taking down notes helped me greatly when it came to getting through my last tutorial. This tutorial was all about learning how to instruct the computer to go through specified files and count certain things, such as number of words, or mine through it and tell you how many times certain words or numbers came up. The lesson also included instructions on how to create a subdirectory and how to move your results into that subdirectory. To get a better sense of what is meant by this please take a gander at my notes and let me know if there is anything that I can be doing more efficiently.

Digital History – Research Data with Unix

- the unix shell gives you access to a range of different commands that help you mine and count through research data

- options for counting and mining data though does depend on the amount of metadata or file names given to you

- in order to get the most out of the unix shell it is important to remember to take the time to structure your filing system.

- dowloaded the files to proghist-text successfully and now am about to open in the command line

- Note: CSV files are those in which the units of data (or cells) are separated by commas (comma-separated-values) and TSV files are those in which they are separated by tabs. Both can be read in simple text editors or in spreadsheet programs such as Libre Office Calc or Microsoft Excel.

- to count the contents of a file enter the command: wc -w “name of file” (worked correctly)

- if you want to know the number of lines instead of an actual word count, type wc -1 “name of file”

- in addition if you want to know a character count enter command: wc -m “name of file”


- the most frequent and useful use of the wc command for digital historians is to compare and contrast sizes of a source in digital format.

- wc can also be utilized with other wildcards like * which is an even easier way to compare multiple sources of research data.

- for instance wc -l 2014-01-31_JA_a*.tsv or wc -l “file name”_”file name”*.tsv


- if you wish to get the data put in a new file rather than just appearing in the terminal screen use the >

- for instance wc -l “file name”_”file name”*.tsv > results/”file name”_”file name”_wc.txt

- this will send the results to a newly created file in a subdirectory called results

- As well as counting files, the unix shell can mine through files using the grep command

- For instance you can enter grep “string, or character clusters” (in this case 1999) *.tsv so: grep 1999 *.tsv

- If you add -c to the command it prints how many times the given character cluster or string appears in a given file. In this case grep -c 1999 *.tsv

- Just like earlier you can export this to a brand new file in the results subdirectory in this case though it would look like grep -c 1999 2014-01-31_JA_*.tsv > results/2014-01-31_JA_1999.txt

- It does not need to mine for numbers alone as it can also mine for words

- To do this you simply need to put the word that you are mining for after the flag -c

- So if you were looking for the word “revolution” it would look like this: grep -c revolution 214-01-31_JA_america.tsv 2014-02-02_JA_britain.tsv

- I tried this and did not succeed, BUT i realized that it didn’t work because I didn’t get the file name correct! THIS IS CLEARLY IMPORTANT

- I kept getting the no such file or directory even with the correct file name so i am trying to go back a file, perhaps i am not in the correct directory

- IT WORKED!!! I was just in the wrong directory

- You can “i” flag after the “c” flag to go through the query again and this time have results prints results that are case insensitive, so for example -ci revolution will also pull out results for both “revolution” and “Revolution”. THIS WORKED!

- You can also move these numbers into another file like the other previous example from earlier.

- grep can also create subsets of tabulated data.

- for instance grep -i revolution 2014-01-31_JA_america.tsv 2014-02-02_JA_britain.tsv > 2016-02-12_JA_america_britain_i_revolution.tsv (this worked just fine once i actually included all the information

- Am going to skip the rm step because i am nervous as to what i will erase…

- continuing on though i am adding on the -v on to the command to exclude certain data elements

- you can also transform different files into different platforms using the > flag


Within the Unix shell you can now:

- use the wc command with the flags -w and -l to count the words and lines in a file or a series of files.

- use the redirector and structure > subdirectory/filename to save results into a subdirectory.

- use the grep command to search for instances of a string.

- use with grep the -c flag to count instances of a string, the -i flag to return a case insensitive search for a string, and the -v flag to exclude a string from the results.

- combine these commands and flags to build complex queries in a way that suggests the potential for using the Unix shell to count and mine your research data and research projects.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>