This week I had the opportunity to lead the class in a discussion about topic modelling. What is that you ask? Well it is basically a group of computer programs, Mallet being the most common (which we were working with), that take a text and try to extract topics from that text. Now, computers obviously do not understand the meanings of the words that are extracted from the text but they are able to find relationships between them by judging the frequency in which they appear. From these relationships words are then placed into baskets by the computer that are given a topic as a title.
Working with Mallet manually through the command line though is a ver strenuous and meticulous process. For this reason Dr Graham suggested that I use a GUI (Graphical User Interface) program that runs on Mallet. This program is much more user friendly for me as it allows me to navigate and create a topic model by clicking rather than typing in code on the command line. The only downside being that the options were then limited to what the program thought was necessary and not what I may have considered necessary. Whereas if I was working with Mallet directly on the command line I could specify the instructions a bit more for my needs.
When completing this tutorial, I had a few different hiccups along the way. First was making a solidified list of stopwords (list of words for the computer to ignore during its analysis) that would help me find more detailed topics. Thanks to the help of Dr Graham I was guided to a list of stopwords that was published by a historian known as Ted Underwood. This list consisted of over 6,000 words including Roman numerals. Once I began using this list the topics that the computer began making topics that were highly accurate when it came to analyzing my thesis. Another hiccup that I came across was figuring out how to analyze multiple files at one time when using the GUI. This was quickly figured out by being shown that instead of selecting the individual files, I simply select the files that they are all in. So if I was wanting to select multiple article sources from research that I had done for instance I would just put them all in one file then select that file in particular. Finally, I had great trouble figuring out how to visualize these models with Excel. For instance graphs can be made that show how different words relate to one another. This in particular is something that I will have to continue to look at as I have yet to completely grasp it.
Below are the notes that I took during the process of completing this tutorial…
What is Topic Modelling?
– topic modelling tools like Mallet look at patterns in the use of words in an attempt to inject local meaning behind vocabulary
– tools like this are transforming the practice of reading into what Matthew Kirschenbaum calls ‘distant reading’
– What is meant by this is that computers and programs such as Google are making scanning millions of books for themes and patterns at the same time possible
– just because you can use these programs though doesn’t mean you should
– If you are trying to look over only one document for instance tools like Volant Tools, that count frequency of words may get the job done just fine
– topic modelling is all about finding topics in mass amounts of texts
– texts can be anything from a blog post to an email to a book
– Topic modelling programs do not know anything about the meaning of the words in a text.
– Instead, they assume that any piece of text is composed (by an author) by selecting words from possible baskets of words where each basket corresponds to a topic. If that is true, then it becomes possible to mathematically
– decompose a text into the probable baskets from whence the words first came.
– The tool goes through this process over and over again until it settles on the most likely distribution of words into baskets, which we call topics.
– Topic Modelling is often referred to synonymously as LDA
– when working on Mallet in the Terminal always remember to add ./ before entering a command
– There are 9 mallet commands that we can learn and sometimes we can even combine these instructions
– after attempting to work with while I am trying to use a different program that is essentially the same concept, however it is a GUI. Unlike Terminal I will not have to meticulously enter each command manually but rather have a much less tiring way of completing the tutorial by clicking on different options
– seeing as I don’t have the sample .txt files provided by Mallet I will use some of the sources from my thesis to complete the tutorial
– First I imported the files that I wanted to analyze
– next i made sure that there was stop words in place (it had already automatically set itself up to remove the default Mallet stop words such as, and, the, of, but, if, etc)
– however i wanted to make sure that the stop words were more comprehensive so that it would find more meaningful themes in the file
– so i googled stop word dictionary and it lead me to http://www.ranks.nl/stopwords (Dr Graham suggested that i use ted underwood’s list of stopwords that contains over 6000 words including roman numerals found here: https://www.ideals.illinois.edu/bitstream/handle/2142/45709/stoplist_final.txt?sequence=2)
– I filled out all the stop words that it had under a new text file in text wrangler but cannot figure out how to activate it instead of the default Mallet stop words as whenever I try to open a stop word file it does not let me open any files inside of the user hollispeirce
– makes me think that perhaps i could solve this problem by saving it outside of these files…
– only file i am able to get into is the dropbox folder
– managed to save it in dropbox but when i went to open it it did not appear so I will continue on with simply using the default mallet list
– the first time i attempted to have it learn 200 topics i asked it to learn 200 topics iterations and print 10 topic with a 0.05 proportion threshold. with 200 different topics
– it spat out 200 lines of random words, majority of which did not relate to the overall theme of the thesis paper but included things like “gz, uv, ku, rr, autumn,” etc
– HOWEVER when i reduced the number of iterations dramatically to a much more reasonable list that is much more accurate and includes words such as “incunabula, book, digital, information, ebook, scroll, history, disabled” but does include a few two letter words like the other so i will try reducing the number of iterations again
– that didn’t appear to do anything
– I attempted to import all the files of my thesis into one importation but could not not figure out how to open it with text wrangler after converting it from a .docx file to a .txt file even though it opened just fine in text editor (i will try opening the .docx file in text wrangler and see if that solves the issue)
– solved this issue by being shown by dr graham that i only highlight the file that the .txt files that are needing to be analyzed are in
– only problem with that was that the file not only contained my thesis .txt files but also contained the settings for Notational Velocity which, as he explained was why i was getting a bunch of random two letter words in my list of topics
– to solve this problem all i need to do is create a new file that only has my thesis inside of it
– so i then did this but it did not create a list of topic words in the topic modelling like it normally did but instead it just repeated where the topics were saved
– when i looked the file up though it had worked as suspected and excluded all the two letter words so now i just need to narrow the topics to be even more specific
– having trouble making a chart with topic model from my thesis so i am going to try using the jesuit relations files