Task to Complete

Create a program that uses a neural network and stylometric techniques to determine the authorship of a disputed text.  (This project will be focusing specifically on William Shakespeare’s disputed works.)

May 11th, 2017

  • I spent this class writing a Text class for my program.  This class would represent a specific work and have a title and author.  It would also have a filename corresponding to the .txt file the actual text would be stored in.
  • In this class, I wrote a analyzeText method to analyze the text from the file.  This would first create a hash map with keys as a list of “function words” (read from another file), and then fill in the values with the frequencies of each corresponding word in that particular text.
  • With this design, I can easily get the hash map from the Text, convert it into an array, and use it as the inputs for a neural network.

May 15th, 2017

  • I spent this class creating the neural network for my program.  Because I already had experience using neuroph in the previous project, this process was extremely fast for me.
  • I decided to evaluate the possibility of authorship from four different authors.  Thei input to the nerual network would be the aforementioned function word frequencies.  Then, the output will be an array of four doubles, corresponding to the possibility of the author being each person.
  • I created a new singleton Stylometry class to carry out the neural network training.  It would first create a 4 layer network (I decided to use 2 hidden layers because this made the training much faster, and the determination much more accurate).  Then, it would go through the Texts it had created and create a DataSet.  Finally, it would train the neural network with the data.
  • I had some remaining time at the end of class so I created the .txt files of the works.  I collected approximately six to seven texts from both Shakespeare and Marlowe, and fewer (approximately two or three) from the less prolific writers, Rowley and Peele.  I then collected five disputed texts part of the “Shakespeare Apocrypha”.

May 17th, 2017

  • Today, I first finished up the neural network training by tweaking the learning rate and momentum numbers.  I ran the trained neural network on some texts that I knew the authorship of but didn’t use to train, and it correctly identified most of them, which meant that although there is possibility of error, the stylometric techniques used had a certain level of accuracy.
  • I spent the rest of class designing a GUI interface for the user.  I found that Eclipse had a wonderful tool called WindowBuilder which allowed me to use drag-and-drop methods to design the GUI, and the code would be written automatically  (though I still had to go back and organize/rewrite the code).  I created JLists for the user to select the texts to train with, and the disputed text to analyze.  Because the WindowBuilder, I was able to finish the GUI before class ended.
  • There are definitely more ways the program could be improved, whether it is simply getting more texts to train with to increase accuracy, or to increase the size of each data row (there are other types of “writer invariant” other than function words).  However, at this point, the program is functional and thus, given the timeframe, pretty much complete.