Task to Complete

Develop a program to analyze the letter frequencies of a given passage, compare it with the letter frequencies of a language, and make the best guess about the language the passage is written in.  This program must:

  • be written using Python
  • use strings
  • read files (a .txt file for the passage and a .csv file for the frequencies)
  • use dictionaries

January 3th, 2017

  • This is the first time I am using Python to write a program, so I spent this day familiarizing myself with the syntax.
  • I learned the basic syntax on codecademy.com, including dynamic variables, methods, conditionals, loops, lists, and dictionaries.
  • With the remainder of the time, I was able to read into letter frequency analysis and create a .csv file with the letter frequencies of some common languages from Wikipedia (https://en.wikipedia.org/wiki/Letter_frequency)
  • *One thing to note is that because this is the first time I am trying out Python, the program output doesn’t look very nice …. I didn’t have enough time to figure out graphics display, so it is just simple printing on the console.

January 5th, 2017

  • Today, I officially began work on the program.
  • I created a method load_text that would take in a file name and read the text file.
  • I also created a method load_frequencies that would take in another file name and read in the letter frequencies.  It would return a dictionary of dictionaries, where for the original dictionary, each key would be a language, and for each sub-dictionary linked with a language, each key would be a letter linked with its respective frequencey.
  • To make reading the frequencies from a .csv file easier, I used the transpose method in Excel to change the vertical and horizontal axes of the file.  This way, every horizontal line I read will be a language, rather than a letter, which fits better with the way I am storing the data.
  • I also created a method count_letters that would count the number of occurences of each letter from the text read from the load_text method, storing the information in a dictionary.  Then, it would divide each number by the total number of letters to calculate the actual frequency in the text, and return the result.
  • I created a method guess_language that would be responsible for taking the letter frequencies of the text and comparing it to that of the languages in general, but I did not have time to implement this method.
  • Reflection on the first day of using Python:
    • I found Python to be very convenient and flexible to work with.  It is also very easy to use, given the number of pre-written methods that can be called.
    • However, there are still some syntax points that I need to get used to, having spent most of my previous time working with Java (i.e. the lack of braces and semicolons and the use of colons)

January 9th, 2017

  • I finished the guess_language method.
  • This method first finds the difference in letter frequency between the passage and each language.
    • It creates a new dictionary to store the frequency differences.
    • It then loops through each language, comparing the frequencies of each letter from the language in general to that from the input text, by finding the difference.
    • Next, it sums up the absolute value of all these differences for letters in a language, and uses the sum as the overall language frequency difference.  (This is using the Manhattan Distance approach)
  • This method then finds the language with the smallest difference by looping through the dictionary, and replacing a “minimum” variable with if the subsequent difference is smaller than “minimum”.  It returns the language corresponding to the final “minimum” value.
  • However, at this point, I found that many of the languages I had in my list had non-ASCII letters that couldn’t be directly read or analyzed by the program.  I did a lot of research into this, and found out how to deal with the problem.
    • First, using a convoluted process, I had to convert the .csv file into a .txt file, save it as a new file using UTF-8 encryption, replace all the tabs with commas, and save it again back to a .csv file.
    • Next, instead of using regular methods to read the .csv and .txt files, I had to import a codecs library to help read UTF-8 encrypted files
  • I had a bit of time left over, so I also coded different ways (other than the Manhattan Distance approach) of finding the differences in language frequencies, including using Euclidean Distances and Cosine Similarities.
    • Both the Euclidean Distance and the Manhattan Distance worked very well.  However, in some cases, the Cosine Similarity approach guessed incorrectly.
    • I didn’t have enough time to look into why this is the case – if it was a problem with the code, or that the method itself isn’t ideal.  This is an area where further improvement could be made.
  • Other than that, this program is pretty much complete.