Task to Complete
Develop a program to analyze the letter frequencies of a given passage, compare it with the letter frequencies of a language, and make the best guess about the language the passage is written in. This program must:
- be written using Python
- use strings
- read files (a .txt file for the passage and a .csv file for the frequencies)
- use dictionaries
January 3th, 2017
- This is the first time I am using Python to write a program, so I spent this day familiarizing myself with the syntax.
- I learned the basic syntax on codecademy.com, including dynamic variables, methods, conditionals, loops, lists, and dictionaries.
- With the remainder of the time, I was able to read into letter frequency analysis and create a .csv file with the letter frequencies of some common languages from Wikipedia (https://en.wikipedia.org/wiki/Letter_frequency)
- *One thing to note is that because this is the first time I am trying out Python, the program output doesn’t look very nice …. I didn’t have enough time to figure out graphics display, so it is just simple printing on the console.
January 5th, 2017
- Today, I officially began work on the program.
- I created a method load_text that would take in a file name and read the text file.
- I also created a method load_frequencies that would take in another file name and read in the letter frequencies. It would return a dictionary of dictionaries, where for the original dictionary, each key would be a language, and for each sub-dictionary linked with a language, each key would be a letter linked with its respective frequencey.
- To make reading the frequencies from a .csv file easier, I used the transpose method in Excel to change the vertical and horizontal axes of the file. This way, every horizontal line I read will be a language, rather than a letter, which fits better with the way I am storing the data.
- I also created a method count_letters that would count the number of occurences of each letter from the text read from the load_text method, storing the information in a dictionary. Then, it would divide each number by the total number of letters to calculate the actual frequency in the text, and return the result.
- I created a method guess_language that would be responsible for taking the letter frequencies of the text and comparing it to that of the languages in general, but I did not have time to implement this method.
- Reflection on the first day of using Python:
- I found Python to be very convenient and flexible to work with. It is also very easy to use, given the number of pre-written methods that can be called.
- However, there are still some syntax points that I need to get used to, having spent most of my previous time working with Java (i.e. the lack of braces and semicolons and the use of colons)
January 9th, 2017
- I finished the guess_language method.
- This method first finds the difference in letter frequency between the passage and each language.
- It creates a new dictionary to store the frequency differences.
- It then loops through each language, comparing the frequencies of each letter from the language in general to that from the input text, by finding the difference.
- Next, it sums up the absolute value of all these differences for letters in a language, and uses the sum as the overall language frequency difference. (This is using the Manhattan Distance approach)
- This method then finds the language with the smallest difference by looping through the dictionary, and replacing a “minimum” variable with if the subsequent difference is smaller than “minimum”. It returns the language corresponding to the final “minimum” value.
- However, at this point, I found that many of the languages I had in my list had non-ASCII letters that couldn’t be directly read or analyzed by the program. I did a lot of research into this, and found out how to deal with the problem.
- First, using a convoluted process, I had to convert the .csv file into a .txt file, save it as a new file using UTF-8 encryption, replace all the tabs with commas, and save it again back to a .csv file.
- Next, instead of using regular methods to read the .csv and .txt files, I had to import a codecs library to help read UTF-8 encrypted files
- I had a bit of time left over, so I also coded different ways (other than the Manhattan Distance approach) of finding the differences in language frequencies, including using Euclidean Distances and Cosine Similarities.
- Both the Euclidean Distance and the Manhattan Distance worked very well. However, in some cases, the Cosine Similarity approach guessed incorrectly.
- I didn’t have enough time to look into why this is the case – if it was a problem with the code, or that the method itself isn’t ideal. This is an area where further improvement could be made.
- Other than that, this program is pretty much complete.