CSC 161 Program #4

"Etaoin Shrdlu"

That's pronounced "eh-tay-oh-in shird-loo". ETAOINSHRDLU... these are the 12 most commonly used letters in the English language. 'E' is used most, 'T' second, and so on.

In Program #4, we'll explore a simple brand of encryption (letter substitution) and decryption (letter frequency analysis). If that weren't bad enough, this is your group project.

 


Logistics

This program is worth 12 points, or 12% of your final grade. The pertinent time coordinates:

  • Assigned: Thurs May 5, 2005
  • Program due: Thurs May 19, 2005
  • Decryption due: Tues May 24, 2005

Some of the fun, new C++ stuff we'll cover includes:

  • Programming in a team... the agony, the ecstasy
  • Exceptions
  • Using your TextMenu class from Program #3

  



Description

What does "fokpz... zpv, cjmm" mean? First some terms...

Encryption or encoding involves taking some text and converting it into some code, so that the original text is "difficult" to understand or interpret.

Decryption or decoding is the opposite process of decryption. It takes an encrypted text and, by some means, deduces the original message.

Character substitution is a form of encryption that encodes text by uniformly substituting one letter for another. For example, I could encrypt "enjoy... yow, bill" into "fokpz... zpv, cjmm" by shifting each character one place... B=>C, i=>j, l=>m, and so on. This subset of character substitution is called shifting or a Caesar code. We will use more general character substitution, however, not shifting.

Letter frequency decryption is a method of decoding text encrypted by character substitution. The frequency of letters in the English language has been studied extensively. An easy example: the letter 'e' occurs in English text far more often than 'z'. A table of letter frequencies can be used to decode text encrypted using character substitution. In short, the frequencies of letters in an encrypted message (even with the letters jumbled) should approximate those in regular English text. We will research this topic more in class.

Here's a nice wikipedia link on letter frequency decryption:

http://en.wikipedia.org/wiki/Letter_frequencies

and another on using frequencies for decryption:

http://en.wikipedia.org/wiki/Frequency_analysis

  


Three jobs

We will work in teams of three. Oddly, there are three jobs to be done.

  1. The project leader will have many hats: coordinate the team communication, research and design efforts, build main() using the software of other team members, and report progress back to the (completely out of touch technically) VP in charge... me. I'll need those TPS reports every Monday morning.
  2. The CharMap guy will design, code and test the CharMap class.
  3. The FreqList guy will design, code and test the FreqList class.

FYI: I dislike these class names... old age, writer's block, whatever. Please feel free to do better.

All team members will participate in researching decryption techniques, gathering text sources to encrypt/decrypt, designing these CharMap and FreqList classes, and designing the program in general. Classes and code may be designed as a group, but please keep the actually coding boundaries intact... CharMap guy does the CharMap class, etc.

ITS is setting up shared k: drive folders for each team. I will provide class/lab time for team meetings, but project leaders should immediately harvest team emails and such to further communication inside each group.

The CharMap Guy/Class

The CharMap class is used to do character substitution. This includes:

  • map a char... 'y' => 'g', for example
  • create a random mapping of each char
  • encode a char, using a mapping
  • decode a char, reversing the mapping
  • encrypt a file by encoding each char in a file
  • decrypt a file by decoding each char in a file
  • read/write a CharMap object from/to a file

My CharMap files look like this:

%CharMap%
<letter1> <substitute_letter2>
<letter2> <substitute_letter2>
...

I use the %CharMap% as a magic number at the top of the file to identify it as a CharMap file. It's an old Unix thing. Also, here's an example of a nice CharMap file: start_map.txt

The FreqList Guy/Class

The FreqList class is used to track the frequency of letters in text, including:

  • get/set the count and/or frequency of a char
  • increment the count of a char
  • count all the occurrences of a char in a file
  • sort the FreqList data by char count
  • get a char from this sorted order... the first char being the most common, the second being second most common and so on
  • read/write FreqList files from/to a file

FreqList files look like this:

%FreqList%
<letter1> <count> <frequency>
<letter2> <count> <frequency>
...

Here's an example of a FreqList file: start_freqs.txt

The Project Leader: Encryption/Decryption

The job of the project leader will be to use the CharMap and FreqList classes to implement actual encryption and decryption algorithms. Please see Figure 1 attached which describes the encryption and decryption processes.

The encryption process:

  1. Read in a text file
  2. Either read a CharMap file or generate a new, random one
  3. Use the CharMap to encode each char in the input file
  4. Write these new encoded chars to an output encrypted file
  5. You probably want to save the CharMap as well, if you generated it.

The decryption process:

  1. Read in an encrypted file
  2. Create a new FreqList and count the occurrences of chars in the encrypted file
  3. Read a FreqList file (or two) of representative character frequencies for your decryption... more on this in a second.
  4. Create a CharMap using the two FreqLists: the one for the encrypted file, and the one with the representative sample.
  5. Apply the CharMap to decode each letter in the encrypted file
  6. Write these decoded chars to an output file. Again, you'll probably want to save that CharMap and maybe the FreqList too.

Decryption will probably not be a 100% automatic process. Your character frequencies, more often than not, will not exactly match whatever CharMap standard you are using.

  


Testing and the final stage

At the end of the programming phase, you must provide:

  • Test cases exercising your CharMap class
  • Test cases exercising your FreqList class
  • At least 3 simple decryption examples where you use the exact FreqList created for the file you encrypt. This should help show that your decryption/encryption cycle is working.
  • At least 3 "real" decryption examples... where you use a FreqList file for the English language or some such thing.
  • I'll provide a couple of files to decrypt.

You'll notice that there are 5 days scheduled for "decryption" after the actual program is due. I will provide a couple of "difficult" decryption examples, and you can work on them as a team. Recall, this letter frequency stuff is not completely automatic, and you may have to manually massage or, even better, add some capabilities to your program to decrypt these last examples. We will discuss this more in class.

  

Design Requirements

You should:

  • Work in teams of 3: a project leader, a CharMap guy and a FreqList guy as described above.
  • Use exceptions at least once in both the CharMap and FreqList classes. This is a requirement.
  • Use your TextMenu classes for your menu(s)

  


Grading

Short program this time:
  • Quality, 50%
  • Function, 50%
Each of these areas is discussed in the handout: CSC 161 Program Design, Style and Testing. Please leave a README file in your folder describing your testing and the state of your program... problems, questions, etc.

thanks... yow, bill

 


Links

Some links that I enjoyed:

Just FYI, some of the google searches that I did included:

  • English "most common letters"
  • English "most common words"
  • "Etaoin Shrdlu"

  

Notes

May 5, 2005

You may feel the need to rewind an input file, or re-read it from the beginning. Check out page 850 of our text and the seekg() member function of fstream. Note: the example on page 850 should read "dataIn.seekg...", not "dataIn.seek". Get it?

Also, did you guys cover sorting already in 160? If not, let me know. These are very small sorts (26 letter), so old, slow, simple bubble sort is fine. It's on page 583.


May 10, 2005

A blurb that I wrote on the research phase of this project: prog04_freq_research.html

yow, bill


May 12, 2005

I have added a function to the k: drive...

common_area/prog04_freqs/comment.cpp

If you install this function in your CharMap and/or FreqList classes, then you will be able to read files that have comment lines beginning with a '#'. I have comments in some of my example FreqList files.

thanks... yow, bill


May 17, 2005

We'll talk about all this at lecture tonight:

  1. I will lecture tonight about "scary characters" and how to handle them in Program #4. Don't miss it... it's scary!
  2. Also, the 4 killer codes are now available on the k: drive common_area. They are the file in folders go1, go2, go3, and go4... file code.txt. Your goal is to decrypt these 4 examples by the deadline. Good luck.
  3. Each team must find and contribute 3 text examples of size >=10K bytes (or chars) for research purposes. I recommend chapters of fiction from the literature or Guttenberg sites listed above. Place these in your team folder by Thursday, and I'll move them to the common_area for all to use.
  4. I want each team to implement 2-3 ideas/schemes for breaking these codes. Let me know soon what research ideas you want to explore. Please see the research page above for some ideas. Here's a couple more:
    • I'm liking the idea of counting digraphs and trigraphs as a way to help break the codes.
    • Or, you can find frequently-occurring words. For example, if you can decode incredibly common 3-letter words "the", "and",  and "for", then that gives you a nice 9 letter base to start from.
    • You can automate the process above by using some kind of word scoring.
    • Create a FreqList for any of the "go" examples or any of your own. Perhaps you can start to detect patterns in the frequencies. For example, have you found any text that doesn't have 'e' as the most frequent letter? There are more of these, I believe.
    • For a large example (like the "go" examples), it seems unlikely that a character will be too far from its expected frequency. However, if 3-4 characters all have a similar probability, then decoding these 3-4 chars may work by trial and error.

good luck... yow, bill