# Learning R through a mini game part 2

Late last year I discovered proton, an educational game in R about processing data frames, via R-bloggers and had a go at it. I thought it was fun and educational; it was also the first time I tried to use the dplyr package. I recently learned that there are two more games produced by the same developer of proton. This post is on the frequon game.

The frequon game (along with proton and regression) is part of the BetaBit package.

install.packages('BetaBit')
library(BetaBit)

_____     _                    _    _____ _ _      _____
| __  |___| |_ ___    ___ ___ _| |  | __  |_| |_   |   __|___ _____ ___ ___
| __ -| -_|  _| .'|  | .'|   | . |  | __ -| |  _|  |  |  | .'|     | -_|_ -|
|_____|___|_| |__,|  |__,|_|_|___|  |_____|_|_|    |_____|__,|_|_|_|___|___|

Choose your game. Just type the name of the selected game in the console.
It's a function so do not forget about parentheses!

1. proton()
2. frequon()
3. regression()


To start the frequon game, just type frequon(). If you want to have a go at the game yourself, stop reading now; below are my solutions.

# start the game
frequon()
_____ _          _____                            _____
|_   _| |_ ___   |   __|___ ___ ___ _ _ ___ ___   |   __|___ _____ ___
| | |   | -_|  |   __|  _| -_| . | | | . |   |  |  |  | .'|     | -_|
|_| |_|_|___|  |__|  |_| |___|_  |___|___|_|_|  |_____|__,|_|_|_|___|
|_|
You've Got Mail

From: 154eb7278fc44650bdd2bb39bb2b5c69@mail.tor
To: c81632dce28ca740f2f2503656f3d62a@mail.tor
Subject: Interested?

Hi,
We are looking for a smart guy with extraordinary hacking skills.
Our mutual friend assured us that you are our man.

TL;DR: We are observing a group of terrorists that are planning something.
We have intercepted some data, but do not know how to read it (attached).

There is a password somewhere. We have to find it to stop terrorists.

It's not clear how to start. Our informer told us that the key is somehow related with three key phrases: guns, and, roses.
Probably these are the names of some messages / datasets.
Would you like to check if you have access to any of them?

If you want to help, please type:
frequon(subject = "Re: Interested?", content = "Text of the message that you have access to")
so as we could read the message too.

Remember: any time you want, you can get some piece of advice, just
type: frequon(hint=TRUE).

frequon(subject = "Re: Interested?", content = "Text of the message that you have access to")
We were looking so long for somebody like you who could help us solve this mystery.
If you dont have access to the messages, than we will have to ask someone else :(


To check what data sets are part of the BetaBit package, use data().

data()
Data sets in package ‘BetaBit’:

EnglishLetterFrequency              The vector of letter frequencies in English.
FSW (daneEdu)                       The data from the study of Polish upper-secondary schools students.
bash_history                        The history of recently executed commands.
dataFSW (daneEdu)                   The data from the study of Polish upper-secondary schools students.
employees (proton_data)             The database with employees of Faculty of Electronics and Information
Technology of Warsaw University of Technology.
logs                                The history of logs into the Proton server
lyo (messages)                      The three messages to be decoded.
pcs (messages)                      The three messages to be decoded.
pistoale (messages)                 The three messages to be decoded.
roses (messages)                    The three messages to be decoded.
top100commonWords                   The vector of 100 most common words in English.
varLabels (daneEdu)                 The data frame containng labels of the variables from 'dataDNiP' and
'DNiP' datasets.
wikiquotes                          List with quotes in 18 languages.

frequon(subject = "Re: Interested?", content = roses)

You've Got Mail

From: 154eb7278fc44650bdd2bb39bb2b5c69@mail.tor
To: c81632dce28ca740f2f2503656f3d62a@mail.tor
Subject: Frequencies

We are so glad you want to help us!

Thank you for the message, it looks interesting...
However, this text is too long to be the password that we are seeking for.
This must be some coded message. If only we could know the key...

But lets take a look.
The p letter appears very often.
And i doesnt. In English language letter e occurs most often.
If we knew how often each letter is used in the message, we could compare them with well-known English letter frequencies! So lets do this!

Take the message that you have found, remove everything that is not a letter and calculate frequencies of letters.
The result should be a named vector with names corresponding to letters and values corresponding to number of occurrences.

Send us a reply: frequon(subject = "Re: Frequencies", content = freq) as soon as you finish. freq is the vector of frequencies for each letter.
Please, remember to name this vector with appropriate letters!


Firstly to remove non-alphanumeric characters (\W+), we can use gsub() function. We can use the tolower() function, to convert all the letters into lower case. Finally, we use the strsplit() function to split up the roses string into single letters and the table() function to tally up the letters. The answer needs to be in a vector, so we can use the as.vector() function to convert the table into a vector. And finally use the names() function to name the vector.

roses_letter <- gsub(pattern = "\\W+", replacement ='', x=roses, perl=TRUE)
roses_letter <- tolower(roses_letter)
roses_table  <- table(factor(strsplit(roses_letter, '')[[1]], levels=letters[1:26]))
freq <- as.vector(roses_table)
names(freq) <- names(roses_table)
frequon(subject = "Re: Frequencies", content = freq)

You've Got Mail

From: 154eb7278fc44650bdd2bb39bb2b5c69@mail.tor
To: c81632dce28ca740f2f2503656f3d62a@mail.tor
Subject: Transcription

Great job! Now, when we do know which letters are used the most often and which are the least common, we can combine them with well-known English frequencies of letters.
We have prepared and attached the EnglishLetterFrequency dataset.
It contains frequencies of letters in the English language.

Now you can substitute old ciphered letters with the new English letters according to their frequencies in the attached corpus.
Such operation is called character translation / transliteration.
Can you pass the transliterated message to us?

Send us reply: frequon(subject = "Re: Transcription", content = "text_you_will_get").
We wish you luck!

Best regards!


Use the chartr() function to perform the transliteration; I had to look this function up because there wasn't a tr() function, which is what I use in Unix. If you examine the EnglishLetterFrequency vector, you will notice that it is sorted; therefore, we need to sort our freq vector before we use the chartr() function. (I had to look up the hints for the part because my solutions weren't working because it wasn't clear to me that the roses data set was supposed to be converted into lower case.)

# chartr(old, new, x)
my_old <- paste(names(sort(freq)), collapse = '')
my_new <- paste(names(EnglishLetterFrequency), collapse = '')
text_you_will_get <- chartr(my_old, my_new, tolower(roses))

frequon(subject = "Re: Transcription", content = text_you_will_get)

You've Got Mail

From: 154eb7278fc44650bdd2bb39bb2b5c69@mail.tor
To: c81632dce28ca740f2f2503656f3d62a@mail.tor
Subject: Key

Well done! We are so close now! Our message looks a little bit familiar.
There are even some words that we can recognize.
But there are still some words that are looking strange.
It means that our key is not completely correct and we need to correct these letters that were mistranslated.

Let's use the word frequency to correct the transliteration.

Let's count all the words.
Those, which appear the most often, are for us easy to amend.
The ones which appear the least often are perhaps easily recognisable nouns?

Take advantage of top100commonWords.
Find the right transliteration in order to decode it into the proper English.
Work with lowercased text.

Type:
frequon( subject="Re: Key", content=c(old = "abcdef....z", new = "newlettersrespectively")),
where old are the letters in the message roses while new are fitted real letters.

Good luck!


For this part I just manually figured out the remaining code by looking at words that were obviously misspelt by one letter.

# I tallied up commonly used words but didn't really use it
sort(table(strsplit(text_you_will_get, split = ' ')))

my_manual_old <- 'kubivgmjnrahxfqowtcdylzsep'
my_manual_new <- 'zqqxkvbycgpwmufdlirsnaohte'
frequon( subject="Re: Key", content=c(old = my_manual_old, new = my_manual_new))

You've Got Mail

From: 154eb7278fc44650bdd2bb39bb2b5c69@mail.tor
To: c81632dce28ca740f2f2503656f3d62a@mail.tor
Subject: Next text

Excellent work, you have cracked the code!

However, there is no password in here.
There must be some clue in this message...

Perhaps our friend used the key to cipher the names of the two remaining messages?

Transliterate these names: 'guns', 'and', 'roses' and check if there are datasets with these new names.

frequon(subject = "Re: Next text",
content = "Content of the unlocked message").

We would be grateful.


We can use the chartr() function on "guns and roses" and check the data sets again to look for a data set that matches the translated string.

chartr(my_manual_old, my_manual_new, 'guns and roses')
[1] "vqch pcs gdhth"

# there is a pcs data set
frequon(subject = "Re: Next text", content = pcs)

You've Got Mail

From: 154eb7278fc44650bdd2bb39bb2b5c69@mail.tor
To: c81632dce28ca740f2f2503656f3d62a@mail.tor
Subject: Lengths in the text

This message is written in a language that we cannot dont know.
Can you recognize the language?

We know a simple idea how to recognize the language of any message.
All you need to do is to measure the length of each word. If we knew how many words had the lengths of 1, 2, 3, and so on, we could compare them with the lengths of the words in languages that we know!

Naturally, we need a huge amount of words in many languages.
Fortunately we have a sample from wikipedia resources, so we can share it with you.
Please, find the wikiquotes attached.

There is a list of quotes in many languages.
We hope this will be enough for our needs.

Note that different languages are using different letters, thus to find words use the space   as a separator.

This is a time-consuming job to measure the length of each word for all of the languages, but we believe that you know some fast way to cope with this problem.

frequon(subject="Re: Lengths in the text", content = lengths, attachment = wiki_lengths).
lengths is the vector of counts of words of given length (named vector with names - lengths and values - counts).
wiki_lengths is the list with vectors of counts for each language.
Please, remember to name this list with appropriate languages!


I wrote a function to tally the words in each language.

# there are 18 languages
length(wikiquotes)
[1] 18

names(wikiquotes)
[1] "Estonian"   "Swedish"    "Hungarian"  "Romanian"   "Finnish"    "Lithuanian"
[7] "Turkish"    "Croatian"   "French"     "German"     "Indonesian" "Italian"
[13] "Norwegian"  "Polish"     "Portuguese" "Slovenian"  "Spanish"    "Czech"

# function to tally words
tally_word <- function(x){
x_split <- unlist(strsplit(x, split=' '))
table(sapply(X = x_split, FUN = nchar))
}

lengths <- tally_word(pcs)
lengths <- tally_word(pcs)
wiki_lengths <- sapply(X = wikiquotes, FUN = tally_word)
frequon(subject="Re: Lengths in the text", content = lengths, attachment = wiki_lengths)

You've Got Mail

From: 154eb7278fc44650bdd2bb39bb2b5c69@mail.tor
To: c81632dce28ca740f2f2503656f3d62a@mail.tor
Subject: Language in the message

Well done!

Now we need to compare frequencies for our message and frequencies for other languages.
Try to plot barplots for each language and then a barplot for out message.
It will be easier to compare these frequencies graphically.

Now we can investigate what language was used to prepare the second message.
What do you think?
Which distribution is the most similar to the distribution of the lengths in our message?

frequon(subject = "Re: Language in and message", content = "Language")
where "Language" is the name of the correct language.


Create bar plots to compare the word frequencies.

par(mfrow = c(4,5))
barplot(lengths)
sapply(wiki_lengths, barplot)

frequon(subject = "Re: Language in and message", content = "Romanian")

You've Got Mail

From: 154eb7278fc44650bdd2bb39bb2b5c69@mail.tor
To: c81632dce28ca740f2f2503656f3d62a@mail.tor

We think so too. Great job! Now we have the key, we know the language...
Perhaps before we start translating this message, we will try to get an access to the phrase 'guns'.
We know that our friend likes to use the key from one puzzle in order to encode the name of the other one. So maybe also this time...
Do you know what is the Romanian counterpart of the word 'guns'?

We are responsible for translating the message. If you find something interesting in the third of our messages, let us know, please!

Type:
frequon(subject = "Re: Password", content = "thePasswordYouWillFind").


I guessed Romanian.

Finally, guns in Romanian is pistoale.

frequon(subject = "Re: Password", content = pistoale)

You've Got Mail

From: 154eb7278fc44650bdd2bb39bb2b5c69@mail.tor
To: c81632dce28ca740f2f2503656f3d62a@mail.tor
Subject: You are the best!
`

## Summary

I enjoyed this game as well and learned of the chartr() function in the process. The 3rd part will be on the frequency game. (As a note, the emails in the game have a .tor address; a couple of months ago I finished reading The Dark Net, which describes the dark net and what goes on in that part of the web. Definitely a fascinating read.)