Our speaker for this week was Eric Peterson, a Research Manager for McAfee. He said he went to school hoping to become a pilot, but things changed and he ended up working with research in security.
The first thing we did was a “Phishing Quiz” that was designed by the McAfee team to see how good we were at detecting if an email was trying to phish us or not. We had to use clues and our experience to determine this. The quiz is linked at http://phishingquiz.mcafee.com/. I don’t think the link works anymore, but it’s in the the video lecture. I think the class got around 80% of the emails correctly marked as spam or not spam.
There were a ton of new terms that he wanted us to know before moving on. Here are a few of the important ones:
- Spam is exactly what you think, it’s spam/garbage email sent by spammers to spread their crap. Ham is anything that’s not spam/good email.
- Spam trap
- Email address used to get spammer’s emails to identify who they are.
- Trying to get people to enter their real information by sending fake webpages/emails.
- Real-time Blackhole List, or RBL, is a list of sites that for sure send spam and should be avoided.
We learned about a few tools used to research spam and other related emails. A few that Linux uses are DIG, WHOIS, GREP, SED, and AWK. DIG is the domain information groper, which prints information about a specific webpage/domain. WHOIS also serves a similar purpose, but it gives IP/domain registration information. GREP lets us search for specific strings within files. SED is a editor used for modifying files, usually text files. AWK is a programming language used for text processing and extracting data.
He also listed two types of SQL database types, PostgreSQL and MySQL. PostgreSQl is an object relational type database, it is considered the most advanced database management system available. MySQL is the most popular database management system. I have only used MySQL in my first databases class, so PostgreSQL is new to me. We ended up doing a short demo using PostgreSQL in our VM’s.
There is also a tool named Regex Coach, which helps users learn how to create and formulate regular expressions to search and match specific strings. We ended up using this one in the lab. I had made regular expressions in my CS 321 class, Theory of Computation, but I had any real life application for it. It turns out making regular expressions is pretty complicated and needs a lot of careful planning.
The other two tools listed were just websites that hosted information that researchers used. They were trustedsource.org and spamhaus.org.
We first accessed the PostgreSQL server on the Linux VM’s. We didn’t do much other than that, the video cut off and we were already moving on to the next section.
The next part was to use Regex Coach to create regular expressions for the strings
“v | a g r a” “\/iaGra” “V|4agra” but not “Viagra”
I used an online Regex learning too that functioned just as well as Regex Coach.
The way Eric showed to do it in class was to work through one letter at a time so you could catch more strange variations with a shorter regular expression. The way I did it was to just catch the individual strings.
In the next lecture we started learning about emails. The first thing we looked at was email headers and SMTP conversation for a ham email. The main difference that happens between a ham and spam SMTP is that the spam will be caught out and blocked, which sends a 500 level number which indicates a failure in sending. 200 numbers mean OK. We also learned to read email headers from bottom to top.
The next section was learning to identify what spam or ham is based off observations.
The things we that stood out to me were the php extension in the link, the European name and domain, and the fact that it sounded like clickbait and was not very personal. The class identified that Oprah is used a lot to get her fans to click, it was HTML based, and that there were no periods. They did make more observations, but these were the main ones.
The ham email stood out to me as clean because there were no links/URLs, it’s not a random subject(it’s a Newsletter) and it was formatted, so you know it would take time to make. The class pointed out that it was “hippie targetted” and there were no greetings/salutations. Then we made a shape that would represent the email’s features. The more similar an email is to that shape, the more likely it will be the same type, spam or ham.
Our lab this week is extremely similar to the classification we did for the malicious URLs, but this time it is using PostgreSQL. This time, he did not tell us the percentage of spam vs. ham in the 100,000 rows of real-world message meta data.
My method was to group by the subject line and mark any that are seen over 1000 times. I viewed the highest counted subjects and found that most of the subjects over 1000 or so were definitely spam related(they were mostly related to increasing stock value). The most common one was a 61k counted piece of spam. That made up a majority of my 70k or so. I also wanted to do similar queries with source IP’s and other similar fields, but my SQL skills were lacking.