POSTGRADUATE
SCHOOL
MONTEREY,CALIFORNIA
THESIS
AN APPROACH FOR DETECTING MALICIOUS EMAILS USING RUNTIME MONITORING WITH HIDDEN DATA
Kristin R. Sellers
June 2016
Thesis Advisor: Doron Drusinsky
Second Reader: Man-Tak Shing
Approved for public releasedistribution is unlimited
THIS PAGEINTENTIONALLY LEFT BLANK
REPORT DOCUMENTATION PAGE | Form Approved OMB No. 0704-0188 | |||||
Public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instruction, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington, VA 22202-4302, and to the Office of Management and Budget, Paperwork Reduction Project (0704-0188) Washington DC 20503. | ||||||
1. AGENCY USE ONLY (Leave blank) | 2. REPORT DATE June 2016 | 3. REPORT TYPE AND DATES COVERED Master’s thesis | ||||
4. TITLE AND SUBTITLE AN APPROACH FOR DETECTING MALICIOUS EMAILS USING RUNTIME MONITORING WITH HIDDEN DATA | 5. FUNDING NUMBERS HDTRA 139119 | |||||
6. AUTHOR(S) Kristin R. Sellers | ||||||
7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) Naval Postgraduate School Monterey, CA 93943-5000 | 8. PERFORMING ORGANIZATION REPORT NUMBER | |||||
9. SPONSORING /MONITORING AGENCY NAME(S) AND ADDRESS(ES) Defense Threat Reduction Agency (DTRA) 8725 John J. Kingman Rd. Fort Belvoir, VA 22060 | 10. SPONSORING / MONITORING AGENCY REPORT NUMBER | |||||
11. SUPPLEMENTARY NOTES The views expressed in this thesis are those of the author and do not reflect the official policy or position of the Department of Defense or the U.S. Government. IRB Protocol number N/A. | ||||||
12a. DISTRIBUTION / AVAILABILITY STATEMENT Approved for public release distribution is unlimited | 12b. DISTRIBUTION CODE | |||||
13. ABSTRACT (maximum 200 words) Computer systems continue to be at risk of attack by malicious software that are attached to email. Email has been determined to be the cause of spread of 80% of the computer virus infections. Millions of dollars are lost yearly due to the damage brought by the malicious emails. Popular approaches towards the defense against the malicious emails are the use of antivirus scanners and server based filters. Further, state of the art methods is being employed to enhance security against the malicious programs. However, despite efforts being subjected towards the protection of personal information in emails, the malicious programs continue to pose significant threat. This thesis presented elaborates the application of a hybrid of Runtime Monitoring and Machine Learning for monitoring patterns of malicious emails. The system is designed in way that it gathers malicious emails to determine whether they are suspicious, unknown or benign. The application of Runtime monitoring helps in the reducing chances of spread of emails that are determined to be suspicious and with the likelihood to cause threat to users. Patterns were developed in Rules4business.com to facilitate the detection of threats and applying rules to the identified rules validation while at the same time tracking them. The runtime monitoring application system entails the detection of the malicious emails by the assessment of the pattern in which they are sent and qualifying them into different states identified as benign, suspicious or unknown. Through the application of the system, it would be possible to eliminate threats posed to private individuals and corporations emanating from the malicious emails. We performed deterministic runtime monitoring, built a Hidden Markov Model (HMM), and performed runtime monitoring with hidden data. The reasoning about the patterns of malicious emails with hidden artifacts provides the potential of providing improved classification. | ||||||
14. SUBJECT TERMS Malicious emails, runtime monitoring, statechart assertions, formal specifications, hidden markov model | 15. NUMBER OF PAGES 45 | |||||
16. PRICE CODE | ||||||
17. SECURITY CLASSIFICATION OF REPORT Unclassified | 18. SECURITY CLASSIFICATION OF THIS PAGE Unclassified | 19. SECURITY CLASSIFICATION OF ABSTRACT Unclassified | 20. LIMITATION OF ABSTRACT UU |
NSN7540-01-280-5500 Standard Form 298 (Rev. 2-89)
Prescribedby ANSI Std. 239-18
THIS PAGEINTENTIONALLY LEFT BLANK
Approvedfor public release distribution is unlimited
ANAPPROACH FOR DETECTING MALICIOUS EMAILS USING RUNTIME MONITORING WITHHIDDEN DATA
KristinR. Sellers
Lieutenant,United States Navy
B.S.,Langston University, 2008
Submittedin partial fulfillment of the
requirementsfor the degree of
MASTEROF SCIENCE IN COMPUTER SCIENCE
fromthe
POSTGRADUATE SCHOOL
June2016
Approvedby: Dr. Doron Drusinsky
ThesisAdvisor
Dr.Man-Tak Shing
SecondReader
Dr.Peter Denning
Chair,Department of Computer Science
THIS PAGEINTENTIONALLY LEFT BLANK
ABSTRACT
Computer systems continue to beat risk of attack by malicious software that are attached to email.Email has been determined to be the cause of spread of 80% of thecomputer virus infections. Millions of dollars are lost yearly due tothe damage brought by the malicious emails. Popular approachestowards the defense against the malicious emails are the use ofantivirus scanners and server based filters. Further, state of theart methods is employed to enhance security against the maliciousprograms. However, despite efforts being subjected towards theprotection of personal information in emails, the malicious programscontinue to pose significant threat.
This thesis presented elaboratesthe application of a hybrid of Runtime Monitoring and MachineLearning for monitoring patterns of malicious emails. The system isdesigned in way that it gathers malicious emails to determine whetherthey are suspicious, unknown, or benign. The application of Runtimemonitoring helps in the reducing chances of spread of emails that aredetermined to be suspicious and with the likelihood to cause threatto users. Patterns were developed in Rules4business.com to facilitatethe detection of threats and applying rules to the identified rulesvalidation while at the same time tracking them. The runtimemonitoring application system entails the detection of the maliciousemails by the assessment of the pattern in which they are sent andqualifying them into different states identified as suspicious,unknown, or benign. Through the application of the system, it wouldbe possible to eliminate threats posed to private individuals andcorporations emanating from the malicious emails.
We performed deterministicruntime monitoring, built a Hidden Markov Model (HMM), and performedruntime monitoring with hidden data. The reasoning about the patternsof malicious emails with hidden artifacts provides the potential ofproviding improved classification.
THIS PAGEINTENTIONALLY LEFT BLANK
TABLE OFCONTENTS
I. INTRODUCTION 2
A. THENEED FOR RUNTIME MONITORING OF MALICIOUS EMAILS 3
B. MOTIVATIONFOR USING RUNTIME MONITORING OF HIDDEN DATA 4
C. ORGANIZATIONOF THESIS 5
II. MALICIOUSEMAILS 8
A. DETECTINGMALICIOUS EMAILS BY COLLECTING DATA THROUGH BULK EMAIL OR PHISHING 8
B. DODTARGETED MALICIOUS EMAILS 10
III. background 12
A. Formalspecification Tradeoff Cuboid 12
B. translatingnatural language to Formal Specifcation 13
C. Rules4Business 14
D. StateRoverToolset 16
E. DeterministicRuntime Monitoring and Verification using Statechart Assertions 17
IV. usingHIDDEN MARKOV MODELS in rm 21
A. Runtimemonitoring and verification system with hmm 21
B. PROBABILISTICRUNTIME MONITORING USING STATECHART ASSERTIONS COMBINED WITH HMM 22
C. learningphase with HMM 23
V. RESULTS:PROOF OF CONCEPT 28
A. COLLECTIONOF data 28
B. DeterministicRule development 28
C. StateRoverRule Creation and Code Generation 29
D. Generatingthe HMM from the Learning Phase CSV File 32
E. GenerateSpecial Java Code for Probabilistic RM 34
F. RUNTIMECSV’S 36
1. TheAlpha Method 37
2. RuntimeMonitoring 38
VI. conclusionand future research 40
List of References 43
initial distribution list 46
THIS PAGEINTENTIONALLY LEFT BLANK
LISTOF FIGURES
Figure 1. Fraudulent Email Example. Source: [3]. 9
Figure 2. Cost Space (Source: [6]). 13
Figure 3. Coverage Space (Source: [6]). 13
Figure 4. Name of the columns in R4B (Source: [10]). 15
Figure 5. Rule 9 UML-statechart (Source: [10]). 16
Figure 6. Rule 11 UML-statechart (Source: [10]). 16
Figure 7. A statechart-assertion for requirement Rule 9 (Adapted: [10]). 18
Figure 8. Workflow for developing pattern matching with hidden information (Source: [12]). 23
Figure 9. Snippet of Learning-Phase CSV 25
Figure 10. Capture of Rule 9 Flag Timeline (Source: [10]). 29
Figure 11. Capture of Rule 11 Timeline (Source: [10]). 29
Figure 12. Rule 9 Statechart-Assertion (adapted from: [10]). 30
Figure 13. Rule 9 Events.properties file (adapted from: [10]). 31
Figure 14. JUnit Sanity Test 32
Figure 15. Python Code Quantization 33
Figure 16. Quantization of HMM.JSON File 34
Figure 17. DTRA_RULES 35
Figure 18. Runtime Table CSV File 37
Figure 19. Alpha.json File 38
THIS PAGEINTENTIONALLY LEFT BLANK
LISTOF TABLES
Table 1. Instances of Rule 9 and Rule 11 (Adopted: [10]). 15
Table 2. Matrix A of HMM state transition probabilities 26
Table 3. A part of Matrix B, of probability of observation O in HMM states 27
THIS PAGEINTENTIONALLY LEFT BLANK
LISTOF ACRONYMS AND ABBREVIATIONS
CSV Comma Separated Values
DoD Department of Defense
HMM Hidden Markov Model
IP Internet Protocol
IRS Internal Revenue Service
NL Natural Language
R4B RulesforBusiness
REM Runtime Execution Monitoring
RM RuntimeMonitoring
RV Runtime Verification
UML Unified Modeling Language
THIS PAGEINTENTIONALLY LEFT BLANK
EXECUTIVESUMMARY
Maliciousemails continue to pose significant threat to individuals andinstitutions. The danger they pose is eminent it often difficult todetermine the amount of danger to a particular email recipient yet isknown to be great overall. Hence, it becomes crucial to develop anefficient way to detect email threats. Rules4business.com has come upwith an appropriate channel through which such threats can bedetected and dealt with before causing any imminent danger. However,it can only cater for deterministic information. This thesisdescribes a hybrid of a deterministic monitoring technique with amachine learning technique based on Hidden Markov models.
THIS PAGEINTENTIONALLY LEFT BLANK
ACKNOWLEDGMENTS
Ipass my gratitude to my academic advisors, Dr. Doron Drusinsky, myThesis Advisor, Dr. Man-Tak Shing, Second Reader, Dr. Peter Denning,Chair at the Department of Computer Science, and my writing coach,Michelle Pagnani. I am grateful for their constant support during theentire period of study.
Iextend my gratitude further goes to my family members and friends forthe endless support during the whole period.
Thisresearch was funded by a grant from the U.S. Defense Threat ReductionAgency (DTRA).
THIS PAGEINTENTIONALLY LEFT BLANK
- INTRODUCTION
Email has some time for now been an internet executioner application used by people, organizations, and governments for imparting, sharing and dispersing data. However, a range of illegitimate emails is among the emails sent out. Certain fraudulent actors, for example, those connected with spam use email to send spontaneous mass ads to influence people to buy items that will create income. Other actors, for instance, those behind phishing use email as a means to obtain an individual’s bio data and to profile people who are susceptible to these types of activities. The analysis and monitoring of various types of malicious emails are focused on in the thesis.
The thesis concentrates on analyzing temporal and sequencing patterns of malicious emails using both visible email data as well as learned hidden state information it then used a hybrid run-time monitoring technique to qualify suspicious email sequences.
Based off information in the emails, we developed three categories for the hidden states: suspicious, unknown, and benign. For example, if an individual were constantly receiving an email from a fraudulent actor, we would identify the pattern and classify the hidden state as suspicious. We will use these three states as inputs to the runtime monitoring algorithm described in the sequel.
An assertion, or rule, is a mathematical rule used to predict behavior. In software engineering, “assertion is a statement that a predicate (Boolean-valued function, a true-false expression) is expected to always be true” [16]. The formal specification assertion can monitor the sequencing and the temporal patterns of the malicious emails. By categorizing the emails using assertions, we are also able to compare the behavioral patterns to the correct behavior as specified by a formal specification [14].
The approach taken in this thesis is as follows. First, we developed deterministic rules to detect threats based on temporal and sequencing patterns by deterministic it is meant that the rule assumes all its inputs are visible (have a 0 or 1 probability of occurrence). We then validated those rules by applying them to the known threats. Next, we generated the Hidden Markov Model using a machine learning technique. Finally, in runtime, we used the validated rules to input data that contains both visible and hidden artifacts, for detection and tracking of incoming threats.
Our input email-data is packaged as Microsoft Excel Worksheets. Variations of these spreadsheets were used to (i) perform deterministic runtime monitoring for rule validation, (ii) helped build deterministic rules for monitoring hidden and visible data, (iii) build and generate a Hidden Markov Model (HMM) in the learning phase, and finally (iv) to perform runtime monitoring with hidden data.
- THE NEED FOR RUNTIME MONITORING OF MALICIOUS EMAILS
Often computer security threats encompass execution of unauthorized foreign code on the victim machine [1]. Malicious emails received with links or attachments serves as security threats are one example of unauthorized code. Fiskiran and Lee [1] identified runtime execution monitoring (REM) as a method to detect program flow anomalies associated with such malicious code.
In Fiskiran and Lee’s paper, ‘Runtime Execution Monitoring (REM) to Detect and Prevent Malicious Code Execution’, REM is used to detect program flow anomalies that occur during execution such as buffer overrun attacks commonly used by network and malicious emails. They conclude by asserting the need for formal methods to effectively categorize malicious emails.
This thesis uses a runtime monitoring program to present formal specifications as a way to detect these malicious emails. Runtime monitoring provides real-time situational awareness of conditions, a quality mentioned in the Fiskiran and Lee’s paper. In addition, by using temporal assertions, we demonstrate the detection of sequential patterns of emails. Temporal assertions detect patterns of emails that users may not evident from a single email. Therefore, sequencing and temporal patterns of emails is potentially more informative than monitoring individual emails one by one, independently of each other. This topic is further addressed again in Chapter III.
- MOTIVATION FOR USING RUNTIME MONITORING OF HIDDEN DATA
Every day users are receiving massive amount of emails. With intruders seeking information or hiding their intent by mimicking well-known websites, the user may ask themselves, “can I trust this email?”. A straightforward answer is “analyze the content of the email”, i.e., analyze each email independently of others. This answer however fails to exploit sequencing and temporal information associated with a plurality of emails. Hence, an improved approach, demonstrated in this thesis, is to monitor sequences and temporal patterns of emails. Monitoring sequences of emails is potentially more informative than monitoring individual emails because it helps distinguish a hidden intent of the email sequence, an intent that is not evident from individual emails.
For example, suppose we receive an email from an agent that works for the IRS and uses the same format as the IRS. The agent states that the organization has identified cases of fake agents sending out emails and asking for personal information, but in the content of this email, the agent also asks for contact information. Within the next two days, we receive an email from a different agent, but this individual is also using the same domain. This time, the agent requests date of birth. Receiving both emails within a week, the sequence if more suspicious than each individual email alone.
An additional contribution of this thesis is that it demonstrates monitoring sequences of emails where some email properties are not contained in the email text, i.e., they are hidden properties These properties are probabilistically learned and modeled as a Hidden Markov Model (HMM). Runtime monitoring of temporal and sequencing patterns of emails based on both visible and hidden artifacts has the potential to provide even better discovery of malicious email patterns.
- ORGANIZATION OF THESIS
Chapter II addresses the background information on malicious emails, phishing, the DoD, formal speculations or assertions, and rules to identify parameters to use during runtime monitoring. Chapter III takes a deeper look at runtime monitoring techniques. Chapter IV provides a proof of concept for using the DTRA HMM RV toolset and the Hidden Markov Model to identify hidden data that can be used in behavioral and temporal pattern detection. Section V describes the Microsoft Excel Spreadsheets on the validation and learning data used to validate the formal specifications provided in Chapter IV and the results of running the formal specifications through these tables. Section VI identifies shortcomings and recommendations of this thesis and a conclusion.
THIS PAGE INTENTIONALLY LEFT BLANK
- MALICIOUS EMAILS
- DETECTING MALICIOUS EMAILS BY COLLECTING DATA THROUGH BULK EMAIL OR PHISHING
Bulk email provides a means to disseminate official information to an entire organization with ease [2]. Mass emails are relatively common and are mostly automatically generated [2]. Since bulk emails are standard, most users do not think about the type of information being sought by the fraudulent actor or scammer. Fraudulent actors or scammers can easily exploit both the selected distribution of bulk email or phishing and its perceived integrity. Collecting data from bulk email or phishing can help us to categorize the data. With the use of formal validation and verification techniques, we can further capture and target malicious emails through this collected data. As a result, we can see who is targeted by the malicious email.
Unwanted email, such as spam, is sent in bulk to a large number of people on the Internet, while malicious emails are sent to specific individuals. The techniques that malicious actors use to craft and send these targeted emails are different from the techniques used by scammers. The fraudulent emails appear as if they are an official communication from a particular target bank or a given company. They would contain a malicious attachment or a particular link to the website that tries to collect information from the victim (see Figure 1) [3].
Fraudulent Email Example. Source: [3].
Receiving several of these emails within a week, we will likely perceive these emails to be suspicious. By categorizing them, an organization can more easily decide whether to accept or reject the email coming into their network environment. This especially true when some properties of incoming emails are not deterministically available in the email text rather, they are probabilistically learned or hidden properties. In this case, reasoning about patterns of emails with hidden artifacts has the potential of providing improved or probabilistic classification. Using the Runtime Monitoring and Verification System, we can provide a way to track activity and meet the requirements to keep our systems safe from malicious emails.
- DOD TARGETED MALICIOUS EMAILS
Malicious emails not only target Internet Service Provider users and banks but also to governmental organizations like the Department of Defense. These more sophisticated attacks deploy emails that look identical to official mail and are therefore threatening to the security of Government networks and DoD members [4].
Spear phishing, in particular, is a significant and widespread effort that the DoD is battling. In 2006, the JTF-GNO released an article saying that its members have “observed tens of thousands of malicious emails targeting soldiers, sailors, airmen and Marines U.S. government civilian workers and DoD contractors, with the potential compromise of a significant number of computers across the DoD” [5]. Therefore, fraudulent actors are targeting government employees to gain more than just account or personal information they are focused on collecting intelligence. From the accounts that have been compromised, further exploitation of the DoD network may be possible. However, the true scope is unknown, leading the government to believe that some actors already have extensive knowledge of their targets and know precisely what further information they want. DoD users are required to digitally sign their emails, but the DoD has not been able to protect personal emails. This thesis seeks to define a means of identifying email threats in a Naval and DoD environment.
THIS PAGE INTENTIONALLY LEFT BLANK
- background
- Formal specification Tradeoff Cuboid
Traditionally, formal specifications are used for Validation and Verification (V&V). Verification means to ensure a product is built correctly. Validation is a process aimed at demonstrating that a system meets the user’s true requirements– often called “building the right system” [11]. To select a validation and verification technique that is appropriate for detecting temporal patterns of malicious emails, we used the visual tradeoff space from Drusinsky, Michael, and Shing’s paper in [6], which compares three predominant formal validation and verification techniques. Noted in Drusinsky, Michael, and Shing’s paper, the three techniques include theorem proving, model checking, and runtime monitoring.
The “cube” is a three dimensional comparison known as the formal validation and verification tradeoff cube it is illustrated in Figures 2 and 3 [6]. The tradeoff cubes depict the coverage and cost of each of the three techniques. The three dimensions of the coverage and cost cubes are: (i) specification dimension – the techniques capacity to specify complex properties, (ii) the efficiency of verification dimension, and (ii) the complexity of programs that can be verified.
Ultimately, we chose RM as the best method of monitoring malicious emails because we are not concerned with the verification and program dimensions of the cube. This is because when monitoring for patterns of malicious emails there is no underlying program to verify.
Cost Space (Source: [6]).
Coverage Space (Source: [6]).
- translating natural language to Formal Specifcation
In software engineering, formal specifications are scientifically based procedures that help with the implementation of systems and software. They are used to portray a system, to examine its conduct, and to help in its configuration by confirming key properties of interest. These specifications are formal in the sense that they help improve clarity and precision of specifications requirements. So, the question is asked, “why convert natural language to formal specification?”
Natural language (NL) is inherently ambiguous, rendering accurate specification problematic [7]. However, formal specifications allow us to convey the exact intent of the natural language requirement. Essentially, a formal specification is meant to pinpoint particular information that the user seeks to extract from the natural language. Drusinsky’s paper [8] presents patterns for ensuring that formal specifications catch the intent of underlying natural language requirements [9].
- Rules4Business
Rules4Business (R4B) is a website that allows users to create rules based on events and timing patterns. The rules are a way of analyzing, validating, and verifying the behavior of the models in the data provided. Patterns define temporal and sequencing behavior. Therefore, the user can use R4B to choose and customize statechart assertions. In R4B, users have two interfaces that allow them to customize instances and validate their assertions. First, users are able to select a rule according to the NL assertions. Secondly, users upload a spreadsheet with the required columns to be able to validate assertions. Figure 4 shows an example of how to specify the column indexes before uploading the spreadsheet.
Name of the columns in R4B (Source: [10]).
In this thesis, we choose to create two rules. The rules we chose are rule 9 and rule 11 their instances are shown in Table 1. Figures 5 and 6 show the corresponsding UML-statecharts for each rule.
Rules4Business | ||
Rule 9 | Pattern | Flag whenever some pair of consecutive E events is less than time T apart |
Events and Limits | E=HiddenState==="U", Time bounds: T=30, Time units: minutes | |
Description | Flag whenever some pair of consecutive unknown SendingIP are less than 30 minutes apart. | |
Rule 11 | Pattern | Flag whenever event P with eventual event Q within time T after P. |
Events and Limits | P= Sendinghost.indexOf("3ff7b9e2.cst.lightpath.net")>=0, Q=HiddenState==="S", Time bounds: T=1, Time units: hours | |
Description | Flag when there is a suspicious email within one hour of an email from 3ff7b9e2.cst.lightpath.net |
Instances of Rule 9 and Rule 11 (Adopted: [10]).
Rule 9 UML-statechart (Source: [10]).
Rule 11 UML-statechart (Source: [10]).
- StateRover Toolset
The StateRover used in this research is used as part of the code generation
process, whose algorithm is given in Chapter V . The code generation process is implemented by the dtracg tool (see Chapter V ), which relies on code generated from the StateRover. There is no other reason for using the StateRover in this research, other than this purely technical reason therefore, uninterested readers can jump to Section .
The StateRover used extends the statechart diagrammatic notation with Java as an action language, resulting in a Turing-equivalent notation [13]. Before we can use the StateRover toolset we perform validation testing to assure that the formal specification assertion is trusted to represent the requirements of the rules in the subsequent automated verification phase.
- Deterministic Runtime Monitoring and Verification using Statechart Assertions
Runtime Monitoring is a light-weight formal verification technique in which the runtime execution of a system is checked and contrasted with an executable variant of the system’s formal specification. As such, runtime verification (RV) carries on as a computerized eyewitness of the program’s conduct and contrasts that conduct with the normal conduct per the formal specification [12].
Consider the following generic natural language (NL) pattern, being generic rule 9 of the rules4business website:
Rule 9: Flag whenever some pair of consecutive E events is less than time T apart.
Figure 7 depicts a statechart-assertion for rule 9 as designed using the StateRover tool.
A statechart-assertion is a machine augmented with a particular flowcharting capabilities, hierarchy, a Java action language, and a built-in Boolean flag named bFlag whose default value is false, with a valid value indicating that the pattern has been flagged [12]. The statechart-pattern of Figure 7 combines the flowchart and state-machine elements. The compound states and flowchart action boxes are flowchart elements the statechart flows through the boxes while executing their actions and conditions.
A statechart-assertion for requirement Rule 9 (Adapted: [10]).
In the statechart-assertion of Figure 7, the statechart flows through the Initial flowchart box, executes its actions, and then checks whether the SendingIP transaction is unknown or not. Therefore, if rule 9 has been violated, and the statechart-assertion transitions to the Error state where it sets the bSuccess flag to false. The flag indicates that the assertion has failed [13].
NL1 is an instance of generic rule 9. It uses visible data in an email, i.e., all monitored events are assumed to be visible:
NL1. Flag whenever some pair of consecutive unknown SendingIP are less than 30 minutes apart.
THIS PAGE INTENTIONALLY LEFT BLANK
- using HIDDEN MARKOV MODELS in rm
- Runtime monitoring and verification system with hmm
Runtime monitoring (RM) is a technique that allows the user to observe the behavior of the system while it is running. Also, it analyzes the system’s current behavior to determine if it satisfies or violates formal specifications.
We use HMM’s to detect behavioral patterns of hidden artifacts. Using an HMM, we monitor hidden HMM states as well as visible email data. As stated in Drusinsky’s paper [12], HMM components are: (i) a set of states, (ii) observations made in those states, (iii) state transition probabilities, and (iv) initial state distribution. In our specific HMM, the set of states consists of three states previously mentioned: suspicious (S), unknown (U), and benign (B). The observable tuple, O, consists of combinations of data taken from cells of the input email spreadsheet specifically, it is a pair of Sending host and SendingIP values, both represented as integers.
As suggested in Drusinsky’s paper [12], after determining the transition and observable probabilities, we can assume that the initial state will always be suspicious. This email-data is fed into the HMM Drusinsky states that it will execute the probability estimation algorithm that runs on a current iteration. The HMM outputs the vector of symbols and associated probabilities then becomes the input into the pattern’s implementation code with malicious emails. Pattern implementation code outputs a weighted version of a state-machine state change which can be used to conduct RV with existing formal specifications [12].
- PROBABILISTIC RUNTIME MONITORING USING STATECHART ASSERTIONS COMBINED WITH HMM
In Chapter III, we discussed how we perform deterministic runtime monitoring by using a code generator that produces a deterministic pattern implementation. Moreover, by using RM with HMM, we were able to perform a probabilistic pattern detection using a special pattern code generator that produces a probabilistic, weighted implementation [12]. Figure 8 shows all the steps from determining the natural language to performing probabilistic pattern matching.
Workflow for developing pattern matching with hidden information (Source: [12]).
The NL pattern is employed when it comes to detection of data that has not been presented. When it comes to the detection of malicious emails, it is essential to employ the same technique to help gather information that would help categorize a particular email as being benign, unknown or suspect. To facilitate the detection of the pattern and the corresponding state-chart patter, it could be feasible to use pattern detection architecture. The architecture has been determined to have crucial components including
A Hidden Markov Model that is employed to decode probability of the occurrence of sequences in hidden states for the malicious emails. The states defined include benign, unknown and suspicious. The HMM offers plurality of weighted type of threat to the statechart-pattern. Further, HMM is of critical significance since it offers a way of capturing patterns that are essential in the acquisition of accurate decisions. In this thesis, the hidden part is the categorization of the different types of emails to help in ascertaining whether their state and see whether they are malicious and do not pose a significant threat to the organization.
A special code generator generating a probabilistic implementation for a statechart-pattern, that operates on weighted inputs derived from HMM.
The pattern is evaluated through the use of a success score in a defined range.
- learning phase with HMM
In the learning phase, we created a learning phase spreadsheet of collected data from bulk or phishing emails. We added the HiddenState column to the spreadsheet. Using the information from the emails, we classified the hidden states into three categories. The emails are categorized as: suspicious (S), unknown (U), and benign (B). To populate the spreadsheet, we acted as the expert pulling information from the emails like the date, time, IP address, links, attachments, etc. Figure 9 shows a snippet of our learning phase spreadsheet. Now, we take a look into what role HMM plays in the learning phase. We will explain the HMM creation and learning algorithm.
The learning phase has two columns: k- a visible output column that consist of (sending host and sending IP), and s, the hidden state column (HiddenState). We let k, s, and N represent the visible output column, hidden state column, and the total number of rows. Also, ki and si are the values of the visible output and hidden state columns in row i. In Drusinsky’s paper [14], our HMM is derived from the following requirements as shown below:
The HMM transition’s probability between states is calculated by dividing the number of specific transition to N-1 (total number of transition in the spreadsheet). Lets say we have 15 transitions from suspicious (S) to unknown (U) in s and N is 31. Therefore, the HMM transition’s probability for S->U is #(S->U)/(N-1) that equals 0.5
Lets say the value in ki is O and si is U. The observable’s probability, O, emitted in the hidden state U is calculated by diving the number of times when a row satisfies k and U.
The initial distribution probability is equal to the number time a row i satisfies with the hidden state value divided by N. For example, we have 12 unknown emails in the spreadsheet with a total of 35 rows. Therefore, the initial distribution probability is 0.34.
The learning phase csv file with HMM gives us the opportunity to actually see what are the probabilities of the hidden states being detected. Using the data from our learning phase csv, we were able to generate Matrix A and Matrix B. Matrix A, represented in Table 2, provides the state transition properties for the HMM. Matrix B, represented in Table 3, gives the probability of the given observable tuple, O, being observed in the three states. In Chapter V, we will present the results of how effective the HMM is.
Snippet of Learning-Phase CSV
Transition
SourceTarget
Suspicious
Unknown
Benign
Suspicious
.206
.058
.058
Unknown
.058
.176
.058
Benign
.058
.058
.147
Matrix A of HMM state transition probabilities
Ostate
Suspicious
Unknown
Benign
<0,1,0>
.853
.066
.080
<0,2,1>
.611
.130
.388
<1,0,0>
.644
.172
.206
A part of Matrix B, of probability of observation O in HMM states
THIS PAGE INTENTIONALLY LEFT BLANK
- RESULTS: PROOF OF CONCEPT
- COLLECTION OF data
We used data from Naval Postgraduate School ITACS and bulk and phishing emails from a personal account. We sorted the emails by details within the emails or attachments for various characteristics based on the three definitive categories. We tracked the emails to see if a user would receive mails from a certain IP address, host address, etc. The emails contained specific details that were recorded in a spreadsheet to be used as a guide in the categorization based on the different levels of threat. We organized the data regarding details of the emails initially recorded in the spreadsheet that helped in their categorization into different threat levels by date, time, sending host, sending IP, subject and attachments, and defined the hidden state as suspicious, unknown, and benign.
- Deterministic Rule development
In Chapter III, Section A1, we discussed already how the user is about to create rules in the R4B website. As a part of developing our rules, we had to validate our rules as well. To validate our assertion, we uploaded our validation csv, which we expect our rules to flag. Figure 11 depicts a sample of the validation csv. By using the running example set, each rule was utilizing Validation.csv, that is by uploading it to rules4business.com as shown in the figure below.
In our validation phase, we validated rules 9 and 11. Rule 9 evaluates whether the emails are less than 30 minutes apart from the Sending IP is an unknown (U) threat. Recall in Chapter III, Section A1 that rule11 evaluates whether a suspicious email within one hour is from a specific Sending host address. In Figures 13 and 14, we show if the results are what we expected or not.
Capture of Rule 9 Flag Timeline (Source: [10]).
Capture of Rule 11 Timeline (Source: [10]).
In rule 9, we see that it flagged an unknown email within 30 minutes, and rule 11 flagged a suspicious email within one hour from the specific Sending host. Therefore, we have validated both rules and checked whether they both flagged what we expected them too or not.
- StateRover Rule Creation and Code Generation
In Chapter III, Section A2, we already discussed the purpose of the StateRover. The StateRover in our process is used just to save development time and money when developing DTRA toolset.
In this section, we show the conversion of R4B diagrams to StateRover diagrams. In Figure 15 is a look of the statechart-assertion of rule 9. The diagram starts with the initial state and the events are transition between states. The final state is known as the flag state which lets us know if the assertion succeeds or fails. Whenever, the StateRover reaches the final state, it gives a false to bSuccess because the assertion sees a flagged event.
Rule 9 Statechart-Assertion (adapted from: [10]).
StateRover needs two arguments for the automatic code generate that we created. The file for each rule we created is the rules_events.properties files. The files are simple text file that were implemented in the R4B website when we created our rules as shown in Figure 16.
Rule 9 Events.properties file (adapted from: [10]).
A two-step process using StateRover is applied in verification of the rules. First, StateRover generates a java code by our statechart diagrams. Second, we run a JUnit test to verify that the StateRover has the same behavior patterns for each statechart assertions as in R4B. Figure 17 shows a successfully ran JUnit sanity test.
JUnit Sanity Test
- Generating the HMM from the Learning Phase CSV File
In Chapter IV, Section C, we already discussed how we created our learning phase spreadsheet, populated our hidden state column, and the HMM creation and learning algorithm.
We know that the HMM cannot handle an open-ended range because it has a large amount of outcomes that should be processed. Therefore, we use R4B to help us go beyond the open-ended range. So, we quantize our values with corresponding values such as E==SendingIP==Type 1. There are four types as follows:
Type 1 represents the beginning of the IP address starting with 63.
Type 2 represents the beginning of the IP address starting with 157.
Type 3 represents the beginning of the IP address starting with 45.
Type 4 represents any IP address that is not specific within Types 1-3.
In Figure 18 shows an example of our python code for quantization. Quantization enables us to map generic names or numbers.
Python Code Quantization
The last step is to run the command for generating hmm.json file, which includes the quantized visible data for the hidden states as shown in Figure 19.
Quantization of HMM.JSON File
- Generate Special Java Code for Probabilistic RM
In this step, we created a new java project called DTRA_Rules including each rules’ java codes and sanity tests as shown in Figure 20. The DTRA HMM-RVtoolset is a tool applied to statechart assertions customized by using the service R4B.
DTRA_RULES
The special Java code for probabilistic RM implements an algorithm. Suggested in Drusinsky’s paper [14], this algorithm forms input sequence that is of two tuple list such as Input={K1,P1},{K2,P2}, {K3,P3}…{KN,PN}. Ki is an event that is visible (i.e., Sendinghost and SendingIP) or hidden (i.e., HiddenState column), Pi is the probability of distribution (POD) of Ki. Ki = eventi [conditioni], eventi and conditioni which is an optional argument that is hidden or visible.
As suggested in [14], each assertion has a collection of instances called configurations. Collection is labeled as Col and the configuration as Conf. Each Conf has a present state PS(Conf) and probability value called P(Conf) that shows the behavior of how the assertion acts as proposed by Conf. The startup shows how each Conf has the default value that represents the present-state and probability of Conf that is 1. Therefore, if Pi=1, the Conf acts like a traditional state machine like St_1 -> timeoutFire[count>N] Flag. On the off chance that Pi !=1, it implies that eventi or conditioni is hidden. In this way, the Conf react the sets, {Si, Pi}, with two configurations called Conf1 and Conf2. Probabilities and states of Conf1 and Conf2 are computed relying on whether event or condition is hidden. The calculation is below as follows:
If eventi is hidden,
P(Conf1)=P(Conf)*Pi and P(Conf2)=1-P(Conf1)
PS(Conf1) is the next state decided by transition, if event fired. If not then, PS(Conf) assigned to PS(Conf2).
If conditioni is hidden,
P(conditioni) is calculated according to the constitutive components. For instance, if conditioni is HiddenState=M || HiddenState=S, P(conditioni)=P(HiddenState=M) + P(HiddenState=S). And then P(Conf1)=P(Conf)*P(conditioni) and P(Conf2)=1-P(Conf1)
PS(Conf1) and PS(Conf2) are calculated as conditioni is true and false.
Keeping in mind the end goal to give effortlessness, both eventi and conditioni are not permitted to be hidden. Configurations that have same present state are joined in a one configuration as Confcombined by summing all P’(Conf).
The statechart assertions proclaims the probability of violation of its corresponding requirements also known as probability of failure (POF) [14]. It is entirety of all P(Conf). The present state of the configuration is an error state which is likewise the flag state in R4B. Error states are represented in statechart assertions as sink states that are deadlock with no transition. For these assertions, the probability of POF bit by bit increments after some time.
- RUNTIME CSV’S
In the final phrase of this thesis, we created a runtime table spreadsheet. The runtime table csv file has four columns: date, sending host, NA0, and sending IP. The hidden state column is not presented in the runtimetable csv file as shown in Figure 21. We used the runtime table csv file to run the alpha method and runtime monitoring discussed in sections 8 and 9.
Runtime Table CSV File
- The Alpha Method
The alpha method has been used for the estimation of probability of occurrence at a given time. However, it suffers from the scaling problem because of the calculation_′ t (i ), _ (i ), and _t ”(i ) generates values that potentially scale down geometrically with t5. However, various measures have been enacted to help in addressing the challenges that are likely to arise. In the alpha phrase, we generated the alpha.json file before we can conduct runtime monitoring. Figure 22 shows a snippet of the alpha.json file. We need the alpha.json in the next step, section 9, to use as a command argument.
Alpha.json File
Noted in Drusinsky’s paper [12], is a proposed three techniques that estimates POD at a specific time t. Therefore, out of the three techniques, we used the alpha method which is where the alpha.json file comes from. The alpha method is applied and RV tool set monitors each and every row in the runtime table csv file. Every state has an alpha value that is a probability of the state given input.
The alpha method does not need observable sequences. It further allows for the detection of POD. It uses N values of `t(i)=P(qt=si|O1O2…Ot, ), one
per symbol si, 1iN. Note that Σ1iN `t(i) = 1. For further explanations, Drusinsky explains the alpha algorithm in [12]. It is employed in process mining with the aim of reconstructing a causality from a defined set of sequence of events. It focuses on the examination of causal relationships that have been observed between different tasks.
- Runtime Monitoring
Sections 7 through 9 helped us to conduct our runtime monitoring. Figure 23 shows a list of probabilities of Rule 9. This represents the probability of the monitor reaching the flag state. Each row of each rule has probability value from 0 to 1 range. This proves that runtime monitoring is an effective tool to detect malicious emails. The same is achieved through the ability to detect different types of emails followed by their definition into different categories. They include the benign, suspicious and unknown.
THIS PAGE INTENTIONALLY LEFT BLANK
- conclusion and future research
Malicious emails continue to cause a significant challenge because of the threat that they present. Measures that have been imposed to help in dealing with the malicious have not been successful. Potential threat imposed by the malicious emails adjust to the inventions that are introduced to help in dealing with the menace adjusts to the measures that are imposed. Even though the complete eradication of programs that are malicious appears to be a difficult task, the information possessed regarding the availability of the malicious programs is crucial in limiting the threat that exists. The Runtime monitoring system was developed to facilitate the detection and the classification of the emails into identified states such as benign, unknown and suspicious. The ability to detect such emails is essential as the first step toward ensuring that private individuals and organizations are protected from the potential dangers that arise from the availability of the email. As cases of malicious attempts by email continue to evolve and overcome the security steps that have been imposed, the Runtime monitoring system would be endowed with the ability to track the threats and deal with them. It would be possible to address the threats that are imposed by the malicious emails. Further, through the application of the runtime monitoring system, private individuals and organizations would be well placed to deal with the dangers and minimize the costs to be incurred due to the execution of the threats that are known. The development and validation of the rules that help detect and classify threats serve as the first step toward management of the potential threats presented by the malicious emails. Further, it was possible to perform deterministic runtime monitoring rules for the validation. The activity was executed through the building of the Hidden Markov Model at the learning phase followed by the performance of runtime monitoring with the data that is hidden.
In future, the project would be migrated to a given direction such that the program can take complete control over spam email to fish out emails that are suspect. Further, with the development of technology, it is essential to have an email that is endowed with the capability of dealing with emails that escape firewalls. The runtime monitoring system would be helpful when it comes to dealing with the threats that are imminent for identification, validation, and execution of the necessary measures to deal with the issue. However, it is crucial to acknowledge the technological advances that are witnessed. The advancements further increase the ability of malicious emails to deal with security steps that have been imposed. Such is reason enough to explore measures that seek to adapt to the threats stipulated in the malicious emails. It opens ground for future research in a bid to deal with the continued threats. The framework for the Runtime monitoring system allows for ease when it comes to categorization of emails that could be containing the potential threat. Different angles exist for future research where rather than conducting the detection and validation, another system can facilitate the execution of curtailing the threat. It would help reduce the imminent dangers posed by the availability of the emails and the components that it could be attached. Such a measure would facilitate curbing the threat while enhancing flexibility in dealing with the malicious emails.
THIS PAGE INTENTIONALLY LEFT BLANK
List of References
[1] A.M. Fiskiran and R. B. Lee, “Runtime Execution Monitoring (REM) to Detect and Prevent Malicious Code Execution,” Princeton University, 2004. ICCD 2004, IEEE International Symposium on, pp. 452-457, October 2004, URL: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1347961
[2] A. A. Slack, “Digital Authentication for Official Bulk Email,” Naval Postgraduate School, 2009. NPS 2009, pp. 5-10, March 2009, URL: http://cisr.nps.edu/downloads/theses/09thesis_slack.pdf
[3] E. Sharf, “Fake malware notifications from “Websense Labs”,” Websense Security Labs Blog, 2011.
URL:http://community.websense.com/blogs/securitylabs/archive/2011/09/22/fake-malware-notifications-from-websense-labs.aspx, last accessed September 2015.
[4] J. W. Ragucci, S. A. Robila, "Societal Aspects of Phishing," Technology and Society, 2006. ISTAS 2006, IEEE International Symposium on, pp. 1-5, 8-10 June 2006, URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=43 75893&isnumber=4375875, last accessed September 2015.
[5] B. Brewin, “DOD battles spear phishing,” The Business of Federal Technology, 2006. URL: https://fcw.com/articles/2006/12/26/dod-battles-spear-phishing.aspxfrom-websense-labs.aspx, last accessed September 2015.
[6] D. Drusinsky, J. B. Michael, and M.-T. Shing, “A visual tradeoff space for formalverification and validation techniques,” Systems Journal, IEEE, vol. 2, no. 4, pp. 513–519, Dec. 2008.
[7] K. Shimizu, D. L. Dill, and A. J. Hu, “Monitor-based formal specification of PCI,” Formal Methods in Computer-Aided Design, vol. 1954, pp. 372–390, Jun. 2000.
[8] D. Drusinsky, J. B. Michael, T. W. Otani, and M.-T. Shing, “Validating UML statechart-based assertions libraries for improved reliability and assurance,” in SSIRI’08. Second International Conference on, Yokohama, Japan, 2008, pp. 47– 51.
[9] J. J. Galinski, “Formal Specifications for an Electrical Power Grid System Stability and Reliability,” Naval Postgraduate School, 2015. NPS 2015, pp. 1-11, September 2015, URL:http://cisr.nps.edu/downloads/theses/15thesis_galinski.pdf.
[10] D. Drusinsky. “Rules for Business.” Rules4Business. http://www.rules4business.com/acmeBank/index.html
[11] A. D. Preece, “Verification and Validation of Knowledge-Based Systems with Formal Specifications,” University of Aberdeen, 1990, pp. 1-4. URL: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.17.7692&rep=rep1&type=pdf.
[12] D. Drusinsky, “Behavioral and Temporal Pattern Detection within Financial Data with Hidden Information,” J. UCS, vol. 18, no. 14, pp. 1950–1966, Jul. 2012.
[13] D. Drusinsky, “UML-based Specification, Validation, and Log-file based Verification of the Orion Pad Abort Software,”
[14] D. Drusinsky, “Runtime Monitoring and Verification of Systems with Hidden Information,” Innovations in Systems and Software Engineering: Volume 10, Issue 2 (2014), Page 123-136. http:// www.time-rover.com/articles.html
[15] D.Drusinsky, “A Hidden Markov Model based Runtime Monitoring Tool,”
[16] R. Sedgewick, “Just The facts101Textbook Key Facts, Algorithms,” 4th Edition,
pp.
[17] R. Jhala and R. Majumdar, “Software model checking,” ACM CSUR, vol. 41, no. 4, pp. 21–22, Oct. 2009.
THIS PAGE INTENTIONALLY LEFT BLANK
initial distribution list
1. Defense Technical Information Center
Ft. Belvoir, Virginia
Attn: Paul Tandy
2. Dudley Knox Library
Naval Postgraduate School
Monterey, California