POSTGRADUATE
SCHOOL
MONTEREY, CALIFORNIA
THESIS
[Title]
Kristin R. Sellers
Thesis Advisor: Doron Drusinsky
Second Reader: Man-Tak Shing
[Category]
THISPAGE INTENTIONALLY LEFT BLANK
REPORT DOCUMENTATION PAGE | Form Approved OMB No. 0704-0188 | |||||
Public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instruction, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington, VA 22202-4302, and to the Office of Management and Budget, Paperwork Reduction Project (0704-0188) Washington DC 20503. | ||||||
1. AGENCY USE ONLY (Leave blank) | 2. REPORT DATE | 3. REPORT TYPE AND DATES COVERED Master’s thesis | ||||
4. TITLE AND SUBTITLE [Title] | 5. FUNDING NUMBERS | |||||
6. AUTHOR(S) Kristin R. Sellers | ||||||
7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) Naval Postgraduate School Monterey, CA 93943-5000 | 8. PERFORMING ORGANIZATION REPORT NUMBER | |||||
9. SPONSORING /MONITORING AGENCY NAME(S) AND ADDRESS(ES) N/A | 10. SPONSORING / MONITORING AGENCY REPORT NUMBER | |||||
11. SUPPLEMENTARY NOTES The views expressed in this thesis are those of the author and do not reflect the official policy or position of the Department of Defense or the U.S. Government. IRB Protocol number N/A. | ||||||
12a. DISTRIBUTION / AVAILABILITY STATEMENT [Category] | 12b. DISTRIBUTION CODE | |||||
13. ABSTRACT (maximum 200 words) This thesis provides a technique for using runtime monitoring to establish a way to track the activity of malicious emails. The gathering of malicious emails is to establish if the email is a threat that is suspicious, unknown, or neutral. We developed rules (patterns) in Rules4business.com to detect the threat, applied the rules to the known threat for rules validation, and then applied the validated rules to input data for detection and tracking of threats. We performed deterministic runtime monitoring, built a Hidden Markov Model (HMM), and presented runtime monitoring with hidden data. The reasoning about the patterns of malicious emails with hidden artifacts gave the potential of providing improved (probabilistic) classification. | ||||||
14. SUBJECT TERMS malicious emails, runtime monitoring, statechart assertions, formal specifications, hidden Markov model | 15. NUMBER OF PAGES 44 | |||||
16. PRICE CODE | ||||||
17. SECURITY CLASSIFICATION OF REPORT Unclassified | 18. SECURITY CLASSIFICATION OF THIS PAGE Unclassified | 19. SECURITY CLASSIFICATION OF ABSTRACT Unclassified | 20. LIMITATION OF ABSTRACT UU |
NSN 7540-01-280-5500 StandardForm 298 (Rev. 2-89)
Prescribed by ANSI Std. 239-18
THISPAGE INTENTIONALLY LEFT BLANK
[Category]
[Title]
Kristin R. Sellers
Lieutenant, United States Navy
B.S., Langston University, 2008
Submitted in partial fulfillmentof the
requirements for the degree of
MASTER OF SCIENCE IN COMPUTERSCIENCE
From the
POSTGRADUATE SCHOOL
Approvedby: Dr. Doron Drusinsky
Thesis Advisor
Dr. Man-Tak Shing
Second Reader
Dr. Peter Denning
Chair, Department of ComputerScience
THISPAGE INTENTIONALLY LEFT BLANK
ABSTRACT
The thesis provides a techniquefor using runtime monitoring to establish a way to track the activityof malicious emails. The gathering of malicious emails is toestablish if the email is a threat that is suspicious, unknown, orneutral in nature.
We developed rules (patterns) inRules4business.com to detect the threats, apply the rules to theknown threat for rules validation, and then implement the validatedrules to input data for detection and tracking of threats. Weperformed deterministic runtime monitoring, built a Hidden MarkovModel (HMM), and performed runtime monitoring with hidden data. Thereasoning regarding the patterns of malicious emails with hiddenartifacts gave the potential of providing improved (probabilistic)classification.
THISPAGE INTENTIONALLY LEFT BLANK
TABLEOF CONTENTS
I. INTRODUCTION 2
A. THENEED FOR RUNTIME MONITORING OF MALICIOUS EMAILS 3
B. MOTIVATIONFOR USING RUNTIME MONITORING OF HIDDEN DATA 4
C. ORGANIZATIONOF THESIS 5
II. MALICIOUSEMAILS 7
A. DETECTINGMALICIOUS EMAILS BY COLLECTING DATA THROUGH BULK EMAIL OR PHISHING 7
B. DOD-TARGETEDMALICIOUS EMAILS 9
III. OVERVIEWOF RUNTIME MONITORING TECHNIQUES 11
A. Formalspecification Tradeoff Cuboid 11
1. Rules4Business 14
2. StateRoverToolset 15
3. DeterministicRuntime Monitoring and Verification using Statechart Assertions 16
B. FORMALSPECIFICATIONS PATTERNS 18
IV. HIDDENMARKOV MODEL 24
A. PROBABILISTICRUNTIME MONITORING USING STATECHART ASSERTIONS COMBINED WITH HIDDENMARKOV MODELS 25
B. DTRAHMM-RV TOOLSET 27
V. RESULTS:PROOF OF CONCEPT 30
A. LEARNINGPHASE CSV’S 30
1. RulesforBusinessRule 9 31
2. RulesforBusinessRule 11 32
3. Generatingthe HMM from the Learning Phase CSV File 33
B. RUNTIMECSV’S 34
VI. conclusionand future research 37
List of References 39
initial distribution list 42
Listof Figures
Figure 1. Fraudulent Email Example. Source: [3]. (from [3]). 8
Figure 2. Cost Space (from [6]). 13
Figure 3. Coverage Space (from [6]). 14
Figure 4. Rule 11 UML-statechart (from [10]). 15
Figure 5. A statechart-assertion for requirement Rule 9 (adapted from [10]). 18
Figure 6. Event timeline for evaluating Rule 9 (from [10]). 21
Figure 7. Event timeline for evaluating Rule 11 (from [10]). 22
Figure 8. Workflow for developing pattern matching with hidden information (from [12]). 26
Figure 9. Sample of Validation CSV 31
Figure 10. Multiple time intervals flag for Rule 9. 32
Figure 11. Multiple time interval for Rule 11. 33
Figure 12. Learning-Phase CSV 34
THISPAGE INTENTIONALLY LEFT BLANK
Listof tables
No table of figures entries found.
THISPAGE INTENTIONALLY LEFT BLANK
Listof Acronyms and Abbreviations
CSV Comma Separated Values
DoD Department of Defense
HMM Hidden Markov Model
IP Internet Protocol
IRS Internal Revenue Service
NL Natural Language
REM Runtime Execution Monitoring
RV Runtime Verification
UML Unified Modeling Language
THISPAGE INTENTIONALLY LEFT BLANK
EXECUTIVESUMMARY
Maliciousemails continue to pose significant threat to individuals andinstitutions. The danger they pose is eminent since it becomesdifficult to determine the amount of danger that the recipient of theemails receives. Because of the same, it becomes crucial to come upwith an efficient way through which the threats can be detected.Rules4business.com has come up with an appropriate channel throughwhich such threats can be detected and dealt with before causing anyimminent danger. The organization has developed a monitoring systemthat facilitates the detection of the threats while at the same timeproviding options on how to deal with the issue.
THISPAGE INTENTIONALLY LEFT BLANK
ACKNOWLEDGMENTS
Ipass my gratitude to my academic advisors Dr. Doron Drusinsky myThesis Advisor, Dr. Man-Tak Shing, Second Reader and Dr. PeterDenning, Chair at the Department of Computer Science. I am gratefulfor their constant support during the entire period of study.
Iextend m gratitude further goes to my family members and friends forthe endless support during the whole period.
THISPAGE INTENTIONALLY LEFT BLANK
- INTRODUCTION
Email has some time for now been an internet executioner application used by people, organizations, and governments for imparting, sharing and dispersing data. However, a range of illegitimate emails is among the emails sent out. Certain fraudulent actors, for example, those connected with spam use email to send spontaneous mass ads to influence people to buy items that will create income. Other actors, for instance, those behind phishing use email as a means to obtain an individual’s biodata and to profile people who are susceptible to these types of activities. The analysis and monitoring of various types of malicious emails are focused on in the thesis.
The thesis will concentrate on analyzing the temporal and sequencing patterns of malicious emails to determine their hidden states and then qualify suspicious email sequences. Based on information in the emails, we developed three categories for the hidden states: suspicious, unknown, and benign. For example, if an individual is always receiving an email from a fraudulent actor, we would identify the pattern and classify the hidden state as suspicious. An assertion is a mathematical rule used to predict behavior. In software engineering, “assertion is a statement that a predicate (Boolean-valued function, a true-false expression) is always expected to be true” [16]. The formal specification statement can monitor the sequencing and the temporal patterns of the malicious emails. By categorizing the emails using claims, we are also able to compare the behavioral patterns to the correct behavior as specified by a formal specification [14].
The approach taken in the thesis is as follows. First, we developed the rules to detect threats based on temporal and sequencing patterns. We then validated those rules by applying them to the known threats and generated the Hidden Markov Model. Finally, in runtime, we used the validated rules to input data for detection and tracking of incoming threats. We then coded the email data into a Microsoft Excel Worksheet. These spreadsheets were used to (i) perform deterministic runtime monitoring for rule validation (ii) helped build deterministic rules for monitoring hidden, and visible data (iii) create and generate a Hidden Markov Model (HMM) in the learning phase, and finally (iv) to perform runtime monitoring with hidden data.
- THE NEED FOR RUNTIME MONITORING OF MALICIOUS EMAILS
Often computer security threats encompass execution of unauthorized foreign code on the victim machine [1]. Malicious emails used for denial of service attacks are one example. Fiskiran and Lee [1] identified runtime execution monitoring (REM) as a method to detect program flow anomalies associated with such malicious code. Nevertheless, formal methods of validation and verification, such as the execution-based model checking, could further extract information from malicious emails.
In Fiskiran and Lee’s paper, ‘Runtime Execution Monitoring (REM) to Detect and Prevent Malicious Code Execution’, REM can detect program flow anomalies that occur during execution such as buffer overrun attacks commonly used by network and malicious emails. They note the need for formal methods, in the main conclusions, causes, and recommendations of categorizing malicious emails. This thesis uses a run-time monitoring program to present official specifications for execution-based model checking. Run-time monitoring provides stability and reliability by providing adequate real-time situational awareness of conditions, a quality mentioned in the Fiskiran and Lee’s paper. Also, by using temporal assertions, we can detect patterns in sequences of emails. Temporal assertions extract information in an email that users may not be able to see physically. Over time, temporal claims allow users to determine that information is coming from the same IP address. Therefore, sequencing and identifying temporal patterns in emails has the potential to be more informative than monitoring emails one by one, independently of each other. Temporal assertions not only gather information the user cannot see but also the visible information in the email, allowing deterministic assertions to be made as well. This topic is further addressed again in Chapter III.
- MOTIVATION FOR USING RUNTIME MONITORING OF HIDDEN DATA
This section discusses the hidden data states, a formal specification about one of these states, and how these factors fit together to enhance our ability to conduct the runtime monitoring of malicious emails. For example, we receive an email from an agent that works for the IRS and uses the same format as the IRS. The official states that the organization has identified cases of fake agents sending out emails and asking for personal information, but in the content of this email, the agent also asks for contact information. Within the next two days, we receive an email from a different agent, but this individual is also using the same domain. This time, the agent requests date of birth. Receiving both emails within a week, we become as suspicious, particularly because some properties of the incoming emails are not deterministically available in the email text rather, they are probabilistically learned or hidden features. In this case, reasoning about temporal and sequencing patterns of emails with hidden artifacts has the potential to provide improved or probabilistic classification.
- ORGANIZATION OF THESIS
Chapter II addresses the background information on malicious emails, phishing, the DoD, formal speculations or assertions, and rules to identify parameters to use during runtime monitoring. Chapter III takes an in-depth look at runtime monitoring techniques. Chapter IV provides a proof of concept for using the DTRA HMM RV toolset and the Hidden Markov Model to identify hidden data that can be used in behavioral and temporal pattern detection. Section V describes the Microsoft Excel Spreadsheets on the validation and learning data used to validate the formal specifications provided in Chapter IV and the results of running the official specifications through these tables. Section VI identifies shortcomings and recommendations of this thesis and a conclusion.
THIS PAGE INTENTIONALLY LEFT BLANK
- MALICIOUS EMAILS
- DETECTING MALICIOUS EMAILS BY COLLECTING DATA THROUGH BULK EMAIL OR PHISHING
Bulk email provides a means to disseminate official information to an entire organization with ease [2]. Mass emails are relatively common and are mostly automatically generated [2]. Since bulk emails are standard, most users do not think about the type of information being sought by the fraudulent actor or scammer. Fraudulent actors or scammers can easily exploit both the selected distribution of bulk email or phishing and its perceived integrity. Collecting data from bulk email or phishing can help us to categorize the data. With the use of formal validation and verification techniques, we can further capture and target malicious emails through this collected data. As a result, we can see who is targeted by the malicious email.
Unwanted email, such as spam, is sent in bulk to a large number of people on the Internet while malicious emails are sent to specific individuals. The techniques that malicious actors use to craft and send these targeted emails are different from the techniques used by scammers. The fraudulent emails appear as if they are an official communication from a particular target bank or a given company. They would contain a malicious attachment or a particular link to the website that tries to collect information from the victim (see Figure 1) [3].
Fraudulent Email Example. Source: [3]. (from [3]).
Receiving several of these emails within a week, we will likely perceive these emails to be suspicious. By categorizing them, an organization can more easily decide whether to accept or reject the email coming into their network environment. This especially true when some properties of incoming emails are not deterministically available in the email text rather, they are probabilistically learned or hidden properties. In this case, reasoning about patterns of emails with hidden artifacts has the potential of providing improved or probabilistic classification. Using the Runtime Monitoring and Verification System, we can provide a way to track activity and meet the requirements to keep our systems safe from malicious emails.
- DOD-TARGETED MALICIOUS EMAILS
Malicious emails not only target Internet Service Provider users and banks but also to governmental organizations like the Department of Defense. These more sophisticated attacks deploy emails that look identical to official mail and are therefore threatening to the security of government networks and DoD members [4].
Spear phishing, in particular, is a significant and widespread effort that the DoD is battling. In 2006, the JTF-GNO released an article saying that its members have “observed tens of thousands of malicious emails targeting soldiers, sailors, airmen and Marines U.S. government civilian workers and DoD contractors, with the potential compromise of a significant number of computers across the DoD” [5]. Therefore, fraudulent actors are targeting government employees to gain more than a just account or personal information they are focused on collecting intelligence. From the accounts that have been compromised, further exploitation of the DoD network may be possible. However, the exact scope is unknown, leading the government to believe that some actors already have extensive knowledge of their targets and know precisely what further information they want. DoD users are required to sign their emails digitally, but the DoD has not been able to protect personal emails. This thesis seeks to define a means of identifying email threats in a Naval and DoD environment.
THIS PAGE INTENTIONALLY LEFT BLANK
- OVERVIEW OF RUNTIME MONITORING TECHNIQUES
Using the information gathered for Chapter II, formal specifications can be generated. Each specification is defined first as a natural language requirement then it is converted into a UML-statechart assertion using generic assertions provided by [10]. The specifications are also used to determine which of the three threat categories emails fall into: suspicious, unknown, and benign since they cover vital aspects of finding out if the email is malicious or not. For us to categorize the threats, we are going to have to address a couple of questions in regards to monitoring the behavioral patterns of malicious emails. The first is “how can we observe the behavior of malicious emails?” The second is, “what software do we have to observe this behavior?” The third is, “how can we validate and verify the behavioral patterns being monitored?” The answer to these questions is that we need to use validation and verification techniques to ensure the behavior of the models are correct.
- Formal specification Tradeoff Cuboid
In this section, we introduce the techniques of validation and verification. Verification means to ensure a product is built correctly. The process of validation is aimed at establishing that a system meets the user’s exact requirements-often called “building the right system” [11]. To select the appropriate validation and verification technique for detecting malicious emails, we distinguished the strategy most suitable for the assignment. We used the visual tradeoff space from Drusinsky, Michael, and Shing’s paper in [6], which compares three predominant formal validation and verification techniques. Noted in Drusinsky, Michael, and Shing’s paper, the three methods include theorem proving, model checking, and runtime monitoring. The formal validation and verification tradeoff cube refers to the system, as illustrated in Figures 2 and 3 [6]. The tradeoff cube depicts the related expense and scope of every official acceptance and confirmation strategy. Elements adding to the cost and extent of every system have the capacity to determine complex properties, the exertion required to make determinations of complex properties, and the information exertion needed for software implementation. The tradeoff also alludes the budgetary cost necessary to create and validate the precise specifications [6].
In Drusinsky, Michael, and Shing’s paper [6], Theorem proving is termed to be a verification technique making a convincing argument that a program has met the requirement using mathematical proofs. According to Jhala and Majumdar’s paper [17], analysis of a program to prove that individual events can hold true is done through model checking algorithmically. The downfall of both theorem proving and model checking was due to language limitations, text-based, and makes system visualization difficult for designers. Ultimately, we chose RM as the best method of monitoring malicious emails because it can oversee the behavior of the emails and contain information that we are not able to access and see, like the hidden states. The other two techniques are mainly used for verification of the underlying systems.
In the section that follows, there is the introduction of the rules4business website, staterover toolset, and deterministic runtime monitoring and verification using statechart assertions. Rules4business is a site that we will use to create rules to identify temporal patterns and sequencing. Staterover is a toolset that we will use to verify and validate the rules we created in rules4business and then build our statechart assertion. Runtime monitoring can then determine if the statement we made is true.
Cost Space (from [6]).
Coverage Space (from [6]).
- Rules4Business
Rules4Business is a website that allows users to create rules based on events and timing patterns. The rules are a way of analyzing, validating, and verifying the behavior of the models in the data provided. Patterns define the behavior for example, the website can flag a suspicious email that has been received from a particular Sending host. It is a sample natural language requirement about our hidden state suspicious (S) if a pattern within the email conforms to the specification, it is considered to be flagged: Flag when there is a suspicious email (HiddenState) within one hour of an email from 3ff7b9e2.cst.lightpath.net (Sendinghost). The rules4business website was used in the conversion of the NL to a UML statechart assertion. We created an Excel spreadsheet analyzing months’ worth of personal emails. Generic Rule 11 on the site obeyed to the NL and was used to generate our statement.
Rule 11: Flag whenever event P with eventual event Q within time T after P. Figure 5 depicts the generic UML statechart for our rule. This rule was customized to our purposes by making the following assignments:
P=Sendinghost.indexOf("3ff7b9e2.cst.lightpath.net")>=0, Q=HiddenState==="S", T=1 hours
Rule 11 UML-statechart (from [10]).
Once the rules of our system were identified, we could conduct RV against the specification.
- StateRover Toolset
The StateRover used in this research is used as part of the code generation
process, whose algorithm is given in Chapter V Section B. The code generation process is implemented by the dtracg tool (see Chapter V Section B), which relies on code generated from the StateRover. There is no other reason for using the StateRover in this research, other than this purely technical reason therefore, uninterested readers can jump to section 3.
The StateRover used extends the statechart diagrammatic notation with Java as an action language, resulting in a Turing-equivalent notation [13]. Before we can use the StateRover toolset two steps must be performed: validation testing and verification. With validation testing, the formal- specification assertion is trusted to represent the requirements of the rules in the subsequent automated verification phase. Verification is done by comparing a trace of the system to the behavior of the assertion set [14]. The StateRover V&V tool uses a two-step process. In the first step, a transaction log or statement is changed over into an equivalent JUnit test (JU), and the pattern is code-generated into an equivalent Java class. Second, in an RV step, the JUnit test is executed, therefore checking that the transaction log conforms to the pattern [14]. Therefore, we will generate code for the RM algorithm in dtrarm and to save development time we will use the output of the staterover toolset.
- Deterministic Runtime Monitoring and Verification using Statechart Assertions
Runtime Monitoring is a light-weight formal verification technique through which a runtime execution of the system is checked and contrasted with an executable variant of the system’s formal specification. As such, RV carries on as a computerized eyewitness of the program’s conduct and contrasts that behavior with the general conduct per the official specification [12].
Consider the following natural language (NL) pattern, written in two ways for malicious emails as described below:
NL1. Flag whenever some pair of consecutive unknown (HiddenState) SendingIP are less than 30 minutes apart.
NL1 merely represents the visible data in an email. We can specify exactly which behavior patterns we want to monitor. On the other hand, Generic Rule 9 from rules4business website identify behavior patterns in the data that the user may not be able to see and flag when these events occur.
Rule 9: Flag whenever some pair of consecutive E events is less than time T apart.
Figure 4 depicts a statechart-assertion for Rule 9 as designed using the StateRover tool. A statechart-assertion is a machine augmented with a particular flowcharting capabilities, hierarchy, a Java action language, and a built-in Boolean flag named bFlag whose default value is false, with a valid value indicating that the pattern has been flagged [12]. The statechart-pattern of Figure 4 combines the flowchart and state-machine elements. The compound states and diagram action boxes are flowchart elements the statechart flows through the boxes while executing their actions and conditions.
A statechart-assertion for requirement Rule 9 (adapted from [10]).
In the statechart assertion of Figure 4, the statechart flows through the Initial flowchart box, executes its actions, and then checks whether the SendingIP transaction is unknown or not. Therefore, if Rule 9 has been violated, and the statechart assertion transitions to the Error state where it sets the bSuccess flag to false. The flag indicates that the assertion has failed [13].
- FORMAL SPECIFICATIONS PATTERNS
In software engineering, formal specifications are scientifically based procedures that help with the implementation of systems and software. They are used to portray a system, to examine its conduct, and to help in its configuration by confirming key properties of interest through accurate and powerful tools. The specifications are formal since they contribute to improving clarity and precision of specifications requirements, and doing exactly what it is supposed to do.
Runtime monitoring refers to methods used to monitor a system or application and to compare its current behavior to formal specifications representing correct system behavior [6]. A high volume of collected data used in conjunction with runtime verification of formal specifications can be used to categorize the malicious emails as suspicious, unknown, or benign.
This thesis focuses on targeted malicious emails that have been received by a large organization during a given period. These malicious emails are categorized with hidden data states to help with detection, as opposed to conventional email filtering techniques.
Both detecting malicious emails and converting from natural language to a formal specification are difficult. Natural language is inherently ambiguous, rendering accurate specification problematic [7]. However, Formal specifications allow us to convey the exact intent of the natural language requirement. Drusinsky’s paper shows baseline patterns to be considered when testing the formal specifications [8]. The patterns serve to validate a formal specification, ensuring it does precisely what it intends to do. The second advantage of these test scenarios is that they can ensure the formal specification catches the intent of the natural language requirement [9].
The purpose of a formal specification is to clarify a natural language requirement further. A formal specification is meant to pinpoint particular information that the user seeks to extract from the natural language. As examples, we convert two of our natural language requirement into two different formal specifications.
NL1. Flag whenever some pair of consecutive unknown SendingIP are less than 30 minutes apart.
NL2. Flag when there is a suspicious email within one hour of an email from 3ff7b9e2.cst.lightpath.net (Sendinghost)
NL1 and NL2 are requirements, and Rule 9 and Rule 11 are rules or formal specifications used to match NL1 and NL2 in rules4business.com.
Rule 9: Flag whenever some pair of consecutive E events is less than time T apart.
NL1 is an instance of Rule 9 by using E=HiddenState===” U” T=30 minutes
Rule 11: Flag whenever event P with eventual event Q within time T after P.
NL2 is an instance of Rule 11 by using P= Sendinghost.indexOf("3ff7b9e2.cst.lightpath.net")>=0Q=HiddenState==="S" T=1 hours.
Figures 6 and 7 provide example event timelines to evaluate Rule 9 and Rule 11. By observing Rule 9’s sequence, it is clear that from event E at ten minutes through event E at seventy-nine minutes, there is at least one instance of event E within an unknown email. While Rule 9 and Rule 11 are not direct interpretations of NL1 and NL2, Rule 9 and Rule 11 can flag any suspicious or unknown email specified in rules4business. The counter in Rule 9 resets after each thirty-minute interval, preventing it from identifying an instance of some pair of consecutive E’s, as NL1 would otherwise do. According to Figures 6 and 7, Rule 9 is the preferred rule because its counter resets any time an event is flagged as unknown. However, the intent of NL1and NL2 is to flag unknown and suspicious emails from the given information in the validation spreadsheet as demonstrated through an analysis of Figure 6 and 7.
Event time-line for evaluating Rule 9 (from [10]).
Event time-line for evaluating Rule 11 (from [10]).
THIS PAGE INTENTIONALLY LEFT BLANK
- HIDDEN MARKOV MODEL
RM was about monitoring and verifying the pattern behavior of the emails, comparing it to the correct behavior observed in the formal specification pattern. In this chapter, Hidden Markov Model’s (HMM) was used to ascertain hidden events and structures of the malicious emails analyzed.
The Hidden Markov Model is the best technique to detect a behavioral pattern, learn, and identify hidden artifacts. The model can be analogized to a state machine through which the state transitions, observations, or outputs, are probabilistic [14]. The technique is utilized in combination with probabilistic runtime monitoring of formal specification assertions. An analyzer is required to execute a learning phase based on the system’s deterministic models. The system is then used to recognize concealed artifacts in patterns analyzed. In the runtime monitoring phase, data was monitored by utilizing the model to distinguish hidden data. Upon being identified, it is used for the probabilistic pattern detection and run against an already existing standard specifications.
HMM is a real Markov model, through which the system displayed a Markov process having a hidden or unobserved states. The state in the regular Markov model is expressly apparent to the observer however, in an HMM, the state is not explicitly evident, but the output that depends on the state, is visible [14]. By the learning system states and their transitions, we generated an HMM [12]. The known data created, the HMM utilizes the information to identify the hidden states and it sequences. This model uses the probability of an observable sequence of states.
Runtime Monitoring alludes to techniques employed to monitor the system or its application and the behaviors by contrasting the current behavior to distinguish correct behavior determined by formal specifications [12]. The utilization of formal specifications guarantees that amid runtime, a system is working inside of its proposed limits while providing a flag to distinguish deviation from ordinary and provide a specification as to the place of disturbance enabling remedial actions.
We use runtime monitoring with hidden states [14]. With being able to use the HMM, the HMM was used to decode probability of the occurrence of a sequences of hidden states in the runtime monitoring spreadsheet. In chapter V, we will give the rationale behind the learning phase executed to assist generate the HMM, resulting HMM matrices with their implementation, and a description of runtime analysis of the system state as produced by HMM in regards to the presented specification.
- PROBABILISTIC RUNTIME MONITORING USING STATECHART ASSERTIONS COMBINED WITH HIDDEN MARKOV MODELS
In chapter III, we discussed deterministic runtime monitoring and how it was performed by generating a deterministic pattern implantation using a code generator. However, by using runtime monitoring with HMM, we were able to perform the probabilistic pattern of detection utilizing the unique pattern code generator that could generate a probabilistic and weighted implementation [12]. Figure 8 shows the steps from determining the natural language to performing probabilistic pattern matching.
Workflow for developing pattern matching with hidden information (from [12]).
The model parameters represent a set of states, observable tuple that describes potential data combinations, Matrix A, Matrix B, and the original state distribution [12]. The set of states is the three previously mentioned: Suspicious Unknown, and Benign. An observable tuple, O, is the conjunction of Sendinghost and SendingIP that are both represented as integers. Sendinghost is the initial value in a tuple and SendingIP comes second. Using the data from our learning phase table, we were able to generate Matrix A and Matrix B. Matrix A, represented in Table 1, offers the transition state properties for HMM. Matrix B gives probability of a given observable tuple O observed in the three states: Suspicious, Unknown, and Benign. Table 1 represents Matrix B.
- DTRA HMM-RV TOOLSET
DTRA HMM-RV toolset is a tool applied to the statechart assertions customized by the use of the website, rules4business.com, where deterministic implementation code generated by the StateRover’s code generator [15].
DTRA HMM-RV toolset is to verify statechart assertion created as a Java project for Rule9: Flag whenever a pair of the consecutive events is less than time T apart and Rule11: Flag whenever event P with eventual event Q within time T after P. With the help of statechart_diagram file, and a properties file using TimeRover, the diagram was created for Rule 9 and Rule 11 to satisfy rules4business.com website and the code was generated by the StateRover’s code generator.
StateRover code generator generated Java code that was created from Statechart_Diagram file to meet Boolean functions along with the timer and bundled as a jar file. Junit test cases were executed for testing Boolean functions along with a timer for Rule 9 (Event occurs less than 30 minutes triggers a flag) and Rule 11 (Flag whenever event one with the eventual event two within 30 minutes after event one).
The alpha method is applied, and RV tool set monitors for each and every CSV row the probability of being suspicious, unknown, and benign hidden states that are reported in descending order. It tracks one row at a time from the CSV file.
THIS PAGE INTENTIONALLY LEFT BLANK
- RESULTS: PROOF OF CONCEPT
The next step is a validation of the rules established through testing. The HMM-based Runtime Monitoring Tool documentation is used for guidance on the test scenarios to accomplish the goal. Validation of rules 9 and 11 by using the rules4business website and generating the HMM is conducted.
- LEARNING PHASE CSV’S
In the learning phase, we created a spreadsheet of collected data from bulk or phishing emails. We organized the data from the emails by date, time, sending host, sending IP, subject, and attachments. By the information from in the emails, we determine if the emails were suspicious, unknown, benign to create the hidden state column. To validate our assertion, we uploaded our validation CSV file which we expect our rules to flag. Once, we approve the rules and checking that the rules flagged, we used the validation CSV file to create our learning phase CSV to generate our HMM. Figure 9 depicts a sample of the validation CSV.
Sample of Validation CSV
- RulesforBusiness Rule 9
Rule 9 evaluates whether the emails are less than 30 minutes apart from the SendingIP are an unknown threat. Thus, to validate Rule 9 we uploaded the validation CSV into rules4business. The validation of Rule 9 lets us know if it is bSuccess or bfailure. bSuccess is where we expect Rule 9 to flag. bFailure is where we expect Rule 9 not to flag. To validate this rule, the following four tests were conducted on the prescribed specification: bSuccess, bFailure, and two instances of multiple time intervals. The first step will flag once and the second case to flag four times.
Figure 10 displays the runtime execution of the validation CSV and positions it expected to flag an unknown SendingIP and there it flags once. bFailure did not flag and remaining instances less than 30 minutes apart flagged as expected. Thus, there was a validation of the assertion.
Multiple time intervals flag for Rule 9.
- RulesforBusiness Rule 11
Recall that Rule 11 identifies a suspicious email within one hour of a particular Sendinghost address from the validation CSV. Thus, to validate the assertion on Rule 11, five separate tests were conducted: bSuccess, bFailure, two sets of multiple time intervals, and event repetitions. One case of multiple time intervals will flag after the first time interval while the other will flag after the second time interval. A flag from the success test and a single flag from both the multiple time interval tests define validation of Rule 11 and dispenses with vagueness from its natural language requirement.
Figure 11 shows validation of the multiple time interval flagged as expected as well as the bSuccess and event repetition tests flagged.
Multiple time interval for Rule 11.
- Generating the HMM from the Learning Phase CSV File
After successfully validating rules 9 and 11, we moved on to generating our HMM parameters, which generated our hmm.json file. The hmm.json file consisted of the following three properties: the learning phase CSV file, the quantization properties file, and the command java jar dtrahmm.jar. The learning phase CSV file consists of a single learning phase file synthesized from information found in the validation table. The first column in the learning phase file is identified as the Initial State. In the row, the letter Y in each indicated row stands for a singular part of the file. As mention before in my research, the emails are either: suspicious, unknown, or benign. Therefore, this gather information is collectively called the Hidden State and is utilized to generate the HMM parameters. For the HMM generator to analyze the data from the learning phase, CSV file must contain a Sendinghost column. The entries in this column consist of possible values that are quantized. For example, Rule 9 includes a transition labeled as event E we mapped E==­Sendinghost==­Type1. So, the Rule9events.properties will contain E==­Sendinghost==­Type1. Figure 12 depicts the learning phase CSV.
Learning-Phase CSV
Once the learning phase was complete the command java jar dtrahmm.jar was executed. The execution of the learning phase CSV prepared us to be able to conduct the runtime monitoring of our data.
- RUNTIME CSV’S
The special Java code was created for Probabilistic Runtime Monitor. To create Java code java –jar dtracg.jar command was executed using input file and properties file.
The output was a new sibling file, inFile_DTRA.java, where the sanity test was conducted to test Rule 9: Flag whenever some pair of consecutive E events is less than time T apart and Rule 11: Flag whenever event P with eventual event Q within time T after P.
Eclipse export command was executed to export the final jar file that contains Java package for Rule 9 and Rule11 separately. The JUnit test files for Rule 9 and Rule 11 are separated and apart from source code for the package com.timerover.stateover.ifacesrc.
The alpha method is quiet mainly used because it is considered as a fast one due to its logarithmic response. To run Alpha method, the command java-jar dtraalpaha.jar was executed using CSV, JSON and quantization files. Properties file which is in descending order using column headers. The output was stored in alpha.json file.
Runtime monitoring provides assurance that current software computation conforms to the specified properties. Testing verifies a set of execution paths, the path explored by the test set. The command java-jar dtrarm.jar was executed to check system functionality. The command contains run-time csv file, alpha.jason file, rules .jar file, rule9 and rule11 and properties file with descending order with some cycles needed to monitor.
THIS PAGE INTENTIONALLY LEFT BLANK
- CONCLUSION AND FUTURE RESEARCH
Rules to detect threats have been developed and validated. We performed deterministic runtime monitoring for rule validation, built Hidden Markov Model (HMM) in the learning phase and conducted runtime monitoring with hidden data.
In future, this project might be migrated to the particular direction where ultimately take control over spam and Phishing E-mails and eradicate those E-mails even they escaped through firewalls.
THIS PAGE INTENTIONALLY LEFT BLANK
List of References
[1] A.M. Fiskiran and R. B. Lee, “Runtime Execution Monitoring (REM) to Detect and Prevent Malicious Code Execution,” Princeton University, 2004. ICCD 2004, IEEE International Symposium on, pp. 452-457, October 2004, URL: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1347961
[2] A. A. Slack, “Digital Authentication for Official Bulk Email,” Naval Postgraduate School, 2009. NPS 2009, pp. 5-10, March 2009, URL: http://cisr.nps.edu/downloads/theses/09thesis_slack.pdf
[3] E. Sharf, “Fake malware notifications from “Websense Labs”,” Websense Security Labs Blog, 2011.
URL:http://community.websense.com/blogs/securitylabs/archive/2011/09/22/fake-malware-notifications-from-websense-labs.aspx, last accessed September 2015.
[4] J. W. Ragucci, S. A. Robila, "Societal Aspects of Phishing," Technology and Society, 2006. ISTAS 2006, IEEE International Symposium on, pp. 1-5, 8-10 June 2006, URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=43 75893&isnumber=4375875, last accessed September 2015.
[5] B. Brewin, “DOD battles spear phishing,” The Business of Federal Technology, 2006. URL: https://fcw.com/articles/2006/12/26/dod-battles-spear-phishing.aspxfrom-websense-labs.aspx, last accessed September 2015.
[6] D. Drusinsky, J. B. Michael, and M.-T. Shing, “A visual tradeoff space for formal verification and validation techniques,” Systems Journal, IEEE, vol. 2, no. 4, pp. 513–519, Dec. 2008.
[7] K. Shimizu, D. L. Dill, and A. J. Hu, “Monitor-based formal specification of PCI,” Formal Methods in Computer-Aided Design, vol. 1954, pp. 372–390, Jun. 2000.
[8] D. Drusinsky, J. B. Michael, T. W. Otani, and M.-T. Shing, “Validating UML statechart-based assertions libraries for improved reliability and assurance,” in SSIRI’08. Second International Conference on, Yokohama, Japan, 2008, pp. 47– 51.
[9] J. J. Galinski, “Formal Specifications for an Electrical Power Grid System Stability and Reliability,” Naval Postgraduate School, 2015. NPS 2015, pp. 1-11, September 2015, URL:http://cisr.nps.edu/downloads/theses/15thesis_galinski.pdf.
[10] D. Drusinsky. “Rules for Business.” Rules4Business. http://www.rules4business.com/acmeBank/index.html
[11] A. D. Preece, “Verification and Validation of Knowledge-Based Systems with Formal Specifications,” University of Aberdeen, 1990, pp. 1-4. URL: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.17.7692&rep=rep1&type=pdf.
[12] D. Drusinsky, “Behavioral and Temporal Pattern Detection within Financial Data with Hidden Information,” J. UCS, vol. 18, no. 14, pp. 1950–1966, Jul. 2012.
[13] D. Drusinsky, “UML-based Specification, Validation, and Log-file based Verification of the Orion Pad Abort Software,”
[14] D. Drusinsky, “Runtime Monitoring and Verification of Systems with Hidden Information,” Innovations in Systems and Software Engineering: Volume 10, Issue 2 (2014), Page 123-136. http:// www.time-rover.com/articles.html
[15] D.Drusinsky, “A Hidden Markov Model based Runtime Monitoring Tool,”
[16] R. Sedgewick, “Just The facts101Textbook Key Facts, Algorithms,” 4th Edition,
pp.
[17] R. Jhala and R. Majumdar, “Software model checking,” ACM CSUR, vol. 41, no. 4, pp. 21–22, Oct. 2009.
THIS PAGE INTENTIONALLY LEFT BLANK
DTIC and DKL are the only two entities that are to appear on this page.
Include your desired recipients in your Publication Announcement List form. Thesis forms.
Initial distribution list
1. Defense Technical Information Center
Ft. Belvoir, Virginia
2. Dudley Knox Library
Naval Postgraduate School
Monterey, California