The proposal must clearly explain what you are doing, why you are doing it, what is new about your project, and what is the significance of your project. The proposal should include a preliminary critical review of previous related work, specific aims, how you plan to accomplish these, and a bibliography. It should also include a realistic schedule, a budget, a list of deliverables, and a discussion of any foreseeable difficulties and how you plan to overcome them. The proposal should follow the generally-accepted guidelines for computer science research proposals—for example, as described by the National Science Foundation on their web pages. We recommend that you evolve your proposal into your final report, reusing as much text as possible.
Organize the proposal as outlined below.
redacted A, Purdue University, firstname.lastname@example.org;
redacted B, Purdue University, email@example.com;
redacted C, Purdue University, firstname.lastname@example.org
Security in Hadoop clusters
Keywords: Data Leakage, Data Spillage, Hadoop, Hadoop Cluster, Cloud, Cyberforensics
Military and other government entities perceive data spillage from Hadoop clusters as a security threat when sensitive information is introduced onto a platform considered non-sensitive. This project focuses on tracking sensitive information in HDFS after user introduction for the purpose of removing it. Using forensic analysis of the non-sensitive cluster following the introduction of a document designated as sensitive by the project team, the goal of this project is to create a procedure to remove all material designated as sensitive from the cluster after the initial load of the data.
redacted A: Hadoop cluster configuration and system evaluation
redacted C: Cyber forensics analysis and evaluation
ALL: Experimental design and data analysis
redacted B: Data Management and recommendation of policy related controls
Total Budget: $22,892.00. Please see detail in “Budget” section below.
Data Spillage in Hadoop Clouds Project Proposal, October 1, 2014
redacted A, Purdue University, email@example.com;
redacted B, Purdue University, firstname.lastname@example.org;
redacted C, Purdue University, email@example.com
Data spillage is defined in this project as a security threat when sensitive information is introduced onto a non-sensitive platform. This project will focus on user introduction of sensitive information into an non-sensitive Hadoop cluster. Using forensic analysis of the non-sensitive cluster following the introduction of a document designated as sensitive for testing purposes by the project team, the goal of this project is to create a procedure to remove all sensitive material from the cluster after the initial load of data.
Keywords: Data Leakage, Data Spillage, Hadoop, Hadoop Cluster/Cloud, Cyberforensics
Hadoop is a popular tool that supports data storage through its HDFS file system and processing in a parallel computing environments using MapReduce (in versions prior to 2.0) and YARN (in versions at or above 2.0). Hadoop clusters are a relatively inexpensive method of storing and processing large and diverse data sets; however, organizations using Hadoop should take precautions to securely handle data processed or stored by these systems. Data spillage from Hadoop clusters is an information security vulnerability that is defined in this project as the accidental or intentional introduction of tagged information into a non-sensitive Hadoop cluster. In this project we configure a new Hadoop cluster instantiation including several virtual data nodes, load a tagged document into the cluster, then find and document, through forensic analysis, the location of every piece of the tagged document throughout the cluster. Once each instance of the pieces of the sensitive document are located and documented, the pieces of the sensitive document will be erased according to Department of Justice specifications of secure erasure of data. Maximizing the availability of cluster nodes and the cluster itself while ensuring complete erasure of the sensitive data is the overall goal of this project. Nodes subjected to the optimized process will again be analyzed forensically to determine if any sensitive data can be recovered. The creation of a process to mitigate user-initiated data spillage in Hadoop clusters will have wide ranging impact not only within the United States federal government, but for any large data processor needing to quarantine sensitive information.
Big data sets are classified by the following attributes:
Among the numerous threats to information security, data leakage in Hadoop clusters could be overlooked, yet examples of the use of data of mixed sensitivity exist in the paring of credit card information with items purchased at a retail outlet and in the introduction of mental health information on medical charts just to name two. The ability to find and correct the spillage of sensitive data into non-sensitive environments such as Hadoop clusters will only grow with the growth of electronic data. We seek to provide a means to mitigate the spread of sensitive information caused by an inability to properly locate and remove such data from Hadoop clusters in order to add to contribute to privacy protections in “big data” environments.
According to Intel, growth in the Hadoop market is between 50 and 60 percent annually (Mello, 2014). As an increasingly large tool in the storage and distributed processing of large data sets, the security of Hadoop and other tools based on it, is increasingly important. Data leakage, which is defined as defined as the accidental or intentional distribution of classified or private information to an unauthorized entity, is an example of a security vulnerability in Hadoop that can impact the security of information within an organization (Anjali, M. N. B., Geetanjali, M. P. R., Shivlila, M. P., Swati, M. R. S., & Kadu, 2013). The United States National Security Agency (NSA) has defined a specific instance of data leakage, called data spillage, as the transfer of classified or sensitive information into an unauthorized/undesignated compute node or memory media (e.g. disk)(Mitigations NSA Group, 2012). Requirements of federal agencies to control the distribution of sensitive information, if combined with the use Hadoop clusters for data storage and/or processing make data spillage in Hadoop clusters a serious potential information security vulnerability within the United States government.
Xiao and Xiao (2014) discuss a process to prevent malicious actor from adding a node to a Hadoop cluster which could cause the exfiltration of information from the cluster, the failure of the cluster, or other negative impacts on cluster functionality. While the addition of a malicious node to a Hadoop cluster may be an important information security vulnerability to a subset of organizations using Hadoop, the vulnerability created through user introduction of sensitive information into a non-sensitive Hadoop cluster seems to be a more likely scenario for data loss. Assuming that introductions of sensitive data into non-sensitive Hadoop clusters are inevitable due to mis-classification of data or human error, a process for mitigation of such incidents is important. Cho, Chin, and Chung (2012) have established guidelines for locating and analyzing forensically data in an HDFS file system, but their work does not address the deletion of data from the cluster.
Because Cho, Chin, and Chung focus on Hadoop forensics, deletion of data from Hadoop clusters may be considered out of scope since it is an incident management function; however, information that exists regarding incident management in Hadoop clusters is sparse. Current articles in the trade press that discuss incident management and data security within Hadoop clusters focuses on making difficult the process of accessing the cluster and exfiltrating data from it (Garcia, 2014; Hortonworks, 2014). Therefore, our work will use established digital forensics processes to locate data within the HDFS file system and understand how data is processed within the system. Following that process, we will examine ways to completely remove data from Hadoop clusters in a way that maximizes the availability of the cluster.
NSA believes that it must find each occurrence of sensitive data that has been introduced into a non-sensitive cluster, remove from the cluster all nodes on which the sensitive data resides, erase all data from the node, rebuild that node, and re-insert it into the cluster. This project will create a process for finding all occurrences of sensitive data on the cluster through forensic analysis of the nodes, and determine the optimal process for removing the sensitive data and recovering the impacted nodes. These processes will be evaluated for effectiveness and efficiency on the same Hadoop cluster on which they were created. Because the cluster used in this research is small, evaluating the effectiveness and efficiency of these processes on larger clusters is an area of future research.
The goals of the project are to:
Goal: Detection of the introduction of sensitive data into a non-sensitive Hadoop cluster prior to processing by YARN.
Managing data throughout the research lifecycle is essential to this project in order for it to be replicated for further testing, as guidance for various federal agencies, and for re-testing as new versions of Hadoop are released. The ability to publish and re-create this work will involve the thorough documentation of processes and configuration settings. Initial data management will document the initial hardware and software specifications of the hardware and software used, the configuration processes used in configuring both the Hadoop cluster and the physical and virtual nodes, and a detailed description of the tagged data and the process of loading it into the cluster. The setup of the environment will be documented as a set of instructions that will be uploaded to PURR. These setup instructions will be verified by having another team member duplicate the environmental configuration upon completion of the final report. Software or data integral to the reproduction of the experimental aspects of this study will be backed up to PURR when such backups are technically feasible and will not violate applicable laws. Processes and results of forensic analysis will also be thoroughly documented, verified by a second team member on additional nodes, and uploaded to PURR.
PURR will serve the data management requirements for this project well because it provides work flows and tools for uploading of project data, communication of that data, and other services such as: data security, fidelity, backup, and mirroring. Purdue Libraries also offers at no cost to project researchers consulting services in order to facilitate selection and uploading of data, inclusive of the generating application and necessary metadata that will ensure proper long-term data stewardship. Documentation of results and processes will also be uploaded to the team’s shared space on Google Docs as a backup to PURR. The contact information of the Purdue Libraries data manager assigned to this project, will also be uploaded to PURR to facilitate long term access to, and preservation of project data.
Understanding data processing in the Hadoop environment well enough to mitigate the introduction of sensitive information onto a non-sensitive cluster presents the following challenges:
Redacted A is a doctoral research assistant for Dr. Dark. Her research background is in system modeling and analysis that allows her to approach this project from a statistical-based modeling approach. Over the summer, she worked on developing a data motion power consumption model for multicore systems for the Department of Energy (DOE). In addition, she is working towards growing expertise and experience in big data analytics under her adviser Dr. Springer, the head of the Discovery Advancements Through Analytics (D.A.T.A.) Lab.
Redacted B is a Ph.D. student specializing in secure data structures for national health data infrastructure. His most recent experience includes security and privacy analysis of laws, policies, and information technologies for the Office of the National Coordinator for Health Information Technology within the Department of Health and Human Services.
Redacted C is currently pursuing his second masters at Purdue University in Computer Science and Information Technology. Redacted C is a qualified professional with diverse interests and experiences. Prior to his graduate studies at Warwick University, Redacted C spent four years studying computer science engineering at MVGR College of Engineering. Continuing his education, he graduated with a master in Cyber-security and management degree from University of Warwick, UK. While at Warwick, Redacted C spent his days building a prototype tool for HP cloud compliance automation tool semantic analysis phase at Hewlett Packard Cloud and Security labs. Further, Redacted C worked at Purdue University as a Research Scholar researching in the domain of cyber security and digital forensics. He is an active member of IEEE.
|A.1. Team Hours
|3 members, 12 weeks
|C. Fringe Benefits
|E. Conference Travel
|I. Purdue Indirect
|$15,677.86 * .54
A.1. Senior Personnel:
All three senior personnel will work for 3 months (12 weeks) on the project at an hourly rate of $16.534/hr (based on SFS monthly stipend).
3 months * 40hrs/month * $16.534/hr = $1984
C. Fringe Benefits:
A total of $750 is requested to cover fringe benefits, corresponding to 20% of pay rate to cover medical, vision and dental insurance.
All three senior personnel will attend the Storage Network Industry Association Data Storage Innovation Conference. April 7–9, 2015, Santa Clara, CA, USA.
Estimated full conference academic pass = $695/attendee * 3 attendees = $2084.
Estimated travel expenses (hotel, transportation, meals, others) = $1500/attendee * 3 attendees = $4500.
I. Indirect Costs:
The indirect rate for this project is 54% according to requirements of Purdue University
$15,677.86 * 0.54 = $8466
As usage of large data sets continues to grow, opportunities and needs to process and use sensitive and public data together will also grow. Examples of the use of data of mixed sensitivity exist in the paring of credit card information with items purchased at a retail outlet and in the introduction of mental health information on medical charts Ð just to name two. The ability to find and correct the spillage of sensitive data into non-sensitive environments such as Hadoop clusters will only grow with the growth of electronic data. This project will produce a process that can not only be used to mitigate the introduction of sensitive material into non-sensitive Hadoop clusters, but could be used as the basis for such processes in other distributed data storage and processing systems. In either the Hadoop context, or more broadly, the procedure detailed by this project could potentially add to the electronic protections of privacy for billions of people worldwide.
Conference: Third USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 2011)
Paper: Ko, S. Y. and Jeon, K. (2011). The HybrEx Model for Confidentiality and Privacy in Cloud Computing. In Proceedings of the Third USENIX Workshop on Hot Topics in Cloud Computing. URL: https://www.usenix.org/legacy/events/hotcloud11/tech/final_files/Ko.pdf
Last modified: January 10, 2018
Winter Quarter 2018
You can get a PDF version of this