DAPT2021 Table of Contents _____________________________________________________________________________________ 1. Overview of DAPT2021 2. Dataset Organization 3. Prerequisites 4. Configuration 5. Associated Links 6. Roadmap 7. Contact 8. Acknowledgements _____________________________________________________________________________________ 1. Overview of DAPT2021 _____________________________________________________________________________________ DAPT2021 is a semi-synthetic dataset capturing an Advanced Persistent Threat alongside 2 less skilled attacks. By emulating the normal behavior of the employees in a set target organization, the dataset has been generated to fill in the increasing gap between the real-world datasets and synthetically generated datasets. DAPT2021 is first of its kind in providing APT dataset with 4 different types of labels: Activity: indicating which activity does the flow or the event map to, Stage: indicating the progress of the attacker towards the target, DefenderResponse: indicating whether the defender detected, mitigated the malicious activity leading to a change in the attacker's tactic, technique and/or procedure (MITRE's framework: https://attack.mitre.org/) and, Signature: indicating which attacker group does this event belong to. _____________________________________________________________________________________ 2. Dataset Organization _____________________________________________________________________________________ Due to the huge size of the dataset, and to let people download only what they would like to work on rather than having to clone all of the dataset, we have chosen to upload the dataset on this S3 bucket. Scripts for downloading the dataset can be found at our git repo https://gitlab.thothlab.org/dapt2021/apt-dataset/-/tree/master. Our dataset on this bucket is organized into 'raw' and 'processed' folders (prefixes). The raw prefix contains our dataset in its raw form as collected, placed into prefixes 'host-logs', 'network-captures', and 'nids'. As such, the processed folder contains processed raw data in the prefixes 'host-logs'. 'network-flows', and 'nids'. - Host Logs: The 'host-logs' folder comprises of auth logs, audit logs, and sys logs for each host under the prefix of the host IP. - Network Traffic: The 'network-captures' comprise of the network traffic collected in the form of packet captures. You will find that some captures have spanned over multiple days from their prefix. - NIDS: This folder comprises of multiple snort logs collected through out the dataset collection period. Depending on your needs, you may either download only raw, or processed or both. The primary difference between the 2 is that the processed data is labeled while the raw data is not. _____________________________________________________________________________________ 2. Prerequisites _____________________________________________________________________________________ In order to download the dataset, you need to have - Software: AWS CLI installed on your system. - Hardware: 1TB of storage _____________________________________________________________________________________ 3. Downloading raw data _____________________________________________________________________________________ There are 2 ways you can access the raw data: - Using AWS CLI: To download the contents of the S3 bucket, please make sure the following requirements are met - python: You can download latest version of python from here - https://www.python.org/downloads/ - pip: You can install pip from https://pip.pypa.io/en/stable/installation/ Once pip3 is installed, please run the below commands to get aws setup on your system: ```sh pip install boto3 --upgrade pip install aws --upgrade ``` Once installed, please run the below command to make sure you are able to view the bucket. ```sh aws --region us-east-1 s3 ls s3://dapt2021 ``` After you get to list the contents of the bucket, you may run the below command to download a specific day's processed network flows onto your local system. ```sh aws --region us-east-1 s3 cp s3://dapt2021/processed/network-flows/ ``` You may also refer to aws documentation at https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html for further instructions on installing AWS CLI. - Using S3 Viewer Alternatively, you can download S3 Viewer from the Git repo [here](https://github.com/SharonBrizinov/s3viewer). This tool can help to view/download the folders and its contents over GUI. This is no specific configuration required as this bucket is public. _____________________________________________________________________________________ 4. Associated Links _____________________________________________________________________________________ The source code and tools used for generating the dataset can be found at https://gitlab.thothlab.org/dapt2021/apt-dataset/-/tree/master, along with the scripts used for running machine learning based evaluation on the dataset. _____________________________________________________________________________________ 5. Roadmap _____________________________________________________________________________________ The dataset is intended for research purposes and any updates to this dataset since its official publication will be documented as a change log in this bucket. _____________________________________________________________________________________ 6. Contact _____________________________________________________________________________________ Sowmya Myneni ( smyneni2@asu.edu ) Kritshekhar Jha ( kjha9@asu.edu ) _____________________________________________________________________________________ 7. Acknowledgements _____________________________________________________________________________________ We would like to thank the DevilSec team(https://asu.campuslabs.com/engage/organization/devilsec), specifically, Nathan Smith, and Austin Ballard for carrying out the planned APT attack and sharing their professional expertise to make the DAPT2021 more realistic.