SDS-2.2, Scalable Data Science
Archived YouTube video of this live unedited lab-lecture:
Network anomaly detection
Student Project
by Victor Ingman and Kasper Ramström
This project set out to build an automatic network anomaly detection system for networks. Network threats are a major and growing concern for enterprises and private consumers all over the world. On average it takes 191 days for a company to detect a threat and another 66 days to contain the threat (Enhancing Threat Detection with Big Data and AI). In addition to taking long time to detect and contain threats, they also involve a ton of manual labour that require security experts. Thus, it should be a big priority for businesses to find solutions that not prevent malicious intrusions but also find these malicious activities in a fast and automated way, so that they can be dealt with swiftly.
An example of the threats we're facing today is the WannaCry ransomware which spread rapidly throughout the world during 2017 and caused major havoc for companies and privates consumers throughout, including Akademiska Sjukhuset here in Uppsala.
With better security systems and automated ways of detecting malicious behaviour, many of these attacks could be prevented.
To gain inspiration for our project and find out how others have developed similar systems we've used the book Advanced Analytics with Spark which uses k-means clustering.
In the book, the authors cluster different kinds of network events with the hopes of separating abnormal behaviour in clusters different from other events. The data used in the book is the publicly available KDD Cup 1999 Data, which is both quite dated and different from the data we've used, but it works well as a proof of concept for our project. The code accompanying the above mentioned book can be found at https://github.com/sryza/aas and for our project we've used a similar approach for clustering the data using k-means.
Below, we present the code for our project alongside with explanations for what we've done and how we've done it. This includes data collection, data visualization, clustering of data and possible improvements and future work.
frameIt: (u: String, h: Int)String
displayHTML(frameIt("https://en.wikipedia.org/wiki/Anomaly_detection",500))
Data Collection
To get data for our network security project we decided to generate it ourselves from our own networks and perform malicious activity as well.
Our basic idea for the data collection involved having one victim device, which would perform normal internet activity, including streaming to different media devices, transferring files and web surfing. During this, another device would (the attacker) would perform malicious activity such as port scans and fingerprinting of the victim. Our hopes were that the malicious activities would stand out from the other traffic and would hopefully be detectable for our anomaly detection models.
From the book Network Security Through Analysis we read about the tools Wireshark and Nmap. For our project, we used Wireshark for collecting network data on the victim's computer and Nmap for performing malicious activity.
Data anonymization
As we collected data on our own private network and publish this notebook along with the data publicly, we decided to anonmyize our network data for privacy reasons. To do this, we followed the Databricks guide: https://databricks.com/blog/2017/02/13/anonymizing-datasets-at-scale-leveraging-databricks-interoperability.html
By using the package Faker we generated fake source IP's and destination IP's for our network traffic data and used this data for the remainder of the project. Since we didn't parse the packet details for our network traffic and since it can potentially include sensitive information about our connections, we decided to remove that data from the public dataset.
displayHTML(frameIt("https://en.wikipedia.org/wiki/Data_anonymization",500))
pip install unicodecsv Faker
Collecting unicodecsv Downloading unicodecsv-0.14.1.tar.gz Collecting Faker Downloading Faker-0.8.8-py2.py3-none-any.whl (707kB) Collecting text-unidecode (from Faker) Downloading text_unidecode-1.1-py2.py3-none-any.whl (77kB) Collecting python-dateutil>=2.4 (from Faker) Downloading python_dateutil-2.6.1-py2.py3-none-any.whl (194kB) Requirement already satisfied (use --upgrade to upgrade): six in /usr/lib/python2.7/dist-packages (from Faker) Collecting ipaddress; python_version == "2.7" (from Faker) Downloading ipaddress-1.0.19.tar.gz Building wheels for collected packages: unicodecsv, ipaddress Running setup.py bdist_wheel for unicodecsv: started Running setup.py bdist_wheel for unicodecsv: finished with status 'done' Stored in directory: /root/.cache/pip/wheels/97/e2/16/219fa93b83edaff912b6805cfa19d0597e21f8d353f3e2d22f Running setup.py bdist_wheel for ipaddress: started Running setup.py bdist_wheel for ipaddress: finished with status 'done' Stored in directory: /root/.cache/pip/wheels/d7/6b/69/666188e8101897abb2e115d408d139a372bdf6bfa7abb5aef5 Successfully built unicodecsv ipaddress Installing collected packages: unicodecsv, text-unidecode, python-dateutil, ipaddress, Faker Successfully installed Faker-0.8.8 ipaddress-1.0.19 python-dateutil-2.6.1 text-unidecode-1.1 unicodecsv-0.14.1 You are using pip version 8.1.1, however version 9.0.1 is available. You should consider upgrading via the 'pip install --upgrade pip' command.
import unicodecsv as csv
from collections import defaultdict
from faker import Factory
def anonymize_rows(rows):
"""
Rows is an iterable of dictionaries that contain name and
email fields that need to be anonymized.
"""
# Load faker
faker = Factory.create()
# Create mappings of names, emails, social security numbers, and phone numbers to faked names & emails.
sources = defaultdict(faker.ipv4)
destinations = defaultdict(faker.ipv4)
# Iterate over the rows from the file and yield anonymized rows.
for row in rows:
# Replace name and email fields with faked fields.
row["Source"] = sources[row["Source"]]
row["Destination"] = destinations[row["Destination"]]
# Yield the row back to the caller
yield row
def anonymize(source, target):
"""
The source argument is a path to a CSV file containing data to anonymize,
while target is a path to write the anonymized CSV data to.
"""
with open(source, 'rU') as f:
with open(target, 'w') as o:
# Use the DictReader to easily extract fields
reader = csv.DictReader(f)
writer = csv.DictWriter(o, reader.fieldnames)
# Read and anonymize data, writing to target file.
for row in anonymize_rows(reader):
writer.writerow(row)
# anonymize("path-to-dataset-to-be-anonymized", "path-to-output-file")
Wireshark and Nmap
What is it? https://www.wireshark.org/
Wireshark is a free and open source packet analyzer. It is used for network troubleshooting, analysis, software and communications protocol development, and education.
Our setup consisted of two computers, one as victim and one as attacker.
Step by step
- Opened up Wireshark on the victims computer as well as logging activity on the network
- For a guide on how to log network info with wireshark, see the following: https://www.wireshark.org/docs/wsug*html*chunked/ChCapCapturingSection.html
- Started a lot of transfers and streams on the victims computer
- Started a Chromecast stream of a workout video on Youtube to a TV on the network
- Streaming music to speakers on the network via Spotify Connect
- Sending large files via Apple Airdrop
- The attacker started Nmap and started a port scan against the victim
- The attacker did a thourough fingerprint of the victim, such as OS detection and software detection at the open ports, also with Nmap
- We exported the victims wireshark log as CSV by doing the following:
The following image visualizes the network environment
The dotted lines shows network communications Filled lines shows local execution or communication between nodes Lines with arrows shows directed communication
After that was done, about 30 minutes later, we exported the data to CSV-format. The CSV was formatted as follows:
No | Time | Source | Destination | Protocol | Length | Info --- | --- | --- | --- | --- | --- | --- 1 | 0.001237 | 10.0.0.66 | 10.0.0.1 | DNS | 54 | [Redacted] ⫶ | ⫶ | ⫶ | ⫶ | ⫶ | ⫶ | ⫶
Description of collected data
- No = The id of the packet captured, starts from 0.
- Time = Number of seconds elapsed since the capture started
- Source = The IP address of the sender of the packet
- Destination = The IP address of the receiver of the packet
- Protocol = The protocol of the packet
- Length = Length of the packet
- Info = Data that is sent with the packet, redacted for privacy and anonymity
That way we are able to visualize the data collected in the form of a directed graph network and use the number of times a packet is sent identified by unique (source, destination, protocol).
Download the network data
The data dump we collected is available for download at the following url
http://sunlabs.se/assets/sds/anon\_data.csv
wget "http://sunlabs.se/assets/sds/anon_data.csv"
--2018-01-04 20:19:16-- http://sunlabs.se/assets/sds/anon_data.csv Resolving sunlabs.se (sunlabs.se)... 52.233.164.195 Connecting to sunlabs.se (sunlabs.se)|52.233.164.195|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 20156753 (19M) [application/octet-stream] Saving to: ‘anon_data.csv’ 0K .......... .......... .......... .......... .......... 0% 147K 2m14s 50K .......... .......... .......... .......... .......... 0% 293K 1m40s 100K .......... .......... .......... .......... .......... 0% 293K 89s 150K .......... .......... .......... .......... .......... 1% 121M 66s 200K .......... .......... .......... .......... .......... 1% 293K 66s 250K .......... .......... .......... .......... .......... 1% 209M 55s 300K .......... .......... .......... .......... .......... 1% 195M 47s 350K .......... .......... .......... .......... .......... 2% 293K 49s 400K .......... .......... .......... .......... .......... 2% 203M 44s 450K .......... .......... .......... .......... .......... 2% 172M 39s 500K .......... .......... .......... .......... .......... 2% 295K 42s 550K .......... .......... .......... .......... .......... 3% 237M 38s 600K .......... .......... .......... .......... .......... 3% 82.6M 35s 650K .......... .......... .......... .......... .......... 3% 102M 32s 700K .......... .......... .......... .......... .......... 3% 295K 34s 750K .......... .......... .......... .......... .......... 4% 227M 32s 800K .......... .......... .......... .......... .......... 4% 114M 30s 850K .......... .......... .......... .......... .......... 4% 38.3M 29s 900K .......... .......... .......... .......... .......... 4% 212M 27s 950K .......... .......... .......... .......... .......... 5% 297K 29s 1000K .......... .......... .......... .......... .......... 5% 94.5M 27s 1050K .......... .......... .......... .......... .......... 5% 137M 26s 1100K .......... .......... .......... .......... .......... 5% 68.3M 25s 1150K .......... .......... .......... .......... .......... 6% 62.9M 24s 1200K .......... .......... .......... .......... .......... 6% 192M 23s 1250K .......... .......... .......... .......... .......... 6% 298K 24s 1300K .......... .......... .......... .......... .......... 6% 139M 23s 1350K .......... .......... .......... .......... .......... 7% 80.3M 22s 1400K .......... .......... .......... .......... .......... 7% 132M 21s 1450K .......... .......... .......... .......... .......... 7% 245M 21s 1500K .......... .......... .......... .......... .......... 7% 43.8M 20s 1550K .......... .......... .......... .......... .......... 8% 123M 19s 1600K .......... .......... .......... .......... .......... 8% 152M 19s 1650K .......... .......... .......... .......... .......... 8% 299K 20s 1700K .......... .......... .......... .......... .......... 8% 105M 19s 1750K .......... .......... .......... .......... .......... 9% 181M 19s 1800K .......... .......... .......... .......... .......... 9% 74.1M 18s 1850K .......... .......... .......... .......... .......... 9% 117M 18s 1900K .......... .......... .......... .......... .......... 9% 139M 17s 1950K .......... .......... .......... .......... .......... 10% 77.7M 17s 2000K .......... .......... .......... .......... .......... 10% 114M 16s 2050K .......... .......... .......... .......... .......... 10% 212M 16s 2100K .......... .......... .......... .......... .......... 10% 300K 17s 2150K .......... .......... .......... .......... .......... 11% 85.0M 16s 2200K .......... .......... .......... .......... .......... 11% 93.6M 16s 2250K .......... .......... .......... .......... .......... 11% 74.9M 15s 2300K .......... .......... .......... .......... .......... 11% 112M 15s 2350K .......... .......... .......... .......... .......... 12% 166M 15s 2400K .......... .......... .......... .......... .......... 12% 144M 14s 2450K .......... .......... .......... .......... .......... 12% 251M 14s 2500K .......... .......... .......... .......... .......... 12% 78.0M 14s 2550K .......... .......... .......... .......... .......... 13% 277M 13s 2600K .......... .......... .......... .......... .......... 13% 331M 13s 2650K .......... .......... .......... .......... .......... 13% 380M 13s 2700K .......... .......... .......... .......... .......... 13% 432M 13s 2750K .......... .......... .......... .......... .......... 14% 442M 12s 2800K .......... .......... .......... .......... .......... 14% 300K 13s 2850K .......... .......... .......... .......... .......... 14% 316M 13s 2900K .......... .......... .......... .......... .......... 14% 127M 13s 2950K .......... .......... .......... .......... .......... 15% 81.9M 12s 3000K .......... .......... .......... .......... .......... 15% 99.1M 12s 3050K .......... .......... .......... .......... .......... 15% 118M 12s 3100K .......... .......... .......... .......... .......... 16% 149M 12s 3150K .......... .......... .......... .......... .......... 16% 175M 11s 3200K .......... .......... .......... .......... .......... 16% 256M 11s 3250K .......... .......... .......... .......... .......... 16% 218M 11s 3300K .......... .......... .......... .......... .......... 17% 49.7M 11s 3350K .......... .......... .......... .......... .......... 17% 42.5M 11s 3400K .......... .......... .......... .......... .......... 17% 18.2M 10s 3450K .......... .......... .......... .......... .......... 17% 17.5M 10s 3500K .......... .......... .......... .......... .......... 18% 43.6M 10s 3550K .......... .......... .......... .......... .......... 18% 57.8M 10s 3600K .......... .......... .......... .......... .......... 18% 49.3M 10s 3650K .......... .......... .......... .......... .......... 18% 320K 10s 3700K .......... .......... .......... .......... .......... 19% 153M 10s 3750K .......... .......... .......... .......... .......... 19% 198M 10s 3800K .......... .......... .......... .......... .......... 19% 211M 10s 3850K .......... .......... .......... .......... .......... 19% 89.3M 10s 3900K .......... .......... .......... .......... .......... 20% 44.8M 10s 3950K .......... .......... .......... .......... .......... 20% 48.7M 9s 4000K .......... .......... .......... .......... .......... 20% 34.9M 9s 4050K .......... .......... .......... .......... .......... 20% 36.2M 9s 4100K .......... .......... .......... .......... .......... 21% 47.6M 9s 4150K .......... .......... .......... .......... .......... 21% 60.9M 9s 4200K .......... .......... .......... .......... .......... 21% 16.7M 9s 4250K .......... .......... .......... .......... .......... 21% 25.5M 9s 4300K .......... .......... .......... .......... .......... 22% 66.6M 8s 4350K .......... .......... .......... .......... .......... 22% 51.9M 8s 4400K .......... .......... .......... .......... .......... 22% 42.4M 8s 4450K .......... .......... .......... .......... .......... 22% 47.1M 8s 4500K .......... .......... .......... .......... .......... 23% 37.0M 8s 4550K .......... .......... .......... .......... .......... 23% 56.0M 8s 4600K .......... .......... .......... .......... .......... 23% 33.0M 8s 4650K .......... .......... .......... .......... .......... 23% 25.0M 8s 4700K .......... .......... .......... .......... .......... 24% 335K 8s 4750K .......... .......... .......... .......... .......... 24% 46.9M 8s 4800K .......... .......... .......... .......... .......... 24% 53.1M 8s 4850K .......... .......... .......... .......... .......... 24% 47.4M 8s 4900K .......... .......... .......... .......... .......... 25% 59.4M 8s 4950K .......... .......... .......... .......... .......... 25% 58.8M 8s 5000K .......... .......... .......... .......... .......... 25% 38.1M 7s 5050K .......... .......... .......... .......... .......... 25% 42.7M 7s 5100K .......... .......... .......... .......... .......... 26% 39.3M 7s 5150K .......... .......... .......... .......... .......... 26% 16.3M 7s 5200K .......... .......... .......... .......... .......... 26% 28.8M 7s 5250K .......... .......... .......... .......... .......... 26% 40.4M 7s 5300K .......... .......... .......... .......... .......... 27% 10.0M 7s 5350K .......... .......... .......... .......... .......... 27% 44.1M 7s 5400K .......... .......... .......... .......... .......... 27% 57.1M 7s 5450K .......... .......... .......... .......... .......... 27% 5.63M 7s 5500K .......... .......... .......... .......... .......... 28% 20.2M 7s 5550K .......... .......... .......... .......... .......... 28% 41.7M 7s 5600K .......... .......... .......... .......... .......... 28% 47.2M 6s 5650K .......... .......... .......... .......... .......... 28% 36.6M 6s 5700K .......... .......... .......... .......... .......... 29% 371K 7s 5750K .......... .......... .......... .......... .......... 29% 18.4M 7s 5800K .......... .......... .......... .......... .......... 29% 51.9M 6s 5850K .......... .......... .......... .......... .......... 29% 43.2M 6s 5900K .......... .......... .......... .......... .......... 30% 50.0M 6s 5950K .......... .......... .......... .......... .......... 30% 56.4M 6s 6000K .......... .......... .......... .......... .......... 30% 52.6M 6s 6050K .......... .......... .......... .......... .......... 30% 42.7M 6s 6100K .......... .......... .......... .......... .......... 31% 6.56M 6s 6150K .......... .......... .......... .......... .......... 31% 127M 6s 6200K .......... .......... .......... .......... .......... 31% 18.2M 6s 6250K .......... .......... .......... .......... .......... 32% 10.3M 6s 6300K .......... .......... .......... .......... .......... 32% 7.01M 6s 6350K .......... .......... .......... .......... .......... 32% 55.4M 6s 6400K .......... .......... .......... .......... .......... 32% 45.7M 6s 6450K .......... .......... .......... .......... .......... 33% 45.6M 6s 6500K .......... .......... .......... .......... .......... 33% 20.6M 6s 6550K .......... .......... .......... .......... .......... 33% 24.6M 5s 6600K .......... .......... .......... .......... .......... 33% 47.4M 5s 6650K .......... .......... .......... .......... .......... 34% 29.0M 5s 6700K .......... .......... .......... .......... .......... 34% 51.3M 5s 6750K .......... .......... .......... .......... .......... 34% 361K 6s 6800K .......... .......... .......... .......... .......... 34% 52.3M 5s 6850K .......... .......... .......... .......... .......... 35% 15.1M 5s 6900K .......... .......... .......... .......... .......... 35% 10.9M 5s 6950K .......... .......... .......... .......... .......... 35% 4.23M 5s 7000K .......... .......... .......... .......... .......... 35% 68.3M 5s 7050K .......... .......... .......... .......... .......... 36% 28.7M 5s 7100K .......... .......... .......... .......... .......... 36% 32.3M 5s 7150K .......... .......... .......... .......... .......... 36% 23.2M 5s 7200K .......... .......... .......... .......... .......... 36% 25.7M 5s 7250K .......... .......... .......... .......... .......... 37% 53.4M 5s 7300K .......... .......... .......... .......... .......... 37% 18.4M 5s 7350K .......... .......... .......... .......... .......... 37% 77.4M 5s 7400K .......... .......... .......... .......... .......... 37% 67.2M 5s 7450K .......... .......... .......... .......... .......... 38% 20.3M 5s 7500K .......... .......... .......... .......... .......... 38% 58.2M 5s 7550K .......... .......... .......... .......... .......... 38% 56.1M 5s 7600K .......... .......... .......... .......... .......... 38% 56.0M 5s 7650K .......... .......... .......... .......... .......... 39% 46.6M 5s 7700K .......... .......... .......... .......... .......... 39% 52.5M 5s 7750K .......... .......... .......... .......... .......... 39% 384K 5s 7800K .......... .......... .......... .......... .......... 39% 24.5M 5s 7850K .......... .......... .......... .......... .......... 40% 3.46M 5s 7900K .......... .......... .......... .......... .......... 40% 37.0M 5s 7950K .......... .......... .......... .......... .......... 40% 46.5M 5s 8000K .......... .......... .......... .......... .......... 40% 8.05M 4s 8050K .......... .......... .......... .......... .......... 41% 41.1M 4s 8100K .......... .......... .......... .......... .......... 41% 48.1M 4s 8150K .......... .......... .......... .......... .......... 41% 50.8M 4s 8200K .......... .......... .......... .......... .......... 41% 44.3M 4s 8250K .......... .......... .......... .......... .......... 42% 25.3M 4s 8300K .......... .......... .......... .......... .......... 42% 9.81M 4s 8350K .......... .......... .......... .......... .......... 42% 7.92M 4s 8400K .......... .......... .......... .......... .......... 42% 49.7M 4s 8450K .......... .......... .......... .......... .......... 43% 19.6M 4s 8500K .......... .......... .......... .......... .......... 43% 19.4M 4s 8550K .......... .......... .......... .......... .......... 43% 27.2M 4s 8600K .......... .......... .......... .......... .......... 43% 21.6M 4s 8650K .......... .......... .......... .......... .......... 44% 47.1M 4s 8700K .......... .......... .......... .......... .......... 44% 35.0M 4s 8750K .......... .......... .......... .......... .......... 44% 60.1M 4s 8800K .......... .......... .......... .......... .......... 44% 423K 4s 8850K .......... .......... .......... .......... .......... 45% 9.94M 4s 8900K .......... .......... .......... .......... .......... 45% 5.31M 4s 8950K .......... .......... .......... .......... .......... 45% 3.88M 4s 9000K .......... .......... .......... .......... .......... 45% 29.3M 4s 9050K .......... .......... .......... .......... .......... 46% 12.5M 4s 9100K .......... .......... .......... .......... .......... 46% 35.8M 4s 9150K .......... .......... .......... .......... .......... 46% 7.94M 4s 9200K .......... .......... .......... .......... .......... 46% 12.0M 4s 9250K .......... .......... .......... .......... .......... 47% 42.7M 4s 9300K .......... .......... .......... .......... .......... 47% 38.8M 4s 9350K .......... .......... .......... .......... .......... 47% 28.8M 4s 9400K .......... .......... .......... .......... .......... 48% 14.2M 4s 9450K .......... .......... .......... .......... .......... 48% 18.4M 4s 9500K .......... .......... .......... .......... .......... 48% 22.0M 4s 9550K .......... .......... .......... .......... .......... 48% 54.8M 3s 9600K .......... .......... .......... .......... .......... 49% 27.0M 3s 9650K .......... .......... .......... .......... .......... 49% 46.4M 3s 9700K .......... .......... .......... .......... .......... 49% 8.68M 3s 9750K .......... .......... .......... .......... .......... 49% 59.9M 3s 9800K .......... .......... .......... .......... .......... 50% 481K 3s 9850K .......... .......... .......... .......... .......... 50% 20.0M 3s 9900K .......... .......... .......... .......... .......... 50% 4.16M 3s 9950K .......... .......... .......... .......... .......... 50% 7.48M 3s 10000K .......... .......... .......... .......... .......... 51% 6.70M 3s 10050K .......... .......... .......... .......... .......... 51% 11.6M 3s 10100K .......... .......... .......... .......... .......... 51% 44.8M 3s 10150K .......... .......... .......... .......... .......... 51% 13.3M 3s 10200K .......... .......... .......... .......... .......... 52% 9.20M 3s 10250K .......... .......... .......... .......... .......... 52% 16.3M 3s 10300K .......... .......... .......... .......... .......... 52% 41.3M 3s 10350K .......... .......... .......... .......... .......... 52% 38.6M 3s 10400K .......... .......... .......... .......... .......... 53% 42.4M 3s 10450K .......... .......... .......... .......... .......... 53% 13.2M 3s 10500K .......... .......... .......... .......... .......... 53% 14.4M 3s 10550K .......... .......... .......... .......... .......... 53% 37.9M 3s 10600K .......... .......... .......... .......... .......... 54% 32.8M 3s 10650K .......... .......... .......... .......... .......... 54% 31.6M 3s 10700K .......... .......... .......... .......... .......... 54% 16.6M 3s 10750K .......... .......... .......... .......... .......... 54% 10.1M 3s 10800K .......... .......... .......... .......... .......... 55% 6.07M 3s 10850K .......... .......... .......... .......... .......... 55% 511K 3s 10900K .......... .......... .......... .......... .......... 55% 15.4M 3s 10950K .......... .......... .......... .......... .......... 55% 4.83M 3s 11000K .......... .......... .......... .......... .......... 56% 3.78M 3s 11050K .......... .......... .......... .......... .......... 56% 9.89M 3s 11100K .......... .......... .......... .......... .......... 56% 50.0M 3s 11150K .......... .......... .......... .......... .......... 56% 16.7M 3s 11200K .......... .......... .......... .......... .......... 57% 10.6M 3s 11250K .......... .......... .......... .......... .......... 57% 29.1M 3s 11300K .......... .......... .......... .......... .......... 57% 44.3M 3s 11350K .......... .......... .......... .......... .......... 57% 26.5M 3s 11400K .......... .......... .......... .......... .......... 58% 26.0M 3s 11450K .......... .......... .......... .......... .......... 58% 5.51M 3s 11500K .......... .......... .......... .......... .......... 58% 12.0M 3s 11550K .......... .......... .......... .......... .......... 58% 37.6M 3s 11600K .......... .......... .......... .......... .......... 59% 24.4M 3s 11650K .......... .......... .......... .......... .......... 59% 35.4M 2s 11700K .......... .......... .......... .......... .......... 59% 29.8M 2s 11750K .......... .......... .......... .......... .......... 59% 25.4M 2s 11800K .......... .......... .......... .......... .......... 60% 46.1M 2s 11850K .......... .......... .......... .......... .......... 60% 492K 2s 11900K .......... .......... .......... .......... .......... 60% 16.8M 2s 11950K .......... .......... .......... .......... .......... 60% 4.46M 2s 12000K .......... .......... .......... .......... .......... 61% 7.39M 2s 12050K .......... .......... .......... .......... .......... 61% 4.19M 2s 12100K .......... .......... .......... .......... .......... 61% 31.5M 2s 12150K .......... .......... .......... .......... .......... 61% 6.88M 2s 12200K .......... .......... .......... .......... .......... 62% 35.1M 2s 12250K .......... .......... .......... .......... .......... 62% 12.2M 2s 12300K .......... .......... .......... .......... .......... 62% 28.7M 2s 12350K .......... .......... .......... .......... .......... 62% 42.7M 2s 12400K .......... .......... .......... .......... .......... 63% 36.1M 2s 12450K .......... .......... .......... .......... .......... 63% 43.4M 2s 12500K .......... .......... .......... .......... .......... 63% 5.81M 2s 12550K .......... .......... .......... .......... .......... 64% 8.75M 2s 12600K .......... .......... .......... .......... .......... 64% 8.01M 2s 12650K .......... .......... .......... .......... .......... 64% 17.6M 2s 12700K .......... .......... .......... .......... .......... 64% 20.8M 2s 12750K .......... .......... .......... .......... .......... 65% 13.5M 2s 12800K .......... .......... .......... .......... .......... 65% 3.70M 2s 12850K .......... .......... .......... .......... .......... 65% 14.4M 2s 12900K .......... .......... .......... .......... .......... 65% 665K 2s 12950K .......... .......... .......... .......... .......... 66% 9.40M 2s 13000K .......... .......... .......... .......... .......... 66% 3.60M 2s 13050K .......... .......... .......... .......... .......... 66% 4.57M 2s 13100K .......... .......... .......... .......... .......... 66% 4.64M 2s 13150K .......... .......... .......... .......... .......... 67% 24.0M 2s 13200K .......... .......... .......... .......... .......... 67% 8.65M 2s 13250K .......... .......... .......... .......... .......... 67% 12.8M 2s 13300K .......... .......... .......... .......... .......... 67% 7.09M 2s 13350K .......... .......... .......... .......... .......... 68% 45.3M 2s 13400K .......... .......... .......... .......... .......... 68% 55.8M 2s 13450K .......... .......... .......... .......... .......... 68% 51.1M 2s 13500K .......... .......... .......... .......... .......... 68% 36.2M 2s 13550K .......... .......... .......... .......... .......... 69% 13.8M 2s 13600K .......... .......... .......... .......... .......... 69% 19.9M 2s 13650K .......... .......... .......... .......... .......... 69% 2.66M 2s 13700K .......... .......... .......... .......... .......... 69% 14.7M 2s 13750K .......... .......... .......... .......... .......... 70% 11.1M 2s 13800K .......... .......... .......... .......... .......... 70% 25.3M 2s 13850K .......... .......... .......... .......... .......... 70% 5.49M 2s 13900K .......... .......... .......... .......... .......... 70% 762K 2s 13950K .......... .......... .......... .......... .......... 71% 9.68M 2s 14000K .......... .......... .......... .......... .......... 71% 3.61M 2s 14050K .......... .......... .......... .......... .......... 71% 7.60M 2s 14100K .......... .......... .......... .......... .......... 71% 6.15M 2s 14150K .......... .......... .......... .......... .......... 72% 6.36M 2s 14200K .......... .......... .......... .......... .......... 72% 17.9M 2s 14250K .......... .......... .......... .......... .......... 72% 7.69M 2s 14300K .......... .......... .......... .......... .......... 72% 6.69M 2s 14350K .......... .......... .......... .......... .......... 73% 25.6M 2s 14400K .......... .......... .......... .......... .......... 73% 18.8M 2s 14450K .......... .......... .......... .......... .......... 73% 42.9M 1s 14500K .......... .......... .......... .......... .......... 73% 4.91M 1s 14550K .......... .......... .......... .......... .......... 74% 5.74M 1s 14600K .......... .......... .......... .......... .......... 74% 16.4M 1s 14650K .......... .......... .......... .......... .......... 74% 32.1M 1s 14700K .......... .......... .......... .......... .......... 74% 11.2M 1s 14750K .......... .......... .......... .......... .......... 75% 17.7M 1s 14800K .......... .......... .......... .......... .......... 75% 2.48M 1s 14850K .......... .......... .......... .......... .......... 75% 17.9M 1s 14900K .......... .......... .......... .......... .......... 75% 5.75M 1s 14950K .......... .......... .......... .......... .......... 76% 991K 1s 15000K .......... .......... .......... .......... .......... 76% 7.60M 1s 15050K .......... .......... .......... .......... .......... 76% 3.75M 1s 15100K .......... .......... .......... .......... .......... 76% 7.45M 1s 15150K .......... .......... .......... .......... .......... 77% 4.78M 1s 15200K .......... .......... .......... .......... .......... 77% 25.3M 1s 15250K .......... .......... .......... .......... .......... 77% 7.42M 1s 15300K .......... .......... .......... .......... .......... 77% 14.5M 1s 15350K .......... .......... .......... .......... .......... 78% 7.57M 1s 15400K .......... .......... .......... .......... .......... 78% 19.8M 1s 15450K .......... .......... .......... .......... .......... 78% 24.9M 1s 15500K .......... .......... .......... .......... .......... 78% 10.9M 1s 15550K .......... .......... .......... .......... .......... 79% 4.56M 1s 15600K .......... .......... .......... .......... .......... 79% 8.08M 1s 15650K .......... .......... .......... .......... .......... 79% 39.9M 1s 15700K .......... .......... .......... .......... .......... 80% 11.1M 1s 15750K .......... .......... .......... .......... .......... 80% 6.54M 1s 15800K .......... .......... .......... .......... .......... 80% 5.56M 1s 15850K .......... .......... .......... .......... .......... 80% 5.55M 1s 15900K .......... .......... .......... .......... .......... 81% 5.56M 1s 15950K .......... .......... .......... .......... .......... 81% 1.02M 1s 16000K .......... .......... .......... .......... .......... 81% 5.98M 1s 16050K .......... .......... .......... .......... .......... 81% 4.08M 1s 16100K .......... .......... .......... .......... .......... 82% 7.54M 1s 16150K .......... .......... .......... .......... .......... 82% 6.61M 1s 16200K .......... .......... .......... .......... .......... 82% 6.01M 1s 16250K .......... .......... .......... .......... .......... 82% 16.4M 1s 16300K .......... .......... .......... .......... .......... 83% 8.25M 1s 16350K .......... .......... .......... .......... .......... 83% 8.41M 1s 16400K .......... .......... .......... .......... .......... 83% 8.53M 1s 16450K .......... .......... .......... .......... .......... 83% 48.8M 1s 16500K .......... .......... .......... .......... .......... 84% 39.0M 1s 16550K .......... .......... .......... .......... .......... 84% 6.14M 1s 16600K .......... .......... .......... .......... .......... 84% 6.06M 1s 16650K .......... .......... .......... .......... .......... 84% 12.7M 1s 16700K .......... .......... .......... .......... .......... 85% 33.1M 1s 16750K .......... .......... .......... .......... .......... 85% 11.2M 1s 16800K .......... .......... .......... .......... .......... 85% 6.58M 1s 16850K .......... .......... .......... .......... .......... 85% 3.20M 1s 16900K .......... .......... .......... .......... .......... 86% 19.5M 1s 16950K .......... .......... .......... .......... .......... 86% 6.29M 1s 17000K .......... .......... .......... .......... .......... 86% 908K 1s 17050K .......... .......... .......... .......... .......... 86% 15.1M 1s 17100K .......... .......... .......... .......... .......... 87% 3.87M 1s 17150K .......... .......... .......... .......... .......... 87% 7.34M 1s 17200K .......... .......... .......... .......... .......... 87% 5.27M 1s 17250K .......... .......... .......... .......... .......... 87% 12.0M 1s 17300K .......... .......... .......... .......... .......... 88% 8.83M 1s 17350K .......... .......... .......... .......... .......... 88% 10.2M 1s 17400K .......... .......... .......... .......... .......... 88% 12.4M 1s 17450K .......... .......... .......... .......... .......... 88% 8.18M 1s 17500K .......... .......... .......... .......... .......... 89% 57.6M 1s 17550K .......... .......... .......... .......... .......... 89% 13.4M 1s 17600K .......... .......... .......... .......... .......... 89% 5.26M 1s 17650K .......... .......... .......... .......... .......... 89% 5.81M 1s 17700K .......... .......... .......... .......... .......... 90% 4.93M 1s 17750K .......... .......... .......... .......... .......... 90% 33.9M 0s 17800K .......... .......... .......... .......... .......... 90% 41.4M 0s 17850K .......... .......... .......... .......... .......... 90% 4.12M 0s 17900K .......... .......... .......... .......... .......... 91% 10.8M 0s 17950K .......... .......... .......... .......... .......... 91% 5.97M 0s 18000K .......... .......... .......... .......... .......... 91% 1.02M 0s 18050K .......... .......... .......... .......... .......... 91% 5.34M 0s 18100K .......... .......... .......... .......... .......... 92% 3.95M 0s 18150K .......... .......... .......... .......... .......... 92% 9.24M 0s 18200K .......... .......... .......... .......... .......... 92% 7.17M 0s 18250K .......... .......... .......... .......... .......... 92% 2.33M 0s 18300K .......... .......... .......... .......... .......... 93% 11.2M 0s 18350K .......... .......... .......... .......... .......... 93% 43.4M 0s 18400K .......... .......... .......... .......... .......... 93% 15.2M 0s 18450K .......... .......... .......... .......... .......... 93% 32.0M 0s 18500K .......... .......... .......... .......... .......... 94% 47.9M 0s 18550K .......... .......... .......... .......... .......... 94% 49.4M 0s 18600K .......... .......... .......... .......... .......... 94% 6.28M 0s 18650K .......... .......... .......... .......... .......... 94% 5.63M 0s 18700K .......... .......... .......... .......... .......... 95% 2.55M 0s 18750K .......... .......... .......... .......... .......... 95% 48.2M 0s 18800K .......... .......... .......... .......... .......... 95% 60.5M 0s 18850K .......... .......... .......... .......... .......... 96% 49.5M 0s 18900K .......... .......... .......... .......... .......... 96% 3.63M 0s 18950K .......... .......... .......... .......... .......... 96% 25.8M 0s 19000K .......... .......... .......... .......... .......... 96% 9.78M 0s 19050K .......... .......... .......... .......... .......... 97% 925K 0s 19100K .......... .......... .......... .......... .......... 97% 17.2M 0s 19150K .......... .......... .......... .......... .......... 97% 3.98M 0s 19200K .......... .......... .......... .......... .......... 97% 7.60M 0s 19250K .......... .......... .......... .......... .......... 98% 2.15M 0s 19300K .......... .......... .......... .......... .......... 98% 9.55M 0s 19350K .......... .......... .......... .......... .......... 98% 34.0M 0s 19400K .......... .......... .......... .......... .......... 98% 42.9M 0s 19450K .......... .......... .......... .......... .......... 99% 13.5M 0s 19500K .......... .......... .......... .......... .......... 99% 47.4M 0s 19550K .......... .......... .......... .......... .......... 99% 41.9M 0s 19600K .......... .......... .......... .......... .......... 99% 28.0M 0s 19650K .......... .......... .......... .... 100% 4.95M=5.0s 2018-01-04 20:19:23 (3.81 MB/s) - ‘anon_data.csv’ saved [20156753/20156753]
pwd
ls
/databricks/driver anon_data.csv anon_data.csv.1 conf derby.log eventlogs ganglia logs
val dataPath = "file:/databricks/driver/anon_data.csv"
spark.read.format("csv")
.option("header","true")
.option("inferSchema", "true")
.load(dataPath)
.createOrReplaceTempView("anonymized_data_raw")
dataPath: String = file:/databricks/driver/anon_data.csv
Data visualization
To better understand our our network data, analyze it and verify its correctness, we decided to represent the data in a graph network. A graph is made up of vertices and edges and can be either directed or undirected. A visualization of an example graph can be seen in the picture below:
And more information about graph theory can be found at https://en.wikipedia.org/wiki/Graph\_theory.
In our context of network traffic, each connected device can be seen as a vertex in the graph and each packet sent between two devices is an edge. For our data a packet is always sent from one source node (vertex) to another destination node (vertex). Thus each edge is directed from and the whole graph is directed.
To use this graph representation for our network data we used the Spark package GraphFrames.
GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs. It provides high-level APIs in Scala, Java, and Python. It aims to provide both the functionality of GraphX and extended functionality taking advantage of Spark DataFrames. This extended functionality includes motif finding, DataFrame-based serialization, and highly expressive graph queries.
The GraphFrames package is available from Spark Packages.
This notebook demonstrates examples from the GraphFrames User Guide.
(Above GraphFrames explanation taken from Raazesh Sainudiin's course Scalable Data Science)
Using GraphFrames we can also see the the relationship between vertices using motifs, filter graphs and find the in- and outdegrees of vertices.
To visualize our graph network we decided to use the package JavaScript visualization package D3 which allows for complex visualizations of graph networks and tons of other applications.
displayHTML(frameIt("https://d3js.org",500))
displayHTML(frameIt("http://graphframes.github.io/user-guide.html",500))
val sqlDF = spark.sql("SELECT * FROM anonymized_data_raw")
sqlDF: org.apache.spark.sql.DataFrame = [n: int, Time: double ... 4 more fields]
display(sqlDF)
n | Time | Source | Destination | Protocol | Length |
---|---|---|---|---|---|
1.0 | 0.0 | 174.226.241.183 | 95.155.84.47 | STP | 52.0 |
2.0 | 0.140331 | 177.174.162.63 | 131.157.50.23 | TLSv1.2 | 129.0 |
3.0 | 0.141313 | 177.174.162.63 | 131.157.50.23 | TLSv1.2 | 129.0 |
4.0 | 0.142322 | 177.174.162.63 | 113.26.139.31 | DNS | 69.0 |
5.0 | 0.146544 | 108.5.57.212 | 3.189.19.124 | DNS | 85.0 |
6.0 | 0.147182 | 177.174.162.63 | 234.164.133.186 | TCP | 78.0 |
7.0 | 0.151439 | 18.28.228.158 | 3.189.19.124 | TCP | 74.0 |
8.0 | 0.151544 | 177.174.162.63 | 234.164.133.186 | TCP | 66.0 |
9.0 | 0.151839 | 177.174.162.63 | 234.164.133.186 | TLSv1.2 | 583.0 |
10.0 | 0.155831 | 18.28.228.158 | 3.189.19.124 | TCP | 66.0 |
11.0 | 0.156337 | 18.28.228.158 | 3.189.19.124 | TLSv1.2 | 216.0 |
12.0 | 0.156426 | 177.174.162.63 | 234.164.133.186 | TCP | 66.0 |
13.0 | 0.156624 | 177.174.162.63 | 234.164.133.186 | TLSv1.2 | 117.0 |
14.0 | 0.160726 | 18.28.228.158 | 3.189.19.124 | TLSv1.2 | 135.0 |
15.0 | 0.160769 | 177.174.162.63 | 234.164.133.186 | TCP | 66.0 |
16.0 | 0.167779 | 177.174.162.63 | 234.164.133.186 | TLSv1.2 | 119.0 |
17.0 | 0.16778 | 177.174.162.63 | 234.164.133.186 | TLSv1.2 | 122.0 |
18.0 | 0.16778 | 177.174.162.63 | 234.164.133.186 | TLSv1.2 | 108.0 |
19.0 | 0.168021 | 177.174.162.63 | 234.164.133.186 | TLSv1.2 | 104.0 |
20.0 | 0.168136 | 177.174.162.63 | 234.164.133.186 | TLSv1.2 | 908.0 |
21.0 | 0.171819 | 18.28.228.158 | 3.189.19.124 | TCP | 66.0 |
22.0 | 0.171826 | 18.28.228.158 | 3.189.19.124 | TLSv1.2 | 104.0 |
23.0 | 0.171922 | 177.174.162.63 | 234.164.133.186 | TCP | 66.0 |
24.0 | 0.172035 | 18.28.228.158 | 3.189.19.124 | TCP | 66.0 |
25.0 | 0.184559 | 101.96.108.245 | 3.189.19.124 | TCP | 66.0 |
26.0 | 0.185073 | 101.96.108.245 | 3.189.19.124 | TCP | 66.0 |
27.0 | 0.192846 | 101.96.108.245 | 3.189.19.124 | TLSv1.2 | 805.0 |
28.0 | 0.19291 | 177.174.162.63 | 131.157.50.23 | TCP | 66.0 |
29.0 | 0.211526 | 18.28.228.158 | 3.189.19.124 | TCP | 66.0 |
30.0 | 0.262157 | 101.96.108.245 | 3.189.19.124 | TLSv1.2 | 129.0 |
Truncated to 30 rows
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
// Truncate the data for each millisecond
val truncData = sqlDF
.select($"n", $"Source", $"Destination", round($"Time", 2).as("ts"), $"Protocol", $"Length")
.groupBy($"ts", $"Source", $"Destination", $"Protocol")
.agg(avg($"Length").as("len"), (avg("Length") / max($"Length")).as("local_anomalies"), count("*").as("count"))
.sort($"ts")
truncData.show(5)
truncData.createOrReplaceTempView("anonymized_data")
+----+---------------+---------------+--------+-----+------------------+-----+ | ts| Source| Destination|Protocol| len| local_anomalies|count| +----+---------------+---------------+--------+-----+------------------+-----+ | 0.0|174.226.241.183| 95.155.84.47| STP| 52.0| 1.0| 1| |0.14| 177.174.162.63| 113.26.139.31| DNS| 69.0| 1.0| 1| |0.14| 177.174.162.63| 131.157.50.23| TLSv1.2|129.0| 1.0| 2| |0.15| 177.174.162.63|234.164.133.186| TCP| 72.0|0.9230769230769231| 2| |0.15| 108.5.57.212| 3.189.19.124| DNS| 85.0| 1.0| 1| +----+---------------+---------------+--------+-----+------------------+-----+ only showing top 5 rows import org.apache.spark.sql._ import org.apache.spark.sql.functions._ truncData: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [ts: double, Source: string ... 5 more fields]
import org.graphframes._
val v = truncData.select($"Source".as("id"), $"Source".as("src")).where("count > 10")
v.show()
val e = truncData.select($"Source".as("src"), $"Destination".as("dst"), $"Protocol", $"count").where("count > 10")
e.show()
val g = GraphFrame(v, e)
val gE= g.edges.select($"src", $"dst".as("dest"), $"count")
display(gE)
src | dest | count |
---|---|---|
177.174.162.63 | 223.1.230.140 | 12.0 |
108.5.57.212 | 246.42.223.127 | 14.0 |
177.174.162.63 | 196.141.158.131 | 20.0 |
9.9.252.254 | 3.189.19.124 | 20.0 |
177.174.162.63 | 196.141.158.131 | 19.0 |
9.9.252.254 | 3.189.19.124 | 19.0 |
9.9.252.254 | 3.189.19.124 | 19.0 |
177.174.162.63 | 196.141.158.131 | 19.0 |
9.9.252.254 | 3.189.19.124 | 19.0 |
177.174.162.63 | 196.141.158.131 | 19.0 |
9.9.252.254 | 3.189.19.124 | 19.0 |
177.174.162.63 | 196.141.158.131 | 19.0 |
9.9.252.254 | 3.189.19.124 | 19.0 |
177.174.162.63 | 196.141.158.131 | 19.0 |
177.174.162.63 | 196.141.158.131 | 19.0 |
9.9.252.254 | 3.189.19.124 | 19.0 |
9.9.252.254 | 3.189.19.124 | 19.0 |
177.174.162.63 | 196.141.158.131 | 19.0 |
108.5.57.212 | 246.42.223.127 | 11.0 |
177.174.162.63 | 196.141.158.131 | 19.0 |
9.9.252.254 | 3.189.19.124 | 19.0 |
9.9.252.254 | 3.189.19.124 | 18.0 |
177.174.162.63 | 196.141.158.131 | 18.0 |
9.9.252.254 | 3.189.19.124 | 19.0 |
177.174.162.63 | 196.141.158.131 | 19.0 |
177.174.162.63 | 196.141.158.131 | 20.0 |
9.9.252.254 | 3.189.19.124 | 20.0 |
9.9.252.254 | 3.189.19.124 | 20.0 |
177.174.162.63 | 196.141.158.131 | 20.0 |
177.174.162.63 | 196.141.158.131 | 19.0 |
Truncated to 30 rows
Warning: classes defined within packages cannot be redefined without a cluster restart. Compilation successful.
d3.graphs.force(
height = 1680,
width = 1280,
clicks = gE.as[d3.Edge])
display(g.inDegrees.orderBy($"inDegree".desc))
id | inDegree |
---|---|
3.189.19.124 | 2938.0 |
101.107.251.253 | 408.0 |
196.141.158.131 | 235.0 |
113.26.139.31 | 224.0 |
255.171.74.61 | 170.0 |
77.186.49.77 | 124.0 |
68.103.236.48 | 113.0 |
20.202.77.120 | 96.0 |
245.230.44.106 | 91.0 |
36.27.131.116 | 66.0 |
128.85.28.242 | 62.0 |
65.106.24.202 | 50.0 |
160.164.22.168 | 48.0 |
221.230.195.197 | 43.0 |
186.26.246.188 | 41.0 |
88.254.222.208 | 32.0 |
191.78.151.216 | 31.0 |
4.56.83.115 | 31.0 |
210.149.168.241 | 30.0 |
164.194.30.88 | 24.0 |
185.231.205.227 | 21.0 |
237.163.203.190 | 20.0 |
225.106.215.13 | 20.0 |
48.95.6.121 | 16.0 |
190.0.74.244 | 15.0 |
34.189.230.39 | 14.0 |
122.90.226.185 | 13.0 |
160.50.25.154 | 13.0 |
227.152.51.49 | 13.0 |
134.136.168.165 | 12.0 |
Truncated to 30 rows
display(g.outDegrees.orderBy($"outDegree".desc))
id | outDegree |
---|---|
177.174.162.63 | 2153.0 |
155.96.68.95 | 415.0 |
6.131.29.230 | 303.0 |
9.9.252.254 | 281.0 |
108.5.57.212 | 258.0 |
252.104.57.94 | 208.0 |
199.20.94.26 | 162.0 |
167.24.21.68 | 133.0 |
254.6.133.244 | 95.0 |
152.176.154.244 | 94.0 |
173.224.30.161 | 92.0 |
174.158.93.19 | 80.0 |
40.3.131.78 | 78.0 |
137.163.106.242 | 76.0 |
166.180.146.227 | 73.0 |
233.28.220.75 | 72.0 |
238.84.79.228 | 53.0 |
123.68.159.165 | 48.0 |
100.151.1.32 | 40.0 |
102.87.130.73 | 35.0 |
138.109.33.123 | 33.0 |
89.229.179.92 | 27.0 |
96.61.208.226 | 23.0 |
43.145.231.203 | 21.0 |
246.125.196.252 | 21.0 |
168.155.119.20 | 20.0 |
17.13.167.242 | 20.0 |
242.173.91.12 | 18.0 |
127.206.74.216 | 16.0 |
132.77.224.252 | 15.0 |
Truncated to 30 rows
Clustering
Pre-processing of data
We preprocessed the data logged from wireshark doing the following:
- Rounding timestamps by milliseconds, that would be four significant decimals.
- Group the data by (timestamp, source, destination, protocol) with a count of how many times these kind of packets was sent/received during a millisecond.
- One-hot encoded the protocol values
- If you don't what that means, check this article out https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f
- Standardized features for count and length of packets
Setting up k-means clustering
- 23 features
- Filtering out features that are not numeric, example is destination and source
displayHTML(frameIt("https://en.wikipedia.org/wiki/K-means_clustering",500))
import pandas as pd
sampled = sqlContext.sql("SELECT * FROM anonymized_data").toPandas()
# standardize features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
sample = sampled['len']
sample = sample.reshape(-1, 1) # one feature
scaler.fit(sample)
sampled['len'] = scaler.transform(sample)
sample = sampled['count']
sample = sample.reshape(-1, 1) # one feature
scaler.fit(sample)
sampled['count'] = scaler.transform(sample)
/databricks/python/local/lib/python2.7/site-packages/sklearn/utils/validation.py:429: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler. warnings.warn(msg, _DataConversionWarning)
df_count = sampled['count']
df_length = sampled['len']
df_proto = pd.get_dummies(sampled['Protocol'])
df_source = sampled['Source']
df_dest = sampled['Destination']
df_ts = sampled['ts']
onehot = pd.concat([df_proto, df_source, df_length, df_dest, df_ts, df_count], axis=1)
onehotDF = sqlContext.createDataFrame(onehot)
sqlContext.sql("DROP TABLE IF EXISTS anonymized_data_onehot")
onehotDF.write.saveAsTable('anonymized_data_onehot')
case class Packet(AJP13: Double, ALLJOYN_NS: Double, ARP: Double, DHCP: Double, DNS: Double, HTTP: Double, HTTP_XML: Double, ICMP: Double, ICMPv6: Double, IGMPv1: Double, IGMPv2: Double, IGMPv3: Double, MDNS: Double, NBNS: Double, NTP: Double, OCSP: Double, QUIC: Double, RTCP: Double, SIP: Double, SNMP: Double, SSDP: Double, STP: Double, STUN: Double, TCP: Double, TFTP: Double, TLSv1: Double, TLSv1_2: Double, UDP: Double, XMPP_XML: Double, Source: String, len: Double, Destination: String, ts: Double,
count: Long)
def parseRow(row: org.apache.spark.sql.Row): Packet = {
def toDouble(value: Any): Double = {
try {
value.toString.toDouble
} catch {
case e: Exception => 0.0
}
}
def toLong(value: Any): Long = {
try {
value.toString.toLong
} catch {
case e: Exception => 0
}
}
Packet(toDouble(row(0)), toDouble(row(1)), toDouble(row(2)), toDouble(row(3)), toDouble(row(4)), toDouble(row(5)), toDouble(row(6)), toDouble(row(7)), toDouble(row(8)), toDouble(row(9)), toDouble(row(10)), toDouble(row(11)), toDouble(row(12)), toDouble(row(13)), toDouble(row(14)), toDouble(row(15)), toDouble(row(16)), toDouble(row(17)), toDouble(row(18)), toDouble(row(19)), toDouble(row(20)), toDouble(row(21)), toDouble(row(22)), toDouble(row(23)), toDouble(row(24)), toDouble(row(25)), toDouble(row(26)), toDouble(row(27)), toDouble(row(28)), row(29).toString, toDouble(row(30)), row(31).toString, toDouble(row(32)), toLong(row(33)))
}
val df = table("anonymized_data_onehot").map(parseRow).toDF
df.createOrReplaceTempView("packetsView")
defined class Packet parseRow: (row: org.apache.spark.sql.Row)Packet df: org.apache.spark.sql.DataFrame = [AJP13: double, ALLJOYN_NS: double ... 32 more fields]
import org.apache.spark.ml.feature.VectorAssembler
val list = ("Source, Destination")
val cols = df.columns
val filtered = cols.filter { el =>
!list.contains(el)
}
val trainingData = new VectorAssembler()
.setInputCols(filtered)
.setOutputCol("features")
.transform(table("packetsView"))
import org.apache.spark.ml.feature.VectorAssembler list: String = Source, Destination cols: Array[String] = Array(AJP13, ALLJOYN_NS, ARP, DHCP, DNS, HTTP, HTTP_XML, ICMP, ICMPv6, IGMPv1, IGMPv2, IGMPv3, MDNS, NBNS, NTP, OCSP, QUIC, RTCP, SIP, SNMP, SSDP, STP, STUN, TCP, TFTP, TLSv1, TLSv1_2, UDP, XMPP_XML, Source, len, Destination, ts, count) filtered: Array[String] = Array(AJP13, ALLJOYN_NS, ARP, DHCP, DNS, HTTP, HTTP_XML, ICMP, ICMPv6, IGMPv1, IGMPv2, IGMPv3, MDNS, NBNS, NTP, OCSP, QUIC, RTCP, SIP, SNMP, SSDP, STP, STUN, TCP, TFTP, TLSv1, TLSv1_2, UDP, XMPP_XML, len, ts, count) trainingData: org.apache.spark.sql.DataFrame = [AJP13: double, ALLJOYN_NS: double ... 33 more fields]
import org.apache.spark.ml.clustering.KMeans
val model = new KMeans().setK(23).fit(trainingData)
val modelTransformed = model.transform(trainingData)
import org.apache.spark.ml.clustering.KMeans model: org.apache.spark.ml.clustering.KMeansModel = kmeans_81ed7de82229 modelTransformed: org.apache.spark.sql.DataFrame = [AJP13: double, ALLJOYN_NS: double ... 34 more fields]
Improvements and future work
In this section we present possible improvements that could have been done for our project and future work to further build on the project, increase its usability and value.
Dimensionality improvements
We used k-means for clustering our network data which uses euclidean distance. Models using euclidean distance are susceptible to the Curse of Dimensionality. With the 23 features we got after using one-hot encoding for the protocol column in the original dataset we are likely suffering from this high dimensionality. To improve the clustering one could an algorithm that doesn't use euclidean distance (or other distance measures that don't work well for high dimensionality). Another possible solution could be to to use dimensionality reduction and try to retain as much information as possible with fewer features. This could be done using techniques such as PCA or LDA.
Parse packet contents
We didn't parse the packet information other than IP addresses, packet lengths and protocol. To gain further insights one could parse the additional packet contents and look for sensitive items, including usernames, passwords etc.
Graph Analysis
One could continue analyze the graph representation of the data. Examples of this could include looking for comlpex relationships in the graph using GraphFrames motifs.
Real time network analysis using Spark streaming
To make the project even more useful in a real environment, one could use Spark Streaming k-means to cluster network traffic in real time and then perform anomaly detection in real time as well. An example approach of this can be seen in the following video: https://www.youtube.com/watch?v=i8\_\_\_3GdxlQ
Additional continuations of this could include giving suggestions for actions to perform when deteching malicious activity.