SDS-2.2, Scalable Data Science

Archived YouTube video of this live unedited lab-lecture:

Archived YouTube video of this live unedited lab-lecture

Network anomaly detection

Student Project

by Victor Ingman and Kasper Ramström

This project set out to build an automatic network anomaly detection system for networks. Network threats are a major and growing concern for enterprises and private consumers all over the world. On average it takes 191 days for a company to detect a threat and another 66 days to contain the threat (Enhancing Threat Detection with Big Data and AI). In addition to taking long time to detect and contain threats, they also involve a ton of manual labour that require security experts. Thus, it should be a big priority for businesses to find solutions that not prevent malicious intrusions but also find these malicious activities in a fast and automated way, so that they can be dealt with swiftly.

An example of the threats we're facing today is the WannaCry ransomware which spread rapidly throughout the world during 2017 and caused major havoc for companies and privates consumers throughout, including Akademiska Sjukhuset here in Uppsala.

Super cool WannaCry screenshot

With better security systems and automated ways of detecting malicious behaviour, many of these attacks could be prevented.

To gain inspiration for our project and find out how others have developed similar systems we've used the book Advanced Analytics with Spark which uses k-means clustering.

Advanced Analytics with Spark book

In the book, the authors cluster different kinds of network events with the hopes of separating abnormal behaviour in clusters different from other events. The data used in the book is the publicly available KDD Cup 1999 Data, which is both quite dated and different from the data we've used, but it works well as a proof of concept for our project. The code accompanying the above mentioned book can be found at https://github.com/sryza/aas and for our project we've used a similar approach for clustering the data using k-means.

Below, we present the code for our project alongside with explanations for what we've done and how we've done it. This includes data collection, data visualization, clustering of data and possible improvements and future work.

frameIt: (u: String, h: Int)String
displayHTML(frameIt("https://en.wikipedia.org/wiki/Anomaly_detection",500))

Data Collection

To get data for our network security project we decided to generate it ourselves from our own networks and perform malicious activity as well.

Our basic idea for the data collection involved having one victim device, which would perform normal internet activity, including streaming to different media devices, transferring files and web surfing. During this, another device would (the attacker) would perform malicious activity such as port scans and fingerprinting of the victim. Our hopes were that the malicious activities would stand out from the other traffic and would hopefully be detectable for our anomaly detection models.

From the book Network Security Through Analysis we read about the tools Wireshark and Nmap. For our project, we used Wireshark for collecting network data on the victim's computer and Nmap for performing malicious activity.

Data anonymization

As we collected data on our own private network and publish this notebook along with the data publicly, we decided to anonmyize our network data for privacy reasons. To do this, we followed the Databricks guide: https://databricks.com/blog/2017/02/13/anonymizing-datasets-at-scale-leveraging-databricks-interoperability.html

By using the package Faker we generated fake source IP's and destination IP's for our network traffic data and used this data for the remainder of the project. Since we didn't parse the packet details for our network traffic and since it can potentially include sensitive information about our connections, we decided to remove that data from the public dataset.

displayHTML(frameIt("https://en.wikipedia.org/wiki/Data_anonymization",500))
pip install unicodecsv Faker
Collecting unicodecsv
  Downloading unicodecsv-0.14.1.tar.gz
Collecting Faker
  Downloading Faker-0.8.8-py2.py3-none-any.whl (707kB)
Collecting text-unidecode (from Faker)
  Downloading text_unidecode-1.1-py2.py3-none-any.whl (77kB)
Collecting python-dateutil>=2.4 (from Faker)
  Downloading python_dateutil-2.6.1-py2.py3-none-any.whl (194kB)
Requirement already satisfied (use --upgrade to upgrade): six in /usr/lib/python2.7/dist-packages (from Faker)
Collecting ipaddress; python_version == "2.7" (from Faker)
  Downloading ipaddress-1.0.19.tar.gz
Building wheels for collected packages: unicodecsv, ipaddress
  Running setup.py bdist_wheel for unicodecsv: started
  Running setup.py bdist_wheel for unicodecsv: finished with status 'done'
  Stored in directory: /root/.cache/pip/wheels/97/e2/16/219fa93b83edaff912b6805cfa19d0597e21f8d353f3e2d22f
  Running setup.py bdist_wheel for ipaddress: started
  Running setup.py bdist_wheel for ipaddress: finished with status 'done'
  Stored in directory: /root/.cache/pip/wheels/d7/6b/69/666188e8101897abb2e115d408d139a372bdf6bfa7abb5aef5
Successfully built unicodecsv ipaddress
Installing collected packages: unicodecsv, text-unidecode, python-dateutil, ipaddress, Faker
Successfully installed Faker-0.8.8 ipaddress-1.0.19 python-dateutil-2.6.1 text-unidecode-1.1 unicodecsv-0.14.1
You are using pip version 8.1.1, however version 9.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
import unicodecsv as csv
from collections import defaultdict
from faker import Factory

def anonymize_rows(rows):
    """
    Rows is an iterable of dictionaries that contain name and
    email fields that need to be anonymized.
    """
    # Load faker
    faker  = Factory.create()

    # Create mappings of names, emails, social security numbers, and phone numbers to faked names & emails.
    sources  = defaultdict(faker.ipv4)
    destinations = defaultdict(faker.ipv4)

    # Iterate over the rows from the file and yield anonymized rows.
    for row in rows:
        # Replace name and email fields with faked fields.
        row["Source"]  = sources[row["Source"]]
        row["Destination"] = destinations[row["Destination"]]

        # Yield the row back to the caller
        yield row

def anonymize(source, target):
    """
    The source argument is a path to a CSV file containing data to anonymize,
    while target is a path to write the anonymized CSV data to.
    """
    with open(source, 'rU') as f:
        with open(target, 'w') as o:
            # Use the DictReader to easily extract fields
            reader = csv.DictReader(f)
            writer = csv.DictWriter(o, reader.fieldnames)

            # Read and anonymize data, writing to target file.
            for row in anonymize_rows(reader):
                writer.writerow(row)

# anonymize("path-to-dataset-to-be-anonymized", "path-to-output-file")

Wireshark and Nmap

What is it? https://www.wireshark.org/

Wireshark is a free and open source packet analyzer. It is used for network troubleshooting, analysis, software and communications protocol development, and education.

Our setup consisted of two computers, one as victim and one as attacker.

Step by step

  • Opened up Wireshark on the victims computer as well as logging activity on the network
  • Started a lot of transfers and streams on the victims computer
    • Started a Chromecast stream of a workout video on Youtube to a TV on the network
    • Streaming music to speakers on the network via Spotify Connect
    • Sending large files via Apple Airdrop
  • The attacker started Nmap and started a port scan against the victim
  • The attacker did a thourough fingerprint of the victim, such as OS detection and software detection at the open ports, also with Nmap
  • We exported the victims wireshark log as CSV by doing the following:

The following image visualizes the network environment

The dotted lines shows network communications Filled lines shows local execution or communication between nodes Lines with arrows shows directed communication

After that was done, about 30 minutes later, we exported the data to CSV-format. The CSV was formatted as follows:

No | Time | Source | Destination | Protocol | Length | Info --- | --- | --- | --- | --- | --- | --- 1 | 0.001237 | 10.0.0.66 | 10.0.0.1 | DNS | 54 | [Redacted] ⫶ | ⫶ | ⫶ | ⫶ | ⫶ | ⫶ | ⫶

Description of collected data

  • No = The id of the packet captured, starts from 0.
  • Time = Number of seconds elapsed since the capture started
  • Source = The IP address of the sender of the packet
  • Destination = The IP address of the receiver of the packet
  • Protocol = The protocol of the packet
  • Length = Length of the packet
  • Info = Data that is sent with the packet, redacted for privacy and anonymity

That way we are able to visualize the data collected in the form of a directed graph network and use the number of times a packet is sent identified by unique (source, destination, protocol).

Download the network data

The data dump we collected is available for download at the following url

http://sunlabs.se/assets/sds/anon\_data.csv

wget "http://sunlabs.se/assets/sds/anon_data.csv"
--2018-01-04 20:19:16--  http://sunlabs.se/assets/sds/anon_data.csv
Resolving sunlabs.se (sunlabs.se)... 52.233.164.195
Connecting to sunlabs.se (sunlabs.se)|52.233.164.195|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20156753 (19M) [application/octet-stream]
Saving to: ‘anon_data.csv’

     0K .......... .......... .......... .......... ..........  0%  147K 2m14s
    50K .......... .......... .......... .......... ..........  0%  293K 1m40s
   100K .......... .......... .......... .......... ..........  0%  293K 89s
   150K .......... .......... .......... .......... ..........  1%  121M 66s
   200K .......... .......... .......... .......... ..........  1%  293K 66s
   250K .......... .......... .......... .......... ..........  1%  209M 55s
   300K .......... .......... .......... .......... ..........  1%  195M 47s
   350K .......... .......... .......... .......... ..........  2%  293K 49s
   400K .......... .......... .......... .......... ..........  2%  203M 44s
   450K .......... .......... .......... .......... ..........  2%  172M 39s
   500K .......... .......... .......... .......... ..........  2%  295K 42s
   550K .......... .......... .......... .......... ..........  3%  237M 38s
   600K .......... .......... .......... .......... ..........  3% 82.6M 35s
   650K .......... .......... .......... .......... ..........  3%  102M 32s
   700K .......... .......... .......... .......... ..........  3%  295K 34s
   750K .......... .......... .......... .......... ..........  4%  227M 32s
   800K .......... .......... .......... .......... ..........  4%  114M 30s
   850K .......... .......... .......... .......... ..........  4% 38.3M 29s
   900K .......... .......... .......... .......... ..........  4%  212M 27s
   950K .......... .......... .......... .......... ..........  5%  297K 29s
  1000K .......... .......... .......... .......... ..........  5% 94.5M 27s
  1050K .......... .......... .......... .......... ..........  5%  137M 26s
  1100K .......... .......... .......... .......... ..........  5% 68.3M 25s
  1150K .......... .......... .......... .......... ..........  6% 62.9M 24s
  1200K .......... .......... .......... .......... ..........  6%  192M 23s
  1250K .......... .......... .......... .......... ..........  6%  298K 24s
  1300K .......... .......... .......... .......... ..........  6%  139M 23s
  1350K .......... .......... .......... .......... ..........  7% 80.3M 22s
  1400K .......... .......... .......... .......... ..........  7%  132M 21s
  1450K .......... .......... .......... .......... ..........  7%  245M 21s
  1500K .......... .......... .......... .......... ..........  7% 43.8M 20s
  1550K .......... .......... .......... .......... ..........  8%  123M 19s
  1600K .......... .......... .......... .......... ..........  8%  152M 19s
  1650K .......... .......... .......... .......... ..........  8%  299K 20s
  1700K .......... .......... .......... .......... ..........  8%  105M 19s
  1750K .......... .......... .......... .......... ..........  9%  181M 19s
  1800K .......... .......... .......... .......... ..........  9% 74.1M 18s
  1850K .......... .......... .......... .......... ..........  9%  117M 18s
  1900K .......... .......... .......... .......... ..........  9%  139M 17s
  1950K .......... .......... .......... .......... .......... 10% 77.7M 17s
  2000K .......... .......... .......... .......... .......... 10%  114M 16s
  2050K .......... .......... .......... .......... .......... 10%  212M 16s
  2100K .......... .......... .......... .......... .......... 10%  300K 17s
  2150K .......... .......... .......... .......... .......... 11% 85.0M 16s
  2200K .......... .......... .......... .......... .......... 11% 93.6M 16s
  2250K .......... .......... .......... .......... .......... 11% 74.9M 15s
  2300K .......... .......... .......... .......... .......... 11%  112M 15s
  2350K .......... .......... .......... .......... .......... 12%  166M 15s
  2400K .......... .......... .......... .......... .......... 12%  144M 14s
  2450K .......... .......... .......... .......... .......... 12%  251M 14s
  2500K .......... .......... .......... .......... .......... 12% 78.0M 14s
  2550K .......... .......... .......... .......... .......... 13%  277M 13s
  2600K .......... .......... .......... .......... .......... 13%  331M 13s
  2650K .......... .......... .......... .......... .......... 13%  380M 13s
  2700K .......... .......... .......... .......... .......... 13%  432M 13s
  2750K .......... .......... .......... .......... .......... 14%  442M 12s
  2800K .......... .......... .......... .......... .......... 14%  300K 13s
  2850K .......... .......... .......... .......... .......... 14%  316M 13s
  2900K .......... .......... .......... .......... .......... 14%  127M 13s
  2950K .......... .......... .......... .......... .......... 15% 81.9M 12s
  3000K .......... .......... .......... .......... .......... 15% 99.1M 12s
  3050K .......... .......... .......... .......... .......... 15%  118M 12s
  3100K .......... .......... .......... .......... .......... 16%  149M 12s
  3150K .......... .......... .......... .......... .......... 16%  175M 11s
  3200K .......... .......... .......... .......... .......... 16%  256M 11s
  3250K .......... .......... .......... .......... .......... 16%  218M 11s
  3300K .......... .......... .......... .......... .......... 17% 49.7M 11s
  3350K .......... .......... .......... .......... .......... 17% 42.5M 11s
  3400K .......... .......... .......... .......... .......... 17% 18.2M 10s
  3450K .......... .......... .......... .......... .......... 17% 17.5M 10s
  3500K .......... .......... .......... .......... .......... 18% 43.6M 10s
  3550K .......... .......... .......... .......... .......... 18% 57.8M 10s
  3600K .......... .......... .......... .......... .......... 18% 49.3M 10s
  3650K .......... .......... .......... .......... .......... 18%  320K 10s
  3700K .......... .......... .......... .......... .......... 19%  153M 10s
  3750K .......... .......... .......... .......... .......... 19%  198M 10s
  3800K .......... .......... .......... .......... .......... 19%  211M 10s
  3850K .......... .......... .......... .......... .......... 19% 89.3M 10s
  3900K .......... .......... .......... .......... .......... 20% 44.8M 10s
  3950K .......... .......... .......... .......... .......... 20% 48.7M 9s
  4000K .......... .......... .......... .......... .......... 20% 34.9M 9s
  4050K .......... .......... .......... .......... .......... 20% 36.2M 9s
  4100K .......... .......... .......... .......... .......... 21% 47.6M 9s
  4150K .......... .......... .......... .......... .......... 21% 60.9M 9s
  4200K .......... .......... .......... .......... .......... 21% 16.7M 9s
  4250K .......... .......... .......... .......... .......... 21% 25.5M 9s
  4300K .......... .......... .......... .......... .......... 22% 66.6M 8s
  4350K .......... .......... .......... .......... .......... 22% 51.9M 8s
  4400K .......... .......... .......... .......... .......... 22% 42.4M 8s
  4450K .......... .......... .......... .......... .......... 22% 47.1M 8s
  4500K .......... .......... .......... .......... .......... 23% 37.0M 8s
  4550K .......... .......... .......... .......... .......... 23% 56.0M 8s
  4600K .......... .......... .......... .......... .......... 23% 33.0M 8s
  4650K .......... .......... .......... .......... .......... 23% 25.0M 8s
  4700K .......... .......... .......... .......... .......... 24%  335K 8s
  4750K .......... .......... .......... .......... .......... 24% 46.9M 8s
  4800K .......... .......... .......... .......... .......... 24% 53.1M 8s
  4850K .......... .......... .......... .......... .......... 24% 47.4M 8s
  4900K .......... .......... .......... .......... .......... 25% 59.4M 8s
  4950K .......... .......... .......... .......... .......... 25% 58.8M 8s
  5000K .......... .......... .......... .......... .......... 25% 38.1M 7s
  5050K .......... .......... .......... .......... .......... 25% 42.7M 7s
  5100K .......... .......... .......... .......... .......... 26% 39.3M 7s
  5150K .......... .......... .......... .......... .......... 26% 16.3M 7s
  5200K .......... .......... .......... .......... .......... 26% 28.8M 7s
  5250K .......... .......... .......... .......... .......... 26% 40.4M 7s
  5300K .......... .......... .......... .......... .......... 27% 10.0M 7s
  5350K .......... .......... .......... .......... .......... 27% 44.1M 7s
  5400K .......... .......... .......... .......... .......... 27% 57.1M 7s
  5450K .......... .......... .......... .......... .......... 27% 5.63M 7s
  5500K .......... .......... .......... .......... .......... 28% 20.2M 7s
  5550K .......... .......... .......... .......... .......... 28% 41.7M 7s
  5600K .......... .......... .......... .......... .......... 28% 47.2M 6s
  5650K .......... .......... .......... .......... .......... 28% 36.6M 6s
  5700K .......... .......... .......... .......... .......... 29%  371K 7s
  5750K .......... .......... .......... .......... .......... 29% 18.4M 7s
  5800K .......... .......... .......... .......... .......... 29% 51.9M 6s
  5850K .......... .......... .......... .......... .......... 29% 43.2M 6s
  5900K .......... .......... .......... .......... .......... 30% 50.0M 6s
  5950K .......... .......... .......... .......... .......... 30% 56.4M 6s
  6000K .......... .......... .......... .......... .......... 30% 52.6M 6s
  6050K .......... .......... .......... .......... .......... 30% 42.7M 6s
  6100K .......... .......... .......... .......... .......... 31% 6.56M 6s
  6150K .......... .......... .......... .......... .......... 31%  127M 6s
  6200K .......... .......... .......... .......... .......... 31% 18.2M 6s
  6250K .......... .......... .......... .......... .......... 32% 10.3M 6s
  6300K .......... .......... .......... .......... .......... 32% 7.01M 6s
  6350K .......... .......... .......... .......... .......... 32% 55.4M 6s
  6400K .......... .......... .......... .......... .......... 32% 45.7M 6s
  6450K .......... .......... .......... .......... .......... 33% 45.6M 6s
  6500K .......... .......... .......... .......... .......... 33% 20.6M 6s
  6550K .......... .......... .......... .......... .......... 33% 24.6M 5s
  6600K .......... .......... .......... .......... .......... 33% 47.4M 5s
  6650K .......... .......... .......... .......... .......... 34% 29.0M 5s
  6700K .......... .......... .......... .......... .......... 34% 51.3M 5s
  6750K .......... .......... .......... .......... .......... 34%  361K 6s
  6800K .......... .......... .......... .......... .......... 34% 52.3M 5s
  6850K .......... .......... .......... .......... .......... 35% 15.1M 5s
  6900K .......... .......... .......... .......... .......... 35% 10.9M 5s
  6950K .......... .......... .......... .......... .......... 35% 4.23M 5s
  7000K .......... .......... .......... .......... .......... 35% 68.3M 5s
  7050K .......... .......... .......... .......... .......... 36% 28.7M 5s
  7100K .......... .......... .......... .......... .......... 36% 32.3M 5s
  7150K .......... .......... .......... .......... .......... 36% 23.2M 5s
  7200K .......... .......... .......... .......... .......... 36% 25.7M 5s
  7250K .......... .......... .......... .......... .......... 37% 53.4M 5s
  7300K .......... .......... .......... .......... .......... 37% 18.4M 5s
  7350K .......... .......... .......... .......... .......... 37% 77.4M 5s
  7400K .......... .......... .......... .......... .......... 37% 67.2M 5s
  7450K .......... .......... .......... .......... .......... 38% 20.3M 5s
  7500K .......... .......... .......... .......... .......... 38% 58.2M 5s
  7550K .......... .......... .......... .......... .......... 38% 56.1M 5s
  7600K .......... .......... .......... .......... .......... 38% 56.0M 5s
  7650K .......... .......... .......... .......... .......... 39% 46.6M 5s
  7700K .......... .......... .......... .......... .......... 39% 52.5M 5s
  7750K .......... .......... .......... .......... .......... 39%  384K 5s
  7800K .......... .......... .......... .......... .......... 39% 24.5M 5s
  7850K .......... .......... .......... .......... .......... 40% 3.46M 5s
  7900K .......... .......... .......... .......... .......... 40% 37.0M 5s
  7950K .......... .......... .......... .......... .......... 40% 46.5M 5s
  8000K .......... .......... .......... .......... .......... 40% 8.05M 4s
  8050K .......... .......... .......... .......... .......... 41% 41.1M 4s
  8100K .......... .......... .......... .......... .......... 41% 48.1M 4s
  8150K .......... .......... .......... .......... .......... 41% 50.8M 4s
  8200K .......... .......... .......... .......... .......... 41% 44.3M 4s
  8250K .......... .......... .......... .......... .......... 42% 25.3M 4s
  8300K .......... .......... .......... .......... .......... 42% 9.81M 4s
  8350K .......... .......... .......... .......... .......... 42% 7.92M 4s
  8400K .......... .......... .......... .......... .......... 42% 49.7M 4s
  8450K .......... .......... .......... .......... .......... 43% 19.6M 4s
  8500K .......... .......... .......... .......... .......... 43% 19.4M 4s
  8550K .......... .......... .......... .......... .......... 43% 27.2M 4s
  8600K .......... .......... .......... .......... .......... 43% 21.6M 4s
  8650K .......... .......... .......... .......... .......... 44% 47.1M 4s
  8700K .......... .......... .......... .......... .......... 44% 35.0M 4s
  8750K .......... .......... .......... .......... .......... 44% 60.1M 4s
  8800K .......... .......... .......... .......... .......... 44%  423K 4s
  8850K .......... .......... .......... .......... .......... 45% 9.94M 4s
  8900K .......... .......... .......... .......... .......... 45% 5.31M 4s
  8950K .......... .......... .......... .......... .......... 45% 3.88M 4s
  9000K .......... .......... .......... .......... .......... 45% 29.3M 4s
  9050K .......... .......... .......... .......... .......... 46% 12.5M 4s
  9100K .......... .......... .......... .......... .......... 46% 35.8M 4s
  9150K .......... .......... .......... .......... .......... 46% 7.94M 4s
  9200K .......... .......... .......... .......... .......... 46% 12.0M 4s
  9250K .......... .......... .......... .......... .......... 47% 42.7M 4s
  9300K .......... .......... .......... .......... .......... 47% 38.8M 4s
  9350K .......... .......... .......... .......... .......... 47% 28.8M 4s
  9400K .......... .......... .......... .......... .......... 48% 14.2M 4s
  9450K .......... .......... .......... .......... .......... 48% 18.4M 4s
  9500K .......... .......... .......... .......... .......... 48% 22.0M 4s
  9550K .......... .......... .......... .......... .......... 48% 54.8M 3s
  9600K .......... .......... .......... .......... .......... 49% 27.0M 3s
  9650K .......... .......... .......... .......... .......... 49% 46.4M 3s
  9700K .......... .......... .......... .......... .......... 49% 8.68M 3s
  9750K .......... .......... .......... .......... .......... 49% 59.9M 3s
  9800K .......... .......... .......... .......... .......... 50%  481K 3s
  9850K .......... .......... .......... .......... .......... 50% 20.0M 3s
  9900K .......... .......... .......... .......... .......... 50% 4.16M 3s
  9950K .......... .......... .......... .......... .......... 50% 7.48M 3s
 10000K .......... .......... .......... .......... .......... 51% 6.70M 3s
 10050K .......... .......... .......... .......... .......... 51% 11.6M 3s
 10100K .......... .......... .......... .......... .......... 51% 44.8M 3s
 10150K .......... .......... .......... .......... .......... 51% 13.3M 3s
 10200K .......... .......... .......... .......... .......... 52% 9.20M 3s
 10250K .......... .......... .......... .......... .......... 52% 16.3M 3s
 10300K .......... .......... .......... .......... .......... 52% 41.3M 3s
 10350K .......... .......... .......... .......... .......... 52% 38.6M 3s
 10400K .......... .......... .......... .......... .......... 53% 42.4M 3s
 10450K .......... .......... .......... .......... .......... 53% 13.2M 3s
 10500K .......... .......... .......... .......... .......... 53% 14.4M 3s
 10550K .......... .......... .......... .......... .......... 53% 37.9M 3s
 10600K .......... .......... .......... .......... .......... 54% 32.8M 3s
 10650K .......... .......... .......... .......... .......... 54% 31.6M 3s
 10700K .......... .......... .......... .......... .......... 54% 16.6M 3s
 10750K .......... .......... .......... .......... .......... 54% 10.1M 3s
 10800K .......... .......... .......... .......... .......... 55% 6.07M 3s
 10850K .......... .......... .......... .......... .......... 55%  511K 3s
 10900K .......... .......... .......... .......... .......... 55% 15.4M 3s
 10950K .......... .......... .......... .......... .......... 55% 4.83M 3s
 11000K .......... .......... .......... .......... .......... 56% 3.78M 3s
 11050K .......... .......... .......... .......... .......... 56% 9.89M 3s
 11100K .......... .......... .......... .......... .......... 56% 50.0M 3s
 11150K .......... .......... .......... .......... .......... 56% 16.7M 3s
 11200K .......... .......... .......... .......... .......... 57% 10.6M 3s
 11250K .......... .......... .......... .......... .......... 57% 29.1M 3s
 11300K .......... .......... .......... .......... .......... 57% 44.3M 3s
 11350K .......... .......... .......... .......... .......... 57% 26.5M 3s
 11400K .......... .......... .......... .......... .......... 58% 26.0M 3s
 11450K .......... .......... .......... .......... .......... 58% 5.51M 3s
 11500K .......... .......... .......... .......... .......... 58% 12.0M 3s
 11550K .......... .......... .......... .......... .......... 58% 37.6M 3s
 11600K .......... .......... .......... .......... .......... 59% 24.4M 3s
 11650K .......... .......... .......... .......... .......... 59% 35.4M 2s
 11700K .......... .......... .......... .......... .......... 59% 29.8M 2s
 11750K .......... .......... .......... .......... .......... 59% 25.4M 2s
 11800K .......... .......... .......... .......... .......... 60% 46.1M 2s
 11850K .......... .......... .......... .......... .......... 60%  492K 2s
 11900K .......... .......... .......... .......... .......... 60% 16.8M 2s
 11950K .......... .......... .......... .......... .......... 60% 4.46M 2s
 12000K .......... .......... .......... .......... .......... 61% 7.39M 2s
 12050K .......... .......... .......... .......... .......... 61% 4.19M 2s
 12100K .......... .......... .......... .......... .......... 61% 31.5M 2s
 12150K .......... .......... .......... .......... .......... 61% 6.88M 2s
 12200K .......... .......... .......... .......... .......... 62% 35.1M 2s
 12250K .......... .......... .......... .......... .......... 62% 12.2M 2s
 12300K .......... .......... .......... .......... .......... 62% 28.7M 2s
 12350K .......... .......... .......... .......... .......... 62% 42.7M 2s
 12400K .......... .......... .......... .......... .......... 63% 36.1M 2s
 12450K .......... .......... .......... .......... .......... 63% 43.4M 2s
 12500K .......... .......... .......... .......... .......... 63% 5.81M 2s
 12550K .......... .......... .......... .......... .......... 64% 8.75M 2s
 12600K .......... .......... .......... .......... .......... 64% 8.01M 2s
 12650K .......... .......... .......... .......... .......... 64% 17.6M 2s
 12700K .......... .......... .......... .......... .......... 64% 20.8M 2s
 12750K .......... .......... .......... .......... .......... 65% 13.5M 2s
 12800K .......... .......... .......... .......... .......... 65% 3.70M 2s
 12850K .......... .......... .......... .......... .......... 65% 14.4M 2s
 12900K .......... .......... .......... .......... .......... 65%  665K 2s
 12950K .......... .......... .......... .......... .......... 66% 9.40M 2s
 13000K .......... .......... .......... .......... .......... 66% 3.60M 2s
 13050K .......... .......... .......... .......... .......... 66% 4.57M 2s
 13100K .......... .......... .......... .......... .......... 66% 4.64M 2s
 13150K .......... .......... .......... .......... .......... 67% 24.0M 2s
 13200K .......... .......... .......... .......... .......... 67% 8.65M 2s
 13250K .......... .......... .......... .......... .......... 67% 12.8M 2s
 13300K .......... .......... .......... .......... .......... 67% 7.09M 2s
 13350K .......... .......... .......... .......... .......... 68% 45.3M 2s
 13400K .......... .......... .......... .......... .......... 68% 55.8M 2s
 13450K .......... .......... .......... .......... .......... 68% 51.1M 2s
 13500K .......... .......... .......... .......... .......... 68% 36.2M 2s
 13550K .......... .......... .......... .......... .......... 69% 13.8M 2s
 13600K .......... .......... .......... .......... .......... 69% 19.9M 2s
 13650K .......... .......... .......... .......... .......... 69% 2.66M 2s
 13700K .......... .......... .......... .......... .......... 69% 14.7M 2s
 13750K .......... .......... .......... .......... .......... 70% 11.1M 2s
 13800K .......... .......... .......... .......... .......... 70% 25.3M 2s
 13850K .......... .......... .......... .......... .......... 70% 5.49M 2s
 13900K .......... .......... .......... .......... .......... 70%  762K 2s
 13950K .......... .......... .......... .......... .......... 71% 9.68M 2s
 14000K .......... .......... .......... .......... .......... 71% 3.61M 2s
 14050K .......... .......... .......... .......... .......... 71% 7.60M 2s
 14100K .......... .......... .......... .......... .......... 71% 6.15M 2s
 14150K .......... .......... .......... .......... .......... 72% 6.36M 2s
 14200K .......... .......... .......... .......... .......... 72% 17.9M 2s
 14250K .......... .......... .......... .......... .......... 72% 7.69M 2s
 14300K .......... .......... .......... .......... .......... 72% 6.69M 2s
 14350K .......... .......... .......... .......... .......... 73% 25.6M 2s
 14400K .......... .......... .......... .......... .......... 73% 18.8M 2s
 14450K .......... .......... .......... .......... .......... 73% 42.9M 1s
 14500K .......... .......... .......... .......... .......... 73% 4.91M 1s
 14550K .......... .......... .......... .......... .......... 74% 5.74M 1s
 14600K .......... .......... .......... .......... .......... 74% 16.4M 1s
 14650K .......... .......... .......... .......... .......... 74% 32.1M 1s
 14700K .......... .......... .......... .......... .......... 74% 11.2M 1s
 14750K .......... .......... .......... .......... .......... 75% 17.7M 1s
 14800K .......... .......... .......... .......... .......... 75% 2.48M 1s
 14850K .......... .......... .......... .......... .......... 75% 17.9M 1s
 14900K .......... .......... .......... .......... .......... 75% 5.75M 1s
 14950K .......... .......... .......... .......... .......... 76%  991K 1s
 15000K .......... .......... .......... .......... .......... 76% 7.60M 1s
 15050K .......... .......... .......... .......... .......... 76% 3.75M 1s
 15100K .......... .......... .......... .......... .......... 76% 7.45M 1s
 15150K .......... .......... .......... .......... .......... 77% 4.78M 1s
 15200K .......... .......... .......... .......... .......... 77% 25.3M 1s
 15250K .......... .......... .......... .......... .......... 77% 7.42M 1s
 15300K .......... .......... .......... .......... .......... 77% 14.5M 1s
 15350K .......... .......... .......... .......... .......... 78% 7.57M 1s
 15400K .......... .......... .......... .......... .......... 78% 19.8M 1s
 15450K .......... .......... .......... .......... .......... 78% 24.9M 1s
 15500K .......... .......... .......... .......... .......... 78% 10.9M 1s
 15550K .......... .......... .......... .......... .......... 79% 4.56M 1s
 15600K .......... .......... .......... .......... .......... 79% 8.08M 1s
 15650K .......... .......... .......... .......... .......... 79% 39.9M 1s
 15700K .......... .......... .......... .......... .......... 80% 11.1M 1s
 15750K .......... .......... .......... .......... .......... 80% 6.54M 1s
 15800K .......... .......... .......... .......... .......... 80% 5.56M 1s
 15850K .......... .......... .......... .......... .......... 80% 5.55M 1s
 15900K .......... .......... .......... .......... .......... 81% 5.56M 1s
 15950K .......... .......... .......... .......... .......... 81% 1.02M 1s
 16000K .......... .......... .......... .......... .......... 81% 5.98M 1s
 16050K .......... .......... .......... .......... .......... 81% 4.08M 1s
 16100K .......... .......... .......... .......... .......... 82% 7.54M 1s
 16150K .......... .......... .......... .......... .......... 82% 6.61M 1s
 16200K .......... .......... .......... .......... .......... 82% 6.01M 1s
 16250K .......... .......... .......... .......... .......... 82% 16.4M 1s
 16300K .......... .......... .......... .......... .......... 83% 8.25M 1s
 16350K .......... .......... .......... .......... .......... 83% 8.41M 1s
 16400K .......... .......... .......... .......... .......... 83% 8.53M 1s
 16450K .......... .......... .......... .......... .......... 83% 48.8M 1s
 16500K .......... .......... .......... .......... .......... 84% 39.0M 1s
 16550K .......... .......... .......... .......... .......... 84% 6.14M 1s
 16600K .......... .......... .......... .......... .......... 84% 6.06M 1s
 16650K .......... .......... .......... .......... .......... 84% 12.7M 1s
 16700K .......... .......... .......... .......... .......... 85% 33.1M 1s
 16750K .......... .......... .......... .......... .......... 85% 11.2M 1s
 16800K .......... .......... .......... .......... .......... 85% 6.58M 1s
 16850K .......... .......... .......... .......... .......... 85% 3.20M 1s
 16900K .......... .......... .......... .......... .......... 86% 19.5M 1s
 16950K .......... .......... .......... .......... .......... 86% 6.29M 1s
 17000K .......... .......... .......... .......... .......... 86%  908K 1s
 17050K .......... .......... .......... .......... .......... 86% 15.1M 1s
 17100K .......... .......... .......... .......... .......... 87% 3.87M 1s
 17150K .......... .......... .......... .......... .......... 87% 7.34M 1s
 17200K .......... .......... .......... .......... .......... 87% 5.27M 1s
 17250K .......... .......... .......... .......... .......... 87% 12.0M 1s
 17300K .......... .......... .......... .......... .......... 88% 8.83M 1s
 17350K .......... .......... .......... .......... .......... 88% 10.2M 1s
 17400K .......... .......... .......... .......... .......... 88% 12.4M 1s
 17450K .......... .......... .......... .......... .......... 88% 8.18M 1s
 17500K .......... .......... .......... .......... .......... 89% 57.6M 1s
 17550K .......... .......... .......... .......... .......... 89% 13.4M 1s
 17600K .......... .......... .......... .......... .......... 89% 5.26M 1s
 17650K .......... .......... .......... .......... .......... 89% 5.81M 1s
 17700K .......... .......... .......... .......... .......... 90% 4.93M 1s
 17750K .......... .......... .......... .......... .......... 90% 33.9M 0s
 17800K .......... .......... .......... .......... .......... 90% 41.4M 0s
 17850K .......... .......... .......... .......... .......... 90% 4.12M 0s
 17900K .......... .......... .......... .......... .......... 91% 10.8M 0s
 17950K .......... .......... .......... .......... .......... 91% 5.97M 0s
 18000K .......... .......... .......... .......... .......... 91% 1.02M 0s
 18050K .......... .......... .......... .......... .......... 91% 5.34M 0s
 18100K .......... .......... .......... .......... .......... 92% 3.95M 0s
 18150K .......... .......... .......... .......... .......... 92% 9.24M 0s
 18200K .......... .......... .......... .......... .......... 92% 7.17M 0s
 18250K .......... .......... .......... .......... .......... 92% 2.33M 0s
 18300K .......... .......... .......... .......... .......... 93% 11.2M 0s
 18350K .......... .......... .......... .......... .......... 93% 43.4M 0s
 18400K .......... .......... .......... .......... .......... 93% 15.2M 0s
 18450K .......... .......... .......... .......... .......... 93% 32.0M 0s
 18500K .......... .......... .......... .......... .......... 94% 47.9M 0s
 18550K .......... .......... .......... .......... .......... 94% 49.4M 0s
 18600K .......... .......... .......... .......... .......... 94% 6.28M 0s
 18650K .......... .......... .......... .......... .......... 94% 5.63M 0s
 18700K .......... .......... .......... .......... .......... 95% 2.55M 0s
 18750K .......... .......... .......... .......... .......... 95% 48.2M 0s
 18800K .......... .......... .......... .......... .......... 95% 60.5M 0s
 18850K .......... .......... .......... .......... .......... 96% 49.5M 0s
 18900K .......... .......... .......... .......... .......... 96% 3.63M 0s
 18950K .......... .......... .......... .......... .......... 96% 25.8M 0s
 19000K .......... .......... .......... .......... .......... 96% 9.78M 0s
 19050K .......... .......... .......... .......... .......... 97%  925K 0s
 19100K .......... .......... .......... .......... .......... 97% 17.2M 0s
 19150K .......... .......... .......... .......... .......... 97% 3.98M 0s
 19200K .......... .......... .......... .......... .......... 97% 7.60M 0s
 19250K .......... .......... .......... .......... .......... 98% 2.15M 0s
 19300K .......... .......... .......... .......... .......... 98% 9.55M 0s
 19350K .......... .......... .......... .......... .......... 98% 34.0M 0s
 19400K .......... .......... .......... .......... .......... 98% 42.9M 0s
 19450K .......... .......... .......... .......... .......... 99% 13.5M 0s
 19500K .......... .......... .......... .......... .......... 99% 47.4M 0s
 19550K .......... .......... .......... .......... .......... 99% 41.9M 0s
 19600K .......... .......... .......... .......... .......... 99% 28.0M 0s
 19650K .......... .......... .......... ....                 100% 4.95M=5.0s

2018-01-04 20:19:23 (3.81 MB/s) - ‘anon_data.csv’ saved [20156753/20156753]
pwd
ls
/databricks/driver
anon_data.csv
anon_data.csv.1
conf
derby.log
eventlogs
ganglia
logs
val dataPath = "file:/databricks/driver/anon_data.csv"
spark.read.format("csv")
  .option("header","true")
  .option("inferSchema", "true")
  .load(dataPath)
  .createOrReplaceTempView("anonymized_data_raw")
dataPath: String = file:/databricks/driver/anon_data.csv

Data visualization

To better understand our our network data, analyze it and verify its correctness, we decided to represent the data in a graph network. A graph is made up of vertices and edges and can be either directed or undirected. A visualization of an example graph can be seen in the picture below:

Example Graph

And more information about graph theory can be found at https://en.wikipedia.org/wiki/Graph\_theory.

In our context of network traffic, each connected device can be seen as a vertex in the graph and each packet sent between two devices is an edge. For our data a packet is always sent from one source node (vertex) to another destination node (vertex). Thus each edge is directed from and the whole graph is directed.

To use this graph representation for our network data we used the Spark package GraphFrames.

GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs. It provides high-level APIs in Scala, Java, and Python. It aims to provide both the functionality of GraphX and extended functionality taking advantage of Spark DataFrames. This extended functionality includes motif finding, DataFrame-based serialization, and highly expressive graph queries.

The GraphFrames package is available from Spark Packages.

This notebook demonstrates examples from the GraphFrames User Guide.

(Above GraphFrames explanation taken from Raazesh Sainudiin's course Scalable Data Science)

Using GraphFrames we can also see the the relationship between vertices using motifs, filter graphs and find the in- and outdegrees of vertices.

To visualize our graph network we decided to use the package JavaScript visualization package D3 which allows for complex visualizations of graph networks and tons of other applications.

displayHTML(frameIt("https://d3js.org",500))
displayHTML(frameIt("http://graphframes.github.io/user-guide.html",500))
val sqlDF = spark.sql("SELECT * FROM anonymized_data_raw")
sqlDF: org.apache.spark.sql.DataFrame = [n: int, Time: double ... 4 more fields]
display(sqlDF)
n Time Source Destination Protocol Length
1.0 0.0 174.226.241.183 95.155.84.47 STP 52.0
2.0 0.140331 177.174.162.63 131.157.50.23 TLSv1.2 129.0
3.0 0.141313 177.174.162.63 131.157.50.23 TLSv1.2 129.0
4.0 0.142322 177.174.162.63 113.26.139.31 DNS 69.0
5.0 0.146544 108.5.57.212 3.189.19.124 DNS 85.0
6.0 0.147182 177.174.162.63 234.164.133.186 TCP 78.0
7.0 0.151439 18.28.228.158 3.189.19.124 TCP 74.0
8.0 0.151544 177.174.162.63 234.164.133.186 TCP 66.0
9.0 0.151839 177.174.162.63 234.164.133.186 TLSv1.2 583.0
10.0 0.155831 18.28.228.158 3.189.19.124 TCP 66.0
11.0 0.156337 18.28.228.158 3.189.19.124 TLSv1.2 216.0
12.0 0.156426 177.174.162.63 234.164.133.186 TCP 66.0
13.0 0.156624 177.174.162.63 234.164.133.186 TLSv1.2 117.0
14.0 0.160726 18.28.228.158 3.189.19.124 TLSv1.2 135.0
15.0 0.160769 177.174.162.63 234.164.133.186 TCP 66.0
16.0 0.167779 177.174.162.63 234.164.133.186 TLSv1.2 119.0
17.0 0.16778 177.174.162.63 234.164.133.186 TLSv1.2 122.0
18.0 0.16778 177.174.162.63 234.164.133.186 TLSv1.2 108.0
19.0 0.168021 177.174.162.63 234.164.133.186 TLSv1.2 104.0
20.0 0.168136 177.174.162.63 234.164.133.186 TLSv1.2 908.0
21.0 0.171819 18.28.228.158 3.189.19.124 TCP 66.0
22.0 0.171826 18.28.228.158 3.189.19.124 TLSv1.2 104.0
23.0 0.171922 177.174.162.63 234.164.133.186 TCP 66.0
24.0 0.172035 18.28.228.158 3.189.19.124 TCP 66.0
25.0 0.184559 101.96.108.245 3.189.19.124 TCP 66.0
26.0 0.185073 101.96.108.245 3.189.19.124 TCP 66.0
27.0 0.192846 101.96.108.245 3.189.19.124 TLSv1.2 805.0
28.0 0.19291 177.174.162.63 131.157.50.23 TCP 66.0
29.0 0.211526 18.28.228.158 3.189.19.124 TCP 66.0
30.0 0.262157 101.96.108.245 3.189.19.124 TLSv1.2 129.0

Truncated to 30 rows

import org.apache.spark.sql._
import org.apache.spark.sql.functions._

// Truncate the data for each millisecond
val truncData = sqlDF
  .select($"n", $"Source", $"Destination", round($"Time", 2).as("ts"), $"Protocol", $"Length")
  .groupBy($"ts", $"Source", $"Destination", $"Protocol")
  .agg(avg($"Length").as("len"), (avg("Length") / max($"Length")).as("local_anomalies"), count("*").as("count"))
  .sort($"ts")

truncData.show(5)

truncData.createOrReplaceTempView("anonymized_data")
+----+---------------+---------------+--------+-----+------------------+-----+
|  ts|         Source|    Destination|Protocol|  len|   local_anomalies|count|
+----+---------------+---------------+--------+-----+------------------+-----+
| 0.0|174.226.241.183|   95.155.84.47|     STP| 52.0|               1.0|    1|
|0.14| 177.174.162.63|  113.26.139.31|     DNS| 69.0|               1.0|    1|
|0.14| 177.174.162.63|  131.157.50.23| TLSv1.2|129.0|               1.0|    2|
|0.15| 177.174.162.63|234.164.133.186|     TCP| 72.0|0.9230769230769231|    2|
|0.15|   108.5.57.212|   3.189.19.124|     DNS| 85.0|               1.0|    1|
+----+---------------+---------------+--------+-----+------------------+-----+
only showing top 5 rows

import org.apache.spark.sql._
import org.apache.spark.sql.functions._
truncData: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [ts: double, Source: string ... 5 more fields]
import org.graphframes._

val v = truncData.select($"Source".as("id"), $"Source".as("src")).where("count > 10")
v.show()

val e = truncData.select($"Source".as("src"), $"Destination".as("dst"), $"Protocol", $"count").where("count > 10")
e.show()

val g = GraphFrame(v, e)

val gE= g.edges.select($"src", $"dst".as("dest"), $"count")
display(gE)
src dest count
177.174.162.63 223.1.230.140 12.0
108.5.57.212 246.42.223.127 14.0
177.174.162.63 196.141.158.131 20.0
9.9.252.254 3.189.19.124 20.0
177.174.162.63 196.141.158.131 19.0
9.9.252.254 3.189.19.124 19.0
9.9.252.254 3.189.19.124 19.0
177.174.162.63 196.141.158.131 19.0
9.9.252.254 3.189.19.124 19.0
177.174.162.63 196.141.158.131 19.0
9.9.252.254 3.189.19.124 19.0
177.174.162.63 196.141.158.131 19.0
9.9.252.254 3.189.19.124 19.0
177.174.162.63 196.141.158.131 19.0
177.174.162.63 196.141.158.131 19.0
9.9.252.254 3.189.19.124 19.0
9.9.252.254 3.189.19.124 19.0
177.174.162.63 196.141.158.131 19.0
108.5.57.212 246.42.223.127 11.0
177.174.162.63 196.141.158.131 19.0
9.9.252.254 3.189.19.124 19.0
9.9.252.254 3.189.19.124 18.0
177.174.162.63 196.141.158.131 18.0
9.9.252.254 3.189.19.124 19.0
177.174.162.63 196.141.158.131 19.0
177.174.162.63 196.141.158.131 20.0
9.9.252.254 3.189.19.124 20.0
9.9.252.254 3.189.19.124 20.0
177.174.162.63 196.141.158.131 20.0
177.174.162.63 196.141.158.131 19.0

Truncated to 30 rows

Warning: classes defined within packages cannot be redefined without a cluster restart.
Compilation successful.
d3.graphs.force(
  height = 1680,
  width = 1280,
  clicks = gE.as[d3.Edge])
display(g.inDegrees.orderBy($"inDegree".desc))
id inDegree
3.189.19.124 2938.0
101.107.251.253 408.0
196.141.158.131 235.0
113.26.139.31 224.0
255.171.74.61 170.0
77.186.49.77 124.0
68.103.236.48 113.0
20.202.77.120 96.0
245.230.44.106 91.0
36.27.131.116 66.0
128.85.28.242 62.0
65.106.24.202 50.0
160.164.22.168 48.0
221.230.195.197 43.0
186.26.246.188 41.0
88.254.222.208 32.0
191.78.151.216 31.0
4.56.83.115 31.0
210.149.168.241 30.0
164.194.30.88 24.0
185.231.205.227 21.0
237.163.203.190 20.0
225.106.215.13 20.0
48.95.6.121 16.0
190.0.74.244 15.0
34.189.230.39 14.0
122.90.226.185 13.0
160.50.25.154 13.0
227.152.51.49 13.0
134.136.168.165 12.0

Truncated to 30 rows

display(g.outDegrees.orderBy($"outDegree".desc))
id outDegree
177.174.162.63 2153.0
155.96.68.95 415.0
6.131.29.230 303.0
9.9.252.254 281.0
108.5.57.212 258.0
252.104.57.94 208.0
199.20.94.26 162.0
167.24.21.68 133.0
254.6.133.244 95.0
152.176.154.244 94.0
173.224.30.161 92.0
174.158.93.19 80.0
40.3.131.78 78.0
137.163.106.242 76.0
166.180.146.227 73.0
233.28.220.75 72.0
238.84.79.228 53.0
123.68.159.165 48.0
100.151.1.32 40.0
102.87.130.73 35.0
138.109.33.123 33.0
89.229.179.92 27.0
96.61.208.226 23.0
43.145.231.203 21.0
246.125.196.252 21.0
168.155.119.20 20.0
17.13.167.242 20.0
242.173.91.12 18.0
127.206.74.216 16.0
132.77.224.252 15.0

Truncated to 30 rows

Clustering

Pre-processing of data

We preprocessed the data logged from wireshark doing the following:

Setting up k-means clustering

  • 23 features
  • Filtering out features that are not numeric, example is destination and source
displayHTML(frameIt("https://en.wikipedia.org/wiki/K-means_clustering",500))

import pandas as pd

sampled = sqlContext.sql("SELECT * FROM anonymized_data").toPandas()
# standardize features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
sample = sampled['len']
sample = sample.reshape(-1, 1) # one feature
scaler.fit(sample)

sampled['len'] = scaler.transform(sample)
sample = sampled['count']
sample = sample.reshape(-1, 1) # one feature
scaler.fit(sample)

sampled['count'] = scaler.transform(sample)
/databricks/python/local/lib/python2.7/site-packages/sklearn/utils/validation.py:429: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.
  warnings.warn(msg, _DataConversionWarning)
df_count = sampled['count']
df_length = sampled['len']
df_proto = pd.get_dummies(sampled['Protocol'])
df_source = sampled['Source']
df_dest = sampled['Destination']
df_ts = sampled['ts']

onehot = pd.concat([df_proto, df_source, df_length, df_dest, df_ts, df_count], axis=1)
onehotDF = sqlContext.createDataFrame(onehot)

sqlContext.sql("DROP TABLE IF EXISTS anonymized_data_onehot")
onehotDF.write.saveAsTable('anonymized_data_onehot')
case class Packet(AJP13: Double, ALLJOYN_NS: Double, ARP: Double, DHCP: Double, DNS: Double, HTTP: Double, HTTP_XML: Double, ICMP: Double, ICMPv6: Double, IGMPv1: Double, IGMPv2: Double, IGMPv3: Double, MDNS: Double, NBNS: Double, NTP: Double, OCSP: Double, QUIC: Double, RTCP: Double, SIP: Double, SNMP: Double, SSDP: Double, STP: Double, STUN: Double, TCP: Double, TFTP: Double, TLSv1: Double, TLSv1_2: Double, UDP: Double, XMPP_XML: Double, Source: String, len: Double, Destination: String, ts: Double,
 count: Long)

def parseRow(row: org.apache.spark.sql.Row): Packet = {

  def toDouble(value: Any): Double = {
    try {
       value.toString.toDouble
    } catch {
      case e: Exception => 0.0
    }
  }
  def toLong(value: Any): Long = {
    try {
       value.toString.toLong
    } catch {
      case e: Exception => 0
    }
  }

  Packet(toDouble(row(0)), toDouble(row(1)), toDouble(row(2)), toDouble(row(3)), toDouble(row(4)), toDouble(row(5)), toDouble(row(6)), toDouble(row(7)), toDouble(row(8)), toDouble(row(9)), toDouble(row(10)), toDouble(row(11)), toDouble(row(12)), toDouble(row(13)), toDouble(row(14)), toDouble(row(15)), toDouble(row(16)), toDouble(row(17)), toDouble(row(18)), toDouble(row(19)), toDouble(row(20)), toDouble(row(21)), toDouble(row(22)), toDouble(row(23)), toDouble(row(24)), toDouble(row(25)), toDouble(row(26)), toDouble(row(27)), toDouble(row(28)), row(29).toString, toDouble(row(30)), row(31).toString, toDouble(row(32)), toLong(row(33)))
}

val df = table("anonymized_data_onehot").map(parseRow).toDF
df.createOrReplaceTempView("packetsView")
defined class Packet
parseRow: (row: org.apache.spark.sql.Row)Packet
df: org.apache.spark.sql.DataFrame = [AJP13: double, ALLJOYN_NS: double ... 32 more fields]
import org.apache.spark.ml.feature.VectorAssembler

val list = ("Source, Destination")
val cols = df.columns

val filtered = cols.filter { el =>
  !list.contains(el)
}

val trainingData = new VectorAssembler()
                      .setInputCols(filtered)
                      .setOutputCol("features")
                      .transform(table("packetsView"))
import org.apache.spark.ml.feature.VectorAssembler
list: String = Source, Destination
cols: Array[String] = Array(AJP13, ALLJOYN_NS, ARP, DHCP, DNS, HTTP, HTTP_XML, ICMP, ICMPv6, IGMPv1, IGMPv2, IGMPv3, MDNS, NBNS, NTP, OCSP, QUIC, RTCP, SIP, SNMP, SSDP, STP, STUN, TCP, TFTP, TLSv1, TLSv1_2, UDP, XMPP_XML, Source, len, Destination, ts, count)
filtered: Array[String] = Array(AJP13, ALLJOYN_NS, ARP, DHCP, DNS, HTTP, HTTP_XML, ICMP, ICMPv6, IGMPv1, IGMPv2, IGMPv3, MDNS, NBNS, NTP, OCSP, QUIC, RTCP, SIP, SNMP, SSDP, STP, STUN, TCP, TFTP, TLSv1, TLSv1_2, UDP, XMPP_XML, len, ts, count)
trainingData: org.apache.spark.sql.DataFrame = [AJP13: double, ALLJOYN_NS: double ... 33 more fields]
import org.apache.spark.ml.clustering.KMeans

val model = new KMeans().setK(23).fit(trainingData)
val modelTransformed = model.transform(trainingData)
import org.apache.spark.ml.clustering.KMeans
model: org.apache.spark.ml.clustering.KMeansModel = kmeans_81ed7de82229
modelTransformed: org.apache.spark.sql.DataFrame = [AJP13: double, ALLJOYN_NS: double ... 34 more fields]

Improvements and future work

In this section we present possible improvements that could have been done for our project and future work to further build on the project, increase its usability and value.

Dimensionality improvements

We used k-means for clustering our network data which uses euclidean distance. Models using euclidean distance are susceptible to the Curse of Dimensionality. With the 23 features we got after using one-hot encoding for the protocol column in the original dataset we are likely suffering from this high dimensionality. To improve the clustering one could an algorithm that doesn't use euclidean distance (or other distance measures that don't work well for high dimensionality). Another possible solution could be to to use dimensionality reduction and try to retain as much information as possible with fewer features. This could be done using techniques such as PCA or LDA.

Parse packet contents

We didn't parse the packet information other than IP addresses, packet lengths and protocol. To gain further insights one could parse the additional packet contents and look for sensitive items, including usernames, passwords etc.

Graph Analysis

One could continue analyze the graph representation of the data. Examples of this could include looking for comlpex relationships in the graph using GraphFrames motifs.

Real time network analysis using Spark streaming

To make the project even more useful in a real environment, one could use Spark Streaming k-means to cluster network traffic in real time and then perform anomaly detection in real time as well. An example approach of this can be seen in the following video: https://www.youtube.com/watch?v=i8\_\_\_3GdxlQ

Additional continuations of this could include giving suggestions for actions to perform when deteching malicious activity.