Pcap Database

Recently I've been playing around with what I call "hobbyist cyber security". Other than the usual precautions one might take against malware/network intrusion such as antivirus software, password managers, always using the latest firmware on routers and IOT...I've also been training models to detect beacons (periodic communications between malware and command and control servers) and DGA (domain generating algorithms).

Unfortunately, datasets for building these kinds of models is scarce. There are a few interesting sets like the Darpa Intrusion dataset, but they are few and far between. THis data is rare for good reason! Getting access to someone's network traffic gives you a lot of information about how they work and what they do. If the data isn't proprly cleaned you can leak sensitive data or private communications.

To fight this I've started collecting my own data to work on. How can I do this? Well, if I were to take the most naive approach, I could infect different computers with a variety of malware and collect the internet traffic. This would give me labeled data to do amchine learning on!

I'm actually doing something more sophisticated than that, but it's not worth going into here. In this post I'm going to walk through one simple way to store the network traffic on a single device for later inspection.

We'll move through this process setp by step:

  • What is a pcap file?
  • Using Wireshark/tshark
  • Streaming data in python
  • A simple python database
  • Putting it all together

What is a pcap file?

Network intrusion detection and Infsec usually start and end with sniffing network traffic for unusual stuff. This is often done using heuristic/rules based methods and trained human agents. The goal of Infosec when looking through network traffic is to find 1) communications of malware with Command and control servers 2) malware jumping between devices on the network 3) someone else spying on network traffic 4) unsecured communications.

At a basic level, this can be done by looking at packet capture (pcap) files. These are file records of network traffic. All of the little handshakes and file exchanges that happen behind the scenes when we use computers on a network are recorded here. To be clear, pcap files don't capture what is happening on a particular device, only the communications to and from that device.

To get a picture of the entire network we would want to look at pcap files collected from the network router(s). For this tutorial we will just look at the traffic to and from the device we are working on.

Using Wireshark/tshark

Actually capturing the traffic is easy. We'll use a standard tool in the industry: Wireshark. Wireshark has a nice guided user interface (GUI), but we will only interact with wireshark using the terminal for now.

You can download Wireshark from their website (linked above) or using a package manager like homebrew: brew cask install wireshark.

Tshark is the terminal "version" of wireshark. It should ahve been installed along with wireshark. Let's first go to the terminal/shell/command line.

Run tshark -D to see the available network interfaces to monitor. You should see an option for monitoring Wifi. A screenshot of my terminal is shown below:

Looks like en0 is the designation for the wifi interface. This seems like a useful interface to monitor for suspicious traffic! Let's see what's going on with my wifi traffic.

Run tshark -i en0 in the terminal. Wow! A lot of text probably started streaming down your screen. Press cntrl c to break the stream. That stream probably looked soemthing like this screen capture below:

Each row in this stream is a piece of communication (either to or from our device). We haven't filtered this information down at allso it's jsut a mess.

You can read tshark's manual to see all of the options tshark provides. For now I'll give a quick breakdown of useful commands. I'd also consider reading through these other posts: A quick tutorial on Tshark, Network Sniffing with Tshark, Protocol Numbers, Wireshark display filters.

We'll start very simple. Let's write out all of our data to a pcap file.

tshark -i en0 -w network_traffic.pcap

Now you probably aren't used to dealing with pcap files! Most datascientists deal with json, csv, and parquet file formats. Let's convert the file we saved to csv. When we do that we'll also filter down the content to a few fields we really care about.

tshark -r network_traffic.pcap \
    -T fields -e _ws.col.Protocol -e frame.time_epoch -e ip.src -e ip.dst \
    -e http.request.full_uri -e dns.qry.name -e dns.resp.name \
    -E header=y -E separator=, -E quote=d -E occurrence=f > network_traffic.csv

Let's break down this long command.

The first part -r network_traffic.pcap is really simple; we're jsut reading from the pcap file we saved a moment ago. -T fields is letting tshark know that we're going to filter down to a select number of fields to write to our file. Every argument beginning with -e is a field that we will include in the output file. Every argument beginning with -E tells tshark how to format that ourput file. For example, -E separator=, indicates that our output file should use commas as it's delimiter between fields.

We can filter down by protocol or even to a specific ip as well using the argument -Y. For example, we could filter down to only domain name service protocols using the argument -Y dns. We could also set a time to let tshark know to stop capturing data after a set amount of time using the argument -a duration:10 where 10 is the number of seconds to capture packets for.

Now you have a csv that you can use to train a model! Are we done? Nope. I think we can do a little better than jsut a csv.

Streaming data in python

Ok, up until now we simply captured traffic over a certain period of time. What happens is we want packet capture to be always on? We could always set a cron job to launch tshark every 10 minutes with a 10 minute run timer. We would then get a new file every 10 minutes.....that sounds ugly.

Why not set up a streaming service to log network traffic as it coems in? Yes, let's do that.

We'll start with an understanding of pipes. A pipe is a redirection to send the output of one command as the input to another command. Let's see an example!

Let's say we want to know the version numpy that we are using in the current python environment. We could run the command pip freeze to list all packages managed by pip in the current environment and read through it until we see numpy. That seems tedious. Of course, if we already had that list we could leverage regular expressions, using grep, to quickly search for the numpy entry! Let's pipe the output of pip freeze to grep numpy to find the version of numpy we are using.

We ca run terminal commands in jupyter notebooks using the maginc !.

In [1]:
! pip freeze | grep numpy
numpy==1.15.4

See how easy life is with pipes? Let's try piping the output of tshark to a grep and see what happens. We'll stream the protocol and source ip of 30 line of network traffic communication and then use grep to filter out only traffic using the TCP protocol.

In [2]:
! tshark -i en0 -c 20 -T fields -e _ws.col.Protocol -e ip.src | grep TCP
Capturing on 'Wi-Fi'
20 
6 packets dropped
TCP	192.168.1.3
TCP	192.168.1.3
TCP	192.168.1.3
TCP	52.200.116.91
TCP	192.168.1.3
TCP	192.168.1.3
TCP	192.168.1.3
TCP	52.200.116.91
TCP	52.200.116.91
TCP	192.168.1.3
TCP	52.200.116.91
TCP	192.168.1.3
TCP	192.168.1.3
TCP	52.200.116.91
TCP	52.200.116.91

That took awhile! You probably had to wait to see the output of the grep command (I waited a couple seconds for the output to display). Why?

It turns out that the pipe waits for the first command to finish before sending it's output to the seconda command. We don't want to wait for this 'buffering' behavior since we are interested in setting up a streaming servie using pipes.

There are a few ways around this, but many are dependant on the operating system you are working on. Since I need a solution that works on a variety of systems like ubuntu, red hat, and MacOS, I like to use expect (should also work on windows).

Install expect on Mac using homebrew: brew install expect. Mac has some issues with the install path for expect, so export the correct path to the executable: export TCLLIBPATH="/usr/local/lib".

Let's try rerunning the pipe from before, but this time we'll add the expect command unbuffer.

In [3]:
! unbuffer tshark -i en0 -c 20 -T fields -e _ws.col.Protocol -e ip.src | grep TCP
TCP	192.168.1.3
TCP	34.232.28.44
TCP	192.168.1.3
TCP	34.232.28.44
TCP	192.168.1.3
TCP	45.57.62.158
TCP	192.168.1.3
TCP	45.57.62.158
TCP	192.168.1.3
TCP	192.168.1.3

Awesome! Now we're streaming using pipes. Before we move on to the next section, let's connect our stream to a python script. Create a file named stream.py with the following code:

import sys

for something in sys.stdin:
    print( something )

This will take in stream of data from standard input and print it as it comes in. Python makes this really easy to do and the symtax is almsot identical to how files are read in and handled.

Let's try out our script.

In [4]:
! unbuffer tshark -i en0 -c 20 -T fields -e _ws.col.Protocol -e ip.src | python stream.py
Capturing on 'Wi-Fi'

UDP	192.168.1.3

ARP	

UDP	192.168.1.3

ARP	

TCP	192.168.1.3

TCP	159.69.211.233

TLSv1.2	192.168.1.3

TLSv1.2	45.57.62.158

TCP	192.168.1.3

TCP	45.57.62.158

TCP	45.57.62.158

TCP	192.168.1.3

TCP	45.57.62.158

TCP	45.57.62.158

TCP	45.57.62.158

TCP	192.168.1.3

TCP	192.168.1.3

TCP	45.57.62.158

TCP	45.57.62.158

TCP	45.57.62.158

84 packets dropped

20 packets captured

And that's it! We're now streaming data.

A simple python database

Ok, we're finally approaching third base. The last piece to our puzzle is where we are going to store our stream of logs for our entwork traffic. For a large company this is a HUGE engineering task. I have friends that build out and maintain these large scale cyber security databases and they are a nightmare.

Lucky for us we only one a toy database to store the a tiny amount of data for us to do analysis and build models on. For this use case we'll use sqlite.

We'll create a python module to handle all of the database management for us. Copy the code below and put it into a file called database.py.

import sqlite3
from sqlite3 import Error

def create_connection(db_file):
    """ create a database connection to a SQLite database """
    conn = sqlite3.connect(db_file)
    # conn = sqlite3.connect(':memory:')
    print(sqlite3.version)
    return conn

def create_table(conn, create_table_sql):
    """ create a table from the create_table_sql statement
    :param conn: Connection object
    :param create_table_sql: a CREATE TABLE statement
    :return:
    """
    try:
        c = conn.cursor()
        c.execute(create_table_sql)
    except Error as e:
        print(e)

def create_entry(conn, table_name, fields, entry):
    """
    Create a new project into the projects table
    :param conn:
    :param project:
    :return: project id
    """
    column_names = ", ".join(fields)
    values = ", ".join(["?"]*len(fields))
    sql = """ INSERT INTO {}({})
              VALUES({}) """.format(table_name, column_names, values)
    cur = conn.cursor()
    cur.execute(sql, entry)
    return cur.lastrowid

def select_all(conn, table_name):
    """
    Query all rows in the tasks table
    :param conn: the Connection object
    :return:
    """
    cur = conn.cursor()
    cur.execute("SELECT * FROM {}".format(table_name))

    rows = cur.fetchall()

    for row in rows:
        print(row)

Let's walk through this code and understand what it's doing. I'm not going to spend time explaining the sql syntax. If you are unfamilair with sql queries of databases I would recommend looking up a short tutorial. SQL is very simple and easy to pick up!

First we have a function to connect to our database. This function takes in a path to the database file and sets up a connection to it. This connection is our way to query the database and/or add to the database. IF no file exists in the path then a new database file is created.

def create_connection(db_file):
    """ create a database connection to a SQLite database """
    conn = sqlite3.connect(db_file)
    # conn = sqlite3.connect(':memory:')
    print(sqlite3.version)
    return conn

Next we have a function to add tables to the database. We will only need one table for our use case. However, if we were creating a database for a small web app we might have multiple tables to 1) manage user credentials 2) store cookies/sessions 3) storing data that users upload.

def create_table(conn, create_table_sql):
    """ create a table from the create_table_sql statement
    :param conn: Connection object
    :param create_table_sql: a CREATE TABLE statement
    :return:
    """
    try:
        c = conn.cursor()
        c.execute(create_table_sql)
    except Error as e:
        print(e)

Now for the good stuff. Here we add entries to the table we created.

def create_entry(conn, table_name, fields, entry):
    """
    Create a new project into the projects table
    :param conn:
    :param project:
    :return: project id
    """
    column_names = ", ".join(fields)
    values = ", ".join(["?"]*len(fields))
    sql = """ INSERT INTO {}({})
              VALUES({}) """.format(table_name, column_names, values)
    cur = conn.cursor()
    cur.execute(sql, entry)
    return cur.lastrowid

Finally, we can pour out all of the entries in our database to look at them. Your can modify the sql query here to select only the columns you care about. For example, if you find that a particular IP address has been submitting rejected password attempts to your computers over ssh (a warning sign that someone os trying to brute force their way onto your computer), you can query your database to see if that IP address has had an history communicating with your computer (hackers often do reconnaissance like port sniffing before an attack).

def select_all(conn, table_name):
    """
    Query all rows in the tasks table
    :param conn: the Connection object
    :return:
    """
    cur = conn.cursor()
    cur.execute("SELECT * FROM {}".format(table_name))

    rows = cur.fetchall()

    for row in rows:
        print(row)

Putting it all together

Ok, let's create our final master script to start logging network traffic. Copy the following code into a script called network_capture.py.

import sys
import database

conn = database.create_connection("./network_data.db")

table_name = "pcap_data"

sql_str = """
CREATE TABLE IF NOT EXISTS {} (
    id integer PRIMARY KEY,
    protocol text NOT NULL,
    epoch_time text NOT NULL
);
""".format(table_name)

database.create_table(conn, sql_str)

for row in sys.stdin:

    if row != "Capturing on 'Wi-Fi'\n" and "packets" not in row:
        entry = row.replace("\n","").split(",")
        fields = ['protocol', 'epoch_time']
        entry_id = database.create_entry(conn, table_name, fields, entry)

database.select_all(conn, table_name)

This is a pretty simple piece of code. We create a database file/establish a connection to an existing database file. We then create a table in that database. We write each line that tshark ends us to the database. If that stream ends (meaning an end of file is passed through the pipe), we print out the entire contents of our database.

Below we run that command. We tell tshark to stop after capturing 20 packets (otherwise it will go on forever and no EOF will be passed through the pipe).

In [5]:
! unbuffer tshark -i en0 -c 20 -T fields -e _ws.col.Protocol -e frame.time_epoch -E separator=, | python network_capture.py
2.6.0
(1, 'ARP', '1549157912.653421000')
(2, 'TCP', '1549157913.300935000')
(3, 'TCP', '1549157913.424068000')
(4, 'TLSv1.2', '1549157913.818130000')
(5, 'TCP', '1549157913.818228000')
(6, 'TLSv1.2', '1549157913.818384000')
(7, 'TCP', '1549157913.836271000')
(8, 'TLSv1.2', '1549157915.370844000')
(9, 'TLSv1.2', '1549157915.390536000')
(10, 'TCP', '1549157915.390612000')
(11, 'TCP', '1549157915.392205000')
(12, 'TCP', '1549157915.395092000')
(13, 'TCP', '1549157915.395126000')
(14, 'TCP', '1549157915.395159000')
(15, 'TCP', '1549157915.395200000')
(16, 'TCP', '1549157915.395752000')
(17, 'TCP', '1549157915.401887000')
(18, 'TCP', '1549157915.401892000')
(19, 'TCP', '1549157915.401893000')
(20, 'TCP', '1549157915.401894000')

If you want to see the data as it streams in, simply add a print statment below for row in sys.stdin: to display the row.

Parting Words

We've gone over a super simple version of how you can stream network traffic logs into a database for analytics and model building. THe code was bare bones and leaves room for a ton of improvement. For example, we might want to stream the data to a database in the cloud like dynamoDB (cheap and easy to use makes dynamoDB great for pet projects). We might also want network traffic from the entire network which would mean fowarding logs from your router to a device running this capture script.

I'll leave these possible improvements to you! I hope you enjoyed this post :)