This article is for Linux Administrators who are trying to demystify eBPF.
Pre-requisites
This article is written with the assumption that the reader have a good understanding of Linux networking and familiar with packet tracing using tcpdump
.
Some of the internals were intentionally excluded to simplify the topic.
The Berkley Packet Filter (BPF)
The tcpdump
utility is a special tool in Linux, and first, we will discuss the basics of how tcpdump
works.
Let’s take the scenario were you wanted to observe all ARP packet coming to the network interface of a Linux system. The packet fist lands in the network device hardware and then later will be placed in an RX (receive) queue in the Kernel.
For a user to examine the contents of a matching packet, that packet needs to be copied from kernel space to the user space .Then each of the packets needs to be filtered based on its type; here its ARP type.
Switching CPU from kernel space to user space to copy packet is inefficient and will affect the system performance.
So how can we filter packets which are on-the way within the kernel space and copy only the matching packets in user space?
Here comes the BPF or Berkley Packet Filter.
The BPF virtual machine is a pseudo VM inside the Linux kernel. For the sake of simplicity, you can consider this as a custom module loaded to the kernel.
For the ease of understanding the concept, you can think of BPF like JVM (Java Virtual Machine)
The BPF VM supports a limited set of instructions and there are many restrictions to the usage as well.
Below are the registers in BPF VM (or pseudo-machine)
- An accumulator
[A]
where the contents of the packet get loaded. - An index register
[X]
. - A scratch memory area.
- An implicit Program Counter.
The filters we pass to tcpdump
command will be converted into “byte code” and then injected directly into the kernel.(More about byte code will be coming later in this article.)
The load instructions loads the packet data to accumulator, and then we can examine the packets in BPF VM.
Let’s examine the code generated by the tcpdump
command that filters the ARP
packets coming to interface ens33
.
[root@localhost ~]# tcpdump -i ens33 arp -d
(000) ldh [12]
(001) jeq #0x806 jt 2 jf 3
(002) ret #262144
(003) ret #0
[root@localhost ~]#
Explanation
(000) ldh - Load half word (16 bits) from index 12
(001) jeq - If accumulator value is 0x806 ; ie ARP packet, then jump to 2 else jump to 3
(002) ret - Return the contents with buffer size 262144 ; ie entire packet or [max snapshot length](https://github.com/the-tcpdump-group/tcpdump/blob/tcpdump-4.9/netdissect.h#L263)
(003) ret - Discard the packet
You can find more details of the inner working of BPF in this Usenix paper
So the above filter skips the source and destination mac fields and then loads 16bit
s from the index 12
which is the packet type.
So the 16bits - 0x806
(00000100 00000011
) at offset 12
will try to match ARP
packet!
Few points to note;
The
Ethernet type II
packet have below format;+--------------------+--------------------+-------------+----------------+-----+ | 6 Byte Dest. Mac | 6 Byte Source Mac | 2 Byte Type | 46 - 1500 data | FCS | +--------------------+--------------------+-------------+----------------+-----+
Ethernet packets are
big-endain
.In a
32bit
system, a full word is32bit
, half word is16bit
.1
byte =8bits
,2
byte =16bits
You can find the Ethernet type hex representation of packet types in IANA
------------------------------------------------------------------------------------------------------------------------------------------------ Ethertype (decimal) Ethertype (hex) Exp. Ethernet (decimal) Exp. Ethernet (octal) Description Reference ------------------------------------------------------------------------------------------------------------------------------------------------ 2054 0806 - - Address Resolution Protocol (ARP) [RFC7042] ------------------------------------------------------------------------------------------------------------------------------------------------
The Byte Code
The BPF program we discussed above can be converted to byte code.
What is byte code?
A byte code will be executed by a Virtual Machine (VM).
In this case the VM is a BPF pseudo VM sitting inside the Kernel.
The user space can inject this bytecode to the BPF pseudo VM and the VM will convert that to the architecture dependant assembly code which can be executed directly on the hardware.
We can generate the bytecode of the BPF instruction in tcpdump
itself.
[root@localhost ~]# tcpdump -i ens33 arp -ddd
4
40 0 0 12
21 0 1 2054
6 0 0 262144
6 0 0 0
The bytecode can be injected into the system in different ways. The tcmpdump
utility have it’s own logic to do this operation.
With that we concludes the Part - 1 of eBPF for Linux Admins here.
In the next part, we will discuss eBPF or extended BPF.
About the Author
Ansil Hameed Kunju
Ansil has more than a decade of experience in different IT domains. He is an expert in DevOps.His skill set includes Linux, GCP, AWS, VMware, Nutanix, Rancher, Docker, Git, Python, Golang, Kubernetes, Istio, Prometheus, Grafana, ArgoCD, Jenkins, StackStorm and other CNCF projects.