bpf —
Berkeley
Packet Filter
pseudo-device bpfilter
The Berkeley Packet Filter provides a raw interface to data link layers in a
protocol-independent fashion. All packets on the network, even those destined
for other hosts, are accessible through this mechanism.
The packet filter appears as a character special device,
/dev/bpf. After opening the device, the file
descriptor must be bound to a specific network interface with the
BIOCSETIF
ioctl(2). A given interface can
be shared between multiple listeners, and the filter underlying each
descriptor will see an identical packet stream.
Associated with each open instance of a
bpf file is
a user-settable packet filter. Whenever a packet is received by an interface,
all file descriptors listening on that interface apply their filter. Each
descriptor that accepts the packet receives its own copy.
Reads from these files return the next group of packets that have matched the
filter. To improve performance, the buffer passed to read must be the same
size as the buffers used internally by
bpf. This
size is returned by the
BIOCGBLEN
ioctl(2) and can be set with
BIOCSBLEN
. Note that an individual packet
larger than this size is necessarily truncated.
A packet can be sent out on the network by writing to a
bpf file descriptor. Each descriptor can also
have a user-settable filter for controlling the writes. Only packets matching
the filter are sent out of the interface. The writes are unbuffered, meaning
only one packet can be processed per write.
Once a descriptor is configured, further changes to the configuration can be
prevented using the
BIOCLOCK
ioctl(2).
The
ioctl(2) command codes below
are defined in
<net/bpf.h>.
All commands require these includes:
#include
<sys/types.h>
#include
<sys/time.h>
#include
<sys/ioctl.h>
#include
<net/bpf.h>
Additionally,
BIOCGETIF
and
BIOCSETIF
require
<sys/socket.h>
and
<net/if.h>.
The (third) argument to the
ioctl(2) call should be a pointer
to the type indicated.
BIOCGBLEN
u_int *
- Returns the required buffer length for reads on
bpf files.
BIOCSBLEN
u_int *
- Sets the buffer length for reads on
bpf files. The buffer must be set before the
file is attached to an interface with
BIOCSETIF
. If the requested buffer size
cannot be accommodated, the closest allowable size will be set and
returned in the argument. A read call will result in
EINVAL
if it is passed a buffer that is
not this size.
BIOCGDLT
u_int *
- Returns the type of the data link layer underlying the
attached interface.
EINVAL
is returned
if no interface has been specified. The device types, prefixed with
“DLT_”, are defined in
<net/bpf.h>.
BIOCGDLTLIST
struct bpf_dltlist *
- Returns an array of the available types of the data link
layer underlying the attached interface:
struct bpf_dltlist {
u_int bfl_len;
u_int *bfl_list;
};
The available types are returned in the array pointed to by the
bfl_list field while their length in
u_int is supplied to the
bfl_len field.
ENOMEM
is returned if there is not
enough buffer space and EFAULT
is
returned if a bad address is encountered. The
bfl_len field is modified on return to
indicate the actual length in u_int of
the array returned. If bfl_list is
NULL
, the
bfl_len field is set to indicate the
required length of the array in u_int.
BIOCSDLT
u_int *
- Changes the type of the data link layer underlying the
attached interface.
EINVAL
is returned
if no interface has been specified or the specified type is not available
for the interface.
BIOCPROMISC
- Forces the interface into promiscuous mode. All packets,
not just those destined for the local host, are processed. Since more than
one file can be listening on a given interface, a listener that opened its
interface non-promiscuously may receive packets promiscuously. This
problem can be remedied with an appropriate filter.
The interface remains in promiscuous mode until all files listening
promiscuously are closed.
BIOCFLUSH
- Flushes the buffer of incoming packets and resets the
statistics that are returned by
BIOCGSTATS
.
BIOCLOCK
- This ioctl is designed to prevent the security issues
associated with an open bpf descriptor in
unprivileged programs. Even with dropped privileges, an open
bpf descriptor can be abused by a rogue
program to listen on any interface on the system, send packets on these
interfaces if the descriptor was opened read-write and send signals to
arbitrary processes using the signaling mechanism of
bpf. By allowing only “known
safe” ioctls, the
BIOCLOCK
ioctl
prevents this abuse. The allowable ioctls are
BIOCFLUSH
,
BIOCGBLEN
,
BIOCGDIRFILT
,
BIOCGDLT
,
BIOCGDIRFILT
,
BIOCGDLTLIST
,
BIOCGETIF
,
BIOCGHDRCMPLT
,
BIOCGRSIG
,
BIOCGRTIMEOUT
,
BIOCGSTATS
,
BIOCIMMEDIATE
,
BIOCLOCK
,
BIOCSRTIMEOUT
,
BIOCVERSION
,
TIOCGPGRP
, and
FIONREAD
. Use of any other ioctl is
denied with error EPERM
. Once a
descriptor is locked, it is not possible to unlock it. A process with root
privileges is not affected by the lock.
A privileged program can open a bpf device,
drop privileges, set the interface, filters and modes on the descriptor,
and lock it. Once the descriptor is locked, the system is safe from
further abuse through the descriptor. Locking a descriptor does not
prevent writes. If the application does not need to send packets through
bpf, it can open the device read-only to
prevent writing. If sending packets is necessary, a write-filter can be
set before locking the descriptor to prevent arbitrary packets from being
sent out.
BIOCGETIF
struct ifreq *
- Returns the name of the hardware interface that the file is
listening on. The name is returned in the
ifr_name field of the
struct ifreq
. All other fields are undefined.
BIOCSETIF
struct ifreq *
- Sets the hardware interface associated with the file. This
command must be performed before any packets can be read. The device is
indicated by name using the ifr_name
field of the
struct ifreq
. Additionally, performs
the actions of BIOCFLUSH
.
BIOCSRTIMEOUT
struct timeval *
-
BIOCGRTIMEOUT
struct timeval *
- Sets or gets the read timeout parameter. The
timeval specifies the length of time to
wait before timing out on a read request. This parameter is initialized to
zero by open(2), indicating no
timeout.
BIOCGSTATS
struct bpf_stat *
- Returns the following structure of packet statistics:
struct bpf_stat {
u_int bs_recv;
u_int bs_drop;
};
The fields are:
-
-
- bs_recv
- Number of packets received by the descriptor since
opened or reset (including any buffered since the last read
call).
-
-
- bs_drop
- Number of packets which were accepted by the filter but
dropped by the kernel because of buffer overflows (i.e., the
application's reads aren't keeping up with the packet traffic).
BIOCIMMEDIATE
u_int *
- Enables or disables “immediate mode”, based
on the truth value of the argument. When immediate mode is enabled, reads
return immediately upon packet reception. Otherwise, a read will block
until either the kernel buffer becomes full or a timeout occurs. This is
useful for programs like
rarpd(8), which must respond
to messages in real time. The default for a new file is off.
BIOCSETF
struct bpf_program *
- Sets the filter program used by the kernel to discard
uninteresting packets. An array of instructions and its length are passed
in using the following structure:
struct bpf_program {
u_int bf_len;
struct bpf_insn *bf_insns;
};
The filter program is pointed to by the
bf_insns field, while its length in units
of struct bpf_insn
is given by the
bf_len field. Also, the actions of
BIOCFLUSH
are performed.
See section FILTER
MACHINE for an explanation of the filter language.
BIOCSETWF
struct bpf_program *
- Sets the filter program used by the kernel to filter the
packets written to the descriptor before the packets are sent out on the
network. See
BIOCSETF
for a description
of the filter program. This ioctl also acts as
BIOCFLUSH
.
Note that the filter operates on the packet data written to the descriptor.
If the “header complete” flag is not set, the kernel sets
the link-layer source address of the packet after filtering.
BIOCVERSION
struct bpf_version *
- Returns the major and minor version numbers of the filter
language currently recognized by the kernel. Before installing a filter,
applications must check that the current version is compatible with the
running kernel. Version numbers are compatible if the major numbers match
and the application minor is less than or equal to the kernel minor. The
kernel version number is returned in the following structure:
struct bpf_version {
u_short bv_major;
u_short bv_minor;
};
The current version numbers are given by
BPF_MAJOR_VERSION
and
BPF_MINOR_VERSION
from
<net/bpf.h>.
An incompatible filter may result in undefined behavior (most likely, an
error returned by ioctl(2) or
haphazard packet matching).
BIOCSRSIG
u_int *
-
BIOCGRSIG
u_int *
- Sets or gets the receive signal. This signal will be sent
to the process or process group specified by
FIOSETOWN
. It defaults to
SIGIO
.
BIOCSHDRCMPLT
u_int *
-
BIOCGHDRCMPLT
u_int *
- Sets or gets the status of the “header
complete” flag. Set to zero if the link level source address should
be filled in automatically by the interface output routine. Set to one if
the link level source address will be written, as provided, to the wire.
This flag is initialized to zero by default.
BIOCSFILDROP
u_int *
-
BIOCGFILDROP
u_int *
- Sets or gets the status of the “filter drop”
flag. If non-zero, packets matching any filters will be reported to the
associated interface so that they can be dropped.
BIOCSDIRFILT
u_int *
-
BIOCGDIRFILT
u_int *
- Sets or gets the status of the “direction
filter” flag. If non-zero, packets matching the specified direction
(either
BPF_DIRECTION_IN
or
BPF_DIRECTION_OUT
) will be
ignored.
bpf now supports several standard ioctls which
allow the user to do asynchronous and/or non-blocking I/O to an open
bpf file descriptor.
FIONREAD
int *
- Returns the number of bytes that are immediately available
for reading.
FIONBIO
int *
- Sets or clears non-blocking I/O. If the argument is
non-zero, enable non-blocking I/O. If the argument is zero, disable
non-blocking I/O. If non-blocking I/O is enabled, the return value of a
read while no data is available will be 0. The non-blocking read behavior
is different from performing non-blocking reads on other file descriptors,
which will return -1 and set errno to
EAGAIN
if no data is available. Note:
setting this overrides the timeout set by
BIOCSRTIMEOUT
.
FIOASYNC
int *
- Enables or disables asynchronous I/O. When enabled
(argument is non-zero), the process or process group specified by
FIOSETOWN
will start receiving
SIGIO
signals when packets arrive. Note
that you must perform an FIOSETOWN
command in order for this to take effect, as the system will not do it by
default. The signal may be changed via
BIOCSRSIG
.
FIOSETOWN
int *
-
FIOGETOWN
int *
- Sets or gets the process or process group (if negative)
that should receive
SIGIO
when packets
are available. The signal may be changed using
BIOCSRSIG
(see above).
The following structure is prepended to each packet returned by
read(2):
struct bpf_hdr {
struct bpf_timeval bh_tstamp;
u_int32_t bh_caplen;
u_int32_t bh_datalen;
u_int16_t bh_hdrlen;
};
The fields, stored in host order, are as follows:
-
-
- bh_tstamp
- Time at which the packet was processed by the packet
filter.
-
-
- bh_caplen
- Length of the captured portion of the packet. This is the
minimum of the truncation amount specified by the filter and the length of
the packet.
-
-
- bh_datalen
- Length of the packet off the wire. This value is
independent of the truncation amount specified by the filter.
-
-
- bh_hdrlen
- Length of the BPF header, which may not be equal to
sizeof(struct bpf_hdr)
.
The
bh_hdrlen field exists to account for
padding between the header and the link level protocol. The purpose here is to
guarantee proper alignment of the packet data structures, which is required on
alignment-sensitive architectures and improves performance on many other
architectures. The packet filter ensures that the
bpf_hdr and the network layer header will be
word aligned. Suitable precautions must be taken when accessing the link layer
protocol fields on alignment restricted machines. (This isn't a problem on an
Ethernet, since the type field is a
short
falling on
an even offset, and the addresses are probably accessed in a bytewise
fashion).
Additionally, individual packets are padded so that each starts on a word
boundary. This requires that an application has some knowledge of how to get
from packet to packet. The macro
BPF_WORDALIGN
is defined in
<net/bpf.h> to
facilitate this process. It rounds up its argument to the nearest word aligned
value (where a word is
BPF_ALIGNMENT
bytes
wide). For example, if
p points to the start
of a packet, this expression will advance it to the next packet:
p = (char *)p + BPF_WORDALIGN(p->bh_hdrlen +
p->bh_caplen);
For the alignment mechanisms to work properly, the buffer passed to
read(2) must itself be word
aligned.
malloc(3) will always
return an aligned buffer.
A filter program is an array of instructions with all branches forwardly
directed, terminated by a “return” instruction. Each instruction
performs some action on the pseudo-machine state, which consists of an
accumulator, index register, scratch memory store, and implicit program
counter.
The following structure defines the instruction format:
struct bpf_insn {
u_int16_t code;
u_char jt;
u_char jf;
u_int32_t k;
};
The
k field is used in different ways by
different instructions, and the
jt and
jf fields are used as offsets by the branch
instructions. The opcodes are encoded in a semi-hierarchical fashion. There
are eight classes of instructions:
BPF_LD
,
BPF_LDX
,
BPF_ST
,
BPF_STX
,
BPF_ALU
,
BPF_JMP
,
BPF_RET
, and
BPF_MISC
. Various other mode and operator
bits are logically OR'd into the class to give the actual instructions. The
classes and modes are defined in
<net/bpf.h>.
Below are the semantics for each defined
bpf
instruction. We use the convention that A is the accumulator, X is the index
register, P[] packet data, and M[] scratch memory store. P[i:n] gives the data
at byte offset “i” in the packet, interpreted as a word (n=4),
unsigned halfword (n=2), or unsigned byte (n=1). M[i] gives the i'th word in
the scratch memory store, which is only addressed in word units. The memory
store is indexed from 0 to
BPF_MEMWORDS
-1.
k,
jt, and
jf are the corresponding fields in the
instruction definition. “len” refers to the length of the
packet.
-
-
BPF_LD
- These instructions copy a value into the accumulator. The
type of the source operand is specified by an “addressing
mode” and can be a constant
(
BPF_IMM
), packet data at a fixed
offset (BPF_ABS
), packet data at a
variable offset (BPF_IND
), the packet
length (BPF_LEN
), or a word in the
scratch memory store (BPF_MEM
). For
BPF_IND
and
BPF_ABS
, the data size must be
specified as a word (BPF_W
), halfword
(BPF_H
), or byte
(BPF_B
). The semantics of all
recognized BPF_LD
instructions follow.
BPF_LD
+BPF_W
+BPF_ABS
- A <- P[k:4]
BPF_LD
+BPF_H
+BPF_ABS
- A <- P[k:2]
BPF_LD
+BPF_B
+BPF_ABS
- A <- P[k:1]
BPF_LD
+BPF_W
+BPF_IND
- A <- P[X+k:4]
BPF_LD
+BPF_H
+BPF_IND
- A <- P[X+k:2]
BPF_LD
+BPF_B
+BPF_IND
- A <- P[X+k:1]
BPF_LD
+BPF_W
+BPF_LEN
- A <- len
BPF_LD
+BPF_IMM
- A <- k
BPF_LD
+BPF_MEM
- A <- M[k]
-
-
BPF_LDX
- These instructions load a value into the index register.
Note that the addressing modes are more restricted than those of the
accumulator loads, but they include
BPF_MSH
, a hack for efficiently loading
the IP header length.
BPF_LDX
+BPF_W
+BPF_IMM
- X <- k
BPF_LDX
+BPF_W
+BPF_MEM
- X <- M[k]
BPF_LDX
+BPF_W
+BPF_LEN
- X <- len
BPF_LDX
+BPF_B
+BPF_MSH
- X <- 4*(P[k:1]&0xf)
-
-
BPF_ST
- This instruction stores the accumulator into the scratch
memory. We do not need an addressing mode since there is only one
possibility for the destination.
BPF_ST
- M[k] <- A
-
-
BPF_STX
- This instruction stores the index register in the scratch
memory store.
BPF_STX
- M[k] <- X
-
-
BPF_ALU
- The ALU instructions perform operations between the
accumulator and index register or constant, and store the result back in
the accumulator. For binary operations, a source mode is required
(
BPF_K
or
BPF_X
).
BPF_ALU
+BPF_ADD+BPF_K
- A <- A + k
BPF_ALU
+BPF_SUB+BPF_K
- A <- A - k
BPF_ALU
+BPF_MUL+BPF_K
- A <- A * k
BPF_ALU
+BPF_DIV+BPF_K
- A <- A / k
BPF_ALU
+BPF_AND+BPF_K
- A <- A & k
BPF_ALU
+BPF_OR+BPF_K
- A <- A | k
BPF_ALU
+BPF_LSH+BPF_K
- A <- A << k
BPF_ALU
+BPF_RSH+BPF_K
- A <- A >> k
BPF_ALU
+BPF_ADD+BPF_X
- A <- A + X
BPF_ALU
+BPF_SUB+BPF_X
- A <- A - X
BPF_ALU
+BPF_MUL+BPF_X
- A <- A * X
BPF_ALU
+BPF_DIV+BPF_X
- A <- A / X
BPF_ALU
+BPF_AND+BPF_X
- A <- A & X
BPF_ALU
+BPF_OR+BPF_X
- A <- A | X
BPF_ALU
+BPF_LSH+BPF_X
- A <- A << X
BPF_ALU
+BPF_RSH+BPF_X
- A <- A >> X
BPF_ALU
+BPF_NEG
- A <- -A
-
-
BPF_JMP
- The jump instructions alter flow of control. Conditional
jumps compare the accumulator against a constant
(
BPF_K
) or the index register
(BPF_X
). If the result is true (or
non-zero), the true branch is taken, otherwise the false branch is taken.
Jump offsets are encoded in 8 bits so the longest jump is 256
instructions. However, the jump always
(BPF_JA
) opcode uses the 32-bit
k field as the offset, allowing
arbitrarily distant destinations. All conditionals use unsigned comparison
conventions.
BPF_JMP
+BPF_JA
- pc += k
BPF_JMP
+BPF_JGT+BPF_K
- pc += (A > k) ? jt : jf
BPF_JMP
+BPF_JGE+BPF_K
- pc += (A >= k) ? jt : jf
BPF_JMP
+BPF_JEQ+BPF_K
- pc += (A == k) ? jt : jf
BPF_JMP
+BPF_JSET+BPF_K
- pc += (A & k) ? jt : jf
BPF_JMP
+BPF_JGT+BPF_X
- pc += (A > X) ? jt : jf
BPF_JMP
+BPF_JGE+BPF_X
- pc += (A >= X) ? jt : jf
BPF_JMP
+BPF_JEQ+BPF_X
- pc += (A == X) ? jt : jf
BPF_JMP
+BPF_JSET+BPF_X
- pc += (A & X) ? jt : jf
-
-
BPF_RET
- The return instructions terminate the filter program and
specify the amount of packet to accept (i.e., they return the truncation
amount) or, for the write filter, the maximum acceptable size for the
packet (i.e., the packet is dropped if it is larger than the returned
amount). A return value of zero indicates that the packet should be
ignored/dropped. The return value is either a constant
(
BPF_K
) or the accumulator
(BPF_A
).
BPF_RET
+ BPF_A
- Accept A bytes.
BPF_RET
+ BPF_K
- Accept k bytes.
-
-
BPF_MISC
- The miscellaneous category was created for anything that
doesn't fit into the above classes, and for any new instructions that
might need to be added. Currently, these are the register transfer
instructions that copy the index register to the accumulator or vice
versa.
BPF_MISC
+BPF_TAX
- X <- A
BPF_MISC
+BPF_TXA
- A <- X
The
bpf interface provides the following macros to
facilitate array initializers:
BPF_STMT
(
opcode,
operand)
BPF_JUMP
(
opcode,
operand,
true_offset,
false_offset)
- /dev/bpf
- bpf device
The following filter is taken from the Reverse ARP daemon. It accepts only
Reverse ARP requests.
struct bpf_insn insns[] = {
BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 12),
BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, ETHERTYPE_REVARP, 0, 3),
BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 20),
BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, REVARP_REQUEST, 0, 1),
BPF_STMT(BPF_RET+BPF_K, sizeof(struct ether_arp) +
sizeof(struct ether_header)),
BPF_STMT(BPF_RET+BPF_K, 0),
};
This filter accepts only IP packets between host 128.3.112.15 and 128.3.112.35.
struct bpf_insn insns[] = {
BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 12),
BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, ETHERTYPE_IP, 0, 8),
BPF_STMT(BPF_LD+BPF_W+BPF_ABS, 26),
BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 0x8003700f, 0, 2),
BPF_STMT(BPF_LD+BPF_W+BPF_ABS, 30),
BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 0x80037023, 3, 4),
BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 0x80037023, 0, 3),
BPF_STMT(BPF_LD+BPF_W+BPF_ABS, 30),
BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 0x8003700f, 0, 1),
BPF_STMT(BPF_RET+BPF_K, (u_int)-1),
BPF_STMT(BPF_RET+BPF_K, 0),
};
Finally, this filter returns only TCP finger packets. We must parse the IP
header to reach the TCP header. The
BPF_JSET
instruction checks that the IP
fragment offset is 0 so we are sure that we have a TCP header.
struct bpf_insn insns[] = {
BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 12),
BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, ETHERTYPE_IP, 0, 10),
BPF_STMT(BPF_LD+BPF_B+BPF_ABS, 23),
BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, IPPROTO_TCP, 0, 8),
BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 20),
BPF_JUMP(BPF_JMP+BPF_JSET+BPF_K, 0x1fff, 6, 0),
BPF_STMT(BPF_LDX+BPF_B+BPF_MSH, 14),
BPF_STMT(BPF_LD+BPF_H+BPF_IND, 14),
BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 79, 2, 0),
BPF_STMT(BPF_LD+BPF_H+BPF_IND, 16),
BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 79, 0, 1),
BPF_STMT(BPF_RET+BPF_K, (u_int)-1),
BPF_STMT(BPF_RET+BPF_K, 0),
};
ioctl(2),
read(2),
select(2),
signal(3),
MAKEDEV(8),
tcpdump(8)
McCanne, S. and
Jacobson, V., The BSD Packet
Filter: A New Architecture for User-level Packet Capture,
1993 Winter USENIX Conference, January
1993.
The Enet packet filter was created in 1980 by Mike Accetta and Rick Rashid at
Carnegie-Mellon University. Jeffrey Mogul, at Stanford, ported the code to
BSD and continued its development from 1983 on. Since
then, it has evolved into the Ultrix Packet Filter at DEC, a STREAMS NIT
module under SunOS 4.1, and BPF.
Steve McCanne of Lawrence Berkeley Laboratory
implemented BPF in Summer 1990. Much of the design is due to
Van Jacobson.
The read buffer must be of a fixed size (returned by the
BIOCGBLEN
ioctl).
A file that does not request promiscuous mode may receive promiscuously received
packets as a side effect of another file requesting this mode on the same
hardware interface. This could be fixed in the kernel with additional
processing overhead. However, we favor the model where all files must assume
that the interface is promiscuous, and if so desired, must utilize a filter to
reject foreign packets.