OpenCores
URL https://opencores.org/ocsvn/or1k/or1k/trunk

Subversion Repositories or1k

[/] [or1k/] [trunk/] [linux/] [linux-2.4/] [Documentation/] [networking/] [packet_mmap.txt] - Blame information for rev 1765

Details | Compare with Previous | View Log

Line No. Rev Author Line
1 1275 phoenix
 
2
DaveM:
3
 
4
If you agree with it I will send two small patches to modify
5
kernel's configure help.
6
 
7
        Ulisses
8
 
9
--------------------------------------------------------------------------------
10
+ ABSTRACT
11
--------------------------------------------------------------------------------
12
 
13
This file documents the CONFIG_PACKET_MMAP option available with the PACKET
14
socket interface on 2.4 and 2.6 kernels. This type of sockets is used for
15
capture network traffic with utilities like tcpdump or any other that uses
16
the libpcap library.
17
 
18
You can find the latest version of this document at
19
 
20
    http://pusa.uv.es/~ulisses/packet_mmap/
21
 
22
Please send me your comments to
23
 
24
    Ulisses Alonso Camaró 
25
 
26
-------------------------------------------------------------------------------
27
+ Why use PACKET_MMAP
28
--------------------------------------------------------------------------------
29
 
30
In Linux 2.4/2.6 if PACKET_MMAP is not enabled, the capture process is very
31
inefficient. It uses very limited buffers and requires one system call
32
to capture each packet, it requires two if you want to get packet's
33
timestamp (like libpcap always does).
34
 
35
In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size
36
configurable circular buffer mapped in user space. This way reading packets just
37
needs to wait for them, most of the time there is no need to issue a single
38
system call. By using a shared buffer between the kernel and the user
39
also has the benefit of minimizing packet copies.
40
 
41
It's fine to use PACKET_MMAP to improve the performance of the capture process,
42
but it isn't everything. At least, if you are capturing at high speeds (this
43
is relative to the cpu speed), you should check if the device driver of your
44
network interface card supports some sort of interrupt load mitigation or
45
(even better) if it supports NAPI, also make sure it is enabled.
46
 
47
--------------------------------------------------------------------------------
48
+ How to use CONFIG_PACKET_MMAP
49
--------------------------------------------------------------------------------
50
 
51
From the user standpoint, you should use the higher level libpcap library, wich
52
is a de facto standard, portable across nearly all operating systems
53
including Win32.
54
 
55
Said that, at time of this writing, official libpcap 0.8.1 is out and doesn't include
56
support for PACKET_MMAP, and also probably the libpcap included in your distribution.
57
 
58
I'm aware of two implementations of PACKET_MMAP in libpcap:
59
 
60
    http://pusa.uv.es/~ulisses/packet_mmap/  (by Simon Patarin, based on libpcap 0.6.2)
61
    http://public.lanl.gov/cpw/              (by Phil Wood, based on lastest libpcap)
62
 
63
The rest of this document is intended for people who want to understand
64
the low level details or want to improve libpcap by including PACKET_MMAP
65
support.
66
 
67
--------------------------------------------------------------------------------
68
+ How to use CONFIG_PACKET_MMAP directly
69
--------------------------------------------------------------------------------
70
 
71
From the system calls stand point, the use of PACKET_MMAP involves
72
the following process:
73
 
74
 
75
[setup]     socket() -------> creation of the capture socket
76
            setsockopt() ---> allocation of the circular buffer (ring)
77
            mmap() ---------> maping of the allocated buffer to the
78
                              user process
79
 
80
[capture]   poll() ---------> to wait for incoming packets
81
 
82
[shutdown]  close() --------> destruction of the capture socket and
83
                              deallocation of all associated
84
                              resources.
85
 
86
 
87
socket creation and destruction is straight forward, and is done
88
the same way with or without PACKET_MMAP:
89
 
90
int fd;
91
 
92
fd= socket(PF_PACKET, mode, htons(ETH_P_ALL))
93
 
94
where mode is SOCK_RAW for the raw interface were link level
95
information can be captured or SOCK_DGRAM for the cooked
96
interface where link level information capture is not
97
supported and a link level pseudo-header is provided
98
by the kernel.
99
 
100
The destruction of the socket and all associated resources
101
is done by a simple call to close(fd).
102
 
103
Next I will describe PACKET_MMAP settings and it's constraints,
104
also the maping of the circular buffer in the user process and
105
the use of this buffer.
106
 
107
--------------------------------------------------------------------------------
108
+ PACKET_MMAP settings
109
--------------------------------------------------------------------------------
110
 
111
 
112
To setup PACKET_MMAP from user level code is done with a call like
113
 
114
     setsockopt(fd, SOL_PACKET, PACKET_RX_RING, (void *) &req, sizeof(req))
115
 
116
The most significant argument in the previous call is the req parameter,
117
this parameter must to have the following structure:
118
 
119
    struct tpacket_req
120
    {
121
        unsigned int    tp_block_size;  /* Minimal size of contiguous block */
122
        unsigned int    tp_block_nr;    /* Number of blocks */
123
        unsigned int    tp_frame_size;  /* Size of frame */
124
        unsigned int    tp_frame_nr;    /* Total number of frames */
125
    };
126
 
127
This structure is defined in /usr/include/linux/if_packet.h and establishes a
128
circular buffer (ring) of unswappable memory mapped in the capture process.
129
Being mapped in the capture process allows reading the captured frames and
130
related meta-information like timestamps without requiring a system call.
131
 
132
Captured frames are grouped in blocks. Each block is a physically contiguous
133
region of memory and holds tp_block_size/tp_frame_size frames. The total number
134
of blocks is tp_block_nr. Note that tp_frame_nr is a redundant parameter because
135
 
136
    frames_per_block = tp_block_size/tp_frame_size
137
 
138
indeed, packet_set_ring checks that the following condition is true
139
 
140
    frames_per_block * tp_block_nr == tp_frame_nr
141
 
142
 
143
Lets see an example, with the following values:
144
 
145
     tp_block_size= 4096
146
     tp_frame_size= 2048
147
     tp_block_nr  = 4
148
     tp_frame_nr  = 8
149
 
150
we will get the following buffer structure:
151
 
152
        block #1                 block #2
153
+---------+---------+    +---------+---------+
154
| frame 1 | frame 2 |    | frame 3 | frame 4 |
155
+---------+---------+    +---------+---------+
156
 
157
        block #3                 block #4
158
+---------+---------+    +---------+---------+
159
| frame 5 | frame 6 |    | frame 7 | frame 8 |
160
+---------+---------+    +---------+---------+
161
 
162
A frame can be of any size with the only condition it can fit in a block. A block
163
can only hold an integer number of frames, or in other words, a frame cannot
164
be spawn accross two blocks so there are some datails you have to take into
165
account when choosing the frame_size. See "Maping and use of the circular
166
buffer (ring)".
167
 
168
 
169
--------------------------------------------------------------------------------
170
+ PACKET_MMAP setting constraints
171
--------------------------------------------------------------------------------
172
 
173
In kernel versions prior to 2.4.26 (for the 2.4 branch) and 2.6.5 (2.6 branch),
174
the PACKET_MMAP buffer could hold only 32768 frames in a 32 bit architecture or
175
16384 in a 64 bit architecture. For information on these kernel versions
176
see http://pusa.uv.es/~ulisses/packet_mmap/packet_mmap.pre-2.4.26_2.6.5.txt
177
 
178
 Block size limit
179
------------------
180
 
181
As stated earlier, each block is a contiguous physical region of memory. These
182
memory regions are allocated with calls to the __get_free_pages() function. As
183
the name indicates, this function allocates pages of memory, and the second
184
argument is "order" or a power of two number of pages, that is
185
(for PAGE_SIZE == 4096) order=0 ==> 4096 bytes, order=1 ==> 8192 bytes,
186
order=2 ==> 16384 bytes, etc. The maximum size of a
187
region allocated by __get_free_pages is determined by the MAX_ORDER macro. More
188
precisely the limit can be calculated as:
189
 
190
   PAGE_SIZE << MAX_ORDER
191
 
192
   In a i386 architecture PAGE_SIZE is 4096 bytes
193
   In a 2.4/i386 kernel MAX_ORDER is 10
194
   In a 2.6/i386 kernel MAX_ORDER is 11
195
 
196
So get_free_pages can allocate as much as 4MB or 8MB in a 2.4/2.6 kernel
197
respectively, with an i386 architecture.
198
 
199
User space programs can include /usr/include/sys/user.h and
200
/usr/include/linux/mmzone.h to get PAGE_SIZE MAX_ORDER declarations.
201
 
202
The pagesize can also be determined dynamically with the getpagesize (2)
203
system call.
204
 
205
 
206
 Block number limit
207
--------------------
208
 
209
To understand the constraints of PACKET_MMAP, we have to see the structure
210
used to hold the pointers to each block.
211
 
212
Currently, this structure is a dynamically allocated vector with kmalloc
213
called pg_vec, its size limits the number of blocks that can be allocated.
214
 
215
    +---+---+---+---+
216
    | x | x | x | x |
217
    +---+---+---+---+
218
      |   |   |   |
219
      |   |   |   v
220
      |   |   v  block #4
221
      |   v  block #3
222
      v  block #2
223
     block #1
224
 
225
 
226
kmalloc allocates any number of bytes of phisically contiguous memory from
227
a pool of pre-determined sizes. This pool of memory is mantained by the slab
228
allocator wich is at the end the responsible for doing the allocation and
229
hence wich imposes the maximum memory that kmalloc can allocate.
230
 
231
In a 2.4/2.6 kernel and the i386 architecture, the limit is 131072 bytes. The
232
predetermined sizes that kmalloc uses can be checked in the "size-"
233
entries of /proc/slabinfo
234
 
235
In a 32 bit architecture, pointers are 4 bytes long, so the total number of
236
pointers to blocks is
237
 
238
     131072/4 = 32768 blocks
239
 
240
 
241
 PACKET_MMAP buffer size calculator
242
------------------------------------
243
 
244
Definitions:
245
 
246
    : is the maximum size of allocable with kmalloc (see /proc/slabinfo)
247
: depends on the architecture -- sizeof(void *)
248
   : depends on the architecture -- PAGE_SIZE or getpagesize (2)
249
   : is the value defined with MAX_ORDER
250
  : it's an upper bound of frame's capture size (more on this later)
251
 
252
from these definitions we will derive
253
 
254
         = /
255
         =  << 
256
 
257
so, the max buffer size is
258
 
259
         * 
260
 
261
and, the number of frames be
262
 
263
         *  / 
264
 
265
Suposse the following parameters, wich apply for 2.6 kernel and an
266
i386 architecture:
267
 
268
         = 131072 bytes
269
         = 4 bytes
270
         = 4096 bytes
271
         = 11
272
 
273
and a value for  of 2048 byteas. These parameters will yield
274
 
275
         = 131072/4 = 32768 blocks
276
         = 4096 << 11 = 8 MiB.
277
 
278
and hence the buffer will have a 262144 MiB size. So it can hold
279
262144 MiB / 2048 bytes = 134217728 frames
280
 
281
 
282
Actually, this buffer size is not possible with an i386 architecture.
283
Remember that the memory is allocated in kernel space, in the case of
284
an i386 kernel's memory size is limited to 1GiB.
285
 
286
All memory allocations are not freed until the socket is closed. The memory
287
allocations are done with GFP_KERNEL priority, this basically means that
288
the allocation can wait and swap other process' memory in order to allocate
289
the nececessary memory, so normally limits can be reached.
290
 
291
 Other constraints
292
-------------------
293
 
294
If you check the source code you will see that what I draw here as a frame
295
is not only the link level frame. At the begining of each frame there is a
296
header called struct tpacket_hdr used in PACKET_MMAP to hold link level's frame
297
meta information like timestamp. So what we draw here a frame it's really
298
the following (from include/linux/if_packet.h):
299
 
300
/*
301
   Frame structure:
302
 
303
   - Start. Frame must be aligned to TPACKET_ALIGNMENT=16
304
   - struct tpacket_hdr
305
   - pad to TPACKET_ALIGNMENT=16
306
   - struct sockaddr_ll
307
   - Gap, chosen so that packet data (Start+tp_net) alignes to
308
     TPACKET_ALIGNMENT=16
309
   - Start+tp_mac: [ Optional MAC header ]
310
   - Start+tp_net: Packet data, aligned to TPACKET_ALIGNMENT=16.
311
   - Pad to align to TPACKET_ALIGNMENT=16
312
 */
313
 
314
 
315
 The following are conditions that are checked in packet_set_ring
316
 
317
   tp_block_size must be a multiple of PAGE_SIZE (1)
318
   tp_frame_size must be greater than TPACKET_HDRLEN (obvious)
319
   tp_frame_size must be a multiple of TPACKET_ALIGNMENT
320
   tp_frame_nr   must be exactly frames_per_block*tp_block_nr
321
 
322
Note that tp_block_size should be choosed to be a power of two or there will
323
be a waste of memory.
324
 
325
--------------------------------------------------------------------------------
326
+ Maping and use of the circular buffer (ring)
327
--------------------------------------------------------------------------------
328
 
329
The maping of the buffer in the user process is done with the conventional
330
mmap function. Even the circular buffer is compound of several physically
331
discontiguous blocks of memory, they are contiguous to the user space, hence
332
just one call to mmap is needed:
333
 
334
    mmap(0, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
335
 
336
If tp_frame_size is a divisor of tp_block_size frames will be
337
contiguosly spaced by tp_frame_size bytes. If not, each
338
tp_block_size/tp_frame_size frames there will be a gap between
339
the frames. This is because a frame cannot be spawn across two
340
blocks.
341
 
342
At the beginning of each frame there is an status field (see
343
struct tpacket_hdr). If this field is 0 means that the frame is ready
344
to be used for the kernel, If not, there is a frame the user can read
345
and the following flags apply:
346
 
347
     from include/linux/if_packet.h
348
 
349
     #define TP_STATUS_COPY          2
350
     #define TP_STATUS_LOSING        4
351
     #define TP_STATUS_CSUMNOTREADY  8
352
 
353
 
354
TP_STATUS_COPY        : This flag indicates that the frame (and associated
355
                        meta information) has been truncated because it's
356
                        larger than tp_frame_size. This packet can be
357
                        read entirely with recvfrom().
358
 
359
                        In order to make this work it must to be
360
                        enabled previously with setsockopt() and
361
                        the PACKET_COPY_THRESH option.
362
 
363
                        The number of frames than can be buffered to
364
                        be read with recvfrom is limited like a normal socket.
365
                        See the SO_RCVBUF option in the socket (7) man page.
366
 
367
TP_STATUS_LOSING      : indicates there were packet drops from last time
368
                        statistics where checked with getsockopt() and
369
                        the PACKET_STATISTICS option.
370
 
371
TP_STATUS_CSUMNOTREADY: currently it's used for outgoing IP packets wich
372
                        it's checksum will be done in hardware. So while
373
                        reading the packet we should not try to check the
374
                        checksum.
375
 
376
for convenience there are also the following defines:
377
 
378
     #define TP_STATUS_KERNEL        0
379
     #define TP_STATUS_USER          1
380
 
381
The kernel initializes all frames to TP_STATUS_KERNEL, when the kernel
382
receives a packet it puts in the buffer and updates the status with
383
at least the TP_STATUS_USER flag. Then the user can read the packet,
384
once the packet is read the user must zero the status field, so the kernel
385
can use again that frame buffer.
386
 
387
The user can use poll (any other variant should apply too) to check if new
388
packets are in the ring:
389
 
390
    struct pollfd pfd;
391
 
392
    pfd.fd = fd;
393
    pfd.revents = 0;
394
    pfd.events = POLLIN|POLLRDNORM|POLLERR;
395
 
396
    if (status == TP_STATUS_KERNEL)
397
        retval = poll(&pfd, 1, timeout);
398
 
399
It doesn't incur in a race condition to first check the status value and
400
then poll for frames.
401
 
402
--------------------------------------------------------------------------------
403
+ THANKS
404
--------------------------------------------------------------------------------
405
 
406
   Jesse Brandeburg, for fixing my grammathical/spelling errors
407
 
408
>>> EOF
409
-
410
To unsubscribe from this list: send the line "unsubscribe linux-net" in
411
the body of a message to majordomo@vger.kernel.org
412
More majordomo info at  http://vger.kernel.org/majordomo-info.html

powered by: WebSVN 2.1.0

© copyright 1999-2024 OpenCores.org, equivalent to Oliscience, all rights reserved. OpenCores®, registered trademark.