1 |
1275 |
phoenix |
HISTORY:
|
2 |
|
|
February 16/2002 -- revision 0.2.1:
|
3 |
|
|
COR typo corrected
|
4 |
|
|
February 10/2002 -- revision 0.2:
|
5 |
|
|
some spell checking ;->
|
6 |
|
|
January 12/2002 -- revision 0.1
|
7 |
|
|
This is still work in progress so may change.
|
8 |
|
|
To keep up to date please watch this space.
|
9 |
|
|
|
10 |
|
|
Introduction to NAPI
|
11 |
|
|
====================
|
12 |
|
|
|
13 |
|
|
NAPI is a proven (www.cyberus.ca/~hadi/usenix-paper.tgz) technique
|
14 |
|
|
to improve network performance on Linux. For more details please
|
15 |
|
|
read that paper.
|
16 |
|
|
NAPI provides a "inherent mitigation" which is bound by system capacity
|
17 |
|
|
as can be seen from the following data collected by Robert on Gigabit
|
18 |
|
|
ethernet (e1000):
|
19 |
|
|
|
20 |
|
|
Psize Ipps Tput Rxint Txint Done Ndone
|
21 |
|
|
---------------------------------------------------------------
|
22 |
|
|
60 890000 409362 17 27622 7 6823
|
23 |
|
|
128 758150 464364 21 9301 10 7738
|
24 |
|
|
256 445632 774646 42 15507 21 12906
|
25 |
|
|
512 232666 994445 241292 19147 241192 1062
|
26 |
|
|
1024 119061 1000003 872519 19258 872511 0
|
27 |
|
|
1440 85193 1000003 946576 19505 946569 0
|
28 |
|
|
|
29 |
|
|
|
30 |
|
|
Legend:
|
31 |
|
|
"Ipps" stands for input packets per second.
|
32 |
|
|
"Tput" == packets out of total 1M that made it out.
|
33 |
|
|
"txint" == transmit completion interrupts seen
|
34 |
|
|
"Done" == The number of times that the poll() managed to pull all
|
35 |
|
|
packets out of the rx ring. Note from this that the lower the
|
36 |
|
|
load the more we could clean up the rxring
|
37 |
|
|
"Ndone" == is the converse of "Done". Note again, that the higher
|
38 |
|
|
the load the more times we couldnt clean up the rxring.
|
39 |
|
|
|
40 |
|
|
Observe that:
|
41 |
|
|
when the NIC receives 890Kpackets/sec only 17 rx interrupts are generated.
|
42 |
|
|
The system cant handle the processing at 1 interrupt/packet at that load level.
|
43 |
|
|
At lower rates on the other hand, rx interrupts go up and therefore the
|
44 |
|
|
interrupt/packet ratio goes up (as observable from that table). So there is
|
45 |
|
|
possibility that under low enough input, you get one poll call for each
|
46 |
|
|
input packet caused by a single interrupt each time. And if the system
|
47 |
|
|
cant handle interrupt per packet ratio of 1, then it will just have to
|
48 |
|
|
chug along ....
|
49 |
|
|
|
50 |
|
|
|
51 |
|
|
0) Prerequisites:
|
52 |
|
|
==================
|
53 |
|
|
A driver MAY continue using the old 2.4 technique for interfacing
|
54 |
|
|
to the network stack and not benefit from the NAPI changes.
|
55 |
|
|
NAPI additions to the kernel do not break backward compatibility.
|
56 |
|
|
NAPI, however, requires the following features to be available:
|
57 |
|
|
|
58 |
|
|
A) DMA ring or enough RAM to store packets in software devices.
|
59 |
|
|
|
60 |
|
|
B) Ability to turn off interrupts or maybe events that send packets up
|
61 |
|
|
the stack.
|
62 |
|
|
|
63 |
|
|
NAPI processes packet events in what is known as dev->poll() method.
|
64 |
|
|
Typically, only packet receive events are processed in dev->poll().
|
65 |
|
|
The rest of the events MAY be processed by the regular interrupt handler
|
66 |
|
|
to reduce processing latency (justified also because there are not that
|
67 |
|
|
many of them).
|
68 |
|
|
Note, however, NAPI does not enforce that dev->poll() only processes
|
69 |
|
|
receive events.
|
70 |
|
|
Tests with the tulip driver indicated slightly increased latency if
|
71 |
|
|
all of the interrupt handler is moved to dev->poll(). Also MII handling
|
72 |
|
|
gets a little trickier.
|
73 |
|
|
The example used in this document is to move the receive processing only
|
74 |
|
|
to dev->poll(); this is shown with the patch for the tulip driver.
|
75 |
|
|
For an example of code that moves all the interrupt driver to
|
76 |
|
|
dev->poll() look at the ported e1000 code.
|
77 |
|
|
|
78 |
|
|
There are caveats that might force you to go with moving everything to
|
79 |
|
|
dev->poll(). Different NICs work differently depending on their status/event
|
80 |
|
|
acknowledgement setup.
|
81 |
|
|
There are two types of event register ACK mechanisms.
|
82 |
|
|
I) what is known as Clear-on-read (COR).
|
83 |
|
|
when you read the status/event register, it clears everything!
|
84 |
|
|
The natsemi and sunbmac NICs are known to do this.
|
85 |
|
|
In this case your only choice is to move all to dev->poll()
|
86 |
|
|
|
87 |
|
|
II) Clear-on-write (COW)
|
88 |
|
|
i) you clear the status by writting a 1 in the bit-location you want.
|
89 |
|
|
These are the majority of the NICs and work the best with NAPI.
|
90 |
|
|
Put only receive events in dev->poll(); leave the rest in
|
91 |
|
|
the old interrupt handler.
|
92 |
|
|
ii) whatever you write in the status register clears every thing ;->
|
93 |
|
|
Cant seem to find any supported by Linux which do this. If
|
94 |
|
|
someone knows such a chip email us please.
|
95 |
|
|
Move all to dev->poll()
|
96 |
|
|
|
97 |
|
|
C) Ability to detect new work correctly.
|
98 |
|
|
NAPI works by shutting down event interrupts when theres work and
|
99 |
|
|
turning them on when theres none.
|
100 |
|
|
New packets might show up in the small window while interrupts were being
|
101 |
|
|
re-enabled (refer to appendix 2). A packet might sneak in during the period
|
102 |
|
|
we are enabling interrupts. We only get to know about such a packet when the
|
103 |
|
|
next new packet arrives and generates an interrupt.
|
104 |
|
|
Essentially, there is a small window of opportunity for a race condition
|
105 |
|
|
which for clarity we'll refer to as the "rotting packet".
|
106 |
|
|
|
107 |
|
|
This is a very important topic and appendix 2 is dedicated for more
|
108 |
|
|
discussion.
|
109 |
|
|
|
110 |
|
|
Locking rules and environmental guarantees
|
111 |
|
|
==========================================
|
112 |
|
|
|
113 |
|
|
-Guarantee: Only one CPU at any time can call dev->poll(); this is because
|
114 |
|
|
only one CPU can pick the initial interrupt and hence the initial
|
115 |
|
|
netif_rx_schedule(dev);
|
116 |
|
|
- The core layer invokes devices to send packets in a round robin format.
|
117 |
|
|
This implies receive is totaly lockless because of the guarantee only that
|
118 |
|
|
one CPU is executing it.
|
119 |
|
|
- contention can only be the result of some other CPU accessing the rx
|
120 |
|
|
ring. This happens only in close() and suspend() (when these methods
|
121 |
|
|
try to clean the rx ring);
|
122 |
|
|
****guarantee: driver authors need not worry about this; synchronization
|
123 |
|
|
is taken care for them by the top net layer.
|
124 |
|
|
-local interrupts are enabled (if you dont move all to dev->poll()). For
|
125 |
|
|
example link/MII and txcomplete continue functioning just same old way.
|
126 |
|
|
This improves the latency of processing these events. It is also assumed that
|
127 |
|
|
the receive interrupt is the largest cause of noise. Note this might not
|
128 |
|
|
always be true.
|
129 |
|
|
[according to Manfred Spraul, the winbond insists on sending one
|
130 |
|
|
txmitcomplete interrupt for each packet (although this can be mitigated)].
|
131 |
|
|
For these broken drivers, move all to dev->poll().
|
132 |
|
|
|
133 |
|
|
For the rest of this text, we'll assume that dev->poll() only
|
134 |
|
|
processes receive events.
|
135 |
|
|
|
136 |
|
|
new methods introduce by NAPI
|
137 |
|
|
=============================
|
138 |
|
|
|
139 |
|
|
a) netif_rx_schedule(dev)
|
140 |
|
|
Called by an IRQ handler to schedule a poll for device
|
141 |
|
|
|
142 |
|
|
b) netif_rx_schedule_prep(dev)
|
143 |
|
|
puts the device in a state which allows for it to be added to the
|
144 |
|
|
CPU polling list if it is up and running. You can look at this as
|
145 |
|
|
the first half of netif_rx_schedule(dev) above; the second half
|
146 |
|
|
being c) below.
|
147 |
|
|
|
148 |
|
|
c) __netif_rx_schedule(dev)
|
149 |
|
|
Add device to the poll list for this CPU; assuming that _prep above
|
150 |
|
|
has already been called and returned 1.
|
151 |
|
|
|
152 |
|
|
d) netif_rx_reschedule(dev, undo)
|
153 |
|
|
Called to reschedule polling for device specifically for some
|
154 |
|
|
deficient hardware. Read Appendix 2 for more details.
|
155 |
|
|
|
156 |
|
|
e) netif_rx_complete(dev)
|
157 |
|
|
|
158 |
|
|
Remove interface from the CPU poll list: it must be in the poll list
|
159 |
|
|
on current cpu. This primitive is called by dev->poll(), when
|
160 |
|
|
it completes its work. The device cannot be out of poll list at this
|
161 |
|
|
call, if it is then clearly it is a BUG(). You'll know ;->
|
162 |
|
|
|
163 |
|
|
All these above nethods are used below. So keep reading for clarity.
|
164 |
|
|
|
165 |
|
|
Device driver changes to be made when porting NAPI
|
166 |
|
|
==================================================
|
167 |
|
|
|
168 |
|
|
Below we describe what kind of changes are required for NAPI to work.
|
169 |
|
|
|
170 |
|
|
1) introduction of dev->poll() method
|
171 |
|
|
=====================================
|
172 |
|
|
|
173 |
|
|
This is the method that is invoked by the network core when it requests
|
174 |
|
|
for new packets from the driver. A driver is allowed to send upto
|
175 |
|
|
dev->quota packets by the current CPU before yielding to the network
|
176 |
|
|
subsystem (so other devices can also get opportunity to send to the stack).
|
177 |
|
|
|
178 |
|
|
dev->poll() prototype looks as follows:
|
179 |
|
|
int my_poll(struct net_device *dev, int *budget)
|
180 |
|
|
|
181 |
|
|
budget is the remaining number of packets the network subsystem on the
|
182 |
|
|
current CPU can send up the stack before yielding to other system tasks.
|
183 |
|
|
*Each driver is responsible for decrementing budget by the total number of
|
184 |
|
|
packets sent.
|
185 |
|
|
Total number of packets cannot exceed dev->quota.
|
186 |
|
|
|
187 |
|
|
dev->poll() method is invoked by the top layer, the driver just sends if it
|
188 |
|
|
can to the stack the packet quantity requested.
|
189 |
|
|
|
190 |
|
|
more on dev->poll() below after the interrupt changes are explained.
|
191 |
|
|
|
192 |
|
|
2) registering dev->poll() method
|
193 |
|
|
===================================
|
194 |
|
|
|
195 |
|
|
dev->poll should be set in the dev->probe() method.
|
196 |
|
|
e.g:
|
197 |
|
|
dev->open = my_open;
|
198 |
|
|
.
|
199 |
|
|
.
|
200 |
|
|
/* two new additions */
|
201 |
|
|
/* first register my poll method */
|
202 |
|
|
dev->poll = my_poll;
|
203 |
|
|
/* next register my weight/quanta; can be overriden in /proc */
|
204 |
|
|
dev->weight = 16;
|
205 |
|
|
.
|
206 |
|
|
.
|
207 |
|
|
dev->stop = my_close;
|
208 |
|
|
|
209 |
|
|
|
210 |
|
|
|
211 |
|
|
3) scheduling dev->poll()
|
212 |
|
|
=============================
|
213 |
|
|
This involves modifying the interrupt handler and the code
|
214 |
|
|
path which takes the packet off the NIC and sends them to the
|
215 |
|
|
stack.
|
216 |
|
|
|
217 |
|
|
it's important at this point to introduce the classical D Becker
|
218 |
|
|
interrupt processor:
|
219 |
|
|
|
220 |
|
|
------------------
|
221 |
|
|
static void
|
222 |
|
|
netdevice_interrupt(int irq, void *dev_id, struct pt_regs *regs)
|
223 |
|
|
{
|
224 |
|
|
|
225 |
|
|
struct net_device *dev = (struct net_device *)dev_instance;
|
226 |
|
|
struct my_private *tp = (struct my_private *)dev->priv;
|
227 |
|
|
|
228 |
|
|
int work_count = my_work_count;
|
229 |
|
|
status = read_interrupt_status_reg();
|
230 |
|
|
if (status == 0)
|
231 |
|
|
return; /* Shared IRQ: not us */
|
232 |
|
|
if (status == 0xffff)
|
233 |
|
|
return; /* Hot unplug */
|
234 |
|
|
if (status & error)
|
235 |
|
|
do_some_error_handling()
|
236 |
|
|
|
237 |
|
|
do {
|
238 |
|
|
acknowledge_ints_ASAP();
|
239 |
|
|
|
240 |
|
|
if (status & link_interrupt) {
|
241 |
|
|
spin_lock(&tp->link_lock);
|
242 |
|
|
do_some_link_stat_stuff();
|
243 |
|
|
spin_unlock(&tp->link_lock);
|
244 |
|
|
}
|
245 |
|
|
|
246 |
|
|
if (status & rx_interrupt) {
|
247 |
|
|
receive_packets(dev);
|
248 |
|
|
}
|
249 |
|
|
|
250 |
|
|
if (status & rx_nobufs) {
|
251 |
|
|
make_rx_buffs_avail();
|
252 |
|
|
}
|
253 |
|
|
|
254 |
|
|
if (status & tx_related) {
|
255 |
|
|
spin_lock(&tp->lock);
|
256 |
|
|
tx_ring_free(dev);
|
257 |
|
|
if (tx_died)
|
258 |
|
|
restart_tx();
|
259 |
|
|
spin_unlock(&tp->lock);
|
260 |
|
|
}
|
261 |
|
|
|
262 |
|
|
status = read_interrupt_status_reg();
|
263 |
|
|
|
264 |
|
|
} while (!(status & error) || more_work_to_be_done);
|
265 |
|
|
|
266 |
|
|
}
|
267 |
|
|
|
268 |
|
|
----------------------------------------------------------------------
|
269 |
|
|
|
270 |
|
|
We now change this to what is shown below to NAPI-enable it:
|
271 |
|
|
|
272 |
|
|
----------------------------------------------------------------------
|
273 |
|
|
static void
|
274 |
|
|
netdevice_interrupt(int irq, void *dev_id, struct pt_regs *regs)
|
275 |
|
|
{
|
276 |
|
|
struct net_device *dev = (struct net_device *)dev_instance;
|
277 |
|
|
struct my_private *tp = (struct my_private *)dev->priv;
|
278 |
|
|
|
279 |
|
|
status = read_interrupt_status_reg();
|
280 |
|
|
if (status == 0)
|
281 |
|
|
return; /* Shared IRQ: not us */
|
282 |
|
|
if (status == 0xffff)
|
283 |
|
|
return; /* Hot unplug */
|
284 |
|
|
if (status & error)
|
285 |
|
|
do_some_error_handling();
|
286 |
|
|
|
287 |
|
|
do {
|
288 |
|
|
/************************ start note *********************************/
|
289 |
|
|
acknowledge_ints_ASAP(); // dont ack rx and rxnobuff here
|
290 |
|
|
/************************ end note *********************************/
|
291 |
|
|
|
292 |
|
|
if (status & link_interrupt) {
|
293 |
|
|
spin_lock(&tp->link_lock);
|
294 |
|
|
do_some_link_stat_stuff();
|
295 |
|
|
spin_unlock(&tp->link_lock);
|
296 |
|
|
}
|
297 |
|
|
/************************ start note *********************************/
|
298 |
|
|
if (status & rx_interrupt || (status & rx_nobuffs)) {
|
299 |
|
|
if (netif_rx_schedule_prep(dev)) {
|
300 |
|
|
|
301 |
|
|
/* disable interrupts caused
|
302 |
|
|
* by arriving packets */
|
303 |
|
|
disable_rx_and_rxnobuff_ints();
|
304 |
|
|
/* tell system we have work to be done. */
|
305 |
|
|
__netif_rx_schedule(dev);
|
306 |
|
|
} else {
|
307 |
|
|
printk("driver bug! interrupt while in poll\n");
|
308 |
|
|
/* FIX by disabling interrupts */
|
309 |
|
|
disable_rx_and_rxnobuff_ints();
|
310 |
|
|
}
|
311 |
|
|
}
|
312 |
|
|
/************************ end note note *********************************/
|
313 |
|
|
|
314 |
|
|
if (status & tx_related) {
|
315 |
|
|
spin_lock(&tp->lock);
|
316 |
|
|
tx_ring_free(dev);
|
317 |
|
|
|
318 |
|
|
if (tx_died)
|
319 |
|
|
restart_tx();
|
320 |
|
|
spin_unlock(&tp->lock);
|
321 |
|
|
}
|
322 |
|
|
|
323 |
|
|
status = read_interrupt_status_reg();
|
324 |
|
|
|
325 |
|
|
/************************ start note *********************************/
|
326 |
|
|
} while (!(status & error) || more_work_to_be_done(status));
|
327 |
|
|
/************************ end note note *********************************/
|
328 |
|
|
|
329 |
|
|
}
|
330 |
|
|
|
331 |
|
|
---------------------------------------------------------------------
|
332 |
|
|
|
333 |
|
|
|
334 |
|
|
We note several things from above:
|
335 |
|
|
|
336 |
|
|
I) Any interrupt source which is caused by arriving packets is now
|
337 |
|
|
turned off when it occurs. Depending on the hardware, there could be
|
338 |
|
|
several reasons that arriving packets would cause interrupts; these are the
|
339 |
|
|
interrupt sources we wish to avoid. The two common ones are a) a packet
|
340 |
|
|
arriving (rxint) b) a packet arriving and finding no DMA buffers available
|
341 |
|
|
(rxnobuff) .
|
342 |
|
|
This means also acknowledge_ints_ASAP() will not clear the status
|
343 |
|
|
register for those two items above; clearing is done in the place where
|
344 |
|
|
proper work is done within NAPI; at the poll() and refill_rx_ring()
|
345 |
|
|
discussed further below.
|
346 |
|
|
netif_rx_schedule_prep() returns 1 if device is in running state and
|
347 |
|
|
gets successfully added to the core poll list. If we get a zero value
|
348 |
|
|
we can _almost_ assume are already added to the list (instead of not running.
|
349 |
|
|
Logic based on the fact that you shouldnt get interrupt if not running)
|
350 |
|
|
We rectify this by disabling rx and rxnobuf interrupts.
|
351 |
|
|
|
352 |
|
|
II) that receive_packets(dev) and make_rx_buffs_avail() may have dissapeared.
|
353 |
|
|
These functionalities are still around actually......
|
354 |
|
|
|
355 |
|
|
infact, receive_packets(dev) is very close to my_poll() and
|
356 |
|
|
make_rx_buffs_avail() is invoked from my_poll()
|
357 |
|
|
|
358 |
|
|
4) converting receive_packets() to dev->poll()
|
359 |
|
|
===============================================
|
360 |
|
|
|
361 |
|
|
We need to convert the classical D Becker receive_packets(dev) to my_poll()
|
362 |
|
|
|
363 |
|
|
First the typical receive_packets() below:
|
364 |
|
|
-------------------------------------------------------------------
|
365 |
|
|
|
366 |
|
|
/* this is called by interrupt handler */
|
367 |
|
|
static void receive_packets (struct net_device *dev)
|
368 |
|
|
{
|
369 |
|
|
|
370 |
|
|
struct my_private *tp = (struct my_private *)dev->priv;
|
371 |
|
|
rx_ring = tp->rx_ring;
|
372 |
|
|
cur_rx = tp->cur_rx;
|
373 |
|
|
int entry = cur_rx % RX_RING_SIZE;
|
374 |
|
|
int received = 0;
|
375 |
|
|
int rx_work_limit = tp->dirty_rx + RX_RING_SIZE - tp->cur_rx;
|
376 |
|
|
|
377 |
|
|
while (rx_ring_not_empty) {
|
378 |
|
|
u32 rx_status;
|
379 |
|
|
unsigned int rx_size;
|
380 |
|
|
unsigned int pkt_size;
|
381 |
|
|
struct sk_buff *skb;
|
382 |
|
|
/* read size+status of next frame from DMA ring buffer */
|
383 |
|
|
/* the number 16 and 4 are just examples */
|
384 |
|
|
rx_status = le32_to_cpu (*(u32 *) (rx_ring + ring_offset));
|
385 |
|
|
rx_size = rx_status >> 16;
|
386 |
|
|
pkt_size = rx_size - 4;
|
387 |
|
|
|
388 |
|
|
/* process errors */
|
389 |
|
|
if ((rx_size > (MAX_ETH_FRAME_SIZE+4)) ||
|
390 |
|
|
(!(rx_status & RxStatusOK))) {
|
391 |
|
|
netdrv_rx_err (rx_status, dev, tp, ioaddr);
|
392 |
|
|
return;
|
393 |
|
|
}
|
394 |
|
|
|
395 |
|
|
if (--rx_work_limit < 0)
|
396 |
|
|
break;
|
397 |
|
|
|
398 |
|
|
/* grab a skb */
|
399 |
|
|
skb = dev_alloc_skb (pkt_size + 2);
|
400 |
|
|
if (skb) {
|
401 |
|
|
.
|
402 |
|
|
.
|
403 |
|
|
netif_rx (skb);
|
404 |
|
|
.
|
405 |
|
|
.
|
406 |
|
|
} else { /* OOM */
|
407 |
|
|
/*seems very driver specific ... some just pass
|
408 |
|
|
whatever is on the ring already. */
|
409 |
|
|
}
|
410 |
|
|
|
411 |
|
|
/* move to the next skb on the ring */
|
412 |
|
|
entry = (++tp->cur_rx) % RX_RING_SIZE;
|
413 |
|
|
received++ ;
|
414 |
|
|
|
415 |
|
|
}
|
416 |
|
|
|
417 |
|
|
/* store current ring pointer state */
|
418 |
|
|
tp->cur_rx = cur_rx;
|
419 |
|
|
|
420 |
|
|
/* Refill the Rx ring buffers if they are needed */
|
421 |
|
|
refill_rx_ring();
|
422 |
|
|
.
|
423 |
|
|
.
|
424 |
|
|
|
425 |
|
|
}
|
426 |
|
|
-------------------------------------------------------------------
|
427 |
|
|
We change it to a new one below; note the additional parameter in
|
428 |
|
|
the call.
|
429 |
|
|
|
430 |
|
|
-------------------------------------------------------------------
|
431 |
|
|
|
432 |
|
|
/* this is called by the network core */
|
433 |
|
|
static void my_poll (struct net_device *dev, int *budget)
|
434 |
|
|
{
|
435 |
|
|
|
436 |
|
|
struct my_private *tp = (struct my_private *)dev->priv;
|
437 |
|
|
rx_ring = tp->rx_ring;
|
438 |
|
|
cur_rx = tp->cur_rx;
|
439 |
|
|
int entry = cur_rx % RX_BUF_LEN;
|
440 |
|
|
/* maximum packets to send to the stack */
|
441 |
|
|
/************************ note note *********************************/
|
442 |
|
|
int rx_work_limit = dev->quota;
|
443 |
|
|
|
444 |
|
|
/************************ end note note *********************************/
|
445 |
|
|
do { // outer beggining loop starts here
|
446 |
|
|
|
447 |
|
|
clear_rx_status_register_bit();
|
448 |
|
|
|
449 |
|
|
while (rx_ring_not_empty) {
|
450 |
|
|
u32 rx_status;
|
451 |
|
|
unsigned int rx_size;
|
452 |
|
|
unsigned int pkt_size;
|
453 |
|
|
struct sk_buff *skb;
|
454 |
|
|
/* read size+status of next frame from DMA ring buffer */
|
455 |
|
|
/* the number 16 and 4 are just examples */
|
456 |
|
|
rx_status = le32_to_cpu (*(u32 *) (rx_ring + ring_offset));
|
457 |
|
|
rx_size = rx_status >> 16;
|
458 |
|
|
pkt_size = rx_size - 4;
|
459 |
|
|
|
460 |
|
|
/* process errors */
|
461 |
|
|
if ((rx_size > (MAX_ETH_FRAME_SIZE+4)) ||
|
462 |
|
|
(!(rx_status & RxStatusOK))) {
|
463 |
|
|
netdrv_rx_err (rx_status, dev, tp, ioaddr);
|
464 |
|
|
return;
|
465 |
|
|
}
|
466 |
|
|
|
467 |
|
|
/************************ note note *********************************/
|
468 |
|
|
if (--rx_work_limit < 0) { /* we got packets, but no quota */
|
469 |
|
|
/* store current ring pointer state */
|
470 |
|
|
tp->cur_rx = cur_rx;
|
471 |
|
|
|
472 |
|
|
/* Refill the Rx ring buffers if they are needed */
|
473 |
|
|
refill_rx_ring(dev);
|
474 |
|
|
goto not_done;
|
475 |
|
|
}
|
476 |
|
|
/********************** end note **********************************/
|
477 |
|
|
|
478 |
|
|
/* grab a skb */
|
479 |
|
|
skb = dev_alloc_skb (pkt_size + 2);
|
480 |
|
|
if (skb) {
|
481 |
|
|
.
|
482 |
|
|
.
|
483 |
|
|
/************************ note note *********************************/
|
484 |
|
|
netif_receive_skb (skb);
|
485 |
|
|
/********************** end note **********************************/
|
486 |
|
|
.
|
487 |
|
|
.
|
488 |
|
|
} else { /* OOM */
|
489 |
|
|
/*seems very driver specific ... common is just pass
|
490 |
|
|
whatever is on the ring already. */
|
491 |
|
|
}
|
492 |
|
|
|
493 |
|
|
/* move to the next skb on the ring */
|
494 |
|
|
entry = (++tp->cur_rx) % RX_RING_SIZE;
|
495 |
|
|
received++ ;
|
496 |
|
|
|
497 |
|
|
}
|
498 |
|
|
|
499 |
|
|
/* store current ring pointer state */
|
500 |
|
|
tp->cur_rx = cur_rx;
|
501 |
|
|
|
502 |
|
|
/* Refill the Rx ring buffers if they are needed */
|
503 |
|
|
refill_rx_ring(dev);
|
504 |
|
|
|
505 |
|
|
/* no packets on ring; but new ones can arrive since we last
|
506 |
|
|
checked */
|
507 |
|
|
status = read_interrupt_status_reg();
|
508 |
|
|
if (rx status is not set) {
|
509 |
|
|
/* If something arrives in this narrow window,
|
510 |
|
|
an interrupt will be generated */
|
511 |
|
|
goto done;
|
512 |
|
|
}
|
513 |
|
|
/* done! at least thats what it looks like ;->
|
514 |
|
|
if new packets came in after our last check on status bits
|
515 |
|
|
they'll be caught by the while check and we go back and clear them
|
516 |
|
|
since we havent exceeded our quota */
|
517 |
|
|
} while (rx_status_is_set);
|
518 |
|
|
|
519 |
|
|
done:
|
520 |
|
|
|
521 |
|
|
/************************ note note *********************************/
|
522 |
|
|
dev->quota -= received;
|
523 |
|
|
*budget -= received;
|
524 |
|
|
|
525 |
|
|
/* If RX ring is not full we are out of memory. */
|
526 |
|
|
if (tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL)
|
527 |
|
|
goto oom;
|
528 |
|
|
|
529 |
|
|
/* we are happy/done, no more packets on ring; put us back
|
530 |
|
|
to where we can start processing interrupts again */
|
531 |
|
|
netif_rx_complete(dev);
|
532 |
|
|
enable_rx_and_rxnobuf_ints();
|
533 |
|
|
|
534 |
|
|
/* The last op happens after poll completion. Which means the following:
|
535 |
|
|
* 1. it can race with disabling irqs in irq handler (which are done to
|
536 |
|
|
* schedule polls)
|
537 |
|
|
* 2. it can race with dis/enabling irqs in other poll threads
|
538 |
|
|
* 3. if an irq raised after the begining of the outer beginning
|
539 |
|
|
* loop(marked in the code above), it will be immediately
|
540 |
|
|
* triggered here.
|
541 |
|
|
*
|
542 |
|
|
* Summarizing: the logic may results in some redundant irqs both
|
543 |
|
|
* due to races in masking and due to too late acking of already
|
544 |
|
|
* processed irqs. The good news: no events are ever lost.
|
545 |
|
|
*/
|
546 |
|
|
|
547 |
|
|
return 0; /* done */
|
548 |
|
|
|
549 |
|
|
not_done:
|
550 |
|
|
if (tp->cur_rx - tp->dirty_rx > RX_RING_SIZE/2 ||
|
551 |
|
|
tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL)
|
552 |
|
|
refill_rx_ring(dev);
|
553 |
|
|
|
554 |
|
|
if (!received) {
|
555 |
|
|
printk("received==0\n");
|
556 |
|
|
received = 1;
|
557 |
|
|
}
|
558 |
|
|
dev->quota -= received;
|
559 |
|
|
*budget -= received;
|
560 |
|
|
return 1; /* not_done */
|
561 |
|
|
|
562 |
|
|
oom:
|
563 |
|
|
/* Start timer, stop polling, but do not enable rx interrupts. */
|
564 |
|
|
start_poll_timer(dev);
|
565 |
|
|
return 0; /* we'll take it from here so tell core "done"*/
|
566 |
|
|
|
567 |
|
|
/************************ End note note *********************************/
|
568 |
|
|
}
|
569 |
|
|
-------------------------------------------------------------------
|
570 |
|
|
|
571 |
|
|
From above we note that:
|
572 |
|
|
0) rx_work_limit = dev->quota
|
573 |
|
|
1) refill_rx_ring() is in charge of clearing the bit for rxnobuff when
|
574 |
|
|
it does the work.
|
575 |
|
|
2) We have a done and not_done state.
|
576 |
|
|
3) instead of netif_rx() we call netif_receive_skb() to pass the skb.
|
577 |
|
|
4) we have a new way of handling oom condition
|
578 |
|
|
5) A new outer for (;;) loop has been added. This serves the purpose of
|
579 |
|
|
ensuring that if a new packet has come in, after we are all set and done,
|
580 |
|
|
and we have not exceeded our quota that we continue sending packets up.
|
581 |
|
|
|
582 |
|
|
|
583 |
|
|
-----------------------------------------------------------
|
584 |
|
|
Poll timer code will need to do the following:
|
585 |
|
|
|
586 |
|
|
a)
|
587 |
|
|
|
588 |
|
|
if (tp->cur_rx - tp->dirty_rx > RX_RING_SIZE/2 ||
|
589 |
|
|
tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL)
|
590 |
|
|
refill_rx_ring(dev);
|
591 |
|
|
|
592 |
|
|
/* If RX ring is not full we are still out of memory.
|
593 |
|
|
Restart the timer again. Else we re-add ourselves
|
594 |
|
|
to the master poll list.
|
595 |
|
|
*/
|
596 |
|
|
|
597 |
|
|
if (tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL)
|
598 |
|
|
restart_timer();
|
599 |
|
|
|
600 |
|
|
else netif_rx_schedule(dev); /* we are back on the poll list */
|
601 |
|
|
|
602 |
|
|
5) dev->close() and dev->suspend() issues
|
603 |
|
|
==========================================
|
604 |
|
|
The driver writter neednt worry about this. The top net layer takes
|
605 |
|
|
care of it.
|
606 |
|
|
|
607 |
|
|
6) Adding new Stats to /proc
|
608 |
|
|
=============================
|
609 |
|
|
In order to debug some of the new features, we introduce new stats
|
610 |
|
|
that need to be collected.
|
611 |
|
|
TODO: Fill this later.
|
612 |
|
|
|
613 |
|
|
APPENDIX 1: discussion on using ethernet HW FC
|
614 |
|
|
==============================================
|
615 |
|
|
Most chips with FC only send a pause packet when they run out of Rx buffers.
|
616 |
|
|
Since packets are pulled off the DMA ring by a softirq in NAPI,
|
617 |
|
|
if the system is slow in grabbing them and we have a high input
|
618 |
|
|
rate (faster than the system's capacity to remove packets), then theoretically
|
619 |
|
|
there will only be one rx interrupt for all packets during a given packetstorm.
|
620 |
|
|
Under low load, we might have a single interrupt per packet.
|
621 |
|
|
FC should be programmed to apply in the case when the system cant pull out
|
622 |
|
|
packets fast enough i.e send a pause only when you run out of rx buffers.
|
623 |
|
|
Note FC in itself is a good solution but we have found it to not be
|
624 |
|
|
much of a commodity feature (both in NICs and switches) and hence falls
|
625 |
|
|
under the same category as using NIC based mitigation. Also experiments
|
626 |
|
|
indicate that its much harder to resolve the resource allocation
|
627 |
|
|
issue (aka lazy receiving that NAPI offers) and hence quantify its usefullness
|
628 |
|
|
proved harder. In any case, FC works even better with NAPI but is not
|
629 |
|
|
necessary.
|
630 |
|
|
|
631 |
|
|
|
632 |
|
|
APPENDIX 2: the "rotting packet" race-window avoidance scheme
|
633 |
|
|
=============================================================
|
634 |
|
|
|
635 |
|
|
There are two types of associations seen here
|
636 |
|
|
|
637 |
|
|
1) status/int which honors level triggered IRQ
|
638 |
|
|
|
639 |
|
|
If a status bit for receive or rxnobuff is set and the corresponding
|
640 |
|
|
interrupt-enable bit is not on, then no interrupts will be generated. However,
|
641 |
|
|
as soon as the "interrupt-enable" bit is unmasked, an immediate interrupt is
|
642 |
|
|
generated. [assuming the status bit was not turned off].
|
643 |
|
|
Generally the concept of level triggered IRQs in association with a status and
|
644 |
|
|
interrupt-enable CSR register set is used to avoid the race.
|
645 |
|
|
|
646 |
|
|
If we take the example of the tulip:
|
647 |
|
|
"pending work" is indicated by the status bit(CSR5 in tulip).
|
648 |
|
|
the corresponding interrupt bit (CSR7 in tulip) might be turned off (but
|
649 |
|
|
the CSR5 will continue to be turned on with new packet arrivals even if
|
650 |
|
|
we clear it the first time)
|
651 |
|
|
Very important is the fact that if we turn on the interrupt bit on when
|
652 |
|
|
status is set that an immediate irq is triggered.
|
653 |
|
|
|
654 |
|
|
If we cleared the rx ring and proclaimed there was "no more work
|
655 |
|
|
to be done" and then went on to do a few other things; then when we enable
|
656 |
|
|
interrupts, there is a possibility that a new packet might sneak in during
|
657 |
|
|
this phase. It helps to look at the pseudo code for the tulip poll
|
658 |
|
|
routine:
|
659 |
|
|
|
660 |
|
|
--------------------------
|
661 |
|
|
do {
|
662 |
|
|
ACK;
|
663 |
|
|
while (ring_is_not_empty()) {
|
664 |
|
|
work-work-work
|
665 |
|
|
if quota is exceeded: exit, no touching irq status/mask
|
666 |
|
|
}
|
667 |
|
|
/* No packets, but new can arrive while we are doing this*/
|
668 |
|
|
CSR5 := read
|
669 |
|
|
if (CSR5 is not set) {
|
670 |
|
|
/* If something arrives in this narrow window here,
|
671 |
|
|
* where the comments are ;-> irq will be generated */
|
672 |
|
|
unmask irqs;
|
673 |
|
|
exit poll;
|
674 |
|
|
}
|
675 |
|
|
} while (rx_status_is_set);
|
676 |
|
|
------------------------
|
677 |
|
|
|
678 |
|
|
CSR5 bit of interest is only the rx status.
|
679 |
|
|
If you look at the last if statement:
|
680 |
|
|
you just finished grabbing all the packets from the rx ring .. you check if
|
681 |
|
|
status bit says theres more packets just in ... it says none; you then
|
682 |
|
|
enable rx interrupts again; if a new packet just came in during this check,
|
683 |
|
|
we are counting that CSR5 will be set in that small window of opportunity
|
684 |
|
|
and that by re-enabling interrupts, we would actually triger an interrupt
|
685 |
|
|
to register the new packet for processing.
|
686 |
|
|
|
687 |
|
|
[The above description nay be very verbose, if you have better wording
|
688 |
|
|
that will make this more understandable, please suggest it.]
|
689 |
|
|
|
690 |
|
|
2) non-capable hardware
|
691 |
|
|
|
692 |
|
|
These do not generally respect level triggered IRQs. Normally,
|
693 |
|
|
irqs may be lost while being masked and the only way to leave poll is to do
|
694 |
|
|
a double check for new input after netif_rx_complete() is invoked
|
695 |
|
|
and re-enable polling (after seeing this new input).
|
696 |
|
|
|
697 |
|
|
Sample code:
|
698 |
|
|
|
699 |
|
|
---------
|
700 |
|
|
.
|
701 |
|
|
.
|
702 |
|
|
restart_poll:
|
703 |
|
|
while (ring_is_not_empty()) {
|
704 |
|
|
work-work-work
|
705 |
|
|
if quota is exceeded: exit, not touching irq status/mask
|
706 |
|
|
}
|
707 |
|
|
.
|
708 |
|
|
.
|
709 |
|
|
.
|
710 |
|
|
enable_rx_interrupts()
|
711 |
|
|
netif_rx_complete(dev);
|
712 |
|
|
if (ring_has_new_packet() && netif_rx_reschedule(dev, received)) {
|
713 |
|
|
disable_rx_and_rxnobufs()
|
714 |
|
|
goto restart_poll
|
715 |
|
|
} while (rx_status_is_set);
|
716 |
|
|
---------
|
717 |
|
|
|
718 |
|
|
Basically netif_rx_complete() removes us from the poll list, but because a
|
719 |
|
|
new packet which will never be caught due to the possibility of a race
|
720 |
|
|
might come in, we attempt to re-add ourselves to the poll list.
|
721 |
|
|
|
722 |
|
|
|
723 |
|
|
|
724 |
|
|
|
725 |
|
|
APPENDIX 3: Scheduling issues.
|
726 |
|
|
==============================
|
727 |
|
|
As seen NAPI moves processing to softirq level. Linux uses the ksoftirqd as the
|
728 |
|
|
general solution to schedule softirq's to run before next interrupt and by putting
|
729 |
|
|
them under scheduler control. Also this prevents consecutive softirq's from
|
730 |
|
|
monopolize the CPU. This also have the effect that the priority of ksoftirq needs
|
731 |
|
|
to be considered when running very CPU-intensive applications and networking to
|
732 |
|
|
get the proper balance of softirq/user balance. Increasing ksoftirq priority to 0
|
733 |
|
|
(eventually more) is reported cure problems with low network performance at high
|
734 |
|
|
CPU load.
|
735 |
|
|
|
736 |
|
|
Most used processes in a GIGE router:
|
737 |
|
|
USER PID %CPU %MEM SIZE RSS TTY STAT START TIME COMMAND
|
738 |
|
|
root 3 0.2 0.0 0 0 ? RWN Aug 15 602:00 (ksoftirqd_CPU0)
|
739 |
|
|
root 232 0.0 7.9 41400 40884 ? S Aug 15 74:12 gated
|
740 |
|
|
|
741 |
|
|
--------------------------------------------------------------------
|
742 |
|
|
|
743 |
|
|
relevant sites:
|
744 |
|
|
==================
|
745 |
|
|
ftp://robur.slu.se/pub/Linux/net-development/NAPI/
|
746 |
|
|
|
747 |
|
|
|
748 |
|
|
--------------------------------------------------------------------
|
749 |
|
|
TODO: Write net-skeleton.c driver.
|
750 |
|
|
-------------------------------------------------------------
|
751 |
|
|
|
752 |
|
|
Authors:
|
753 |
|
|
========
|
754 |
|
|
Alexey Kuznetsov
|
755 |
|
|
Jamal Hadi Salim
|
756 |
|
|
Robert Olsson
|
757 |
|
|
|
758 |
|
|
Acknowledgements:
|
759 |
|
|
================
|
760 |
|
|
People who made this document better:
|
761 |
|
|
|
762 |
|
|
Lennert Buytenhek
|
763 |
|
|
Andrew Morton
|
764 |
|
|
Manfred Spraul
|
765 |
|
|
Donald Becker
|
766 |
|
|
Jeff Garzik
|