OpenCores
URL https://opencores.org/ocsvn/or1k/or1k/trunk

Subversion Repositories or1k

[/] [or1k/] [trunk/] [linux/] [linux-2.4/] [Documentation/] [networking/] [NAPI_HOWTO.txt] - Blame information for rev 1765

Details | Compare with Previous | View Log

Line No. Rev Author Line
1 1275 phoenix
HISTORY:
2
February 16/2002 -- revision 0.2.1:
3
COR typo corrected
4
February 10/2002 -- revision 0.2:
5
some spell checking ;->
6
January 12/2002 -- revision 0.1
7
This is still work in progress so may change.
8
To keep up to date please watch this space.
9
 
10
Introduction to NAPI
11
====================
12
 
13
NAPI is a proven (www.cyberus.ca/~hadi/usenix-paper.tgz) technique
14
to improve network performance on Linux. For more details please
15
read that paper.
16
NAPI provides a "inherent mitigation" which is bound by system capacity
17
as can be seen from the following data collected by Robert on Gigabit
18
ethernet (e1000):
19
 
20
 Psize    Ipps       Tput     Rxint     Txint    Done     Ndone
21
 ---------------------------------------------------------------
22
   60    890000     409362        17     27622        7     6823
23
  128    758150     464364        21      9301       10     7738
24
  256    445632     774646        42     15507       21    12906
25
  512    232666     994445    241292     19147   241192     1062
26
 1024    119061    1000003    872519     19258   872511        0
27
 1440     85193    1000003    946576     19505   946569        0
28
 
29
 
30
Legend:
31
"Ipps" stands for input packets per second.
32
"Tput" == packets out of total 1M that made it out.
33
"txint" == transmit completion interrupts seen
34
"Done" == The number of times that the poll() managed to pull all
35
packets out of the rx ring. Note from this that the lower the
36
load the more we could clean up the rxring
37
"Ndone" == is the converse of "Done". Note again, that the higher
38
the load the more times we couldnt clean up the rxring.
39
 
40
Observe that:
41
when the NIC receives 890Kpackets/sec only 17 rx interrupts are generated.
42
The system cant handle the processing at 1 interrupt/packet at that load level.
43
At lower rates on the other hand, rx interrupts go up and therefore the
44
interrupt/packet ratio goes up (as observable from that table). So there is
45
possibility that under low enough input, you get one poll call for each
46
input packet caused by a single interrupt each time. And if the system
47
cant handle interrupt per packet ratio of 1, then it will just have to
48
chug along ....
49
 
50
 
51
0) Prerequisites:
52
==================
53
A driver MAY continue using the old 2.4 technique for interfacing
54
to the network stack and not benefit from the NAPI changes.
55
NAPI additions to the kernel do not break backward compatibility.
56
NAPI, however, requires the following features to be available:
57
 
58
A) DMA ring or enough RAM to store packets in software devices.
59
 
60
B) Ability to turn off interrupts or maybe events that send packets up
61
the stack.
62
 
63
NAPI processes packet events in what is known as dev->poll() method.
64
Typically, only packet receive events are processed in dev->poll().
65
The rest of the events MAY be processed by the regular interrupt handler
66
to reduce processing latency (justified also because there are not that
67
many of them).
68
Note, however, NAPI does not enforce that dev->poll() only processes
69
receive events.
70
Tests with the tulip driver indicated slightly increased latency if
71
all of the interrupt handler is moved to dev->poll(). Also MII handling
72
gets a little trickier.
73
The example used in this document is to move the receive processing only
74
to dev->poll(); this is shown with the patch for the tulip driver.
75
For an example of code that moves all the interrupt driver to
76
dev->poll() look at the ported e1000 code.
77
 
78
There are caveats that might force you to go with moving everything to
79
dev->poll(). Different NICs work differently depending on their status/event
80
acknowledgement setup.
81
There are two types of event register ACK mechanisms.
82
        I)  what is known as Clear-on-read (COR).
83
        when you read the status/event register, it clears everything!
84
        The natsemi and sunbmac NICs are known to do this.
85
        In this case your only choice is to move all to dev->poll()
86
 
87
        II) Clear-on-write (COW)
88
         i) you clear the status by writting a 1 in the bit-location you want.
89
                These are the majority of the NICs and work the best with NAPI.
90
                Put only receive events in dev->poll(); leave the rest in
91
                the old interrupt handler.
92
         ii) whatever you write in the status register clears every thing ;->
93
                Cant seem to find any supported by Linux which do this. If
94
                someone knows such a chip email us please.
95
                Move all to dev->poll()
96
 
97
C) Ability to detect new work correctly.
98
NAPI works by shutting down event interrupts when theres work and
99
turning them on when theres none.
100
New packets might show up in the small window while interrupts were being
101
re-enabled (refer to appendix 2).  A packet might sneak in during the period
102
we are enabling interrupts. We only get to know about such a packet when the
103
next new packet arrives and generates an interrupt.
104
Essentially, there is a small window of opportunity for a race condition
105
which for clarity we'll refer to as the "rotting packet".
106
 
107
This is a very important topic and appendix 2 is dedicated for more
108
discussion.
109
 
110
Locking rules and environmental guarantees
111
==========================================
112
 
113
-Guarantee: Only one CPU at any time can call dev->poll(); this is because
114
only one CPU can pick the initial interrupt and hence the initial
115
netif_rx_schedule(dev);
116
- The core layer invokes devices to send packets in a round robin format.
117
This implies receive is totaly lockless because of the guarantee only that
118
one CPU is executing it.
119
-  contention can only be the result of some other CPU accessing the rx
120
ring. This happens only in close() and suspend() (when these methods
121
try to clean the rx ring);
122
****guarantee: driver authors need not worry about this; synchronization
123
is taken care for them by the top net layer.
124
-local interrupts are enabled (if you dont move all to dev->poll()). For
125
example link/MII and txcomplete continue functioning just same old way.
126
This improves the latency of processing these events. It is also assumed that
127
the receive interrupt is the largest cause of noise. Note this might not
128
always be true.
129
[according to Manfred Spraul, the winbond insists on sending one
130
txmitcomplete interrupt for each packet (although this can be mitigated)].
131
For these broken drivers, move all to dev->poll().
132
 
133
For the rest of this text, we'll assume that dev->poll() only
134
processes receive events.
135
 
136
new methods introduce by NAPI
137
=============================
138
 
139
a) netif_rx_schedule(dev)
140
Called by an IRQ handler to schedule a poll for device
141
 
142
b) netif_rx_schedule_prep(dev)
143
puts the device in a state which allows for it to be added to the
144
CPU polling list if it is up and running. You can look at this as
145
the first half of  netif_rx_schedule(dev) above; the second half
146
being c) below.
147
 
148
c) __netif_rx_schedule(dev)
149
Add device to the poll list for this CPU; assuming that _prep above
150
has already been called and returned 1.
151
 
152
d) netif_rx_reschedule(dev, undo)
153
Called to reschedule polling for device specifically for some
154
deficient hardware. Read Appendix 2 for more details.
155
 
156
e) netif_rx_complete(dev)
157
 
158
Remove interface from the CPU poll list: it must be in the poll list
159
on current cpu. This primitive is called by dev->poll(), when
160
it completes its work. The device cannot be out of poll list at this
161
call, if it is then clearly it is a BUG(). You'll know ;->
162
 
163
All these above nethods are used below. So keep reading for clarity.
164
 
165
Device driver changes to be made when porting NAPI
166
==================================================
167
 
168
Below we describe what kind of changes are required for NAPI to work.
169
 
170
1) introduction of dev->poll() method
171
=====================================
172
 
173
This is the method that is invoked by the network core when it requests
174
for new packets from the driver. A driver is allowed to send upto
175
dev->quota packets by the current CPU before yielding to the network
176
subsystem (so other devices can also get opportunity to send to the stack).
177
 
178
dev->poll() prototype looks as follows:
179
int my_poll(struct net_device *dev, int *budget)
180
 
181
budget is the remaining number of packets the network subsystem on the
182
current CPU can send up the stack before yielding to other system tasks.
183
*Each driver is responsible for decrementing budget by the total number of
184
packets sent.
185
        Total number of packets cannot exceed dev->quota.
186
 
187
dev->poll() method is invoked by the top layer, the driver just sends if it
188
can to the stack the packet quantity requested.
189
 
190
more on dev->poll() below after the interrupt changes are explained.
191
 
192
2) registering dev->poll() method
193
===================================
194
 
195
dev->poll should be set in the dev->probe() method.
196
e.g:
197
dev->open = my_open;
198
.
199
.
200
/* two new additions */
201
/* first register my poll method */
202
dev->poll = my_poll;
203
/* next register my weight/quanta; can be overriden in /proc */
204
dev->weight = 16;
205
.
206
.
207
dev->stop = my_close;
208
 
209
 
210
 
211
3) scheduling dev->poll()
212
=============================
213
This involves modifying the interrupt handler and the code
214
path which takes the packet off the NIC and sends them to the
215
stack.
216
 
217
it's important at this point to introduce the classical D Becker
218
interrupt processor:
219
 
220
------------------
221
static void
222
netdevice_interrupt(int irq, void *dev_id, struct pt_regs *regs)
223
{
224
 
225
        struct net_device *dev = (struct net_device *)dev_instance;
226
        struct my_private *tp = (struct my_private *)dev->priv;
227
 
228
        int work_count = my_work_count;
229
        status = read_interrupt_status_reg();
230
        if (status == 0)
231
                return;         /* Shared IRQ: not us */
232
        if (status == 0xffff)
233
                return;         /* Hot unplug */
234
        if (status & error)
235
                do_some_error_handling()
236
 
237
        do {
238
                acknowledge_ints_ASAP();
239
 
240
                if (status & link_interrupt) {
241
                        spin_lock(&tp->link_lock);
242
                        do_some_link_stat_stuff();
243
                        spin_unlock(&tp->link_lock);
244
                }
245
 
246
                if (status & rx_interrupt) {
247
                        receive_packets(dev);
248
                }
249
 
250
                if (status & rx_nobufs) {
251
                        make_rx_buffs_avail();
252
                }
253
 
254
                if (status & tx_related) {
255
                        spin_lock(&tp->lock);
256
                        tx_ring_free(dev);
257
                        if (tx_died)
258
                                restart_tx();
259
                        spin_unlock(&tp->lock);
260
                }
261
 
262
                status = read_interrupt_status_reg();
263
 
264
        } while (!(status & error) || more_work_to_be_done);
265
 
266
}
267
 
268
----------------------------------------------------------------------
269
 
270
We now change this to what is shown below to NAPI-enable it:
271
 
272
----------------------------------------------------------------------
273
static void
274
netdevice_interrupt(int irq, void *dev_id, struct pt_regs *regs)
275
{
276
        struct net_device *dev = (struct net_device *)dev_instance;
277
        struct my_private *tp = (struct my_private *)dev->priv;
278
 
279
        status = read_interrupt_status_reg();
280
        if (status == 0)
281
                return;         /* Shared IRQ: not us */
282
        if (status == 0xffff)
283
                return;         /* Hot unplug */
284
        if (status & error)
285
                do_some_error_handling();
286
 
287
        do {
288
/************************ start note *********************************/
289
                acknowledge_ints_ASAP();  // dont ack rx and rxnobuff here
290
/************************ end note *********************************/
291
 
292
                if (status & link_interrupt) {
293
                        spin_lock(&tp->link_lock);
294
                        do_some_link_stat_stuff();
295
                        spin_unlock(&tp->link_lock);
296
                }
297
/************************ start note *********************************/
298
                if (status & rx_interrupt || (status & rx_nobuffs)) {
299
                        if (netif_rx_schedule_prep(dev)) {
300
 
301
                                /* disable interrupts caused
302
                                 *      by arriving packets */
303
                                disable_rx_and_rxnobuff_ints();
304
                                /* tell system we have work to be done. */
305
                                __netif_rx_schedule(dev);
306
                        } else {
307
                                printk("driver bug! interrupt while in poll\n");
308
                                /* FIX by disabling interrupts  */
309
                                disable_rx_and_rxnobuff_ints();
310
                        }
311
                }
312
/************************ end note note *********************************/
313
 
314
                if (status & tx_related) {
315
                        spin_lock(&tp->lock);
316
                        tx_ring_free(dev);
317
 
318
                        if (tx_died)
319
                                restart_tx();
320
                        spin_unlock(&tp->lock);
321
                }
322
 
323
                status = read_interrupt_status_reg();
324
 
325
/************************ start note *********************************/
326
        } while (!(status & error) || more_work_to_be_done(status));
327
/************************ end note note *********************************/
328
 
329
}
330
 
331
---------------------------------------------------------------------
332
 
333
 
334
We note several things from above:
335
 
336
I) Any interrupt source which is caused by arriving packets is now
337
turned off when it occurs. Depending on the hardware, there could be
338
several reasons that arriving packets would cause interrupts; these are the
339
interrupt sources we wish to avoid. The two common ones are a) a packet
340
arriving (rxint) b) a packet arriving and finding no DMA buffers available
341
(rxnobuff) .
342
This means also acknowledge_ints_ASAP() will not clear the status
343
register for those two items above; clearing is done in the place where
344
proper work is done within NAPI; at the poll() and refill_rx_ring()
345
discussed further below.
346
netif_rx_schedule_prep() returns 1 if device is in running state and
347
gets successfully added to the core poll list. If we get a zero value
348
we can _almost_ assume are already added to the list (instead of not running.
349
Logic based on the fact that you shouldnt get interrupt if not running)
350
We rectify this by disabling rx and rxnobuf interrupts.
351
 
352
II) that receive_packets(dev) and make_rx_buffs_avail() may have dissapeared.
353
These functionalities are still around actually......
354
 
355
infact, receive_packets(dev) is very close to my_poll() and
356
make_rx_buffs_avail() is invoked from my_poll()
357
 
358
4) converting receive_packets() to dev->poll()
359
===============================================
360
 
361
We need to convert the classical D Becker receive_packets(dev) to my_poll()
362
 
363
First the typical receive_packets() below:
364
-------------------------------------------------------------------
365
 
366
/* this is called by interrupt handler */
367
static void receive_packets (struct net_device *dev)
368
{
369
 
370
        struct my_private *tp = (struct my_private *)dev->priv;
371
        rx_ring = tp->rx_ring;
372
        cur_rx = tp->cur_rx;
373
        int entry = cur_rx % RX_RING_SIZE;
374
        int received = 0;
375
        int rx_work_limit = tp->dirty_rx + RX_RING_SIZE - tp->cur_rx;
376
 
377
        while (rx_ring_not_empty) {
378
                u32 rx_status;
379
                unsigned int rx_size;
380
                unsigned int pkt_size;
381
                struct sk_buff *skb;
382
                /* read size+status of next frame from DMA ring buffer */
383
                /* the number 16 and 4 are just examples */
384
                rx_status = le32_to_cpu (*(u32 *) (rx_ring + ring_offset));
385
                rx_size = rx_status >> 16;
386
                pkt_size = rx_size - 4;
387
 
388
                /* process errors */
389
                if ((rx_size > (MAX_ETH_FRAME_SIZE+4)) ||
390
                    (!(rx_status & RxStatusOK))) {
391
                        netdrv_rx_err (rx_status, dev, tp, ioaddr);
392
                        return;
393
                }
394
 
395
                if (--rx_work_limit < 0)
396
                        break;
397
 
398
                /* grab a skb */
399
                skb = dev_alloc_skb (pkt_size + 2);
400
                if (skb) {
401
                        .
402
                        .
403
                        netif_rx (skb);
404
                        .
405
                        .
406
                } else {  /* OOM */
407
                        /*seems very driver specific ... some just pass
408
                        whatever is on the ring already. */
409
                }
410
 
411
                /* move to the next skb on the ring */
412
                entry = (++tp->cur_rx) % RX_RING_SIZE;
413
                received++ ;
414
 
415
        }
416
 
417
        /* store current ring pointer state */
418
        tp->cur_rx = cur_rx;
419
 
420
        /* Refill the Rx ring buffers if they are needed */
421
        refill_rx_ring();
422
        .
423
        .
424
 
425
}
426
-------------------------------------------------------------------
427
We change it to a new one below; note the additional parameter in
428
the call.
429
 
430
-------------------------------------------------------------------
431
 
432
/* this is called by the network core */
433
static void my_poll (struct net_device *dev, int *budget)
434
{
435
 
436
        struct my_private *tp = (struct my_private *)dev->priv;
437
        rx_ring = tp->rx_ring;
438
        cur_rx = tp->cur_rx;
439
        int entry = cur_rx % RX_BUF_LEN;
440
        /* maximum packets to send to the stack */
441
/************************ note note *********************************/
442
        int rx_work_limit = dev->quota;
443
 
444
/************************ end note note *********************************/
445
    do {  // outer beggining loop starts here
446
 
447
        clear_rx_status_register_bit();
448
 
449
        while (rx_ring_not_empty) {
450
                u32 rx_status;
451
                unsigned int rx_size;
452
                unsigned int pkt_size;
453
                struct sk_buff *skb;
454
                /* read size+status of next frame from DMA ring buffer */
455
                /* the number 16 and 4 are just examples */
456
                rx_status = le32_to_cpu (*(u32 *) (rx_ring + ring_offset));
457
                rx_size = rx_status >> 16;
458
                pkt_size = rx_size - 4;
459
 
460
                /* process errors */
461
                if ((rx_size > (MAX_ETH_FRAME_SIZE+4)) ||
462
                    (!(rx_status & RxStatusOK))) {
463
                        netdrv_rx_err (rx_status, dev, tp, ioaddr);
464
                        return;
465
                }
466
 
467
/************************ note note *********************************/
468
                if (--rx_work_limit < 0) { /* we got packets, but no quota */
469
                        /* store current ring pointer state */
470
                        tp->cur_rx = cur_rx;
471
 
472
                        /* Refill the Rx ring buffers if they are needed */
473
                        refill_rx_ring(dev);
474
                        goto not_done;
475
                }
476
/**********************  end note **********************************/
477
 
478
                /* grab a skb */
479
                skb = dev_alloc_skb (pkt_size + 2);
480
                if (skb) {
481
                        .
482
                        .
483
/************************ note note *********************************/
484
                        netif_receive_skb (skb);
485
/**********************  end note **********************************/
486
                        .
487
                        .
488
                } else {  /* OOM */
489
                        /*seems very driver specific ... common is just pass
490
                        whatever is on the ring already. */
491
                }
492
 
493
                /* move to the next skb on the ring */
494
                entry = (++tp->cur_rx) % RX_RING_SIZE;
495
                received++ ;
496
 
497
        }
498
 
499
        /* store current ring pointer state */
500
        tp->cur_rx = cur_rx;
501
 
502
        /* Refill the Rx ring buffers if they are needed */
503
        refill_rx_ring(dev);
504
 
505
        /* no packets on ring; but new ones can arrive since we last
506
           checked  */
507
        status = read_interrupt_status_reg();
508
        if (rx status is not set) {
509
                        /* If something arrives in this narrow window,
510
                        an interrupt will be generated */
511
                        goto done;
512
        }
513
        /* done! at least thats what it looks like ;->
514
        if new packets came in after our last check on status bits
515
        they'll be caught by the while check and we go back and clear them
516
        since we havent exceeded our quota */
517
    } while (rx_status_is_set);
518
 
519
done:
520
 
521
/************************ note note *********************************/
522
        dev->quota -= received;
523
        *budget -= received;
524
 
525
        /* If RX ring is not full we are out of memory. */
526
        if (tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL)
527
                goto oom;
528
 
529
        /* we are happy/done, no more packets on ring; put us back
530
        to where we can start processing interrupts again */
531
        netif_rx_complete(dev);
532
        enable_rx_and_rxnobuf_ints();
533
 
534
       /* The last op happens after poll completion. Which means the following:
535
        * 1. it can race with disabling irqs in irq handler (which are done to
536
        * schedule polls)
537
        * 2. it can race with dis/enabling irqs in other poll threads
538
        * 3. if an irq raised after the begining of the outer  beginning
539
        * loop(marked in the code above), it will be immediately
540
        * triggered here.
541
        *
542
        * Summarizing: the logic may results in some redundant irqs both
543
        * due to races in masking and due to too late acking of already
544
        * processed irqs. The good news: no events are ever lost.
545
        */
546
 
547
        return 0;   /* done */
548
 
549
not_done:
550
        if (tp->cur_rx - tp->dirty_rx > RX_RING_SIZE/2 ||
551
            tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL)
552
                refill_rx_ring(dev);
553
 
554
        if (!received) {
555
                printk("received==0\n");
556
                received = 1;
557
        }
558
        dev->quota -= received;
559
        *budget -= received;
560
        return 1;  /* not_done */
561
 
562
oom:
563
        /* Start timer, stop polling, but do not enable rx interrupts. */
564
        start_poll_timer(dev);
565
        return 0;  /* we'll take it from here so tell core "done"*/
566
 
567
/************************ End note note *********************************/
568
}
569
-------------------------------------------------------------------
570
 
571
From above we note that:
572
0) rx_work_limit = dev->quota
573
1) refill_rx_ring() is in charge of clearing the bit for rxnobuff when
574
it does the work.
575
2) We have a done and not_done state.
576
3) instead of netif_rx() we call netif_receive_skb() to pass the skb.
577
4) we have a new way of handling oom condition
578
5) A new outer for (;;) loop has been added. This serves the purpose of
579
ensuring that if a new packet has come in, after we are all set and done,
580
and we have not exceeded our quota that we continue sending packets up.
581
 
582
 
583
-----------------------------------------------------------
584
Poll timer code will need to do the following:
585
 
586
a)
587
 
588
        if (tp->cur_rx - tp->dirty_rx > RX_RING_SIZE/2 ||
589
            tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL)
590
                refill_rx_ring(dev);
591
 
592
        /* If RX ring is not full we are still out of memory.
593
           Restart the timer again. Else we re-add ourselves
594
           to the master poll list.
595
         */
596
 
597
        if (tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL)
598
                restart_timer();
599
 
600
        else netif_rx_schedule(dev);  /* we are back on the poll list */
601
 
602
5) dev->close() and dev->suspend() issues
603
==========================================
604
The driver writter neednt worry about this. The top net layer takes
605
care of it.
606
 
607
6) Adding new Stats to /proc
608
=============================
609
In order to debug some of the new features, we introduce new stats
610
that need to be collected.
611
TODO: Fill this later.
612
 
613
APPENDIX 1: discussion on using ethernet HW FC
614
==============================================
615
Most chips with FC only send a pause packet when they run out of Rx buffers.
616
Since packets are pulled off the DMA ring by a softirq in NAPI,
617
if the system is slow in grabbing them and we have a high input
618
rate (faster than the system's capacity to remove packets), then theoretically
619
there will only be one rx interrupt for all packets during a given packetstorm.
620
Under low load, we might have a single interrupt per packet.
621
FC should be programmed to apply in the case when the system cant pull out
622
packets fast enough i.e send a pause only when you run out of rx buffers.
623
Note FC in itself is a good solution but we have found it to not be
624
much of a commodity feature (both in NICs and switches) and hence falls
625
under the same category as using NIC based mitigation. Also experiments
626
indicate that its much harder to resolve the resource allocation
627
issue (aka lazy receiving that NAPI offers) and hence quantify its usefullness
628
proved harder. In any case, FC works even better with NAPI but is not
629
necessary.
630
 
631
 
632
APPENDIX 2: the "rotting packet" race-window avoidance scheme
633
=============================================================
634
 
635
There are two types of associations seen here
636
 
637
1) status/int which honors level triggered IRQ
638
 
639
If a status bit for receive or rxnobuff is set and the corresponding
640
interrupt-enable bit is not on, then no interrupts will be generated. However,
641
as soon as the "interrupt-enable" bit is unmasked, an immediate interrupt is
642
generated.  [assuming the status bit was not turned off].
643
Generally the concept of level triggered IRQs in association with a status and
644
interrupt-enable CSR register set is used to avoid the race.
645
 
646
If we take the example of the tulip:
647
"pending work" is indicated by the status bit(CSR5 in tulip).
648
the corresponding interrupt bit (CSR7 in tulip) might be turned off (but
649
the CSR5 will continue to be turned on with new packet arrivals even if
650
we clear it the first time)
651
Very important is the fact that if we turn on the interrupt bit on when
652
status is set that an immediate irq is triggered.
653
 
654
If we cleared the rx ring and proclaimed there was "no more work
655
to be done" and then went on to do a few other things;  then when we enable
656
interrupts, there is a possibility that a new packet might sneak in during
657
this phase. It helps to look at the pseudo code for the tulip poll
658
routine:
659
 
660
--------------------------
661
        do {
662
                ACK;
663
                while (ring_is_not_empty()) {
664
                        work-work-work
665
                        if quota is exceeded: exit, no touching irq status/mask
666
                }
667
                /* No packets, but new can arrive while we are doing this*/
668
                CSR5 := read
669
                if (CSR5 is not set) {
670
                        /* If something arrives in this narrow window here,
671
                        *  where the comments are ;-> irq will be generated */
672
                        unmask irqs;
673
                        exit poll;
674
                }
675
        } while (rx_status_is_set);
676
------------------------
677
 
678
CSR5 bit of interest is only the rx status.
679
If you look at the last if statement:
680
you just finished grabbing all the packets from the rx ring .. you check if
681
status bit says theres more packets just in ... it says none; you then
682
enable rx interrupts again; if a new packet just came in during this check,
683
we are counting that CSR5 will be set in that small window of opportunity
684
and that by re-enabling interrupts, we would actually triger an interrupt
685
to register the new packet for processing.
686
 
687
[The above description nay be very verbose, if you have better wording
688
that will make this more understandable, please suggest it.]
689
 
690
2) non-capable hardware
691
 
692
These do not generally respect level triggered IRQs. Normally,
693
irqs may be lost while being masked and the only way to leave poll is to do
694
a double check for new input after netif_rx_complete() is invoked
695
and re-enable polling (after seeing this new input).
696
 
697
Sample code:
698
 
699
---------
700
        .
701
        .
702
restart_poll:
703
        while (ring_is_not_empty()) {
704
                work-work-work
705
                if quota is exceeded: exit, not touching irq status/mask
706
        }
707
        .
708
        .
709
        .
710
        enable_rx_interrupts()
711
        netif_rx_complete(dev);
712
        if (ring_has_new_packet() && netif_rx_reschedule(dev, received)) {
713
                disable_rx_and_rxnobufs()
714
                goto restart_poll
715
        } while (rx_status_is_set);
716
---------
717
 
718
Basically netif_rx_complete() removes us from the poll list, but because a
719
new packet which will never be caught due to the possibility of a race
720
might come in, we attempt to re-add ourselves to the poll list.
721
 
722
 
723
 
724
 
725
APPENDIX 3: Scheduling issues.
726
==============================
727
As seen NAPI moves processing to softirq level. Linux uses the ksoftirqd as the
728
general solution to schedule softirq's to run before next interrupt and by putting
729
them under scheduler control. Also this prevents consecutive softirq's from
730
monopolize the CPU. This also have the effect that the priority of ksoftirq needs
731
to be considered when running very CPU-intensive applications and networking to
732
get the proper balance of softirq/user balance. Increasing ksoftirq priority to 0
733
(eventually more) is reported cure problems with low network performance at high
734
CPU load.
735
 
736
Most used processes in a GIGE router:
737
USER       PID %CPU %MEM  SIZE   RSS TTY STAT START   TIME COMMAND
738
root         3  0.2  0.0     0     0  ?  RWN Aug 15 602:00 (ksoftirqd_CPU0)
739
root       232  0.0  7.9 41400 40884  ?  S   Aug 15  74:12 gated
740
 
741
--------------------------------------------------------------------
742
 
743
relevant sites:
744
==================
745
ftp://robur.slu.se/pub/Linux/net-development/NAPI/
746
 
747
 
748
--------------------------------------------------------------------
749
TODO: Write net-skeleton.c driver.
750
-------------------------------------------------------------
751
 
752
Authors:
753
========
754
Alexey Kuznetsov 
755
Jamal Hadi Salim 
756
Robert Olsson 
757
 
758
Acknowledgements:
759
================
760
People who made this document better:
761
 
762
Lennert Buytenhek 
763
Andrew Morton  
764
Manfred Spraul 
765
Donald Becker 
766
Jeff Garzik

powered by: WebSVN 2.1.0

© copyright 1999-2024 OpenCores.org, equivalent to Oliscience, all rights reserved. OpenCores®, registered trademark.