OpenCores
URL https://opencores.org/ocsvn/test_project/test_project/trunk

Subversion Repositories test_project

[/] [test_project/] [trunk/] [linux_sd_driver/] [Documentation/] [DocBook/] [kernel-locking.tmpl] - Blame information for rev 62

Details | Compare with Previous | View Log

Line No. Rev Author Line
1 62 marcus.erl
2
3
        "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []>
4
 
5
6
 
7
  Unreliable Guide To Locking
8
 
9
  
10
   
11
    Rusty
12
    Russell
13
    
14
     
15
      rusty@rustcorp.com.au
16
     
17
    
18
   
19
  
20
 
21
  
22
   2003
23
   Rusty Russell
24
  
25
 
26
  
27
   
28
     This documentation is free software; you can redistribute
29
     it and/or modify it under the terms of the GNU General Public
30
     License as published by the Free Software Foundation; either
31
     version 2 of the License, or (at your option) any later
32
     version.
33
   
34
 
35
   
36
     This program is distributed in the hope that it will be
37
     useful, but WITHOUT ANY WARRANTY; without even the implied
38
     warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
39
     See the GNU General Public License for more details.
40
   
41
 
42
   
43
     You should have received a copy of the GNU General Public
44
     License along with this program; if not, write to the Free
45
     Software Foundation, Inc., 59 Temple Place, Suite 330, Boston,
46
     MA 02111-1307 USA
47
   
48
 
49
   
50
     For more details see the file COPYING in the source
51
     distribution of Linux.
52
   
53
  
54
 
55
 
56
 
57
  
58
   Introduction
59
   
60
     Welcome, to Rusty's Remarkably Unreliable Guide to Kernel
61
     Locking issues.  This document describes the locking systems in
62
     the Linux Kernel in 2.6.
63
   
64
   
65
     With the wide availability of HyperThreading, and 
66
     linkend="gloss-preemption">preemption  in the Linux
67
     Kernel, everyone hacking on the kernel needs to know the
68
     fundamentals of concurrency and locking for
69
     SMP.
70
   
71
  
72
 
73
   
74
    The Problem With Concurrency
75
    
76
      (Skip this if you know what a Race Condition is).
77
    
78
    
79
      In a normal program, you can increment a counter like so:
80
    
81
    
82
      very_important_count++;
83
    
84
 
85
    
86
      This is what they would expect to happen:
87
    
88
 
89
    
90
     Expected Results
91
 
92
     
93
 
94
      
95
       
96
        Instance 1
97
        Instance 2
98
       
99
      
100
 
101
      
102
       
103
        read very_important_count (5)
104
        
105
       
106
       
107
        add 1 (6)
108
        
109
       
110
       
111
        write very_important_count (6)
112
        
113
       
114
       
115
        
116
        read very_important_count (6)
117
       
118
       
119
        
120
        add 1 (7)
121
       
122
       
123
        
124
        write very_important_count (7)
125
       
126
      
127
 
128
     
129
    
130
 
131
    
132
     This is what might happen:
133
    
134
 
135
    
136
     Possible Results
137
 
138
     
139
      
140
       
141
        Instance 1
142
        Instance 2
143
       
144
      
145
 
146
      
147
       
148
        read very_important_count (5)
149
        
150
       
151
       
152
        
153
        read very_important_count (5)
154
       
155
       
156
        add 1 (6)
157
        
158
       
159
       
160
        
161
        add 1 (6)
162
       
163
       
164
        write very_important_count (6)
165
        
166
       
167
       
168
        
169
        write very_important_count (6)
170
       
171
      
172
     
173
    
174
 
175
    
176
    Race Conditions and Critical Regions
177
    
178
      This overlap, where the result depends on the
179
      relative timing of multiple tasks, is called a race condition.
180
      The piece of code containing the concurrency issue is called a
181
      critical region.  And especially since Linux starting running
182
      on SMP machines, they became one of the major issues in kernel
183
      design and implementation.
184
    
185
    
186
      Preemption can have the same effect, even if there is only one
187
      CPU: by preempting one task during the critical region, we have
188
      exactly the same race condition.  In this case the thread which
189
      preempts might run the critical region itself.
190
    
191
    
192
      The solution is to recognize when these simultaneous accesses
193
      occur, and use locks to make sure that only one instance can
194
      enter the critical region at any time.  There are many
195
      friendly primitives in the Linux kernel to help you do this.
196
      And then there are the unfriendly primitives, but I'll pretend
197
      they don't exist.
198
    
199
    
200
  
201
 
202
  
203
   Locking in the Linux Kernel
204
 
205
   
206
     If I could give you one piece of advice: never sleep with anyone
207
     crazier than yourself.  But if I had to give you advice on
208
     locking: keep it simple.
209
   
210
 
211
   
212
     Be reluctant to introduce new locks.
213
   
214
 
215
   
216
     Strangely enough, this last one is the exact reverse of my advice when
217
     you have slept with someone crazier than yourself.
218
     And you should think about getting a big dog.
219
   
220
 
221
   
222
   Three Main Types of Kernel Locks: Spinlocks, Mutexes and Semaphores
223
 
224
   
225
     There are three main types of kernel locks.  The fundamental type
226
     is the spinlock
227
     (include/asm/spinlock.h),
228
     which is a very simple single-holder lock: if you can't get the
229
     spinlock, you keep trying (spinning) until you can.  Spinlocks are
230
     very small and fast, and can be used anywhere.
231
   
232
   
233
     The second type is a mutex
234
     (include/linux/mutex.h): it
235
     is like a spinlock, but you may block holding a mutex.
236
     If you can't lock a mutex, your task will suspend itself, and be woken
237
     up when the mutex is released.  This means the CPU can do something
238
     else while you are waiting.  There are many cases when you simply
239
     can't sleep (see ), and so have to
240
     use a spinlock instead.
241
   
242
   
243
     The third type is a semaphore
244
     (include/asm/semaphore.h): it
245
     can have more than one holder at any time (the number decided at
246
     initialization time), although it is most commonly used as a
247
     single-holder lock (a mutex).  If you can't get a semaphore, your
248
     task will be suspended and later on woken up - just like for mutexes.
249
   
250
   
251
     Neither type of lock is recursive: see
252
     .
253
   
254
   
255
 
256
   
257
    Locks and Uniprocessor Kernels
258
 
259
    
260
      For kernels compiled without CONFIG_SMP, and
261
      without CONFIG_PREEMPT spinlocks do not exist at
262
      all.  This is an excellent design decision: when no-one else can
263
      run at the same time, there is no reason to have a lock.
264
    
265
 
266
    
267
      If the kernel is compiled without CONFIG_SMP,
268
      but CONFIG_PREEMPT is set, then spinlocks
269
      simply disable preemption, which is sufficient to prevent any
270
      races.  For most purposes, we can think of preemption as
271
      equivalent to SMP, and not worry about it separately.
272
    
273
 
274
    
275
      You should always test your locking code with CONFIG_SMP
276
      and CONFIG_PREEMPT enabled, even if you don't have an SMP test box, because it
277
      will still catch some kinds of locking bugs.
278
    
279
 
280
    
281
      Semaphores still exist, because they are required for
282
      synchronization between user
283
      contexts, as we will see below.
284
    
285
   
286
 
287
    
288
     Locking Only In User Context
289
 
290
     
291
       If you have a data structure which is only ever accessed from
292
       user context, then you can use a simple semaphore
293
       (linux/asm/semaphore.h) to protect it.  This
294
       is the most trivial case: you initialize the semaphore to the number
295
       of resources available (usually 1), and call
296
       down_interruptible() to grab the semaphore, and
297
       up() to release it.  There is also a
298
       down(), which should be avoided, because it
299
       will not return if a signal is received.
300
     
301
 
302
     
303
       Example: linux/net/core/netfilter.c allows
304
       registration of new setsockopt() and
305
       getsockopt() calls, with
306
       nf_register_sockopt().  Registration and
307
       de-registration are only done on module load and unload (and boot
308
       time, where there is no concurrency), and the list of registrations
309
       is only consulted for an unknown setsockopt()
310
       or getsockopt() system call.  The
311
       nf_sockopt_mutex is perfect to protect this,
312
       especially since the setsockopt and getsockopt calls may well
313
       sleep.
314
     
315
   
316
 
317
   
318
    Locking Between User Context and Softirqs
319
 
320
    
321
      If a softirq shares
322
      data with user context, you have two problems.  Firstly, the current
323
      user context can be interrupted by a softirq, and secondly, the
324
      critical region could be entered from another CPU.  This is where
325
      spin_lock_bh()
326
      (include/linux/spinlock.h) is
327
      used.  It disables softirqs on that CPU, then grabs the lock.
328
      spin_unlock_bh() does the reverse.  (The
329
      '_bh' suffix is a historical reference to "Bottom Halves", the
330
      old name for software interrupts.  It should really be
331
      called spin_lock_softirq()' in a perfect world).
332
    
333
 
334
    
335
      Note that you can also use spin_lock_irq()
336
      or spin_lock_irqsave() here, which stop
337
      hardware interrupts as well: see .
338
    
339
 
340
    
341
      This works perfectly for UP
342
       as well: the spin lock vanishes, and this macro
343
      simply becomes local_bh_disable()
344
      (include/linux/interrupt.h), which
345
      protects you from the softirq being run.
346
    
347
   
348
 
349
   
350
    Locking Between User Context and Tasklets
351
 
352
    
353
      This is exactly the same as above, because 
354
      linkend="gloss-tasklet">tasklets are actually run
355
      from a softirq.
356
    
357
   
358
 
359
   
360
    Locking Between User Context and Timers
361
 
362
    
363
      This, too, is exactly the same as above, because 
364
      linkend="gloss-timers">timers are actually run from
365
      a softirq.  From a locking point of view, tasklets and timers
366
      are identical.
367
    
368
   
369
 
370
   
371
    Locking Between Tasklets/Timers
372
 
373
    
374
      Sometimes a tasklet or timer might want to share data with
375
      another tasklet or timer.
376
    
377
 
378
    
379
     The Same Tasklet/Timer
380
     
381
       Since a tasklet is never run on two CPUs at once, you don't
382
       need to worry about your tasklet being reentrant (running
383
       twice at once), even on SMP.
384
     
385
    
386
 
387
    
388
     Different Tasklets/Timers
389
     
390
       If another tasklet/timer wants
391
       to share data with your tasklet or timer , you will both need to use
392
       spin_lock() and
393
       spin_unlock() calls.
394
       spin_lock_bh() is
395
       unnecessary here, as you are already in a tasklet, and
396
       none will be run on the same CPU.
397
     
398
    
399
   
400
 
401
   
402
    Locking Between Softirqs
403
 
404
    
405
      Often a softirq might
406
      want to share data with itself or a tasklet/timer.
407
    
408
 
409
    
410
     The Same Softirq
411
 
412
     
413
       The same softirq can run on the other CPUs: you can use a
414
       per-CPU array (see ) for better
415
       performance.  If you're going so far as to use a softirq,
416
       you probably care about scalable performance enough
417
       to justify the extra complexity.
418
     
419
 
420
     
421
       You'll need to use spin_lock() and
422
       spin_unlock() for shared data.
423
     
424
    
425
 
426
    
427
     Different Softirqs
428
 
429
     
430
       You'll need to use spin_lock() and
431
       spin_unlock() for shared data, whether it
432
       be a timer, tasklet, different softirq or the same or another
433
       softirq: any of them could be running on a different CPU.
434
     
435
    
436
   
437
  
438
 
439
  
440
   Hard IRQ Context
441
 
442
   
443
     Hardware interrupts usually communicate with a
444
     tasklet or softirq.  Frequently this involves putting work in a
445
     queue, which the softirq will take out.
446
   
447
 
448
   
449
    Locking Between Hard IRQ and Softirqs/Tasklets
450
 
451
    
452
      If a hardware irq handler shares data with a softirq, you have
453
      two concerns.  Firstly, the softirq processing can be
454
      interrupted by a hardware interrupt, and secondly, the
455
      critical region could be entered by a hardware interrupt on
456
      another CPU.  This is where spin_lock_irq() is
457
      used.  It is defined to disable interrupts on that cpu, then grab
458
      the lock. spin_unlock_irq() does the reverse.
459
    
460
 
461
    
462
      The irq handler does not to use
463
      spin_lock_irq(), because the softirq cannot
464
      run while the irq handler is running: it can use
465
      spin_lock(), which is slightly faster.  The
466
      only exception would be if a different hardware irq handler uses
467
      the same lock: spin_lock_irq() will stop
468
      that from interrupting us.
469
    
470
 
471
    
472
      This works perfectly for UP as well: the spin lock vanishes,
473
      and this macro simply becomes local_irq_disable()
474
      (include/asm/smp.h), which
475
      protects you from the softirq/tasklet/BH being run.
476
    
477
 
478
    
479
      spin_lock_irqsave()
480
      (include/linux/spinlock.h) is a variant
481
      which saves whether interrupts were on or off in a flags word,
482
      which is passed to spin_unlock_irqrestore().  This
483
      means that the same code can be used inside an hard irq handler (where
484
      interrupts are already off) and in softirqs (where the irq
485
      disabling is required).
486
    
487
 
488
    
489
      Note that softirqs (and hence tasklets and timers) are run on
490
      return from hardware interrupts, so
491
      spin_lock_irq() also stops these.  In that
492
      sense, spin_lock_irqsave() is the most
493
      general and powerful locking function.
494
    
495
 
496
   
497
   
498
    Locking Between Two Hard IRQ Handlers
499
    
500
      It is rare to have to share data between two IRQ handlers, but
501
      if you do, spin_lock_irqsave() should be
502
      used: it is architecture-specific whether all interrupts are
503
      disabled inside irq handlers themselves.
504
    
505
   
506
 
507
  
508
 
509
  
510
   Cheat Sheet For Locking
511
   
512
     Pete Zaitcev gives the following summary:
513
   
514
   
515
      
516
        
517
          If you are in a process context (any syscall) and want to
518
        lock other process out, use a semaphore.  You can take a semaphore
519
        and sleep (copy_from_user*( or
520
        kmalloc(x,GFP_KERNEL)).
521
      
522
      
523
      
524
        
525
        Otherwise (== data can be touched in an interrupt), use
526
        spin_lock_irqsave() and
527
        spin_unlock_irqrestore().
528
        
529
      
530
      
531
        
532
        Avoid holding spinlock for more than 5 lines of code and
533
        across any function call (except accessors like
534
        readb).
535
        
536
      
537
    
538
 
539
   
540
   Table of Minimum Requirements
541
 
542
    The following table lists the minimum
543
        locking requirements between various contexts.  In some cases,
544
        the same context can only be running on one CPU at a time, so
545
        no locking is required for that context (eg. a particular
546
        thread can only run on one CPU at a time, but if it needs
547
        shares data with another thread, locking is required).
548
   
549
   
550
        Remember the advice above: you can always use
551
        spin_lock_irqsave(), which is a superset
552
        of all other spinlock primitives.
553
   
554
 
555
   
556
Table of Locking Requirements
557
558
559
 
560
561
562
IRQ Handler A
563
IRQ Handler B
564
Softirq A
565
Softirq B
566
Tasklet A
567
Tasklet B
568
Timer A
569
Timer B
570
User Context A
571
User Context B
572
573
 
574
575
IRQ Handler A
576
None
577
578
 
579
580
IRQ Handler B
581
SLIS
582
None
583
584
 
585
586
Softirq A
587
SLI
588
SLI
589
SL
590
591
 
592
593
Softirq B
594
SLI
595
SLI
596
SL
597
SL
598
599
 
600
601
Tasklet A
602
SLI
603
SLI
604
SL
605
SL
606
None
607
608
 
609
610
Tasklet B
611
SLI
612
SLI
613
SL
614
SL
615
SL
616
None
617
618
 
619
620
Timer A
621
SLI
622
SLI
623
SL
624
SL
625
SL
626
SL
627
None
628
629
 
630
631
Timer B
632
SLI
633
SLI
634
SL
635
SL
636
SL
637
SL
638
SL
639
None
640
641
 
642
643
User Context A
644
SLI
645
SLI
646
SLBH
647
SLBH
648
SLBH
649
SLBH
650
SLBH
651
SLBH
652
None
653
654
 
655
656
User Context B
657
SLI
658
SLI
659
SLBH
660
SLBH
661
SLBH
662
SLBH
663
SLBH
664
SLBH
665
DI
666
None
667
668
 
669
670
671
672
 
673
   
674
Legend for Locking Requirements Table
675
676
677
 
678
679
SLIS
680
spin_lock_irqsave
681
682
683
SLI
684
spin_lock_irq
685
686
687
SL
688
spin_lock
689
690
691
SLBH
692
spin_lock_bh
693
694
695
DI
696
down_interruptible
697
698
 
699
700
701
702
 
703
704
705
 
706
  
707
   Common Examples
708
    
709
Let's step through a simple example: a cache of number to name
710
mappings.  The cache keeps a count of how often each of the objects is
711
used, and when it gets full, throws out the least used one.
712
 
713
    
714
 
715
   
716
    All In User Context
717
    
718
For our first example, we assume that all operations are in user
719
context (ie. from system calls), so we can sleep.  This means we can
720
use a semaphore to protect the cache and all the objects within
721
it.  Here's the code:
722
    
723
 
724
    
725
#include <linux/list.h>
726
#include <linux/slab.h>
727
#include <linux/string.h>
728
#include <asm/semaphore.h>
729
#include <asm/errno.h>
730
 
731
struct object
732
{
733
        struct list_head list;
734
        int id;
735
        char name[32];
736
        int popularity;
737
};
738
 
739
/* Protects the cache, cache_num, and the objects within it */
740
static DECLARE_MUTEX(cache_lock);
741
static LIST_HEAD(cache);
742
static unsigned int cache_num = 0;
743
#define MAX_CACHE_SIZE 10
744
 
745
/* Must be holding cache_lock */
746
static struct object *__cache_find(int id)
747
{
748
        struct object *i;
749
 
750
        list_for_each_entry(i, &cache, list)
751
                if (i->id == id) {
752
                        i->popularity++;
753
                        return i;
754
                }
755
        return NULL;
756
}
757
 
758
/* Must be holding cache_lock */
759
static void __cache_delete(struct object *obj)
760
{
761
        BUG_ON(!obj);
762
        list_del(&obj->list);
763
        kfree(obj);
764
        cache_num--;
765
}
766
 
767
/* Must be holding cache_lock */
768
static void __cache_add(struct object *obj)
769
{
770
        list_add(&obj->list, &cache);
771
        if (++cache_num > MAX_CACHE_SIZE) {
772
                struct object *i, *outcast = NULL;
773
                list_for_each_entry(i, &cache, list) {
774
                        if (!outcast || i->popularity < outcast->popularity)
775
                                outcast = i;
776
                }
777
                __cache_delete(outcast);
778
        }
779
}
780
 
781
int cache_add(int id, const char *name)
782
{
783
        struct object *obj;
784
 
785
        if ((obj = kmalloc(sizeof(*obj), GFP_KERNEL)) == NULL)
786
                return -ENOMEM;
787
 
788
        strlcpy(obj->name, name, sizeof(obj->name));
789
        obj->id = id;
790
        obj->popularity = 0;
791
 
792
        down(&cache_lock);
793
        __cache_add(obj);
794
        up(&cache_lock);
795
        return 0;
796
}
797
 
798
void cache_delete(int id)
799
{
800
        down(&cache_lock);
801
        __cache_delete(__cache_find(id));
802
        up(&cache_lock);
803
}
804
 
805
int cache_find(int id, char *name)
806
{
807
        struct object *obj;
808
        int ret = -ENOENT;
809
 
810
        down(&cache_lock);
811
        obj = __cache_find(id);
812
        if (obj) {
813
                ret = 0;
814
                strcpy(name, obj->name);
815
        }
816
        up(&cache_lock);
817
        return ret;
818
}
819
820
 
821
    
822
Note that we always make sure we have the cache_lock when we add,
823
delete, or look up the cache: both the cache infrastructure itself and
824
the contents of the objects are protected by the lock.  In this case
825
it's easy, since we copy the data for the user, and never let them
826
access the objects directly.
827
    
828
    
829
There is a slight (and common) optimization here: in
830
cache_add we set up the fields of the object
831
before grabbing the lock.  This is safe, as no-one else can access it
832
until we put it in cache.
833
    
834
    
835
 
836
   
837
    Accessing From Interrupt Context
838
    
839
Now consider the case where cache_find can be
840
called from interrupt context: either a hardware interrupt or a
841
softirq.  An example would be a timer which deletes object from the
842
cache.
843
    
844
    
845
The change is shown below, in standard patch format: the
846
- are lines which are taken away, and the
847
+ are lines which are added.
848
    
849
850
--- cache.c.usercontext 2003-12-09 13:58:54.000000000 +1100
851
+++ cache.c.interrupt   2003-12-09 14:07:49.000000000 +1100
852
@@ -12,7 +12,7 @@
853
         int popularity;
854
 };
855
 
856
-static DECLARE_MUTEX(cache_lock);
857
+static spinlock_t cache_lock = SPIN_LOCK_UNLOCKED;
858
 static LIST_HEAD(cache);
859
 static unsigned int cache_num = 0;
860
 #define MAX_CACHE_SIZE 10
861
@@ -55,6 +55,7 @@
862
 int cache_add(int id, const char *name)
863
 {
864
         struct object *obj;
865
+        unsigned long flags;
866
 
867
         if ((obj = kmalloc(sizeof(*obj), GFP_KERNEL)) == NULL)
868
                 return -ENOMEM;
869
@@ -63,30 +64,33 @@
870
         obj->id = id;
871
         obj->popularity = 0;
872
 
873
-        down(&cache_lock);
874
+        spin_lock_irqsave(&cache_lock, flags);
875
         __cache_add(obj);
876
-        up(&cache_lock);
877
+        spin_unlock_irqrestore(&cache_lock, flags);
878
         return 0;
879
 }
880
 
881
 void cache_delete(int id)
882
 {
883
-        down(&cache_lock);
884
+        unsigned long flags;
885
+
886
+        spin_lock_irqsave(&cache_lock, flags);
887
         __cache_delete(__cache_find(id));
888
-        up(&cache_lock);
889
+        spin_unlock_irqrestore(&cache_lock, flags);
890
 }
891
 
892
 int cache_find(int id, char *name)
893
 {
894
         struct object *obj;
895
         int ret = -ENOENT;
896
+        unsigned long flags;
897
 
898
-        down(&cache_lock);
899
+        spin_lock_irqsave(&cache_lock, flags);
900
         obj = __cache_find(id);
901
         if (obj) {
902
                 ret = 0;
903
                 strcpy(name, obj->name);
904
         }
905
-        up(&cache_lock);
906
+        spin_unlock_irqrestore(&cache_lock, flags);
907
         return ret;
908
 }
909
910
 
911
    
912
Note that the spin_lock_irqsave will turn off
913
interrupts if they are on, otherwise does nothing (if we are already
914
in an interrupt handler), hence these functions are safe to call from
915
any context.
916
    
917
    
918
Unfortunately, cache_add calls
919
kmalloc with the GFP_KERNEL
920
flag, which is only legal in user context.  I have assumed that
921
cache_add is still only called in user context,
922
otherwise this should become a parameter to
923
cache_add.
924
    
925
  
926
   
927
    Exposing Objects Outside This File
928
    
929
If our objects contained more information, it might not be sufficient
930
to copy the information in and out: other parts of the code might want
931
to keep pointers to these objects, for example, rather than looking up
932
the id every time.  This produces two problems.
933
    
934
    
935
The first problem is that we use the cache_lock to
936
protect objects: we'd need to make this non-static so the rest of the
937
code can use it.  This makes locking trickier, as it is no longer all
938
in one place.
939
    
940
    
941
The second problem is the lifetime problem: if another structure keeps
942
a pointer to an object, it presumably expects that pointer to remain
943
valid.  Unfortunately, this is only guaranteed while you hold the
944
lock, otherwise someone might call cache_delete
945
and even worse, add another object, re-using the same address.
946
    
947
    
948
As there is only one lock, you can't hold it forever: no-one else would
949
get any work done.
950
    
951
    
952
The solution to this problem is to use a reference count: everyone who
953
has a pointer to the object increases it when they first get the
954
object, and drops the reference count when they're finished with it.
955
Whoever drops it to zero knows it is unused, and can actually delete it.
956
    
957
    
958
Here is the code:
959
    
960
 
961
962
--- cache.c.interrupt   2003-12-09 14:25:43.000000000 +1100
963
+++ cache.c.refcnt      2003-12-09 14:33:05.000000000 +1100
964
@@ -7,6 +7,7 @@
965
 struct object
966
 {
967
         struct list_head list;
968
+        unsigned int refcnt;
969
         int id;
970
         char name[32];
971
         int popularity;
972
@@ -17,6 +18,35 @@
973
 static unsigned int cache_num = 0;
974
 #define MAX_CACHE_SIZE 10
975
 
976
+static void __object_put(struct object *obj)
977
+{
978
+        if (--obj->refcnt == 0)
979
+                kfree(obj);
980
+}
981
+
982
+static void __object_get(struct object *obj)
983
+{
984
+        obj->refcnt++;
985
+}
986
+
987
+void object_put(struct object *obj)
988
+{
989
+        unsigned long flags;
990
+
991
+        spin_lock_irqsave(&cache_lock, flags);
992
+        __object_put(obj);
993
+        spin_unlock_irqrestore(&cache_lock, flags);
994
+}
995
+
996
+void object_get(struct object *obj)
997
+{
998
+        unsigned long flags;
999
+
1000
+        spin_lock_irqsave(&cache_lock, flags);
1001
+        __object_get(obj);
1002
+        spin_unlock_irqrestore(&cache_lock, flags);
1003
+}
1004
+
1005
 /* Must be holding cache_lock */
1006
 static struct object *__cache_find(int id)
1007
 {
1008
@@ -35,6 +65,7 @@
1009
 {
1010
         BUG_ON(!obj);
1011
         list_del(&obj->list);
1012
+        __object_put(obj);
1013
         cache_num--;
1014
 }
1015
 
1016
@@ -63,6 +94,7 @@
1017
         strlcpy(obj->name, name, sizeof(obj->name));
1018
         obj->id = id;
1019
         obj->popularity = 0;
1020
+        obj->refcnt = 1; /* The cache holds a reference */
1021
 
1022
         spin_lock_irqsave(&cache_lock, flags);
1023
         __cache_add(obj);
1024
@@ -79,18 +111,15 @@
1025
         spin_unlock_irqrestore(&cache_lock, flags);
1026
 }
1027
 
1028
-int cache_find(int id, char *name)
1029
+struct object *cache_find(int id)
1030
 {
1031
         struct object *obj;
1032
-        int ret = -ENOENT;
1033
         unsigned long flags;
1034
 
1035
         spin_lock_irqsave(&cache_lock, flags);
1036
         obj = __cache_find(id);
1037
-        if (obj) {
1038
-                ret = 0;
1039
-                strcpy(name, obj->name);
1040
-        }
1041
+        if (obj)
1042
+                __object_get(obj);
1043
         spin_unlock_irqrestore(&cache_lock, flags);
1044
-        return ret;
1045
+        return obj;
1046
 }
1047
1048
 
1049
1050
We encapsulate the reference counting in the standard 'get' and 'put'
1051
functions.  Now we can return the object itself from
1052
cache_find which has the advantage that the user
1053
can now sleep holding the object (eg. to
1054
copy_to_user to name to userspace).
1055
1056
1057
The other point to note is that I said a reference should be held for
1058
every pointer to the object: thus the reference count is 1 when first
1059
inserted into the cache.  In some versions the framework does not hold
1060
a reference count, but they are more complicated.
1061
1062
 
1063
   
1064
    Using Atomic Operations For The Reference Count
1065
1066
In practice, atomic_t would usually be used for
1067
refcnt.  There are a number of atomic
1068
operations defined in
1069
 
1070
include/asm/atomic.h: these are
1071
guaranteed to be seen atomically from all CPUs in the system, so no
1072
lock is required.  In this case, it is simpler than using spinlocks,
1073
although for anything non-trivial using spinlocks is clearer.  The
1074
atomic_inc and
1075
atomic_dec_and_test are used instead of the
1076
standard increment and decrement operators, and the lock is no longer
1077
used to protect the reference count itself.
1078
1079
 
1080
1081
--- cache.c.refcnt      2003-12-09 15:00:35.000000000 +1100
1082
+++ cache.c.refcnt-atomic       2003-12-11 15:49:42.000000000 +1100
1083
@@ -7,7 +7,7 @@
1084
 struct object
1085
 {
1086
         struct list_head list;
1087
-        unsigned int refcnt;
1088
+        atomic_t refcnt;
1089
         int id;
1090
         char name[32];
1091
         int popularity;
1092
@@ -18,33 +18,15 @@
1093
 static unsigned int cache_num = 0;
1094
 #define MAX_CACHE_SIZE 10
1095
 
1096
-static void __object_put(struct object *obj)
1097
-{
1098
-        if (--obj->refcnt == 0)
1099
-                kfree(obj);
1100
-}
1101
-
1102
-static void __object_get(struct object *obj)
1103
-{
1104
-        obj->refcnt++;
1105
-}
1106
-
1107
 void object_put(struct object *obj)
1108
 {
1109
-        unsigned long flags;
1110
-
1111
-        spin_lock_irqsave(&cache_lock, flags);
1112
-        __object_put(obj);
1113
-        spin_unlock_irqrestore(&cache_lock, flags);
1114
+        if (atomic_dec_and_test(&obj->refcnt))
1115
+                kfree(obj);
1116
 }
1117
 
1118
 void object_get(struct object *obj)
1119
 {
1120
-        unsigned long flags;
1121
-
1122
-        spin_lock_irqsave(&cache_lock, flags);
1123
-        __object_get(obj);
1124
-        spin_unlock_irqrestore(&cache_lock, flags);
1125
+        atomic_inc(&obj->refcnt);
1126
 }
1127
 
1128
 /* Must be holding cache_lock */
1129
@@ -65,7 +47,7 @@
1130
 {
1131
         BUG_ON(!obj);
1132
         list_del(&obj->list);
1133
-        __object_put(obj);
1134
+        object_put(obj);
1135
         cache_num--;
1136
 }
1137
 
1138
@@ -94,7 +76,7 @@
1139
         strlcpy(obj->name, name, sizeof(obj->name));
1140
         obj->id = id;
1141
         obj->popularity = 0;
1142
-        obj->refcnt = 1; /* The cache holds a reference */
1143
+        atomic_set(&obj->refcnt, 1); /* The cache holds a reference */
1144
 
1145
         spin_lock_irqsave(&cache_lock, flags);
1146
         __cache_add(obj);
1147
@@ -119,7 +101,7 @@
1148
         spin_lock_irqsave(&cache_lock, flags);
1149
         obj = __cache_find(id);
1150
         if (obj)
1151
-                __object_get(obj);
1152
+                object_get(obj);
1153
         spin_unlock_irqrestore(&cache_lock, flags);
1154
         return obj;
1155
 }
1156
1157
1158
1159
 
1160
   
1161
    Protecting The Objects Themselves
1162
    
1163
In these examples, we assumed that the objects (except the reference
1164
counts) never changed once they are created.  If we wanted to allow
1165
the name to change, there are three possibilities:
1166
    
1167
    
1168
      
1169
        
1170
You can make cache_lock non-static, and tell people
1171
to grab that lock before changing the name in any object.
1172
        
1173
      
1174
      
1175
        
1176
You can provide a cache_obj_rename which grabs
1177
this lock and changes the name for the caller, and tell everyone to
1178
use that function.
1179
        
1180
      
1181
      
1182
        
1183
You can make the cache_lock protect only the cache
1184
itself, and use another lock to protect the name.
1185
        
1186
      
1187
    
1188
 
1189
      
1190
Theoretically, you can make the locks as fine-grained as one lock for
1191
every field, for every object.  In practice, the most common variants
1192
are:
1193
1194
    
1195
      
1196
        
1197
One lock which protects the infrastructure (the cache
1198
list in this example) and all the objects.  This is what we have done
1199
so far.
1200
        
1201
      
1202
      
1203
        
1204
One lock which protects the infrastructure (including the list
1205
pointers inside the objects), and one lock inside the object which
1206
protects the rest of that object.
1207
        
1208
      
1209
      
1210
        
1211
Multiple locks to protect the infrastructure (eg. one lock per hash
1212
chain), possibly with a separate per-object lock.
1213
        
1214
      
1215
    
1216
 
1217
1218
Here is the "lock-per-object" implementation:
1219
1220
1221
--- cache.c.refcnt-atomic       2003-12-11 15:50:54.000000000 +1100
1222
+++ cache.c.perobjectlock       2003-12-11 17:15:03.000000000 +1100
1223
@@ -6,11 +6,17 @@
1224
 
1225
 struct object
1226
 {
1227
+        /* These two protected by cache_lock. */
1228
         struct list_head list;
1229
+        int popularity;
1230
+
1231
         atomic_t refcnt;
1232
+
1233
+        /* Doesn't change once created. */
1234
         int id;
1235
+
1236
+        spinlock_t lock; /* Protects the name */
1237
         char name[32];
1238
-        int popularity;
1239
 };
1240
 
1241
 static spinlock_t cache_lock = SPIN_LOCK_UNLOCKED;
1242
@@ -77,6 +84,7 @@
1243
         obj->id = id;
1244
         obj->popularity = 0;
1245
         atomic_set(&obj->refcnt, 1); /* The cache holds a reference */
1246
+        spin_lock_init(&obj->lock);
1247
 
1248
         spin_lock_irqsave(&cache_lock, flags);
1249
         __cache_add(obj);
1250
1251
 
1252
1253
Note that I decide that the popularity
1254
count should be protected by the cache_lock rather
1255
than the per-object lock: this is because it (like the
1256
struct list_head inside the object) is
1257
logically part of the infrastructure.  This way, I don't need to grab
1258
the lock of every object in __cache_add when
1259
seeking the least popular.
1260
1261
 
1262
1263
I also decided that the id member is
1264
unchangeable, so I don't need to grab each object lock in
1265
__cache_find() to examine the
1266
id: the object lock is only used by a
1267
caller who wants to read or write the name
1268
field.
1269
1270
 
1271
1272
Note also that I added a comment describing what data was protected by
1273
which locks.  This is extremely important, as it describes the runtime
1274
behavior of the code, and can be hard to gain from just reading.  And
1275
as Alan Cox says, Lock data, not code.
1276
1277
1278
1279
 
1280
   
1281
    Common Problems
1282
    
1283
    Deadlock: Simple and Advanced
1284
 
1285
    
1286
      There is a coding bug where a piece of code tries to grab a
1287
      spinlock twice: it will spin forever, waiting for the lock to
1288
      be released (spinlocks, rwlocks and semaphores are not
1289
      recursive in Linux).  This is trivial to diagnose: not a
1290
      stay-up-five-nights-talk-to-fluffy-code-bunnies kind of
1291
      problem.
1292
    
1293
 
1294
    
1295
      For a slightly more complex case, imagine you have a region
1296
      shared by a softirq and user context.  If you use a
1297
      spin_lock() call to protect it, it is
1298
      possible that the user context will be interrupted by the softirq
1299
      while it holds the lock, and the softirq will then spin
1300
      forever trying to get the same lock.
1301
    
1302
 
1303
    
1304
      Both of these are called deadlock, and as shown above, it can
1305
      occur even with a single CPU (although not on UP compiles,
1306
      since spinlocks vanish on kernel compiles with
1307
      CONFIG_SMP=n. You'll still get data corruption
1308
      in the second example).
1309
    
1310
 
1311
    
1312
      This complete lockup is easy to diagnose: on SMP boxes the
1313
      watchdog timer or compiling with DEBUG_SPINLOCKS set
1314
      (include/linux/spinlock.h) will show this up
1315
      immediately when it happens.
1316
    
1317
 
1318
    
1319
      A more complex problem is the so-called 'deadly embrace',
1320
      involving two or more locks.  Say you have a hash table: each
1321
      entry in the table is a spinlock, and a chain of hashed
1322
      objects.  Inside a softirq handler, you sometimes want to
1323
      alter an object from one place in the hash to another: you
1324
      grab the spinlock of the old hash chain and the spinlock of
1325
      the new hash chain, and delete the object from the old one,
1326
      and insert it in the new one.
1327
    
1328
 
1329
    
1330
      There are two problems here.  First, if your code ever
1331
      tries to move the object to the same chain, it will deadlock
1332
      with itself as it tries to lock it twice.  Secondly, if the
1333
      same softirq on another CPU is trying to move another object
1334
      in the reverse direction, the following could happen:
1335
    
1336
 
1337
    
1338
     Consequences
1339
 
1340
     
1341
 
1342
      
1343
       
1344
        CPU 1
1345
        CPU 2
1346
       
1347
      
1348
 
1349
      
1350
       
1351
        Grab lock A -> OK
1352
        Grab lock B -> OK
1353
       
1354
       
1355
        Grab lock B -> spin
1356
        Grab lock A -> spin
1357
       
1358
      
1359
     
1360
    
1361
 
1362
    
1363
      The two CPUs will spin forever, waiting for the other to give up
1364
      their lock.  It will look, smell, and feel like a crash.
1365
    
1366
    
1367
 
1368
    
1369
     Preventing Deadlock
1370
 
1371
     
1372
       Textbooks will tell you that if you always lock in the same
1373
       order, you will never get this kind of deadlock.  Practice
1374
       will tell you that this approach doesn't scale: when I
1375
       create a new lock, I don't understand enough of the kernel
1376
       to figure out where in the 5000 lock hierarchy it will fit.
1377
     
1378
 
1379
     
1380
       The best locks are encapsulated: they never get exposed in
1381
       headers, and are never held around calls to non-trivial
1382
       functions outside the same file.  You can read through this
1383
       code and see that it will never deadlock, because it never
1384
       tries to grab another lock while it has that one.  People
1385
       using your code don't even need to know you are using a
1386
       lock.
1387
     
1388
 
1389
     
1390
       A classic problem here is when you provide callbacks or
1391
       hooks: if you call these with the lock held, you risk simple
1392
       deadlock, or a deadly embrace (who knows what the callback
1393
       will do?).  Remember, the other programmers are out to get
1394
       you, so don't do this.
1395
     
1396
 
1397
    
1398
     Overzealous Prevention Of Deadlocks
1399
 
1400
     
1401
       Deadlocks are problematic, but not as bad as data
1402
       corruption.  Code which grabs a read lock, searches a list,
1403
       fails to find what it wants, drops the read lock, grabs a
1404
       write lock and inserts the object has a race condition.
1405
     
1406
 
1407
     
1408
       If you don't see why, please stay the fuck away from my code.
1409
     
1410
    
1411
    
1412
 
1413
   
1414
    Racing Timers: A Kernel Pastime
1415
 
1416
    
1417
      Timers can produce their own special problems with races.
1418
      Consider a collection of objects (list, hash, etc) where each
1419
      object has a timer which is due to destroy it.
1420
    
1421
 
1422
    
1423
      If you want to destroy the entire collection (say on module
1424
      removal), you might do the following:
1425
    
1426
 
1427
    
1428
        /* THIS CODE BAD BAD BAD BAD: IF IT WAS ANY WORSE IT WOULD USE
1429
           HUNGARIAN NOTATION */
1430
        spin_lock_bh(&list_lock);
1431
 
1432
        while (list) {
1433
                struct foo *next = list->next;
1434
                del_timer(&list->timer);
1435
                kfree(list);
1436
                list = next;
1437
        }
1438
 
1439
        spin_unlock_bh(&list_lock);
1440
    
1441
 
1442
    
1443
      Sooner or later, this will crash on SMP, because a timer can
1444
      have just gone off before the spin_lock_bh(),
1445
      and it will only get the lock after we
1446
      spin_unlock_bh(), and then try to free
1447
      the element (which has already been freed!).
1448
    
1449
 
1450
    
1451
      This can be avoided by checking the result of
1452
      del_timer(): if it returns
1453
      1, the timer has been deleted.
1454
      If 0, it means (in this
1455
      case) that it is currently running, so we can do:
1456
    
1457
 
1458
    
1459
        retry:
1460
                spin_lock_bh(&list_lock);
1461
 
1462
                while (list) {
1463
                        struct foo *next = list->next;
1464
                        if (!del_timer(&list->timer)) {
1465
                                /* Give timer a chance to delete this */
1466
                                spin_unlock_bh(&list_lock);
1467
                                goto retry;
1468
                        }
1469
                        kfree(list);
1470
                        list = next;
1471
                }
1472
 
1473
                spin_unlock_bh(&list_lock);
1474
    
1475
 
1476
    
1477
      Another common problem is deleting timers which restart
1478
      themselves (by calling add_timer() at the end
1479
      of their timer function).  Because this is a fairly common case
1480
      which is prone to races, you should use del_timer_sync()
1481
      (include/linux/timer.h)
1482
      to handle this case.  It returns the number of times the timer
1483
      had to be deleted before we finally stopped it from adding itself back
1484
      in.
1485
    
1486
   
1487
 
1488
  
1489
 
1490
 
1491
    Locking Speed
1492
 
1493
    
1494
There are three main things to worry about when considering speed of
1495
some code which does locking.  First is concurrency: how many things
1496
are going to be waiting while someone else is holding a lock.  Second
1497
is the time taken to actually acquire and release an uncontended lock.
1498
Third is using fewer, or smarter locks.  I'm assuming that the lock is
1499
used fairly often: otherwise, you wouldn't be concerned about
1500
efficiency.
1501
1502
    
1503
Concurrency depends on how long the lock is usually held: you should
1504
hold the lock for as long as needed, but no longer.  In the cache
1505
example, we always create the object without the lock held, and then
1506
grab the lock only when we are ready to insert it in the list.
1507
1508
    
1509
Acquisition times depend on how much damage the lock operations do to
1510
the pipeline (pipeline stalls) and how likely it is that this CPU was
1511
the last one to grab the lock (ie. is the lock cache-hot for this
1512
CPU): on a machine with more CPUs, this likelihood drops fast.
1513
Consider a 700MHz Intel Pentium III: an instruction takes about 0.7ns,
1514
an atomic increment takes about 58ns, a lock which is cache-hot on
1515
this CPU takes 160ns, and a cacheline transfer from another CPU takes
1516
an additional 170 to 360ns.  (These figures from Paul McKenney's
1517
 Linux
1518
Journal RCU article).
1519
1520
    
1521
These two aims conflict: holding a lock for a short time might be done
1522
by splitting locks into parts (such as in our final per-object-lock
1523
example), but this increases the number of lock acquisitions, and the
1524
results are often slower than having a single lock.  This is another
1525
reason to advocate locking simplicity.
1526
1527
    
1528
The third concern is addressed below: there are some methods to reduce
1529
the amount of locking which needs to be done.
1530
1531
 
1532
  
1533
   Read/Write Lock Variants
1534
 
1535
   
1536
      Both spinlocks and semaphores have read/write variants:
1537
      rwlock_t and struct rw_semaphore.
1538
      These divide users into two classes: the readers and the writers.  If
1539
      you are only reading the data, you can get a read lock, but to write to
1540
      the data you need the write lock.  Many people can hold a read lock,
1541
      but a writer must be sole holder.
1542
    
1543
 
1544
   
1545
      If your code divides neatly along reader/writer lines (as our
1546
      cache code does), and the lock is held by readers for
1547
      significant lengths of time, using these locks can help.  They
1548
      are slightly slower than the normal locks though, so in practice
1549
      rwlock_t is not usually worthwhile.
1550
    
1551
   
1552
 
1553
   
1554
    Avoiding Locks: Read Copy Update
1555
 
1556
    
1557
      There is a special method of read/write locking called Read Copy
1558
      Update.  Using RCU, the readers can avoid taking a lock
1559
      altogether: as we expect our cache to be read more often than
1560
      updated (otherwise the cache is a waste of time), it is a
1561
      candidate for this optimization.
1562
    
1563
 
1564
    
1565
      How do we get rid of read locks?  Getting rid of read locks
1566
      means that writers may be changing the list underneath the
1567
      readers.  That is actually quite simple: we can read a linked
1568
      list while an element is being added if the writer adds the
1569
      element very carefully.  For example, adding
1570
      new to a single linked list called
1571
      list:
1572
    
1573
 
1574
    
1575
        new->next = list->next;
1576
        wmb();
1577
        list->next = new;
1578
    
1579
 
1580
    
1581
      The wmb() is a write memory barrier.  It
1582
      ensures that the first operation (setting the new element's
1583
      next pointer) is complete and will be seen by
1584
      all CPUs, before the second operation is (putting the new
1585
      element into the list).  This is important, since modern
1586
      compilers and modern CPUs can both reorder instructions unless
1587
      told otherwise: we want a reader to either not see the new
1588
      element at all, or see the new element with the
1589
      next pointer correctly pointing at the rest of
1590
      the list.
1591
    
1592
    
1593
      Fortunately, there is a function to do this for standard
1594
      struct list_head lists:
1595
      list_add_rcu()
1596
      (include/linux/list.h).
1597
    
1598
    
1599
      Removing an element from the list is even simpler: we replace
1600
      the pointer to the old element with a pointer to its successor,
1601
      and readers will either see it, or skip over it.
1602
    
1603
    
1604
        list->next = old->next;
1605
    
1606
    
1607
      There is list_del_rcu()
1608
      (include/linux/list.h) which does this (the
1609
      normal version poisons the old object, which we don't want).
1610
    
1611
    
1612
      The reader must also be careful: some CPUs can look through the
1613
      next pointer to start reading the contents of
1614
      the next element early, but don't realize that the pre-fetched
1615
      contents is wrong when the next pointer changes
1616
      underneath them.  Once again, there is a
1617
      list_for_each_entry_rcu()
1618
      (include/linux/list.h) to help you.  Of
1619
      course, writers can just use
1620
      list_for_each_entry(), since there cannot
1621
      be two simultaneous writers.
1622
    
1623
    
1624
      Our final dilemma is this: when can we actually destroy the
1625
      removed element?  Remember, a reader might be stepping through
1626
      this element in the list right now: if we free this element and
1627
      the next pointer changes, the reader will jump
1628
      off into garbage and crash.  We need to wait until we know that
1629
      all the readers who were traversing the list when we deleted the
1630
      element are finished.  We use call_rcu() to
1631
      register a callback which will actually destroy the object once
1632
      the readers are finished.
1633
    
1634
    
1635
      But how does Read Copy Update know when the readers are
1636
      finished?  The method is this: firstly, the readers always
1637
      traverse the list inside
1638
      rcu_read_lock()/rcu_read_unlock()
1639
      pairs: these simply disable preemption so the reader won't go to
1640
      sleep while reading the list.
1641
    
1642
    
1643
      RCU then waits until every other CPU has slept at least once:
1644
      since readers cannot sleep, we know that any readers which were
1645
      traversing the list during the deletion are finished, and the
1646
      callback is triggered.  The real Read Copy Update code is a
1647
      little more optimized than this, but this is the fundamental
1648
      idea.
1649
    
1650
 
1651
1652
--- cache.c.perobjectlock       2003-12-11 17:15:03.000000000 +1100
1653
+++ cache.c.rcupdate    2003-12-11 17:55:14.000000000 +1100
1654
@@ -1,15 +1,18 @@
1655
 #include <linux/list.h>
1656
 #include <linux/slab.h>
1657
 #include <linux/string.h>
1658
+#include <linux/rcupdate.h>
1659
 #include <asm/semaphore.h>
1660
 #include <asm/errno.h>
1661
 
1662
 struct object
1663
 {
1664
-        /* These two protected by cache_lock. */
1665
+        /* This is protected by RCU */
1666
         struct list_head list;
1667
         int popularity;
1668
 
1669
+        struct rcu_head rcu;
1670
+
1671
         atomic_t refcnt;
1672
 
1673
         /* Doesn't change once created. */
1674
@@ -40,7 +43,7 @@
1675
 {
1676
         struct object *i;
1677
 
1678
-        list_for_each_entry(i, &cache, list) {
1679
+        list_for_each_entry_rcu(i, &cache, list) {
1680
                 if (i->id == id) {
1681
                         i->popularity++;
1682
                         return i;
1683
@@ -49,19 +52,25 @@
1684
         return NULL;
1685
 }
1686
 
1687
+/* Final discard done once we know no readers are looking. */
1688
+static void cache_delete_rcu(void *arg)
1689
+{
1690
+        object_put(arg);
1691
+}
1692
+
1693
 /* Must be holding cache_lock */
1694
 static void __cache_delete(struct object *obj)
1695
 {
1696
         BUG_ON(!obj);
1697
-        list_del(&obj->list);
1698
-        object_put(obj);
1699
+        list_del_rcu(&obj->list);
1700
         cache_num--;
1701
+        call_rcu(&obj->rcu, cache_delete_rcu, obj);
1702
 }
1703
 
1704
 /* Must be holding cache_lock */
1705
 static void __cache_add(struct object *obj)
1706
 {
1707
-        list_add(&obj->list, &cache);
1708
+        list_add_rcu(&obj->list, &cache);
1709
         if (++cache_num > MAX_CACHE_SIZE) {
1710
                 struct object *i, *outcast = NULL;
1711
                 list_for_each_entry(i, &cache, list) {
1712
@@ -85,6 +94,7 @@
1713
         obj->popularity = 0;
1714
         atomic_set(&obj->refcnt, 1); /* The cache holds a reference */
1715
         spin_lock_init(&obj->lock);
1716
+        INIT_RCU_HEAD(&obj->rcu);
1717
 
1718
         spin_lock_irqsave(&cache_lock, flags);
1719
         __cache_add(obj);
1720
@@ -104,12 +114,11 @@
1721
 struct object *cache_find(int id)
1722
 {
1723
         struct object *obj;
1724
-        unsigned long flags;
1725
 
1726
-        spin_lock_irqsave(&cache_lock, flags);
1727
+        rcu_read_lock();
1728
         obj = __cache_find(id);
1729
         if (obj)
1730
                 object_get(obj);
1731
-        spin_unlock_irqrestore(&cache_lock, flags);
1732
+        rcu_read_unlock();
1733
         return obj;
1734
 }
1735
1736
 
1737
1738
Note that the reader will alter the
1739
popularity member in
1740
__cache_find(), and now it doesn't hold a lock.
1741
One solution would be to make it an atomic_t, but for
1742
this usage, we don't really care about races: an approximate result is
1743
good enough, so I didn't change it.
1744
1745
 
1746
1747
The result is that cache_find() requires no
1748
synchronization with any other functions, so is almost as fast on SMP
1749
as it would be on UP.
1750
1751
 
1752
1753
There is a furthur optimization possible here: remember our original
1754
cache code, where there were no reference counts and the caller simply
1755
held the lock whenever using the object?  This is still possible: if
1756
you hold the lock, noone can delete the object, so you don't need to
1757
get and put the reference count.
1758
1759
 
1760
1761
Now, because the 'read lock' in RCU is simply disabling preemption, a
1762
caller which always has preemption disabled between calling
1763
cache_find() and
1764
object_put() does not need to actually get and
1765
put the reference count: we could expose
1766
__cache_find() by making it non-static, and
1767
such callers could simply call that.
1768
1769
1770
The benefit here is that the reference count is not written to: the
1771
object is not altered in any way, which is much faster on SMP
1772
machines due to caching.
1773
1774
  
1775
 
1776
   
1777
    Per-CPU Data
1778
 
1779
    
1780
      Another technique for avoiding locking which is used fairly
1781
      widely is to duplicate information for each CPU.  For example,
1782
      if you wanted to keep a count of a common condition, you could
1783
      use a spin lock and a single counter.  Nice and simple.
1784
    
1785
 
1786
    
1787
      If that was too slow (it's usually not, but if you've got a
1788
      really big machine to test on and can show that it is), you
1789
      could instead use a counter for each CPU, then none of them need
1790
      an exclusive lock.  See DEFINE_PER_CPU(),
1791
      get_cpu_var() and
1792
      put_cpu_var()
1793
      (include/linux/percpu.h).
1794
    
1795
 
1796
    
1797
      Of particular use for simple per-cpu counters is the
1798
      local_t type, and the
1799
      cpu_local_inc() and related functions,
1800
      which are more efficient than simple code on some architectures
1801
      (include/asm/local.h).
1802
    
1803
 
1804
    
1805
      Note that there is no simple, reliable way of getting an exact
1806
      value of such a counter, without introducing more locks.  This
1807
      is not a problem for some uses.
1808
    
1809
   
1810
 
1811
   
1812
    Data Which Mostly Used By An IRQ Handler
1813
 
1814
    
1815
      If data is always accessed from within the same IRQ handler, you
1816
      don't need a lock at all: the kernel already guarantees that the
1817
      irq handler will not run simultaneously on multiple CPUs.
1818
    
1819
    
1820
      Manfred Spraul points out that you can still do this, even if
1821
      the data is very occasionally accessed in user context or
1822
      softirqs/tasklets.  The irq handler doesn't use a lock, and
1823
      all other accesses are done as so:
1824
    
1825
 
1826
1827
        spin_lock(&lock);
1828
        disable_irq(irq);
1829
        ...
1830
        enable_irq(irq);
1831
        spin_unlock(&lock);
1832
1833
    
1834
      The disable_irq() prevents the irq handler
1835
      from running (and waits for it to finish if it's currently
1836
      running on other CPUs).  The spinlock prevents any other
1837
      accesses happening at the same time.  Naturally, this is slower
1838
      than just a spin_lock_irq() call, so it
1839
      only makes sense if this type of access happens extremely
1840
      rarely.
1841
    
1842
   
1843
  
1844
 
1845
 
1846
    What Functions Are Safe To Call From Interrupts?
1847
 
1848
    
1849
      Many functions in the kernel sleep (ie. call schedule())
1850
      directly or indirectly: you can never call them while holding a
1851
      spinlock, or with preemption disabled.  This also means you need
1852
      to be in user context: calling them from an interrupt is illegal.
1853
    
1854
 
1855
   
1856
    Some Functions Which Sleep
1857
 
1858
    
1859
      The most common ones are listed below, but you usually have to
1860
      read the code to find out if other calls are safe.  If everyone
1861
      else who calls it can sleep, you probably need to be able to
1862
      sleep, too.  In particular, registration and deregistration
1863
      functions usually expect to be called from user context, and can
1864
      sleep.
1865
    
1866
 
1867
    
1868
     
1869
      
1870
        Accesses to
1871
        userspace:
1872
      
1873
      
1874
       
1875
        
1876
          copy_from_user()
1877
        
1878
       
1879
       
1880
        
1881
          copy_to_user()
1882
        
1883
       
1884
       
1885
        
1886
          get_user()
1887
        
1888
       
1889
       
1890
        
1891
           put_user()
1892
        
1893
       
1894
      
1895
     
1896
 
1897
     
1898
      
1899
        kmalloc(GFP_KERNEL)
1900
      
1901
     
1902
 
1903
     
1904
      
1905
      down_interruptible() and
1906
      down()
1907
      
1908
      
1909
       There is a down_trylock() which can be
1910
       used inside interrupt context, as it will not sleep.
1911
       up() will also never sleep.
1912
      
1913
     
1914
    
1915
   
1916
 
1917
   
1918
    Some Functions Which Don't Sleep
1919
 
1920
    
1921
     Some functions are safe to call from any context, or holding
1922
     almost any lock.
1923
    
1924
 
1925
    
1926
     
1927
      
1928
        printk()
1929
      
1930
     
1931
     
1932
      
1933
        kfree()
1934
      
1935
     
1936
     
1937
      
1938
        add_timer() and del_timer()
1939
      
1940
     
1941
    
1942
   
1943
  
1944
 
1945
  
1946
   Further reading
1947
 
1948
   
1949
    
1950
     
1951
       Documentation/spinlocks.txt:
1952
       Linus Torvalds' spinlocking tutorial in the kernel sources.
1953
     
1954
    
1955
 
1956
    
1957
     
1958
       Unix Systems for Modern Architectures: Symmetric
1959
       Multiprocessing and Caching for Kernel Programmers:
1960
     
1961
 
1962
     
1963
       Curt Schimmel's very good introduction to kernel level
1964
       locking (not written for Linux, but nearly everything
1965
       applies).  The book is expensive, but really worth every
1966
       penny to understand SMP locking. [ISBN: 0201633388]
1967
     
1968
    
1969
   
1970
  
1971
 
1972
  
1973
    Thanks
1974
 
1975
    
1976
      Thanks to Telsa Gwynne for DocBooking, neatening and adding
1977
      style.
1978
    
1979
 
1980
    
1981
      Thanks to Martin Pool, Philipp Rumpf, Stephen Rothwell, Paul
1982
      Mackerras, Ruedi Aschwanden, Alan Cox, Manfred Spraul, Tim
1983
      Waugh, Pete Zaitcev, James Morris, Robert Love, Paul McKenney,
1984
      John Ashby for proofreading, correcting, flaming, commenting.
1985
    
1986
 
1987
    
1988
      Thanks to the cabal for having no influence on this document.
1989
    
1990
  
1991
 
1992
  
1993
   Glossary
1994
 
1995
   
1996
    preemption
1997
     
1998
      
1999
        Prior to 2.5, or when CONFIG_PREEMPT is
2000
        unset, processes in user context inside the kernel would not
2001
        preempt each other (ie. you had that CPU until you have it up,
2002
        except for interrupts).  With the addition of
2003
        CONFIG_PREEMPT in 2.5.4, this changed: when
2004
        in user context, higher priority tasks can "cut in": spinlocks
2005
        were changed to disable preemption, even on UP.
2006
     
2007
    
2008
   
2009
 
2010
   
2011
    bh
2012
     
2013
      
2014
        Bottom Half: for historical reasons, functions with
2015
        '_bh' in them often now refer to any software interrupt, e.g.
2016
        spin_lock_bh() blocks any software interrupt
2017
        on the current CPU.  Bottom halves are deprecated, and will
2018
        eventually be replaced by tasklets.  Only one bottom half will be
2019
        running at any time.
2020
     
2021
    
2022
   
2023
 
2024
   
2025
    Hardware Interrupt / Hardware IRQ
2026
    
2027
     
2028
       Hardware interrupt request.  in_irq() returns
2029
       true in a hardware interrupt handler.
2030
     
2031
    
2032
   
2033
 
2034
   
2035
    Interrupt Context
2036
    
2037
     
2038
       Not user context: processing a hardware irq or software irq.
2039
       Indicated by the in_interrupt() macro
2040
       returning true.
2041
     
2042
    
2043
   
2044
 
2045
   
2046
    SMP
2047
    
2048
     
2049
       Symmetric Multi-Processor: kernels compiled for multiple-CPU
2050
       machines.  (CONFIG_SMP=y).
2051
     
2052
    
2053
   
2054
 
2055
   
2056
    Software Interrupt / softirq
2057
    
2058
     
2059
       Software interrupt handler.  in_irq() returns
2060
       false; in_softirq()
2061
       returns true.  Tasklets and softirqs
2062
        both fall into the category of 'software interrupts'.
2063
     
2064
     
2065
       Strictly speaking a softirq is one of up to 32 enumerated software
2066
       interrupts which can run on multiple CPUs at once.
2067
       Sometimes used to refer to tasklets as
2068
       well (ie. all software interrupts).
2069
     
2070
    
2071
   
2072
 
2073
   
2074
    tasklet
2075
    
2076
     
2077
       A dynamically-registrable software interrupt,
2078
       which is guaranteed to only run on one CPU at a time.
2079
     
2080
    
2081
   
2082
 
2083
   
2084
    timer
2085
    
2086
     
2087
       A dynamically-registrable software interrupt, which is run at
2088
       (or close to) a given time.  When running, it is just like a
2089
       tasklet (in fact, they are called from the TIMER_SOFTIRQ).
2090
     
2091
    
2092
   
2093
 
2094
   
2095
    UP
2096
    
2097
     
2098
       Uni-Processor: Non-SMP.  (CONFIG_SMP=n).
2099
     
2100
    
2101
   
2102
 
2103
   
2104
    User Context
2105
    
2106
     
2107
       The kernel executing on behalf of a particular process (ie. a
2108
       system call or trap) or kernel thread.  You can tell which
2109
       process with the current macro.)  Not to
2110
       be confused with userspace.  Can be interrupted by software or
2111
       hardware interrupts.
2112
     
2113
    
2114
   
2115
 
2116
   
2117
    Userspace
2118
    
2119
     
2120
       A process executing its own code outside the kernel.
2121
     
2122
    
2123
   
2124
 
2125
  
2126
2127
 

powered by: WebSVN 2.1.0

© copyright 1999-2025 OpenCores.org, equivalent to Oliscience, all rights reserved. OpenCores®, registered trademark.