1 |
1628 |
jcastillo |
|
2 |
|
|
|
3 |
|
|
This is an NFS client for Linux that supports async RPC calls for
|
4 |
|
|
read-ahead (and hopefully soon, write-back) on regular files.
|
5 |
|
|
|
6 |
|
|
The implementation uses a straightforward nfsiod scheme. After
|
7 |
|
|
trying out a number of different concepts, I finally got back to
|
8 |
|
|
this concept, because everything else either didn't work or gave me
|
9 |
|
|
headaches. It's not flashy, but it works without hacking into any
|
10 |
|
|
other regions of the kernel.
|
11 |
|
|
|
12 |
|
|
|
13 |
|
|
HOW TO USE
|
14 |
|
|
|
15 |
|
|
This stuff compiles as a loadable module (I developed it on 1.3.77).
|
16 |
|
|
Simply type mkmodule, and insmod nfs.o. This will start four nfsiod's
|
17 |
|
|
at the same time (which will show up under the pseudonym of insmod in
|
18 |
|
|
ps-style listings).
|
19 |
|
|
|
20 |
|
|
Alternatively, you can put it right into the kernel: remove everything
|
21 |
|
|
from fs/nfs, move the Makefile and all *.c to this directory, and
|
22 |
|
|
copy all *.h files to include/linux.
|
23 |
|
|
|
24 |
|
|
After mounting, you should be able to watch (with tcpdump) several
|
25 |
|
|
RPC READ calls being placed simultaneously.
|
26 |
|
|
|
27 |
|
|
|
28 |
|
|
HOW IT WORKS
|
29 |
|
|
|
30 |
|
|
When a process reads from a file on an NFS volume, the following
|
31 |
|
|
happens:
|
32 |
|
|
|
33 |
|
|
* nfs_file_read sets file->f_reada if more than 1K is
|
34 |
|
|
read at once. It then calls generic_file_read.
|
35 |
|
|
|
36 |
|
|
* generic_file_read requests one ore more pages via
|
37 |
|
|
nfs_readpage.
|
38 |
|
|
|
39 |
|
|
* nfs_readpage allocates a request slot with an nfsiod
|
40 |
|
|
daemon, fills in the READ request, sends out the
|
41 |
|
|
RPC call, kicks the daemon, and returns.
|
42 |
|
|
If there's no free biod, nfs_readpage places the
|
43 |
|
|
call directly, waiting for the reply (sync readpage).
|
44 |
|
|
|
45 |
|
|
* nfsiod calls nfs_rpc_doio to collect the reply. If the
|
46 |
|
|
call was successful, it sets page->uptodate and
|
47 |
|
|
wakes up all processes waiting on page->wait;
|
48 |
|
|
|
49 |
|
|
This is the rough outline only. There are a few things to note:
|
50 |
|
|
|
51 |
|
|
* Async RPC will not be tried when server->rsize < PAGE_SIZE.
|
52 |
|
|
|
53 |
|
|
* When an error occurs, nfsiod has no way of returning
|
54 |
|
|
the error code to the user process. Therefore, it flags
|
55 |
|
|
page->error and wakes up all processes waiting on that
|
56 |
|
|
page (they usually do so from within generic_readpage).
|
57 |
|
|
|
58 |
|
|
generic_readpage finds that the page is still not
|
59 |
|
|
uptodate, and calls nfs_readpage again. This time around,
|
60 |
|
|
nfs_readpage notices that page->error is set and
|
61 |
|
|
unconditionally does a synchronous RPC call.
|
62 |
|
|
|
63 |
|
|
This area needs a lot of improvement, since read errors
|
64 |
|
|
are not that uncommon (e.g. we have to retransmit calls
|
65 |
|
|
if the fsuid is different from the ruid in order to
|
66 |
|
|
cope with root squashing and stuff like this).
|
67 |
|
|
|
68 |
|
|
Retransmits with fsuid/ruid change should be handled by
|
69 |
|
|
nfsiod, but this doesn't come easily (a more general nfs_call
|
70 |
|
|
routine that does all this may be useful...)
|
71 |
|
|
|
72 |
|
|
* To save some time on readaheads, we save one data copy
|
73 |
|
|
by frobbing the page into the iovec passed to the
|
74 |
|
|
RPC code so that the networking layer copies the
|
75 |
|
|
data into the page directly.
|
76 |
|
|
|
77 |
|
|
This needs to be adjustable (different authentication
|
78 |
|
|
flavors; AUTH_NULL versus AUTH_SHORT verifiers).
|
79 |
|
|
|
80 |
|
|
* Currently, a fixed number of nfsiod's is spawned from
|
81 |
|
|
within init_nfs_fs. This is problematic when running
|
82 |
|
|
as a loadable module, because this will keep insmod's
|
83 |
|
|
memory allocated. As a side-effect, you will see the
|
84 |
|
|
nfsiod processes listed as several insmod's when doing
|
85 |
|
|
a `ps.'
|
86 |
|
|
|
87 |
|
|
* This NFS client implements server congestion control via
|
88 |
|
|
Van Jacobson slow start as implemented in 44BSD. I haven't
|
89 |
|
|
checked how well this behaves, but since Rick Macklem did
|
90 |
|
|
it this way, it should be okay :-)
|
91 |
|
|
|
92 |
|
|
|
93 |
|
|
WISH LIST
|
94 |
|
|
|
95 |
|
|
After giving this thing some testing, I'd like to add some more
|
96 |
|
|
features:
|
97 |
|
|
|
98 |
|
|
* Some sort of async write handling. True write-back doesn't
|
99 |
|
|
work with the current kernel (I think), because invalidate_pages
|
100 |
|
|
kills all pages, regardless of whether they're dirty or not.
|
101 |
|
|
Besides, this may require special bdflush treatment because
|
102 |
|
|
write caching on clients is really hairy.
|
103 |
|
|
|
104 |
|
|
Alternatively, a write-through scheme might be useful where
|
105 |
|
|
the client enqueues the request, but leaves collecting the
|
106 |
|
|
results to nfsiod. Again, we need a way to pass RPC errors
|
107 |
|
|
back to the application.
|
108 |
|
|
|
109 |
|
|
* Support for different authentication flavors.
|
110 |
|
|
|
111 |
|
|
* /proc/net/nfsclnt (for nfsstat, etc.).
|
112 |
|
|
|
113 |
|
|
March 29, 1996
|
114 |
|
|
Olaf Kirch
|