1 |
62 |
marcus.erl |
Notes on the Generic Block Layer Rewrite in Linux 2.5
|
2 |
|
|
=====================================================
|
3 |
|
|
|
4 |
|
|
Notes Written on Jan 15, 2002:
|
5 |
|
|
Jens Axboe
|
6 |
|
|
Suparna Bhattacharya
|
7 |
|
|
|
8 |
|
|
Last Updated May 2, 2002
|
9 |
|
|
September 2003: Updated I/O Scheduler portions
|
10 |
|
|
Nick Piggin
|
11 |
|
|
|
12 |
|
|
Introduction:
|
13 |
|
|
|
14 |
|
|
These are some notes describing some aspects of the 2.5 block layer in the
|
15 |
|
|
context of the bio rewrite. The idea is to bring out some of the key
|
16 |
|
|
changes and a glimpse of the rationale behind those changes.
|
17 |
|
|
|
18 |
|
|
Please mail corrections & suggestions to suparna@in.ibm.com.
|
19 |
|
|
|
20 |
|
|
Credits:
|
21 |
|
|
---------
|
22 |
|
|
|
23 |
|
|
2.5 bio rewrite:
|
24 |
|
|
Jens Axboe
|
25 |
|
|
|
26 |
|
|
Many aspects of the generic block layer redesign were driven by and evolved
|
27 |
|
|
over discussions, prior patches and the collective experience of several
|
28 |
|
|
people. See sections 8 and 9 for a list of some related references.
|
29 |
|
|
|
30 |
|
|
The following people helped with review comments and inputs for this
|
31 |
|
|
document:
|
32 |
|
|
Christoph Hellwig
|
33 |
|
|
Arjan van de Ven
|
34 |
|
|
Randy Dunlap
|
35 |
|
|
Andre Hedrick
|
36 |
|
|
|
37 |
|
|
The following people helped with fixes/contributions to the bio patches
|
38 |
|
|
while it was still work-in-progress:
|
39 |
|
|
David S. Miller
|
40 |
|
|
|
41 |
|
|
|
42 |
|
|
Description of Contents:
|
43 |
|
|
------------------------
|
44 |
|
|
|
45 |
|
|
1. Scope for tuning of logic to various needs
|
46 |
|
|
1.1 Tuning based on device or low level driver capabilities
|
47 |
|
|
- Per-queue parameters
|
48 |
|
|
- Highmem I/O support
|
49 |
|
|
- I/O scheduler modularization
|
50 |
|
|
1.2 Tuning based on high level requirements/capabilities
|
51 |
|
|
1.2.1 I/O Barriers
|
52 |
|
|
1.2.2 Request Priority/Latency
|
53 |
|
|
1.3 Direct access/bypass to lower layers for diagnostics and special
|
54 |
|
|
device operations
|
55 |
|
|
1.3.1 Pre-built commands
|
56 |
|
|
2. New flexible and generic but minimalist i/o structure or descriptor
|
57 |
|
|
(instead of using buffer heads at the i/o layer)
|
58 |
|
|
2.1 Requirements/Goals addressed
|
59 |
|
|
2.2 The bio struct in detail (multi-page io unit)
|
60 |
|
|
2.3 Changes in the request structure
|
61 |
|
|
3. Using bios
|
62 |
|
|
3.1 Setup/teardown (allocation, splitting)
|
63 |
|
|
3.2 Generic bio helper routines
|
64 |
|
|
3.2.1 Traversing segments and completion units in a request
|
65 |
|
|
3.2.2 Setting up DMA scatterlists
|
66 |
|
|
3.2.3 I/O completion
|
67 |
|
|
3.2.4 Implications for drivers that do not interpret bios (don't handle
|
68 |
|
|
multiple segments)
|
69 |
|
|
3.2.5 Request command tagging
|
70 |
|
|
3.3 I/O submission
|
71 |
|
|
4. The I/O scheduler
|
72 |
|
|
5. Scalability related changes
|
73 |
|
|
5.1 Granular locking: Removal of io_request_lock
|
74 |
|
|
5.2 Prepare for transition to 64 bit sector_t
|
75 |
|
|
6. Other Changes/Implications
|
76 |
|
|
6.1 Partition re-mapping handled by the generic block layer
|
77 |
|
|
7. A few tips on migration of older drivers
|
78 |
|
|
8. A list of prior/related/impacted patches/ideas
|
79 |
|
|
9. Other References/Discussion Threads
|
80 |
|
|
|
81 |
|
|
---------------------------------------------------------------------------
|
82 |
|
|
|
83 |
|
|
Bio Notes
|
84 |
|
|
--------
|
85 |
|
|
|
86 |
|
|
Let us discuss the changes in the context of how some overall goals for the
|
87 |
|
|
block layer are addressed.
|
88 |
|
|
|
89 |
|
|
1. Scope for tuning the generic logic to satisfy various requirements
|
90 |
|
|
|
91 |
|
|
The block layer design supports adaptable abstractions to handle common
|
92 |
|
|
processing with the ability to tune the logic to an appropriate extent
|
93 |
|
|
depending on the nature of the device and the requirements of the caller.
|
94 |
|
|
One of the objectives of the rewrite was to increase the degree of tunability
|
95 |
|
|
and to enable higher level code to utilize underlying device/driver
|
96 |
|
|
capabilities to the maximum extent for better i/o performance. This is
|
97 |
|
|
important especially in the light of ever improving hardware capabilities
|
98 |
|
|
and application/middleware software designed to take advantage of these
|
99 |
|
|
capabilities.
|
100 |
|
|
|
101 |
|
|
1.1 Tuning based on low level device / driver capabilities
|
102 |
|
|
|
103 |
|
|
Sophisticated devices with large built-in caches, intelligent i/o scheduling
|
104 |
|
|
optimizations, high memory DMA support, etc may find some of the
|
105 |
|
|
generic processing an overhead, while for less capable devices the
|
106 |
|
|
generic functionality is essential for performance or correctness reasons.
|
107 |
|
|
Knowledge of some of the capabilities or parameters of the device should be
|
108 |
|
|
used at the generic block layer to take the right decisions on
|
109 |
|
|
behalf of the driver.
|
110 |
|
|
|
111 |
|
|
How is this achieved ?
|
112 |
|
|
|
113 |
|
|
Tuning at a per-queue level:
|
114 |
|
|
|
115 |
|
|
i. Per-queue limits/values exported to the generic layer by the driver
|
116 |
|
|
|
117 |
|
|
Various parameters that the generic i/o scheduler logic uses are set at
|
118 |
|
|
a per-queue level (e.g maximum request size, maximum number of segments in
|
119 |
|
|
a scatter-gather list, hardsect size)
|
120 |
|
|
|
121 |
|
|
Some parameters that were earlier available as global arrays indexed by
|
122 |
|
|
major/minor are now directly associated with the queue. Some of these may
|
123 |
|
|
move into the block device structure in the future. Some characteristics
|
124 |
|
|
have been incorporated into a queue flags field rather than separate fields
|
125 |
|
|
in themselves. There are blk_queue_xxx functions to set the parameters,
|
126 |
|
|
rather than update the fields directly
|
127 |
|
|
|
128 |
|
|
Some new queue property settings:
|
129 |
|
|
|
130 |
|
|
blk_queue_bounce_limit(q, u64 dma_address)
|
131 |
|
|
Enable I/O to highmem pages, dma_address being the
|
132 |
|
|
limit. No highmem default.
|
133 |
|
|
|
134 |
|
|
blk_queue_max_sectors(q, max_sectors)
|
135 |
|
|
Sets two variables that limit the size of the request.
|
136 |
|
|
|
137 |
|
|
- The request queue's max_sectors, which is a soft size in
|
138 |
|
|
units of 512 byte sectors, and could be dynamically varied
|
139 |
|
|
by the core kernel.
|
140 |
|
|
|
141 |
|
|
- The request queue's max_hw_sectors, which is a hard limit
|
142 |
|
|
and reflects the maximum size request a driver can handle
|
143 |
|
|
in units of 512 byte sectors.
|
144 |
|
|
|
145 |
|
|
The default for both max_sectors and max_hw_sectors is
|
146 |
|
|
255. The upper limit of max_sectors is 1024.
|
147 |
|
|
|
148 |
|
|
blk_queue_max_phys_segments(q, max_segments)
|
149 |
|
|
Maximum physical segments you can handle in a request. 128
|
150 |
|
|
default (driver limit). (See 3.2.2)
|
151 |
|
|
|
152 |
|
|
blk_queue_max_hw_segments(q, max_segments)
|
153 |
|
|
Maximum dma segments the hardware can handle in a request. 128
|
154 |
|
|
default (host adapter limit, after dma remapping).
|
155 |
|
|
(See 3.2.2)
|
156 |
|
|
|
157 |
|
|
blk_queue_max_segment_size(q, max_seg_size)
|
158 |
|
|
Maximum size of a clustered segment, 64kB default.
|
159 |
|
|
|
160 |
|
|
blk_queue_hardsect_size(q, hardsect_size)
|
161 |
|
|
Lowest possible sector size that the hardware can operate
|
162 |
|
|
on, 512 bytes default.
|
163 |
|
|
|
164 |
|
|
New queue flags:
|
165 |
|
|
|
166 |
|
|
QUEUE_FLAG_CLUSTER (see 3.2.2)
|
167 |
|
|
QUEUE_FLAG_QUEUED (see 3.2.4)
|
168 |
|
|
|
169 |
|
|
|
170 |
|
|
ii. High-mem i/o capabilities are now considered the default
|
171 |
|
|
|
172 |
|
|
The generic bounce buffer logic, present in 2.4, where the block layer would
|
173 |
|
|
by default copyin/out i/o requests on high-memory buffers to low-memory buffers
|
174 |
|
|
assuming that the driver wouldn't be able to handle it directly, has been
|
175 |
|
|
changed in 2.5. The bounce logic is now applied only for memory ranges
|
176 |
|
|
for which the device cannot handle i/o. A driver can specify this by
|
177 |
|
|
setting the queue bounce limit for the request queue for the device
|
178 |
|
|
(blk_queue_bounce_limit()). This avoids the inefficiencies of the copyin/out
|
179 |
|
|
where a device is capable of handling high memory i/o.
|
180 |
|
|
|
181 |
|
|
In order to enable high-memory i/o where the device is capable of supporting
|
182 |
|
|
it, the pci dma mapping routines and associated data structures have now been
|
183 |
|
|
modified to accomplish a direct page -> bus translation, without requiring
|
184 |
|
|
a virtual address mapping (unlike the earlier scheme of virtual address
|
185 |
|
|
-> bus translation). So this works uniformly for high-memory pages (which
|
186 |
|
|
do not have a corresponding kernel virtual address space mapping) and
|
187 |
|
|
low-memory pages.
|
188 |
|
|
|
189 |
|
|
Note: Please refer to DMA-mapping.txt for a discussion on PCI high mem DMA
|
190 |
|
|
aspects and mapping of scatter gather lists, and support for 64 bit PCI.
|
191 |
|
|
|
192 |
|
|
Special handling is required only for cases where i/o needs to happen on
|
193 |
|
|
pages at physical memory addresses beyond what the device can support. In these
|
194 |
|
|
cases, a bounce bio representing a buffer from the supported memory range
|
195 |
|
|
is used for performing the i/o with copyin/copyout as needed depending on
|
196 |
|
|
the type of the operation. For example, in case of a read operation, the
|
197 |
|
|
data read has to be copied to the original buffer on i/o completion, so a
|
198 |
|
|
callback routine is set up to do this, while for write, the data is copied
|
199 |
|
|
from the original buffer to the bounce buffer prior to issuing the
|
200 |
|
|
operation. Since an original buffer may be in a high memory area that's not
|
201 |
|
|
mapped in kernel virtual addr, a kmap operation may be required for
|
202 |
|
|
performing the copy, and special care may be needed in the completion path
|
203 |
|
|
as it may not be in irq context. Special care is also required (by way of
|
204 |
|
|
GFP flags) when allocating bounce buffers, to avoid certain highmem
|
205 |
|
|
deadlock possibilities.
|
206 |
|
|
|
207 |
|
|
It is also possible that a bounce buffer may be allocated from high-memory
|
208 |
|
|
area that's not mapped in kernel virtual addr, but within the range that the
|
209 |
|
|
device can use directly; so the bounce page may need to be kmapped during
|
210 |
|
|
copy operations. [Note: This does not hold in the current implementation,
|
211 |
|
|
though]
|
212 |
|
|
|
213 |
|
|
There are some situations when pages from high memory may need to
|
214 |
|
|
be kmapped, even if bounce buffers are not necessary. For example a device
|
215 |
|
|
may need to abort DMA operations and revert to PIO for the transfer, in
|
216 |
|
|
which case a virtual mapping of the page is required. For SCSI it is also
|
217 |
|
|
done in some scenarios where the low level driver cannot be trusted to
|
218 |
|
|
handle a single sg entry correctly. The driver is expected to perform the
|
219 |
|
|
kmaps as needed on such occasions using the __bio_kmap_atomic and bio_kmap_irq
|
220 |
|
|
routines as appropriate. A driver could also use the blk_queue_bounce()
|
221 |
|
|
routine on its own to bounce highmem i/o to low memory for specific requests
|
222 |
|
|
if so desired.
|
223 |
|
|
|
224 |
|
|
iii. The i/o scheduler algorithm itself can be replaced/set as appropriate
|
225 |
|
|
|
226 |
|
|
As in 2.4, it is possible to plugin a brand new i/o scheduler for a particular
|
227 |
|
|
queue or pick from (copy) existing generic schedulers and replace/override
|
228 |
|
|
certain portions of it. The 2.5 rewrite provides improved modularization
|
229 |
|
|
of the i/o scheduler. There are more pluggable callbacks, e.g for init,
|
230 |
|
|
add request, extract request, which makes it possible to abstract specific
|
231 |
|
|
i/o scheduling algorithm aspects and details outside of the generic loop.
|
232 |
|
|
It also makes it possible to completely hide the implementation details of
|
233 |
|
|
the i/o scheduler from block drivers.
|
234 |
|
|
|
235 |
|
|
I/O scheduler wrappers are to be used instead of accessing the queue directly.
|
236 |
|
|
See section 4. The I/O scheduler for details.
|
237 |
|
|
|
238 |
|
|
1.2 Tuning Based on High level code capabilities
|
239 |
|
|
|
240 |
|
|
i. Application capabilities for raw i/o
|
241 |
|
|
|
242 |
|
|
This comes from some of the high-performance database/middleware
|
243 |
|
|
requirements where an application prefers to make its own i/o scheduling
|
244 |
|
|
decisions based on an understanding of the access patterns and i/o
|
245 |
|
|
characteristics
|
246 |
|
|
|
247 |
|
|
ii. High performance filesystems or other higher level kernel code's
|
248 |
|
|
capabilities
|
249 |
|
|
|
250 |
|
|
Kernel components like filesystems could also take their own i/o scheduling
|
251 |
|
|
decisions for optimizing performance. Journalling filesystems may need
|
252 |
|
|
some control over i/o ordering.
|
253 |
|
|
|
254 |
|
|
What kind of support exists at the generic block layer for this ?
|
255 |
|
|
|
256 |
|
|
The flags and rw fields in the bio structure can be used for some tuning
|
257 |
|
|
from above e.g indicating that an i/o is just a readahead request, or for
|
258 |
|
|
marking barrier requests (discussed next), or priority settings (currently
|
259 |
|
|
unused). As far as user applications are concerned they would need an
|
260 |
|
|
additional mechanism either via open flags or ioctls, or some other upper
|
261 |
|
|
level mechanism to communicate such settings to block.
|
262 |
|
|
|
263 |
|
|
1.2.1 I/O Barriers
|
264 |
|
|
|
265 |
|
|
There is a way to enforce strict ordering for i/os through barriers.
|
266 |
|
|
All requests before a barrier point must be serviced before the barrier
|
267 |
|
|
request and any other requests arriving after the barrier will not be
|
268 |
|
|
serviced until after the barrier has completed. This is useful for higher
|
269 |
|
|
level control on write ordering, e.g flushing a log of committed updates
|
270 |
|
|
to disk before the corresponding updates themselves.
|
271 |
|
|
|
272 |
|
|
A flag in the bio structure, BIO_BARRIER is used to identify a barrier i/o.
|
273 |
|
|
The generic i/o scheduler would make sure that it places the barrier request and
|
274 |
|
|
all other requests coming after it after all the previous requests in the
|
275 |
|
|
queue. Barriers may be implemented in different ways depending on the
|
276 |
|
|
driver. For more details regarding I/O barriers, please read barrier.txt
|
277 |
|
|
in this directory.
|
278 |
|
|
|
279 |
|
|
1.2.2 Request Priority/Latency
|
280 |
|
|
|
281 |
|
|
Todo/Under discussion:
|
282 |
|
|
Arjan's proposed request priority scheme allows higher levels some broad
|
283 |
|
|
control (high/med/low) over the priority of an i/o request vs other pending
|
284 |
|
|
requests in the queue. For example it allows reads for bringing in an
|
285 |
|
|
executable page on demand to be given a higher priority over pending write
|
286 |
|
|
requests which haven't aged too much on the queue. Potentially this priority
|
287 |
|
|
could even be exposed to applications in some manner, providing higher level
|
288 |
|
|
tunability. Time based aging avoids starvation of lower priority
|
289 |
|
|
requests. Some bits in the bi_rw flags field in the bio structure are
|
290 |
|
|
intended to be used for this priority information.
|
291 |
|
|
|
292 |
|
|
|
293 |
|
|
1.3 Direct Access to Low level Device/Driver Capabilities (Bypass mode)
|
294 |
|
|
(e.g Diagnostics, Systems Management)
|
295 |
|
|
|
296 |
|
|
There are situations where high-level code needs to have direct access to
|
297 |
|
|
the low level device capabilities or requires the ability to issue commands
|
298 |
|
|
to the device bypassing some of the intermediate i/o layers.
|
299 |
|
|
These could, for example, be special control commands issued through ioctl
|
300 |
|
|
interfaces, or could be raw read/write commands that stress the drive's
|
301 |
|
|
capabilities for certain kinds of fitness tests. Having direct interfaces at
|
302 |
|
|
multiple levels without having to pass through upper layers makes
|
303 |
|
|
it possible to perform bottom up validation of the i/o path, layer by
|
304 |
|
|
layer, starting from the media.
|
305 |
|
|
|
306 |
|
|
The normal i/o submission interfaces, e.g submit_bio, could be bypassed
|
307 |
|
|
for specially crafted requests which such ioctl or diagnostics
|
308 |
|
|
interfaces would typically use, and the elevator add_request routine
|
309 |
|
|
can instead be used to directly insert such requests in the queue or preferably
|
310 |
|
|
the blk_do_rq routine can be used to place the request on the queue and
|
311 |
|
|
wait for completion. Alternatively, sometimes the caller might just
|
312 |
|
|
invoke a lower level driver specific interface with the request as a
|
313 |
|
|
parameter.
|
314 |
|
|
|
315 |
|
|
If the request is a means for passing on special information associated with
|
316 |
|
|
the command, then such information is associated with the request->special
|
317 |
|
|
field (rather than misuse the request->buffer field which is meant for the
|
318 |
|
|
request data buffer's virtual mapping).
|
319 |
|
|
|
320 |
|
|
For passing request data, the caller must build up a bio descriptor
|
321 |
|
|
representing the concerned memory buffer if the underlying driver interprets
|
322 |
|
|
bio segments or uses the block layer end*request* functions for i/o
|
323 |
|
|
completion. Alternatively one could directly use the request->buffer field to
|
324 |
|
|
specify the virtual address of the buffer, if the driver expects buffer
|
325 |
|
|
addresses passed in this way and ignores bio entries for the request type
|
326 |
|
|
involved. In the latter case, the driver would modify and manage the
|
327 |
|
|
request->buffer, request->sector and request->nr_sectors or
|
328 |
|
|
request->current_nr_sectors fields itself rather than using the block layer
|
329 |
|
|
end_request or end_that_request_first completion interfaces.
|
330 |
|
|
(See 2.3 or Documentation/block/request.txt for a brief explanation of
|
331 |
|
|
the request structure fields)
|
332 |
|
|
|
333 |
|
|
[TBD: end_that_request_last should be usable even in this case;
|
334 |
|
|
Perhaps an end_that_direct_request_first routine could be implemented to make
|
335 |
|
|
handling direct requests easier for such drivers; Also for drivers that
|
336 |
|
|
expect bios, a helper function could be provided for setting up a bio
|
337 |
|
|
corresponding to a data buffer]
|
338 |
|
|
|
339 |
|
|
|
340 |
|
|
usable? Or _last for that matter. I must be missing something>
|
341 |
|
|
|
342 |
|
|
end_that_request_first doesn't modify nr_sectors or current_nr_sectors,
|
343 |
|
|
and hence can't be used for advancing request state settings on the
|
344 |
|
|
completion of partial transfers. The driver has to modify these fields
|
345 |
|
|
directly by hand.
|
346 |
|
|
This is because end_that_request_first only iterates over the bio list,
|
347 |
|
|
and always returns 0 if there are none associated with the request.
|
348 |
|
|
_last works OK in this case, and is not a problem, as I mentioned earlier
|
349 |
|
|
>
|
350 |
|
|
|
351 |
|
|
1.3.1 Pre-built Commands
|
352 |
|
|
|
353 |
|
|
A request can be created with a pre-built custom command to be sent directly
|
354 |
|
|
to the device. The cmd block in the request structure has room for filling
|
355 |
|
|
in the command bytes. (i.e rq->cmd is now 16 bytes in size, and meant for
|
356 |
|
|
command pre-building, and the type of the request is now indicated
|
357 |
|
|
through rq->flags instead of via rq->cmd)
|
358 |
|
|
|
359 |
|
|
The request structure flags can be set up to indicate the type of request
|
360 |
|
|
in such cases (REQ_PC: direct packet command passed to driver, REQ_BLOCK_PC:
|
361 |
|
|
packet command issued via blk_do_rq, REQ_SPECIAL: special request).
|
362 |
|
|
|
363 |
|
|
It can help to pre-build device commands for requests in advance.
|
364 |
|
|
Drivers can now specify a request prepare function (q->prep_rq_fn) that the
|
365 |
|
|
block layer would invoke to pre-build device commands for a given request,
|
366 |
|
|
or perform other preparatory processing for the request. This is routine is
|
367 |
|
|
called by elv_next_request(), i.e. typically just before servicing a request.
|
368 |
|
|
(The prepare function would not be called for requests that have REQ_DONTPREP
|
369 |
|
|
enabled)
|
370 |
|
|
|
371 |
|
|
Aside:
|
372 |
|
|
Pre-building could possibly even be done early, i.e before placing the
|
373 |
|
|
request on the queue, rather than construct the command on the fly in the
|
374 |
|
|
driver while servicing the request queue when it may affect latencies in
|
375 |
|
|
interrupt context or responsiveness in general. One way to add early
|
376 |
|
|
pre-building would be to do it whenever we fail to merge on a request.
|
377 |
|
|
Now REQ_NOMERGE is set in the request flags to skip this one in the future,
|
378 |
|
|
which means that it will not change before we feed it to the device. So
|
379 |
|
|
the pre-builder hook can be invoked there.
|
380 |
|
|
|
381 |
|
|
|
382 |
|
|
2. Flexible and generic but minimalist i/o structure/descriptor.
|
383 |
|
|
|
384 |
|
|
2.1 Reason for a new structure and requirements addressed
|
385 |
|
|
|
386 |
|
|
Prior to 2.5, buffer heads were used as the unit of i/o at the generic block
|
387 |
|
|
layer, and the low level request structure was associated with a chain of
|
388 |
|
|
buffer heads for a contiguous i/o request. This led to certain inefficiencies
|
389 |
|
|
when it came to large i/o requests and readv/writev style operations, as it
|
390 |
|
|
forced such requests to be broken up into small chunks before being passed
|
391 |
|
|
on to the generic block layer, only to be merged by the i/o scheduler
|
392 |
|
|
when the underlying device was capable of handling the i/o in one shot.
|
393 |
|
|
Also, using the buffer head as an i/o structure for i/os that didn't originate
|
394 |
|
|
from the buffer cache unnecessarily added to the weight of the descriptors
|
395 |
|
|
which were generated for each such chunk.
|
396 |
|
|
|
397 |
|
|
The following were some of the goals and expectations considered in the
|
398 |
|
|
redesign of the block i/o data structure in 2.5.
|
399 |
|
|
|
400 |
|
|
i. Should be appropriate as a descriptor for both raw and buffered i/o -
|
401 |
|
|
avoid cache related fields which are irrelevant in the direct/page i/o path,
|
402 |
|
|
or filesystem block size alignment restrictions which may not be relevant
|
403 |
|
|
for raw i/o.
|
404 |
|
|
ii. Ability to represent high-memory buffers (which do not have a virtual
|
405 |
|
|
address mapping in kernel address space).
|
406 |
|
|
iii.Ability to represent large i/os w/o unnecessarily breaking them up (i.e
|
407 |
|
|
greater than PAGE_SIZE chunks in one shot)
|
408 |
|
|
iv. At the same time, ability to retain independent identity of i/os from
|
409 |
|
|
different sources or i/o units requiring individual completion (e.g. for
|
410 |
|
|
latency reasons)
|
411 |
|
|
v. Ability to represent an i/o involving multiple physical memory segments
|
412 |
|
|
(including non-page aligned page fragments, as specified via readv/writev)
|
413 |
|
|
without unnecessarily breaking it up, if the underlying device is capable of
|
414 |
|
|
handling it.
|
415 |
|
|
vi. Preferably should be based on a memory descriptor structure that can be
|
416 |
|
|
passed around different types of subsystems or layers, maybe even
|
417 |
|
|
networking, without duplication or extra copies of data/descriptor fields
|
418 |
|
|
themselves in the process
|
419 |
|
|
vii.Ability to handle the possibility of splits/merges as the structure passes
|
420 |
|
|
through layered drivers (lvm, md, evms), with minimal overhead.
|
421 |
|
|
|
422 |
|
|
The solution was to define a new structure (bio) for the block layer,
|
423 |
|
|
instead of using the buffer head structure (bh) directly, the idea being
|
424 |
|
|
avoidance of some associated baggage and limitations. The bio structure
|
425 |
|
|
is uniformly used for all i/o at the block layer ; it forms a part of the
|
426 |
|
|
bh structure for buffered i/o, and in the case of raw/direct i/o kiobufs are
|
427 |
|
|
mapped to bio structures.
|
428 |
|
|
|
429 |
|
|
2.2 The bio struct
|
430 |
|
|
|
431 |
|
|
The bio structure uses a vector representation pointing to an array of tuples
|
432 |
|
|
of to describe the i/o buffer, and has various other
|
433 |
|
|
fields describing i/o parameters and state that needs to be maintained for
|
434 |
|
|
performing the i/o.
|
435 |
|
|
|
436 |
|
|
Notice that this representation means that a bio has no virtual address
|
437 |
|
|
mapping at all (unlike buffer heads).
|
438 |
|
|
|
439 |
|
|
struct bio_vec {
|
440 |
|
|
struct page *bv_page;
|
441 |
|
|
unsigned short bv_len;
|
442 |
|
|
unsigned short bv_offset;
|
443 |
|
|
};
|
444 |
|
|
|
445 |
|
|
/*
|
446 |
|
|
* main unit of I/O for the block layer and lower layers (ie drivers)
|
447 |
|
|
*/
|
448 |
|
|
struct bio {
|
449 |
|
|
sector_t bi_sector;
|
450 |
|
|
struct bio *bi_next; /* request queue link */
|
451 |
|
|
struct block_device *bi_bdev; /* target device */
|
452 |
|
|
unsigned long bi_flags; /* status, command, etc */
|
453 |
|
|
unsigned long bi_rw; /* low bits: r/w, high: priority */
|
454 |
|
|
|
455 |
|
|
unsigned int bi_vcnt; /* how may bio_vec's */
|
456 |
|
|
unsigned int bi_idx; /* current index into bio_vec array */
|
457 |
|
|
|
458 |
|
|
unsigned int bi_size; /* total size in bytes */
|
459 |
|
|
unsigned short bi_phys_segments; /* segments after physaddr coalesce*/
|
460 |
|
|
unsigned short bi_hw_segments; /* segments after DMA remapping */
|
461 |
|
|
unsigned int bi_max; /* max bio_vecs we can hold
|
462 |
|
|
used as index into pool */
|
463 |
|
|
struct bio_vec *bi_io_vec; /* the actual vec list */
|
464 |
|
|
bio_end_io_t *bi_end_io; /* bi_end_io (bio) */
|
465 |
|
|
atomic_t bi_cnt; /* pin count: free when it hits zero */
|
466 |
|
|
void *bi_private;
|
467 |
|
|
bio_destructor_t *bi_destructor; /* bi_destructor (bio) */
|
468 |
|
|
};
|
469 |
|
|
|
470 |
|
|
With this multipage bio design:
|
471 |
|
|
|
472 |
|
|
- Large i/os can be sent down in one go using a bio_vec list consisting
|
473 |
|
|
of an array of fragments (similar to the way fragments
|
474 |
|
|
are represented in the zero-copy network code)
|
475 |
|
|
- Splitting of an i/o request across multiple devices (as in the case of
|
476 |
|
|
lvm or raid) is achieved by cloning the bio (where the clone points to
|
477 |
|
|
the same bi_io_vec array, but with the index and size accordingly modified)
|
478 |
|
|
- A linked list of bios is used as before for unrelated merges (*) - this
|
479 |
|
|
avoids reallocs and makes independent completions easier to handle.
|
480 |
|
|
- Code that traverses the req list can find all the segments of a bio
|
481 |
|
|
by using rq_for_each_segment. This handles the fact that a request
|
482 |
|
|
has multiple bios, each of which can have multiple segments.
|
483 |
|
|
- Drivers which can't process a large bio in one shot can use the bi_idx
|
484 |
|
|
field to keep track of the next bio_vec entry to process.
|
485 |
|
|
(e.g a 1MB bio_vec needs to be handled in max 128kB chunks for IDE)
|
486 |
|
|
[TBD: Should preferably also have a bi_voffset and bi_vlen to avoid modifying
|
487 |
|
|
bi_offset an len fields]
|
488 |
|
|
|
489 |
|
|
(*) unrelated merges -- a request ends up containing two or more bios that
|
490 |
|
|
didn't originate from the same place.
|
491 |
|
|
|
492 |
|
|
bi_end_io() i/o callback gets called on i/o completion of the entire bio.
|
493 |
|
|
|
494 |
|
|
At a lower level, drivers build a scatter gather list from the merged bios.
|
495 |
|
|
The scatter gather list is in the form of an array of
|
496 |
|
|
entries with their corresponding dma address mappings filled in at the
|
497 |
|
|
appropriate time. As an optimization, contiguous physical pages can be
|
498 |
|
|
covered by a single entry where refers to the first page and
|
499 |
|
|
covers the range of pages (upto 16 contiguous pages could be covered this
|
500 |
|
|
way). There is a helper routine (blk_rq_map_sg) which drivers can use to build
|
501 |
|
|
the sg list.
|
502 |
|
|
|
503 |
|
|
Note: Right now the only user of bios with more than one page is ll_rw_kio,
|
504 |
|
|
which in turn means that only raw I/O uses it (direct i/o may not work
|
505 |
|
|
right now). The intent however is to enable clustering of pages etc to
|
506 |
|
|
become possible. The pagebuf abstraction layer from SGI also uses multi-page
|
507 |
|
|
bios, but that is currently not included in the stock development kernels.
|
508 |
|
|
The same is true of Andrew Morton's work-in-progress multipage bio writeout
|
509 |
|
|
and readahead patches.
|
510 |
|
|
|
511 |
|
|
2.3 Changes in the Request Structure
|
512 |
|
|
|
513 |
|
|
The request structure is the structure that gets passed down to low level
|
514 |
|
|
drivers. The block layer make_request function builds up a request structure,
|
515 |
|
|
places it on the queue and invokes the drivers request_fn. The driver makes
|
516 |
|
|
use of block layer helper routine elv_next_request to pull the next request
|
517 |
|
|
off the queue. Control or diagnostic functions might bypass block and directly
|
518 |
|
|
invoke underlying driver entry points passing in a specially constructed
|
519 |
|
|
request structure.
|
520 |
|
|
|
521 |
|
|
Only some relevant fields (mainly those which changed or may be referred
|
522 |
|
|
to in some of the discussion here) are listed below, not necessarily in
|
523 |
|
|
the order in which they occur in the structure (see include/linux/blkdev.h)
|
524 |
|
|
Refer to Documentation/block/request.txt for details about all the request
|
525 |
|
|
structure fields and a quick reference about the layers which are
|
526 |
|
|
supposed to use or modify those fields.
|
527 |
|
|
|
528 |
|
|
struct request {
|
529 |
|
|
struct list_head queuelist; /* Not meant to be directly accessed by
|
530 |
|
|
the driver.
|
531 |
|
|
Used by q->elv_next_request_fn
|
532 |
|
|
rq->queue is gone
|
533 |
|
|
*/
|
534 |
|
|
.
|
535 |
|
|
.
|
536 |
|
|
unsigned char cmd[16]; /* prebuilt command data block */
|
537 |
|
|
unsigned long flags; /* also includes earlier rq->cmd settings */
|
538 |
|
|
.
|
539 |
|
|
.
|
540 |
|
|
sector_t sector; /* this field is now of type sector_t instead of int
|
541 |
|
|
preparation for 64 bit sectors */
|
542 |
|
|
.
|
543 |
|
|
.
|
544 |
|
|
|
545 |
|
|
/* Number of scatter-gather DMA addr+len pairs after
|
546 |
|
|
* physical address coalescing is performed.
|
547 |
|
|
*/
|
548 |
|
|
unsigned short nr_phys_segments;
|
549 |
|
|
|
550 |
|
|
/* Number of scatter-gather addr+len pairs after
|
551 |
|
|
* physical and DMA remapping hardware coalescing is performed.
|
552 |
|
|
* This is the number of scatter-gather entries the driver
|
553 |
|
|
* will actually have to deal with after DMA mapping is done.
|
554 |
|
|
*/
|
555 |
|
|
unsigned short nr_hw_segments;
|
556 |
|
|
|
557 |
|
|
/* Various sector counts */
|
558 |
|
|
unsigned long nr_sectors; /* no. of sectors left: driver modifiable */
|
559 |
|
|
unsigned long hard_nr_sectors; /* block internal copy of above */
|
560 |
|
|
unsigned int current_nr_sectors; /* no. of sectors left in the
|
561 |
|
|
current segment:driver modifiable */
|
562 |
|
|
unsigned long hard_cur_sectors; /* block internal copy of the above */
|
563 |
|
|
.
|
564 |
|
|
.
|
565 |
|
|
int tag; /* command tag associated with request */
|
566 |
|
|
void *special; /* same as before */
|
567 |
|
|
char *buffer; /* valid only for low memory buffers upto
|
568 |
|
|
current_nr_sectors */
|
569 |
|
|
.
|
570 |
|
|
.
|
571 |
|
|
struct bio *bio, *biotail; /* bio list instead of bh */
|
572 |
|
|
struct request_list *rl;
|
573 |
|
|
}
|
574 |
|
|
|
575 |
|
|
See the rq_flag_bits definitions for an explanation of the various flags
|
576 |
|
|
available. Some bits are used by the block layer or i/o scheduler.
|
577 |
|
|
|
578 |
|
|
The behaviour of the various sector counts are almost the same as before,
|
579 |
|
|
except that since we have multi-segment bios, current_nr_sectors refers
|
580 |
|
|
to the numbers of sectors in the current segment being processed which could
|
581 |
|
|
be one of the many segments in the current bio (i.e i/o completion unit).
|
582 |
|
|
The nr_sectors value refers to the total number of sectors in the whole
|
583 |
|
|
request that remain to be transferred (no change). The purpose of the
|
584 |
|
|
hard_xxx values is for block to remember these counts every time it hands
|
585 |
|
|
over the request to the driver. These values are updated by block on
|
586 |
|
|
end_that_request_first, i.e. every time the driver completes a part of the
|
587 |
|
|
transfer and invokes block end*request helpers to mark this. The
|
588 |
|
|
driver should not modify these values. The block layer sets up the
|
589 |
|
|
nr_sectors and current_nr_sectors fields (based on the corresponding
|
590 |
|
|
hard_xxx values and the number of bytes transferred) and updates it on
|
591 |
|
|
every transfer that invokes end_that_request_first. It does the same for the
|
592 |
|
|
buffer, bio, bio->bi_idx fields too.
|
593 |
|
|
|
594 |
|
|
The buffer field is just a virtual address mapping of the current segment
|
595 |
|
|
of the i/o buffer in cases where the buffer resides in low-memory. For high
|
596 |
|
|
memory i/o, this field is not valid and must not be used by drivers.
|
597 |
|
|
|
598 |
|
|
Code that sets up its own request structures and passes them down to
|
599 |
|
|
a driver needs to be careful about interoperation with the block layer helper
|
600 |
|
|
functions which the driver uses. (Section 1.3)
|
601 |
|
|
|
602 |
|
|
3. Using bios
|
603 |
|
|
|
604 |
|
|
3.1 Setup/Teardown
|
605 |
|
|
|
606 |
|
|
There are routines for managing the allocation, and reference counting, and
|
607 |
|
|
freeing of bios (bio_alloc, bio_get, bio_put).
|
608 |
|
|
|
609 |
|
|
This makes use of Ingo Molnar's mempool implementation, which enables
|
610 |
|
|
subsystems like bio to maintain their own reserve memory pools for guaranteed
|
611 |
|
|
deadlock-free allocations during extreme VM load. For example, the VM
|
612 |
|
|
subsystem makes use of the block layer to writeout dirty pages in order to be
|
613 |
|
|
able to free up memory space, a case which needs careful handling. The
|
614 |
|
|
allocation logic draws from the preallocated emergency reserve in situations
|
615 |
|
|
where it cannot allocate through normal means. If the pool is empty and it
|
616 |
|
|
can wait, then it would trigger action that would help free up memory or
|
617 |
|
|
replenish the pool (without deadlocking) and wait for availability in the pool.
|
618 |
|
|
If it is in IRQ context, and hence not in a position to do this, allocation
|
619 |
|
|
could fail if the pool is empty. In general mempool always first tries to
|
620 |
|
|
perform allocation without having to wait, even if it means digging into the
|
621 |
|
|
pool as long it is not less that 50% full.
|
622 |
|
|
|
623 |
|
|
On a free, memory is released to the pool or directly freed depending on
|
624 |
|
|
the current availability in the pool. The mempool interface lets the
|
625 |
|
|
subsystem specify the routines to be used for normal alloc and free. In the
|
626 |
|
|
case of bio, these routines make use of the standard slab allocator.
|
627 |
|
|
|
628 |
|
|
The caller of bio_alloc is expected to taken certain steps to avoid
|
629 |
|
|
deadlocks, e.g. avoid trying to allocate more memory from the pool while
|
630 |
|
|
already holding memory obtained from the pool.
|
631 |
|
|
[TBD: This is a potential issue, though a rare possibility
|
632 |
|
|
in the bounce bio allocation that happens in the current code, since
|
633 |
|
|
it ends up allocating a second bio from the same pool while
|
634 |
|
|
holding the original bio ]
|
635 |
|
|
|
636 |
|
|
Memory allocated from the pool should be released back within a limited
|
637 |
|
|
amount of time (in the case of bio, that would be after the i/o is completed).
|
638 |
|
|
This ensures that if part of the pool has been used up, some work (in this
|
639 |
|
|
case i/o) must already be in progress and memory would be available when it
|
640 |
|
|
is over. If allocating from multiple pools in the same code path, the order
|
641 |
|
|
or hierarchy of allocation needs to be consistent, just the way one deals
|
642 |
|
|
with multiple locks.
|
643 |
|
|
|
644 |
|
|
The bio_alloc routine also needs to allocate the bio_vec_list (bvec_alloc())
|
645 |
|
|
for a non-clone bio. There are the 6 pools setup for different size biovecs,
|
646 |
|
|
so bio_alloc(gfp_mask, nr_iovecs) will allocate a vec_list of the
|
647 |
|
|
given size from these slabs.
|
648 |
|
|
|
649 |
|
|
The bi_destructor() routine takes into account the possibility of the bio
|
650 |
|
|
having originated from a different source (see later discussions on
|
651 |
|
|
n/w to block transfers and kvec_cb)
|
652 |
|
|
|
653 |
|
|
The bio_get() routine may be used to hold an extra reference on a bio prior
|
654 |
|
|
to i/o submission, if the bio fields are likely to be accessed after the
|
655 |
|
|
i/o is issued (since the bio may otherwise get freed in case i/o completion
|
656 |
|
|
happens in the meantime).
|
657 |
|
|
|
658 |
|
|
The bio_clone() routine may be used to duplicate a bio, where the clone
|
659 |
|
|
shares the bio_vec_list with the original bio (i.e. both point to the
|
660 |
|
|
same bio_vec_list). This would typically be used for splitting i/o requests
|
661 |
|
|
in lvm or md.
|
662 |
|
|
|
663 |
|
|
3.2 Generic bio helper Routines
|
664 |
|
|
|
665 |
|
|
3.2.1 Traversing segments and completion units in a request
|
666 |
|
|
|
667 |
|
|
The macro rq_for_each_segment() should be used for traversing the bios
|
668 |
|
|
in the request list (drivers should avoid directly trying to do it
|
669 |
|
|
themselves). Using these helpers should also make it easier to cope
|
670 |
|
|
with block changes in the future.
|
671 |
|
|
|
672 |
|
|
struct req_iterator iter;
|
673 |
|
|
rq_for_each_segment(bio_vec, rq, iter)
|
674 |
|
|
/* bio_vec is now current segment */
|
675 |
|
|
|
676 |
|
|
I/O completion callbacks are per-bio rather than per-segment, so drivers
|
677 |
|
|
that traverse bio chains on completion need to keep that in mind. Drivers
|
678 |
|
|
which don't make a distinction between segments and completion units would
|
679 |
|
|
need to be reorganized to support multi-segment bios.
|
680 |
|
|
|
681 |
|
|
3.2.2 Setting up DMA scatterlists
|
682 |
|
|
|
683 |
|
|
The blk_rq_map_sg() helper routine would be used for setting up scatter
|
684 |
|
|
gather lists from a request, so a driver need not do it on its own.
|
685 |
|
|
|
686 |
|
|
nr_segments = blk_rq_map_sg(q, rq, scatterlist);
|
687 |
|
|
|
688 |
|
|
The helper routine provides a level of abstraction which makes it easier
|
689 |
|
|
to modify the internals of request to scatterlist conversion down the line
|
690 |
|
|
without breaking drivers. The blk_rq_map_sg routine takes care of several
|
691 |
|
|
things like collapsing physically contiguous segments (if QUEUE_FLAG_CLUSTER
|
692 |
|
|
is set) and correct segment accounting to avoid exceeding the limits which
|
693 |
|
|
the i/o hardware can handle, based on various queue properties.
|
694 |
|
|
|
695 |
|
|
- Prevents a clustered segment from crossing a 4GB mem boundary
|
696 |
|
|
- Avoids building segments that would exceed the number of physical
|
697 |
|
|
memory segments that the driver can handle (phys_segments) and the
|
698 |
|
|
number that the underlying hardware can handle at once, accounting for
|
699 |
|
|
DMA remapping (hw_segments) (i.e. IOMMU aware limits).
|
700 |
|
|
|
701 |
|
|
Routines which the low level driver can use to set up the segment limits:
|
702 |
|
|
|
703 |
|
|
blk_queue_max_hw_segments() : Sets an upper limit of the maximum number of
|
704 |
|
|
hw data segments in a request (i.e. the maximum number of address/length
|
705 |
|
|
pairs the host adapter can actually hand to the device at once)
|
706 |
|
|
|
707 |
|
|
blk_queue_max_phys_segments() : Sets an upper limit on the maximum number
|
708 |
|
|
of physical data segments in a request (i.e. the largest sized scatter list
|
709 |
|
|
a driver could handle)
|
710 |
|
|
|
711 |
|
|
3.2.3 I/O completion
|
712 |
|
|
|
713 |
|
|
The existing generic block layer helper routines end_request,
|
714 |
|
|
end_that_request_first and end_that_request_last can be used for i/o
|
715 |
|
|
completion (and setting things up so the rest of the i/o or the next
|
716 |
|
|
request can be kicked of) as before. With the introduction of multi-page
|
717 |
|
|
bio support, end_that_request_first requires an additional argument indicating
|
718 |
|
|
the number of sectors completed.
|
719 |
|
|
|
720 |
|
|
3.2.4 Implications for drivers that do not interpret bios (don't handle
|
721 |
|
|
multiple segments)
|
722 |
|
|
|
723 |
|
|
Drivers that do not interpret bios e.g those which do not handle multiple
|
724 |
|
|
segments and do not support i/o into high memory addresses (require bounce
|
725 |
|
|
buffers) and expect only virtually mapped buffers, can access the rq->buffer
|
726 |
|
|
field. As before the driver should use current_nr_sectors to determine the
|
727 |
|
|
size of remaining data in the current segment (that is the maximum it can
|
728 |
|
|
transfer in one go unless it interprets segments), and rely on the block layer
|
729 |
|
|
end_request, or end_that_request_first/last to take care of all accounting
|
730 |
|
|
and transparent mapping of the next bio segment when a segment boundary
|
731 |
|
|
is crossed on completion of a transfer. (The end*request* functions should
|
732 |
|
|
be used if only if the request has come down from block/bio path, not for
|
733 |
|
|
direct access requests which only specify rq->buffer without a valid rq->bio)
|
734 |
|
|
|
735 |
|
|
3.2.5 Generic request command tagging
|
736 |
|
|
|
737 |
|
|
3.2.5.1 Tag helpers
|
738 |
|
|
|
739 |
|
|
Block now offers some simple generic functionality to help support command
|
740 |
|
|
queueing (typically known as tagged command queueing), ie manage more than
|
741 |
|
|
one outstanding command on a queue at any given time.
|
742 |
|
|
|
743 |
|
|
blk_queue_init_tags(struct request_queue *q, int depth)
|
744 |
|
|
|
745 |
|
|
Initialize internal command tagging structures for a maximum
|
746 |
|
|
depth of 'depth'.
|
747 |
|
|
|
748 |
|
|
blk_queue_free_tags((struct request_queue *q)
|
749 |
|
|
|
750 |
|
|
Teardown tag info associated with the queue. This will be done
|
751 |
|
|
automatically by block if blk_queue_cleanup() is called on a queue
|
752 |
|
|
that is using tagging.
|
753 |
|
|
|
754 |
|
|
The above are initialization and exit management, the main helpers during
|
755 |
|
|
normal operations are:
|
756 |
|
|
|
757 |
|
|
blk_queue_start_tag(struct request_queue *q, struct request *rq)
|
758 |
|
|
|
759 |
|
|
Start tagged operation for this request. A free tag number between
|
760 |
|
|
|
761 |
|
|
and 'rq' is added to the internal tag management. If the maximum depth
|
762 |
|
|
for this queue is already achieved (or if the tag wasn't started for
|
763 |
|
|
some other reason), 1 is returned. Otherwise 0 is returned.
|
764 |
|
|
|
765 |
|
|
blk_queue_end_tag(struct request_queue *q, struct request *rq)
|
766 |
|
|
|
767 |
|
|
End tagged operation on this request. 'rq' is removed from the internal
|
768 |
|
|
book keeping structures.
|
769 |
|
|
|
770 |
|
|
To minimize struct request and queue overhead, the tag helpers utilize some
|
771 |
|
|
of the same request members that are used for normal request queue management.
|
772 |
|
|
This means that a request cannot both be an active tag and be on the queue
|
773 |
|
|
list at the same time. blk_queue_start_tag() will remove the request, but
|
774 |
|
|
the driver must remember to call blk_queue_end_tag() before signalling
|
775 |
|
|
completion of the request to the block layer. This means ending tag
|
776 |
|
|
operations before calling end_that_request_last()! For an example of a user
|
777 |
|
|
of these helpers, see the IDE tagged command queueing support.
|
778 |
|
|
|
779 |
|
|
Certain hardware conditions may dictate a need to invalidate the block tag
|
780 |
|
|
queue. For instance, on IDE any tagged request error needs to clear both
|
781 |
|
|
the hardware and software block queue and enable the driver to sanely restart
|
782 |
|
|
all the outstanding requests. There's a third helper to do that:
|
783 |
|
|
|
784 |
|
|
blk_queue_invalidate_tags(struct request_queue *q)
|
785 |
|
|
|
786 |
|
|
Clear the internal block tag queue and re-add all the pending requests
|
787 |
|
|
to the request queue. The driver will receive them again on the
|
788 |
|
|
next request_fn run, just like it did the first time it encountered
|
789 |
|
|
them.
|
790 |
|
|
|
791 |
|
|
3.2.5.2 Tag info
|
792 |
|
|
|
793 |
|
|
Some block functions exist to query current tag status or to go from a
|
794 |
|
|
tag number to the associated request. These are, in no particular order:
|
795 |
|
|
|
796 |
|
|
blk_queue_tagged(q)
|
797 |
|
|
|
798 |
|
|
Returns 1 if the queue 'q' is using tagging, 0 if not.
|
799 |
|
|
|
800 |
|
|
blk_queue_tag_request(q, tag)
|
801 |
|
|
|
802 |
|
|
Returns a pointer to the request associated with tag 'tag'.
|
803 |
|
|
|
804 |
|
|
blk_queue_tag_depth(q)
|
805 |
|
|
|
806 |
|
|
Return current queue depth.
|
807 |
|
|
|
808 |
|
|
blk_queue_tag_queue(q)
|
809 |
|
|
|
810 |
|
|
Returns 1 if the queue can accept a new queued command, 0 if we are
|
811 |
|
|
at the maximum depth already.
|
812 |
|
|
|
813 |
|
|
blk_queue_rq_tagged(rq)
|
814 |
|
|
|
815 |
|
|
Returns 1 if the request 'rq' is tagged.
|
816 |
|
|
|
817 |
|
|
3.2.5.2 Internal structure
|
818 |
|
|
|
819 |
|
|
Internally, block manages tags in the blk_queue_tag structure:
|
820 |
|
|
|
821 |
|
|
struct blk_queue_tag {
|
822 |
|
|
struct request **tag_index; /* array or pointers to rq */
|
823 |
|
|
unsigned long *tag_map; /* bitmap of free tags */
|
824 |
|
|
struct list_head busy_list; /* fifo list of busy tags */
|
825 |
|
|
int busy; /* queue depth */
|
826 |
|
|
int max_depth; /* max queue depth */
|
827 |
|
|
};
|
828 |
|
|
|
829 |
|
|
Most of the above is simple and straight forward, however busy_list may need
|
830 |
|
|
a bit of explaining. Normally we don't care too much about request ordering,
|
831 |
|
|
but in the event of any barrier requests in the tag queue we need to ensure
|
832 |
|
|
that requests are restarted in the order they were queue. This may happen
|
833 |
|
|
if the driver needs to use blk_queue_invalidate_tags().
|
834 |
|
|
|
835 |
|
|
Tagging also defines a new request flag, REQ_QUEUED. This is set whenever
|
836 |
|
|
a request is currently tagged. You should not use this flag directly,
|
837 |
|
|
blk_rq_tagged(rq) is the portable way to do so.
|
838 |
|
|
|
839 |
|
|
3.3 I/O Submission
|
840 |
|
|
|
841 |
|
|
The routine submit_bio() is used to submit a single io. Higher level i/o
|
842 |
|
|
routines make use of this:
|
843 |
|
|
|
844 |
|
|
(a) Buffered i/o:
|
845 |
|
|
The routine submit_bh() invokes submit_bio() on a bio corresponding to the
|
846 |
|
|
bh, allocating the bio if required. ll_rw_block() uses submit_bh() as before.
|
847 |
|
|
|
848 |
|
|
(b) Kiobuf i/o (for raw/direct i/o):
|
849 |
|
|
The ll_rw_kio() routine breaks up the kiobuf into page sized chunks and
|
850 |
|
|
maps the array to one or more multi-page bios, issuing submit_bio() to
|
851 |
|
|
perform the i/o on each of these.
|
852 |
|
|
|
853 |
|
|
The embedded bh array in the kiobuf structure has been removed and no
|
854 |
|
|
preallocation of bios is done for kiobufs. [The intent is to remove the
|
855 |
|
|
blocks array as well, but it's currently in there to kludge around direct i/o.]
|
856 |
|
|
Thus kiobuf allocation has switched back to using kmalloc rather than vmalloc.
|
857 |
|
|
|
858 |
|
|
Todo/Observation:
|
859 |
|
|
|
860 |
|
|
A single kiobuf structure is assumed to correspond to a contiguous range
|
861 |
|
|
of data, so brw_kiovec() invokes ll_rw_kio for each kiobuf in a kiovec.
|
862 |
|
|
So right now it wouldn't work for direct i/o on non-contiguous blocks.
|
863 |
|
|
This is to be resolved. The eventual direction is to replace kiobuf
|
864 |
|
|
by kvec's.
|
865 |
|
|
|
866 |
|
|
Badari Pulavarty has a patch to implement direct i/o correctly using
|
867 |
|
|
bio and kvec.
|
868 |
|
|
|
869 |
|
|
|
870 |
|
|
(c) Page i/o:
|
871 |
|
|
Todo/Under discussion:
|
872 |
|
|
|
873 |
|
|
Andrew Morton's multi-page bio patches attempt to issue multi-page
|
874 |
|
|
writeouts (and reads) from the page cache, by directly building up
|
875 |
|
|
large bios for submission completely bypassing the usage of buffer
|
876 |
|
|
heads. This work is still in progress.
|
877 |
|
|
|
878 |
|
|
Christoph Hellwig had some code that uses bios for page-io (rather than
|
879 |
|
|
bh). This isn't included in bio as yet. Christoph was also working on a
|
880 |
|
|
design for representing virtual/real extents as an entity and modifying
|
881 |
|
|
some of the address space ops interfaces to utilize this abstraction rather
|
882 |
|
|
than buffer_heads. (This is somewhat along the lines of the SGI XFS pagebuf
|
883 |
|
|
abstraction, but intended to be as lightweight as possible).
|
884 |
|
|
|
885 |
|
|
(d) Direct access i/o:
|
886 |
|
|
Direct access requests that do not contain bios would be submitted differently
|
887 |
|
|
as discussed earlier in section 1.3.
|
888 |
|
|
|
889 |
|
|
Aside:
|
890 |
|
|
|
891 |
|
|
Kvec i/o:
|
892 |
|
|
|
893 |
|
|
Ben LaHaise's aio code uses a slightly different structure instead
|
894 |
|
|
of kiobufs, called a kvec_cb. This contains an array of
|
895 |
|
|
tuples (very much like the networking code), together with a callback function
|
896 |
|
|
and data pointer. This is embedded into a brw_cb structure when passed
|
897 |
|
|
to brw_kvec_async().
|
898 |
|
|
|
899 |
|
|
Now it should be possible to directly map these kvecs to a bio. Just as while
|
900 |
|
|
cloning, in this case rather than PRE_BUILT bio_vecs, we set the bi_io_vec
|
901 |
|
|
array pointer to point to the veclet array in kvecs.
|
902 |
|
|
|
903 |
|
|
TBD: In order for this to work, some changes are needed in the way multi-page
|
904 |
|
|
bios are handled today. The values of the tuples in such a vector passed in
|
905 |
|
|
from higher level code should not be modified by the block layer in the course
|
906 |
|
|
of its request processing, since that would make it hard for the higher layer
|
907 |
|
|
to continue to use the vector descriptor (kvec) after i/o completes. Instead,
|
908 |
|
|
all such transient state should either be maintained in the request structure,
|
909 |
|
|
and passed on in some way to the endio completion routine.
|
910 |
|
|
|
911 |
|
|
|
912 |
|
|
4. The I/O scheduler
|
913 |
|
|
I/O scheduler, a.k.a. elevator, is implemented in two layers. Generic dispatch
|
914 |
|
|
queue and specific I/O schedulers. Unless stated otherwise, elevator is used
|
915 |
|
|
to refer to both parts and I/O scheduler to specific I/O schedulers.
|
916 |
|
|
|
917 |
|
|
Block layer implements generic dispatch queue in ll_rw_blk.c and elevator.c.
|
918 |
|
|
The generic dispatch queue is responsible for properly ordering barrier
|
919 |
|
|
requests, requeueing, handling non-fs requests and all other subtleties.
|
920 |
|
|
|
921 |
|
|
Specific I/O schedulers are responsible for ordering normal filesystem
|
922 |
|
|
requests. They can also choose to delay certain requests to improve
|
923 |
|
|
throughput or whatever purpose. As the plural form indicates, there are
|
924 |
|
|
multiple I/O schedulers. They can be built as modules but at least one should
|
925 |
|
|
be built inside the kernel. Each queue can choose different one and can also
|
926 |
|
|
change to another one dynamically.
|
927 |
|
|
|
928 |
|
|
A block layer call to the i/o scheduler follows the convention elv_xxx(). This
|
929 |
|
|
calls elevator_xxx_fn in the elevator switch (drivers/block/elevator.c). Oh,
|
930 |
|
|
xxx and xxx might not match exactly, but use your imagination. If an elevator
|
931 |
|
|
doesn't implement a function, the switch does nothing or some minimal house
|
932 |
|
|
keeping work.
|
933 |
|
|
|
934 |
|
|
4.1. I/O scheduler API
|
935 |
|
|
|
936 |
|
|
The functions an elevator may implement are: (* are mandatory)
|
937 |
|
|
elevator_merge_fn called to query requests for merge with a bio
|
938 |
|
|
|
939 |
|
|
elevator_merge_req_fn called when two requests get merged. the one
|
940 |
|
|
which gets merged into the other one will be
|
941 |
|
|
never seen by I/O scheduler again. IOW, after
|
942 |
|
|
being merged, the request is gone.
|
943 |
|
|
|
944 |
|
|
elevator_merged_fn called when a request in the scheduler has been
|
945 |
|
|
involved in a merge. It is used in the deadline
|
946 |
|
|
scheduler for example, to reposition the request
|
947 |
|
|
if its sorting order has changed.
|
948 |
|
|
|
949 |
|
|
elevator_allow_merge_fn called whenever the block layer determines
|
950 |
|
|
that a bio can be merged into an existing
|
951 |
|
|
request safely. The io scheduler may still
|
952 |
|
|
want to stop a merge at this point if it
|
953 |
|
|
results in some sort of conflict internally,
|
954 |
|
|
this hook allows it to do that.
|
955 |
|
|
|
956 |
|
|
elevator_dispatch_fn fills the dispatch queue with ready requests.
|
957 |
|
|
I/O schedulers are free to postpone requests by
|
958 |
|
|
not filling the dispatch queue unless @force
|
959 |
|
|
is non-zero. Once dispatched, I/O schedulers
|
960 |
|
|
are not allowed to manipulate the requests -
|
961 |
|
|
they belong to generic dispatch queue.
|
962 |
|
|
|
963 |
|
|
elevator_add_req_fn called to add a new request into the scheduler
|
964 |
|
|
|
965 |
|
|
elevator_queue_empty_fn returns true if the merge queue is empty.
|
966 |
|
|
Drivers shouldn't use this, but rather check
|
967 |
|
|
if elv_next_request is NULL (without losing the
|
968 |
|
|
request if one exists!)
|
969 |
|
|
|
970 |
|
|
elevator_former_req_fn
|
971 |
|
|
elevator_latter_req_fn These return the request before or after the
|
972 |
|
|
one specified in disk sort order. Used by the
|
973 |
|
|
block layer to find merge possibilities.
|
974 |
|
|
|
975 |
|
|
elevator_completed_req_fn called when a request is completed.
|
976 |
|
|
|
977 |
|
|
elevator_may_queue_fn returns true if the scheduler wants to allow the
|
978 |
|
|
current context to queue a new request even if
|
979 |
|
|
it is over the queue limit. This must be used
|
980 |
|
|
very carefully!!
|
981 |
|
|
|
982 |
|
|
elevator_set_req_fn
|
983 |
|
|
elevator_put_req_fn Must be used to allocate and free any elevator
|
984 |
|
|
specific storage for a request.
|
985 |
|
|
|
986 |
|
|
elevator_activate_req_fn Called when device driver first sees a request.
|
987 |
|
|
I/O schedulers can use this callback to
|
988 |
|
|
determine when actual execution of a request
|
989 |
|
|
starts.
|
990 |
|
|
elevator_deactivate_req_fn Called when device driver decides to delay
|
991 |
|
|
a request by requeueing it.
|
992 |
|
|
|
993 |
|
|
elevator_init_fn
|
994 |
|
|
elevator_exit_fn Allocate and free any elevator specific storage
|
995 |
|
|
for a queue.
|
996 |
|
|
|
997 |
|
|
4.2 Request flows seen by I/O schedulers
|
998 |
|
|
All requests seen by I/O schedulers strictly follow one of the following three
|
999 |
|
|
flows.
|
1000 |
|
|
|
1001 |
|
|
set_req_fn ->
|
1002 |
|
|
|
1003 |
|
|
i. add_req_fn -> (merged_fn ->)* -> dispatch_fn -> activate_req_fn ->
|
1004 |
|
|
(deactivate_req_fn -> activate_req_fn ->)* -> completed_req_fn
|
1005 |
|
|
ii. add_req_fn -> (merged_fn ->)* -> merge_req_fn
|
1006 |
|
|
iii. [none]
|
1007 |
|
|
|
1008 |
|
|
-> put_req_fn
|
1009 |
|
|
|
1010 |
|
|
4.3 I/O scheduler implementation
|
1011 |
|
|
The generic i/o scheduler algorithm attempts to sort/merge/batch requests for
|
1012 |
|
|
optimal disk scan and request servicing performance (based on generic
|
1013 |
|
|
principles and device capabilities), optimized for:
|
1014 |
|
|
i. improved throughput
|
1015 |
|
|
ii. improved latency
|
1016 |
|
|
iii. better utilization of h/w & CPU time
|
1017 |
|
|
|
1018 |
|
|
Characteristics:
|
1019 |
|
|
|
1020 |
|
|
i. Binary tree
|
1021 |
|
|
AS and deadline i/o schedulers use red black binary trees for disk position
|
1022 |
|
|
sorting and searching, and a fifo linked list for time-based searching. This
|
1023 |
|
|
gives good scalability and good availability of information. Requests are
|
1024 |
|
|
almost always dispatched in disk sort order, so a cache is kept of the next
|
1025 |
|
|
request in sort order to prevent binary tree lookups.
|
1026 |
|
|
|
1027 |
|
|
This arrangement is not a generic block layer characteristic however, so
|
1028 |
|
|
elevators may implement queues as they please.
|
1029 |
|
|
|
1030 |
|
|
ii. Merge hash
|
1031 |
|
|
AS and deadline use a hash table indexed by the last sector of a request. This
|
1032 |
|
|
enables merging code to quickly look up "back merge" candidates, even when
|
1033 |
|
|
multiple I/O streams are being performed at once on one disk.
|
1034 |
|
|
|
1035 |
|
|
"Front merges", a new request being merged at the front of an existing request,
|
1036 |
|
|
are far less common than "back merges" due to the nature of most I/O patterns.
|
1037 |
|
|
Front merges are handled by the binary trees in AS and deadline schedulers.
|
1038 |
|
|
|
1039 |
|
|
iii. Plugging the queue to batch requests in anticipation of opportunities for
|
1040 |
|
|
merge/sort optimizations
|
1041 |
|
|
|
1042 |
|
|
This is just the same as in 2.4 so far, though per-device unplugging
|
1043 |
|
|
support is anticipated for 2.5. Also with a priority-based i/o scheduler,
|
1044 |
|
|
such decisions could be based on request priorities.
|
1045 |
|
|
|
1046 |
|
|
Plugging is an approach that the current i/o scheduling algorithm resorts to so
|
1047 |
|
|
that it collects up enough requests in the queue to be able to take
|
1048 |
|
|
advantage of the sorting/merging logic in the elevator. If the
|
1049 |
|
|
queue is empty when a request comes in, then it plugs the request queue
|
1050 |
|
|
(sort of like plugging the bottom of a vessel to get fluid to build up)
|
1051 |
|
|
till it fills up with a few more requests, before starting to service
|
1052 |
|
|
the requests. This provides an opportunity to merge/sort the requests before
|
1053 |
|
|
passing them down to the device. There are various conditions when the queue is
|
1054 |
|
|
unplugged (to open up the flow again), either through a scheduled task or
|
1055 |
|
|
could be on demand. For example wait_on_buffer sets the unplugging going
|
1056 |
|
|
(by running tq_disk) so the read gets satisfied soon. So in the read case,
|
1057 |
|
|
the queue gets explicitly unplugged as part of waiting for completion,
|
1058 |
|
|
in fact all queues get unplugged as a side-effect.
|
1059 |
|
|
|
1060 |
|
|
Aside:
|
1061 |
|
|
This is kind of controversial territory, as it's not clear if plugging is
|
1062 |
|
|
always the right thing to do. Devices typically have their own queues,
|
1063 |
|
|
and allowing a big queue to build up in software, while letting the device be
|
1064 |
|
|
idle for a while may not always make sense. The trick is to handle the fine
|
1065 |
|
|
balance between when to plug and when to open up. Also now that we have
|
1066 |
|
|
multi-page bios being queued in one shot, we may not need to wait to merge
|
1067 |
|
|
a big request from the broken up pieces coming by.
|
1068 |
|
|
|
1069 |
|
|
Per-queue granularity unplugging (still a Todo) may help reduce some of the
|
1070 |
|
|
concerns with just a single tq_disk flush approach. Something like
|
1071 |
|
|
blk_kick_queue() to unplug a specific queue (right away ?)
|
1072 |
|
|
or optionally, all queues, is in the plan.
|
1073 |
|
|
|
1074 |
|
|
4.4 I/O contexts
|
1075 |
|
|
I/O contexts provide a dynamically allocated per process data area. They may
|
1076 |
|
|
be used in I/O schedulers, and in the block layer (could be used for IO statis,
|
1077 |
|
|
priorities for example). See *io_context in block/ll_rw_blk.c, and as-iosched.c
|
1078 |
|
|
for an example of usage in an i/o scheduler.
|
1079 |
|
|
|
1080 |
|
|
|
1081 |
|
|
5. Scalability related changes
|
1082 |
|
|
|
1083 |
|
|
5.1 Granular Locking: io_request_lock replaced by a per-queue lock
|
1084 |
|
|
|
1085 |
|
|
The global io_request_lock has been removed as of 2.5, to avoid
|
1086 |
|
|
the scalability bottleneck it was causing, and has been replaced by more
|
1087 |
|
|
granular locking. The request queue structure has a pointer to the
|
1088 |
|
|
lock to be used for that queue. As a result, locking can now be
|
1089 |
|
|
per-queue, with a provision for sharing a lock across queues if
|
1090 |
|
|
necessary (e.g the scsi layer sets the queue lock pointers to the
|
1091 |
|
|
corresponding adapter lock, which results in a per host locking
|
1092 |
|
|
granularity). The locking semantics are the same, i.e. locking is
|
1093 |
|
|
still imposed by the block layer, grabbing the lock before
|
1094 |
|
|
request_fn execution which it means that lots of older drivers
|
1095 |
|
|
should still be SMP safe. Drivers are free to drop the queue
|
1096 |
|
|
lock themselves, if required. Drivers that explicitly used the
|
1097 |
|
|
io_request_lock for serialization need to be modified accordingly.
|
1098 |
|
|
Usually it's as easy as adding a global lock:
|
1099 |
|
|
|
1100 |
|
|
static spinlock_t my_driver_lock = SPIN_LOCK_UNLOCKED;
|
1101 |
|
|
|
1102 |
|
|
and passing the address to that lock to blk_init_queue().
|
1103 |
|
|
|
1104 |
|
|
5.2 64 bit sector numbers (sector_t prepares for 64 bit support)
|
1105 |
|
|
|
1106 |
|
|
The sector number used in the bio structure has been changed to sector_t,
|
1107 |
|
|
which could be defined as 64 bit in preparation for 64 bit sector support.
|
1108 |
|
|
|
1109 |
|
|
6. Other Changes/Implications
|
1110 |
|
|
|
1111 |
|
|
6.1 Partition re-mapping handled by the generic block layer
|
1112 |
|
|
|
1113 |
|
|
In 2.5 some of the gendisk/partition related code has been reorganized.
|
1114 |
|
|
Now the generic block layer performs partition-remapping early and thus
|
1115 |
|
|
provides drivers with a sector number relative to whole device, rather than
|
1116 |
|
|
having to take partition number into account in order to arrive at the true
|
1117 |
|
|
sector number. The routine blk_partition_remap() is invoked by
|
1118 |
|
|
generic_make_request even before invoking the queue specific make_request_fn,
|
1119 |
|
|
so the i/o scheduler also gets to operate on whole disk sector numbers. This
|
1120 |
|
|
should typically not require changes to block drivers, it just never gets
|
1121 |
|
|
to invoke its own partition sector offset calculations since all bios
|
1122 |
|
|
sent are offset from the beginning of the device.
|
1123 |
|
|
|
1124 |
|
|
|
1125 |
|
|
7. A Few Tips on Migration of older drivers
|
1126 |
|
|
|
1127 |
|
|
Old-style drivers that just use CURRENT and ignores clustered requests,
|
1128 |
|
|
may not need much change. The generic layer will automatically handle
|
1129 |
|
|
clustered requests, multi-page bios, etc for the driver.
|
1130 |
|
|
|
1131 |
|
|
For a low performance driver or hardware that is PIO driven or just doesn't
|
1132 |
|
|
support scatter-gather changes should be minimal too.
|
1133 |
|
|
|
1134 |
|
|
The following are some points to keep in mind when converting old drivers
|
1135 |
|
|
to bio.
|
1136 |
|
|
|
1137 |
|
|
Drivers should use elv_next_request to pick up requests and are no longer
|
1138 |
|
|
supposed to handle looping directly over the request list.
|
1139 |
|
|
(struct request->queue has been removed)
|
1140 |
|
|
|
1141 |
|
|
Now end_that_request_first takes an additional number_of_sectors argument.
|
1142 |
|
|
It used to handle always just the first buffer_head in a request, now
|
1143 |
|
|
it will loop and handle as many sectors (on a bio-segment granularity)
|
1144 |
|
|
as specified.
|
1145 |
|
|
|
1146 |
|
|
Now bh->b_end_io is replaced by bio->bi_end_io, but most of the time the
|
1147 |
|
|
right thing to use is bio_endio(bio, uptodate) instead.
|
1148 |
|
|
|
1149 |
|
|
If the driver is dropping the io_request_lock from its request_fn strategy,
|
1150 |
|
|
then it just needs to replace that with q->queue_lock instead.
|
1151 |
|
|
|
1152 |
|
|
As described in Sec 1.1, drivers can set max sector size, max segment size
|
1153 |
|
|
etc per queue now. Drivers that used to define their own merge functions i
|
1154 |
|
|
to handle things like this can now just use the blk_queue_* functions at
|
1155 |
|
|
blk_init_queue time.
|
1156 |
|
|
|
1157 |
|
|
Drivers no longer have to map a {partition, sector offset} into the
|
1158 |
|
|
correct absolute location anymore, this is done by the block layer, so
|
1159 |
|
|
where a driver received a request ala this before:
|
1160 |
|
|
|
1161 |
|
|
rq->rq_dev = mk_kdev(3, 5); /* /dev/hda5 */
|
1162 |
|
|
rq->sector = 0; /* first sector on hda5 */
|
1163 |
|
|
|
1164 |
|
|
it will now see
|
1165 |
|
|
|
1166 |
|
|
rq->rq_dev = mk_kdev(3, 0); /* /dev/hda */
|
1167 |
|
|
rq->sector = 123128; /* offset from start of disk */
|
1168 |
|
|
|
1169 |
|
|
As mentioned, there is no virtual mapping of a bio. For DMA, this is
|
1170 |
|
|
not a problem as the driver probably never will need a virtual mapping.
|
1171 |
|
|
Instead it needs a bus mapping (pci_map_page for a single segment or
|
1172 |
|
|
use blk_rq_map_sg for scatter gather) to be able to ship it to the driver. For
|
1173 |
|
|
PIO drivers (or drivers that need to revert to PIO transfer once in a
|
1174 |
|
|
while (IDE for example)), where the CPU is doing the actual data
|
1175 |
|
|
transfer a virtual mapping is needed. If the driver supports highmem I/O,
|
1176 |
|
|
(Sec 1.1, (ii) ) it needs to use __bio_kmap_atomic and bio_kmap_irq to
|
1177 |
|
|
temporarily map a bio into the virtual address space.
|
1178 |
|
|
|
1179 |
|
|
|
1180 |
|
|
8. Prior/Related/Impacted patches
|
1181 |
|
|
|
1182 |
|
|
8.1. Earlier kiobuf patches (sct/axboe/chait/hch/mkp)
|
1183 |
|
|
- orig kiobuf & raw i/o patches (now in 2.4 tree)
|
1184 |
|
|
- direct kiobuf based i/o to devices (no intermediate bh's)
|
1185 |
|
|
- page i/o using kiobuf
|
1186 |
|
|
- kiobuf splitting for lvm (mkp)
|
1187 |
|
|
- elevator support for kiobuf request merging (axboe)
|
1188 |
|
|
8.2. Zero-copy networking (Dave Miller)
|
1189 |
|
|
8.3. SGI XFS - pagebuf patches - use of kiobufs
|
1190 |
|
|
8.4. Multi-page pioent patch for bio (Christoph Hellwig)
|
1191 |
|
|
8.5. Direct i/o implementation (Andrea Arcangeli) since 2.4.10-pre11
|
1192 |
|
|
8.6. Async i/o implementation patch (Ben LaHaise)
|
1193 |
|
|
8.7. EVMS layering design (IBM EVMS team)
|
1194 |
|
|
8.8. Larger page cache size patch (Ben LaHaise) and
|
1195 |
|
|
Large page size (Daniel Phillips)
|
1196 |
|
|
=> larger contiguous physical memory buffers
|
1197 |
|
|
8.9. VM reservations patch (Ben LaHaise)
|
1198 |
|
|
8.10. Write clustering patches ? (Marcelo/Quintela/Riel ?)
|
1199 |
|
|
8.11. Block device in page cache patch (Andrea Archangeli) - now in 2.4.10+
|
1200 |
|
|
8.12. Multiple block-size transfers for faster raw i/o (Shailabh Nagar,
|
1201 |
|
|
Badari)
|
1202 |
|
|
8.13 Priority based i/o scheduler - prepatches (Arjan van de Ven)
|
1203 |
|
|
8.14 IDE Taskfile i/o patch (Andre Hedrick)
|
1204 |
|
|
8.15 Multi-page writeout and readahead patches (Andrew Morton)
|
1205 |
|
|
8.16 Direct i/o patches for 2.5 using kvec and bio (Badari Pulavarthy)
|
1206 |
|
|
|
1207 |
|
|
9. Other References:
|
1208 |
|
|
|
1209 |
|
|
9.1 The Splice I/O Model - Larry McVoy (and subsequent discussions on lkml,
|
1210 |
|
|
and Linus' comments - Jan 2001)
|
1211 |
|
|
9.2 Discussions about kiobuf and bh design on lkml between sct, linus, alan
|
1212 |
|
|
et al - Feb-March 2001 (many of the initial thoughts that led to bio were
|
1213 |
|
|
brought up in this discussion thread)
|
1214 |
|
|
9.3 Discussions on mempool on lkml - Dec 2001.
|
1215 |
|
|
|