1 |
1275 |
phoenix |
|
2 |
|
|
The Second Extended Filesystem
|
3 |
|
|
==============================
|
4 |
|
|
|
5 |
|
|
ext2 was originally released in January 1993. Written by R\'emy Card,
|
6 |
|
|
Theodore Ts'o and Stephen Tweedie, it was a major rewrite of the
|
7 |
|
|
Extended Filesystem. It is currently still (April 2001) the predominant
|
8 |
|
|
filesystem in use by Linux. There are also implementations available
|
9 |
|
|
for NetBSD, FreeBSD, the GNU HURD, Windows 95/98/NT, OS/2 and RISC OS.
|
10 |
|
|
|
11 |
|
|
Options
|
12 |
|
|
=======
|
13 |
|
|
|
14 |
|
|
When mounting an ext2 filesystem, the following options are accepted.
|
15 |
|
|
Defaults are marked with (*).
|
16 |
|
|
|
17 |
|
|
bsddf (*) Makes `df' act like BSD.
|
18 |
|
|
minixdf Makes `df' act like Minix.
|
19 |
|
|
|
20 |
|
|
check=none, nocheck (*) Don't do extra checking of bitmaps on mount
|
21 |
|
|
(check=normal and check=strict options removed)
|
22 |
|
|
|
23 |
|
|
debug Extra debugging information is sent to the
|
24 |
|
|
kernel syslog. Useful for developers.
|
25 |
|
|
|
26 |
|
|
errors=continue (*) Keep going on a filesystem error.
|
27 |
|
|
errors=remount-ro Remount the filesystem read-only on an error.
|
28 |
|
|
errors=panic Panic and halt the machine if an error occurs.
|
29 |
|
|
|
30 |
|
|
grpid, bsdgroups Give objects the same group ID as their parent.
|
31 |
|
|
nogrpid, sysvgroups (*) New objects have the group ID of their creator.
|
32 |
|
|
|
33 |
|
|
resuid=n The user ID which may use the reserved blocks.
|
34 |
|
|
resgid=n The group ID which may use the reserved blocks.
|
35 |
|
|
|
36 |
|
|
sb=n Use alternate superblock at this location.
|
37 |
|
|
|
38 |
|
|
grpquota,noquota,quota,usrquota Quota options are silently ignored by ext2.
|
39 |
|
|
|
40 |
|
|
|
41 |
|
|
Specification
|
42 |
|
|
=============
|
43 |
|
|
|
44 |
|
|
ext2 shares many properties with traditional Unix filesystems. It has
|
45 |
|
|
the concepts of blocks, inodes and directories. It has space in the
|
46 |
|
|
specification for Access Control Lists (ACLs), fragments, undeletion and
|
47 |
|
|
compression though these are not yet implemented (some are available as
|
48 |
|
|
separate patches). There is also a versioning mechanism to allow new
|
49 |
|
|
features (such as journalling) to be added in a maximally compatible
|
50 |
|
|
manner.
|
51 |
|
|
|
52 |
|
|
Blocks
|
53 |
|
|
------
|
54 |
|
|
|
55 |
|
|
The space in the device or file is split up into blocks. These are
|
56 |
|
|
a fixed size, of 1024, 2048 or 4096 bytes (8192 bytes on Alpha systems),
|
57 |
|
|
which is decided when the filesystem is created. Smaller blocks mean
|
58 |
|
|
less wasted space per file, but require slightly more accounting overhead,
|
59 |
|
|
and also impose other limits on the size of files and the filesystem.
|
60 |
|
|
|
61 |
|
|
Block Groups
|
62 |
|
|
------------
|
63 |
|
|
|
64 |
|
|
Blocks are clustered into block groups in order to reduce fragmentation
|
65 |
|
|
and minimise the amount of head seeking when reading a large amount
|
66 |
|
|
of consecutive data. Information about each block group is kept in a
|
67 |
|
|
descriptor table stored in the block(s) immediately after the superblock.
|
68 |
|
|
Two blocks near the start of each group are reserved for the block usage
|
69 |
|
|
bitmap and the inode usage bitmap which show which blocks and inodes
|
70 |
|
|
are in use. Since each bitmap is limited to a single block, this means
|
71 |
|
|
that the maximum size of a block group is 8 times the size of a block.
|
72 |
|
|
|
73 |
|
|
The block(s) following the bitmaps in each block group are designated
|
74 |
|
|
as the inode table for that block group and the remainder are the data
|
75 |
|
|
blocks. The block allocation algorithm attempts to allocate data blocks
|
76 |
|
|
in the same block group as the inode which contains them.
|
77 |
|
|
|
78 |
|
|
The Superblock
|
79 |
|
|
--------------
|
80 |
|
|
|
81 |
|
|
The superblock contains all the information about the configuration of
|
82 |
|
|
the filing system. The primary copy of the superblock is stored at an
|
83 |
|
|
offset of 1024 bytes from the start of the device, and it is essential
|
84 |
|
|
to mounting the filesystem. Since it is so important, backup copies of
|
85 |
|
|
the superblock are stored in block groups throughout the filesystem.
|
86 |
|
|
The first version of ext2 (revision 0) stores a copy at the start of
|
87 |
|
|
every block group, along with backups of the group descriptor block(s).
|
88 |
|
|
Because this can consume a considerable amount of space for large
|
89 |
|
|
filesystems, later revisions can optionally reduce the number of backup
|
90 |
|
|
copies by only putting backups in specific groups (this is the sparse
|
91 |
|
|
superblock feature). The groups chosen are 0, 1 and powers of 3, 5 and 7.
|
92 |
|
|
|
93 |
|
|
The information in the superblock contains fields such as the total
|
94 |
|
|
number of inodes and blocks in the filesystem and how many are free,
|
95 |
|
|
how many inodes and blocks are in each block group, when the filesystem
|
96 |
|
|
was mounted (and if it was cleanly unmounted), when it was modified,
|
97 |
|
|
what version of the filesystem it is (see the Revisions section below)
|
98 |
|
|
and which OS created it.
|
99 |
|
|
|
100 |
|
|
If the filesystem is revision 1 or higher, then there are extra fields,
|
101 |
|
|
such as a volume name, a unique identification number, the inode size,
|
102 |
|
|
and space for optional filesystem features to store configuration info.
|
103 |
|
|
|
104 |
|
|
All fields in the superblock (as in all other ext2 structures) are stored
|
105 |
|
|
on the disc in little endian format, so a filesystem is portable between
|
106 |
|
|
machines without having to know what machine it was created on.
|
107 |
|
|
|
108 |
|
|
Inodes
|
109 |
|
|
------
|
110 |
|
|
|
111 |
|
|
The inode (index node) is a fundamental concept in the ext2 filesystem.
|
112 |
|
|
Each object in the filesystem is represented by an inode. The inode
|
113 |
|
|
structure contains pointers to the filesystem blocks which contain the
|
114 |
|
|
data held in the object and all of the metadata about an object except
|
115 |
|
|
its name. The metadata about an object includes the permissions, owner,
|
116 |
|
|
group, flags, size, number of blocks used, access time, change time,
|
117 |
|
|
modification time, deletion time, number of links, fragments, version
|
118 |
|
|
(for NFS) and extended attributes (EAs) and/or Access Control Lists (ACLs).
|
119 |
|
|
|
120 |
|
|
There are some reserved fields which are currently unused in the inode
|
121 |
|
|
structure and several which are overloaded. One field is reserved for the
|
122 |
|
|
directory ACL if the inode is a directory and alternately for the top 32
|
123 |
|
|
bits of the file size if the inode is a regular file (allowing file sizes
|
124 |
|
|
larger than 2GB). The translator field is unused under Linux, but is used
|
125 |
|
|
by the HURD to reference the inode of a program which will be used to
|
126 |
|
|
interpret this object. Most of the remaining reserved fields have been
|
127 |
|
|
used up for both Linux and the HURD for larger owner and group fields,
|
128 |
|
|
The HURD also has a larger mode field so it uses another of the remaining
|
129 |
|
|
fields to store the extra more bits.
|
130 |
|
|
|
131 |
|
|
There are pointers to the first 12 blocks which contain the file's data
|
132 |
|
|
in the inode. There is a pointer to an indirect block (which contains
|
133 |
|
|
pointers to the next set of blocks), a pointer to a doubly-indirect
|
134 |
|
|
block (which contains pointers to indirect blocks) and a pointer to a
|
135 |
|
|
trebly-indirect block (which contains pointers to doubly-indirect blocks).
|
136 |
|
|
|
137 |
|
|
The flags field contains some ext2-specific flags which aren't catered
|
138 |
|
|
for by the standard chmod flags. These flags can be listed with lsattr
|
139 |
|
|
and changed with the chattr command, and allow specific filesystem
|
140 |
|
|
behaviour on a per-file basis. There are flags for secure deletion,
|
141 |
|
|
undeletable, compression, synchronous updates, immutability, append-only,
|
142 |
|
|
dumpable, no-atime, indexed directories, and data-journaling. Not all
|
143 |
|
|
of these are supported yet.
|
144 |
|
|
|
145 |
|
|
Directories
|
146 |
|
|
-----------
|
147 |
|
|
|
148 |
|
|
A directory is a filesystem object and has an inode just like a file.
|
149 |
|
|
It is a specially formatted file containing records which associate
|
150 |
|
|
each name with an inode number. Later revisions of the filesystem also
|
151 |
|
|
encode the type of the object (file, directory, symlink, device, fifo,
|
152 |
|
|
socket) to avoid the need to check the inode itself for this information
|
153 |
|
|
(support for taking advantage of this feature does not yet exist in
|
154 |
|
|
Glibc 2.2).
|
155 |
|
|
|
156 |
|
|
The inode allocation code tries to assign inodes which are in the same
|
157 |
|
|
block group as the directory in which they are first created.
|
158 |
|
|
|
159 |
|
|
The current implementation of ext2 uses a singly-linked list to store
|
160 |
|
|
the filenames in the directory; a pending enhancement uses hashing of the
|
161 |
|
|
filenames to allow lookup without the need to scan the entire directory.
|
162 |
|
|
|
163 |
|
|
The current implementation never removes empty directory blocks once they
|
164 |
|
|
have been allocated to hold more files.
|
165 |
|
|
|
166 |
|
|
Special files
|
167 |
|
|
-------------
|
168 |
|
|
|
169 |
|
|
Symbolic links are also filesystem objects with inodes. They deserve
|
170 |
|
|
special mention because the data for them is stored within the inode
|
171 |
|
|
itself if the symlink is less than 60 bytes long. It uses the fields
|
172 |
|
|
which would normally be used to store the pointers to data blocks.
|
173 |
|
|
This is a worthwhile optimisation as it we avoid allocating a full
|
174 |
|
|
block for the symlink, and most symlinks are less than 60 characters long.
|
175 |
|
|
|
176 |
|
|
Character and block special devices never have data blocks assigned to
|
177 |
|
|
them. Instead, their device number is stored in the inode, again reusing
|
178 |
|
|
the fields which would be used to point to the data blocks.
|
179 |
|
|
|
180 |
|
|
Reserved Space
|
181 |
|
|
--------------
|
182 |
|
|
|
183 |
|
|
In ext2, there is a mechanism for reserving a certain number of blocks
|
184 |
|
|
for a particular user (normally the super-user). This is intended to
|
185 |
|
|
allow for the system to continue functioning even if non-priveleged users
|
186 |
|
|
fill up all the space available to them (this is independent of filesystem
|
187 |
|
|
quotas). It also keeps the filesystem from filling up entirely which
|
188 |
|
|
helps combat fragmentation.
|
189 |
|
|
|
190 |
|
|
Filesystem check
|
191 |
|
|
----------------
|
192 |
|
|
|
193 |
|
|
At boot time, most systems run a consistency check (e2fsck) on their
|
194 |
|
|
filesystems. The superblock of the ext2 filesystem contains several
|
195 |
|
|
fields which indicate whether fsck should actually run (since checking
|
196 |
|
|
the filesystem at boot can take a long time if it is large). fsck will
|
197 |
|
|
run if the filesystem was not cleanly unmounted, if the maximum mount
|
198 |
|
|
count has been exceeded or if the maximum time between checks has been
|
199 |
|
|
exceeded.
|
200 |
|
|
|
201 |
|
|
Feature Compatibility
|
202 |
|
|
---------------------
|
203 |
|
|
|
204 |
|
|
The compatibility feature mechanism used in ext2 is sophisticated.
|
205 |
|
|
It safely allows features to be added to the filesystem, without
|
206 |
|
|
unnecessarily sacrificing compatibility with older versions of the
|
207 |
|
|
filesystem code. The feature compatibility mechanism is not supported by
|
208 |
|
|
the original revision 0 (EXT2_GOOD_OLD_REV) of ext2, but was introduced in
|
209 |
|
|
revision 1. There are three 32-bit fields, one for compatible features
|
210 |
|
|
(COMPAT), one for read-only compatible (RO_COMPAT) features and one for
|
211 |
|
|
incompatible (INCOMPAT) features.
|
212 |
|
|
|
213 |
|
|
These feature flags have specific meanings for the kernel as follows:
|
214 |
|
|
|
215 |
|
|
A COMPAT flag indicates that a feature is present in the filesystem,
|
216 |
|
|
but the on-disk format is 100% compatible with older on-disk formats, so
|
217 |
|
|
a kernel which didn't know anything about this feature could read/write
|
218 |
|
|
the filesystem without any chance of corrupting the filesystem (or even
|
219 |
|
|
making it inconsistent). This is essentially just a flag which says
|
220 |
|
|
"this filesystem has a (hidden) feature" that the kernel or e2fsck may
|
221 |
|
|
want to be aware of (more on e2fsck and feature flags later). The ext3
|
222 |
|
|
HAS_JOURNAL feature is a COMPAT flag because the ext3 journal is simply
|
223 |
|
|
a regular file with data blocks in it so the kernel does not need to
|
224 |
|
|
take any special notice of it if it doesn't understand ext3 journaling.
|
225 |
|
|
|
226 |
|
|
An RO_COMPAT flag indicates that the on-disk format is 100% compatible
|
227 |
|
|
with older on-disk formats for reading (i.e. the feature does not change
|
228 |
|
|
the visible on-disk format). However, an old kernel writing to such a
|
229 |
|
|
filesystem would/could corrupt the filesystem, so this is prevented. The
|
230 |
|
|
most common such feature, SPARSE_SUPER, is an RO_COMPAT feature because
|
231 |
|
|
sparse groups allow file data blocks where superblock/group descriptor
|
232 |
|
|
backups used to live, and ext2_free_blocks() refuses to free these blocks,
|
233 |
|
|
which would leading to inconsistent bitmaps. An old kernel would also
|
234 |
|
|
get an error if it tried to free a series of blocks which crossed a group
|
235 |
|
|
boundary, but this is a legitimate layout in a SPARSE_SUPER filesystem.
|
236 |
|
|
|
237 |
|
|
An INCOMPAT flag indicates the on-disk format has changed in some
|
238 |
|
|
way that makes it unreadable by older kernels, or would otherwise
|
239 |
|
|
cause a problem if an old kernel tried to mount it. FILETYPE is an
|
240 |
|
|
INCOMPAT flag because older kernels would think a filename was longer
|
241 |
|
|
than 256 characters, which would lead to corrupt directory listings.
|
242 |
|
|
The COMPRESSION flag is an obvious INCOMPAT flag - if the kernel
|
243 |
|
|
doesn't understand compression, you would just get garbage back from
|
244 |
|
|
read() instead of it automatically decompressing your data. The ext3
|
245 |
|
|
RECOVER flag is needed to prevent a kernel which does not understand the
|
246 |
|
|
ext3 journal from mounting the filesystem without replaying the journal.
|
247 |
|
|
|
248 |
|
|
For e2fsck, it needs to be more strict with the handling of these
|
249 |
|
|
flags than the kernel. If it doesn't understand ANY of the COMPAT,
|
250 |
|
|
RO_COMPAT, or INCOMPAT flags it will refuse to check the filesystem,
|
251 |
|
|
because it has no way of verifying whether a given feature is valid
|
252 |
|
|
or not. Allowing e2fsck to succeed on a filesystem with an unknown
|
253 |
|
|
feature is a false sense of security for the user. Refusing to check
|
254 |
|
|
a filesystem with unknown features is a good incentive for the user to
|
255 |
|
|
update to the latest e2fsck. This also means that anyone adding feature
|
256 |
|
|
flags to ext2 also needs to update e2fsck to verify these features.
|
257 |
|
|
|
258 |
|
|
Metadata
|
259 |
|
|
--------
|
260 |
|
|
|
261 |
|
|
It is frequently claimed that the ext2 implementation of writing
|
262 |
|
|
asynchronous metadata is faster than the ffs synchronous metadata
|
263 |
|
|
scheme but less reliable. Both methods are equally resolvable by their
|
264 |
|
|
respective fsck programs.
|
265 |
|
|
|
266 |
|
|
If you're exceptionally paranoid, there are 3 ways of making metadata
|
267 |
|
|
writes synchronous on ext2:
|
268 |
|
|
|
269 |
|
|
per-file if you have the program source: use the O_SYNC flag to open()
|
270 |
|
|
per-file if you don't have the source: use "chattr +S" on the file
|
271 |
|
|
per-filesystem: add the "sync" option to mount (or in /etc/fstab)
|
272 |
|
|
|
273 |
|
|
the first and last are not ext2 specific but do force the metadata to
|
274 |
|
|
be written synchronously. See also Journaling below.
|
275 |
|
|
|
276 |
|
|
Limitations
|
277 |
|
|
-----------
|
278 |
|
|
|
279 |
|
|
There are various limits imposed by the on-disk layout of ext2. Other
|
280 |
|
|
limits are imposed by the current implementation of the kernel code.
|
281 |
|
|
Many of the limits are determined at the time the filesystem is first
|
282 |
|
|
created, and depend upon the block size chosen. The ratio of inodes to
|
283 |
|
|
data blocks is fixed at filesystem creation time, so the only way to
|
284 |
|
|
increase the number of inodes is to increase the size of the filesystem.
|
285 |
|
|
No tools currently exist which can change the ratio of inodes to blocks.
|
286 |
|
|
|
287 |
|
|
Most of these limits could be overcome with slight changes in the on-disk
|
288 |
|
|
format and using a compatibility flag to signal the format change (at
|
289 |
|
|
the expense of some compatibility).
|
290 |
|
|
|
291 |
|
|
Filesystem block size: 1kB 2kB 4kB 8kB
|
292 |
|
|
|
293 |
|
|
File size limit: 16GB 256GB 2048GB 2048GB
|
294 |
|
|
Filesystem size limit: 2047GB 8192GB 16384GB 32768GB
|
295 |
|
|
|
296 |
|
|
There is a 2.4 kernel limit of 2048GB for a single block device, so no
|
297 |
|
|
filesystem larger than that can be created at this time. There is also
|
298 |
|
|
an upper limit on the block size imposed by the page size of the kernel,
|
299 |
|
|
so 8kB blocks are only allowed on Alpha systems (and other architectures
|
300 |
|
|
which support larger pages).
|
301 |
|
|
|
302 |
|
|
There is an upper limit of 32768 subdirectories in a single directory.
|
303 |
|
|
|
304 |
|
|
There is a "soft" upper limit of about 10-15k files in a single directory
|
305 |
|
|
with the current linear linked-list directory implementation. This limit
|
306 |
|
|
stems from performance problems when creating and deleting (and also
|
307 |
|
|
finding) files in such large directories. Using a hashed directory index
|
308 |
|
|
(under development) allows 100k-1M+ files in a single directory without
|
309 |
|
|
performance problems (although RAM size becomes an issue at this point).
|
310 |
|
|
|
311 |
|
|
The (meaningless) absolute upper limit of files in a single directory
|
312 |
|
|
(imposed by the file size, the realistic limit is obviously much less)
|
313 |
|
|
is over 130 trillion files. It would be higher except there are not
|
314 |
|
|
enough 4-character names to make up unique directory entries, so they
|
315 |
|
|
have to be 8 character filenames, even then we are fairly close to
|
316 |
|
|
running out of unique filenames.
|
317 |
|
|
|
318 |
|
|
Journaling
|
319 |
|
|
----------
|
320 |
|
|
|
321 |
|
|
A journaling extension to the ext2 code has been developed by Stephen
|
322 |
|
|
Tweedie. It avoids the risks of metadata corruption and the need to
|
323 |
|
|
wait for e2fsck to complete after a crash, without requiring a change
|
324 |
|
|
to the on-disk ext2 layout. In a nutshell, the journal is a regular
|
325 |
|
|
file which stores whole metadata (and optionally data) blocks that have
|
326 |
|
|
been modified, prior to writing them into the filesystem. This means
|
327 |
|
|
it is possible to add a journal to an existing ext2 filesystem without
|
328 |
|
|
the need for data conversion.
|
329 |
|
|
|
330 |
|
|
When changes to the filesystem (e.g. a file is renamed) they are stored in
|
331 |
|
|
a transaction in the journal and can either be complete or incomplete at
|
332 |
|
|
the time of a crash. If a transaction is complete at the time of a crash
|
333 |
|
|
(or in the normal case where the system does not crash), then any blocks
|
334 |
|
|
in that transaction are guaranteed to represent a valid filesystem state,
|
335 |
|
|
and are copied into the filesystem. If a transaction is incomplete at
|
336 |
|
|
the time of the crash, then there is no guarantee of consistency for
|
337 |
|
|
the blocks in that transaction so they are discarded (which means any
|
338 |
|
|
filesystem changes they represent are also lost).
|
339 |
|
|
|
340 |
|
|
The ext3 code is currently (Apr 2001) available for 2.2 kernels only,
|
341 |
|
|
and not yet available for 2.4 kernels.
|
342 |
|
|
|
343 |
|
|
References
|
344 |
|
|
==========
|
345 |
|
|
|
346 |
|
|
The kernel source file:/usr/src/linux/fs/ext2/
|
347 |
|
|
e2fsprogs (e2fsck) http://e2fsprogs.sourceforge.net/
|
348 |
|
|
Design & Implementation http://e2fsprogs.sourceforge.net/ext2intro.html
|
349 |
|
|
Journaling (ext3) ftp://ftp.uk.linux.org/pub/linux/sct/fs/jfs/
|
350 |
|
|
Hashed Directories http://kernelnewbies.org/~phillips/htree/
|
351 |
|
|
Filesystem Resizing http://ext2resize.sourceforge.net/
|
352 |
|
|
Extended Attributes &
|
353 |
|
|
Access Control Lists http://acl.bestbits.at/
|
354 |
|
|
Compression (*) http://www.netspace.net.au/~reiter/e2compr/
|
355 |
|
|
|
356 |
|
|
Implementations for:
|
357 |
|
|
Windows 95/98/NT/2000 http://uranus.it.swin.edu.au/~jn/linux/Explore2fs.htm
|
358 |
|
|
Windows 95 (*) http://www.yipton.demon.co.uk/content.html#FSDEXT2
|
359 |
|
|
DOS client (*) ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/
|
360 |
|
|
OS/2 http://perso.wanadoo.fr/matthieu.willm/ext2-os2/
|
361 |
|
|
RISC OS client ftp://ftp.barnet.ac.uk/pub/acorn/armlinux/iscafs/
|
362 |
|
|
|
363 |
|
|
(*) no longer actively developed/supported (as of Apr 2001)
|