Linux内核之虚拟文件系统

一.概述

Linux内核支持多种文件系统，它和其它unix变种一样，通过虚拟文件系统完成对不同文件系统的支持。第一个虚拟文件系统出现在1986年sun公司的微系统SunOS中，自此以后，大多数unix文件系统都包含VFS。Linux的VFS支持的文件系统最为广泛。

二.虚拟文件系统的概念

虚拟文件系统（VFS）是内核处理所有标准Unix文件系统相关的系统调用的软件层，linux内核通过虚拟文件系统层屏蔽了不同底层文件系统的差异，向上提供了统一的接口。VFS支持的文件系统类型主要分为以下三大类：

基于磁盘的文件系统：

数据存储于本地磁盘或其他能枚举出磁盘的设备（如USB falsh）的文件系统，比如Linux的文件系统exe2，ext3，其他Unix变种的文件系统如syssv，UFS，VxFS，微软文件系统如VFAT，NTFS，CD-ROM文件系统如UDF，其他私有文件系统如IBM的HPFS，苹果的HFS等等。

网络文件系统：

用来简单的访问网络上其他主机上的文件系统，比如NFS，Coda， AFS，CIFS和NCP等。

特殊文件系统：

该类文件系统不会管理本地或远程磁盘空间，/proc是一个典型的特殊文件系统。

三.通用文件模型和VFS数据结构

Linux使用传统Unix文件系统提供的文件模型，不同的文件系统必须将其物理组织转换为VFS的通用文件模型。

用面向对象的方法描述通用文件模型，一个对象由数据结构和操作数据的方法构成。由于效率的原因，linux内核没用面向对象的语言（如C++）实现，C语言用一个有指向一些函数的域的数据结构实现对象，这些函数指针相当于对象的方法。通用文件模型有以下对象类型：

The superblock object：

存储一个挂载的文件系统的信息。对于基于磁盘的文件系统，该对象通常相当于存储在磁盘上的一个filesystem control block。

superblock object在内核中的定义如下：

struct super_block {
	struct list_head	s_list;		/* Keep this first */
	dev_t			s_dev;		/* search index; _not_ kdev_t */
	unsigned char		s_blocksize_bits;
	unsigned long		s_blocksize;
	loff_t			s_maxbytes;	/* Max file size */
	struct file_system_type	*s_type;
	const struct super_operations	*s_op;
	const struct dquot_operations	*dq_op;
	const struct quotactl_ops	*s_qcop;
	const struct export_operations *s_export_op;
	unsigned long		s_flags;
	unsigned long		s_iflags;	/* internal SB_I_* flags */
	unsigned long		s_magic;
	struct dentry		*s_root;
	struct rw_semaphore	s_umount;
	int			s_count;
	atomic_t		s_active;
#ifdef CONFIG_SECURITY
	void                    *s_security;
#endif
	const struct xattr_handler **s_xattr;
#if IS_ENABLED(CONFIG_FS_ENCRYPTION)
	const struct fscrypt_operations	*s_cop;
#endif
	struct hlist_bl_head	s_roots;	/* alternate root dentries for NFS */
	struct list_head	s_mounts;	/* list of mounts; _not_ for fs use */
	struct block_device	*s_bdev;
	struct backing_dev_info *s_bdi;
	struct mtd_info		*s_mtd;
	struct hlist_node	s_instances;
	unsigned int		s_quota_types;	/* Bitmask of supported quota types */
	struct quota_info	s_dquot;	/* Diskquota specific options */

	struct sb_writers	s_writers;

	/*
	 * Keep s_fs_info, s_time_gran, s_fsnotify_mask, and
	 * s_fsnotify_marks together for cache efficiency. They are frequently
	 * accessed and rarely modified.
	 */
	void			*s_fs_info;	/* Filesystem private info */

	/* Granularity of c/m/atime in ns (cannot be worse than a second) */
	u32			s_time_gran;
#ifdef CONFIG_FSNOTIFY
	__u32			s_fsnotify_mask;
	struct fsnotify_mark_connector __rcu	*s_fsnotify_marks;
#endif

	char			s_id[32];	/* Informational name */
	uuid_t			s_uuid;		/* UUID */

	unsigned int		s_max_links;
	fmode_t			s_mode;

	/*
	 * The next field is for VFS *only*. No filesystems have any business
	 * even looking at it. You had been warned.
	 */
	struct mutex s_vfs_rename_mutex;	/* Kludge */

	/*
	 * Filesystem subtype.  If non-empty the filesystem type field
	 * in /proc/mounts will be "type.subtype"
	 */
	char *s_subtype;

	const struct dentry_operations *s_d_op; /* default d_op for dentries */

	/*
	 * Saved pool identifier for cleancache (-1 means none)
	 */
	int cleancache_poolid;

	struct shrinker s_shrink;	/* per-sb shrinker handle */

	/* Number of inodes with nlink == 0 but still referenced */
	atomic_long_t s_remove_count;

	/* Pending fsnotify inode refs */
	atomic_long_t s_fsnotify_inode_refs;

	/* Being remounted read-only */
	int s_readonly_remount;

	/* AIO completions deferred from interrupt context */
	struct workqueue_struct *s_dio_done_wq;
	struct hlist_head s_pins;

	/*
	 * Owning user namespace and default context in which to
	 * interpret filesystem uids, gids, quotas, device nodes,
	 * xattrs and security labels.
	 */
	struct user_namespace *s_user_ns;

	/*
	 * The list_lru structure is essentially just a pointer to a table
	 * of per-node lru lists, each of which has its own spinlock.
	 * There is no need to put them into separate cachelines.
	 */
	struct list_lru		s_dentry_lru;
	struct list_lru		s_inode_lru;
	struct rcu_head		rcu;
	struct work_struct	destroy_work;

	struct mutex		s_sync_lock;	/* sync serialisation lock */

	/*
	 * Indicates how deep in a filesystem stack this SB is
	 */
	int s_stack_depth;

	/* s_inode_list_lock protects s_inodes */
	spinlock_t		s_inode_list_lock ____cacheline_aligned_in_smp;
	struct list_head	s_inodes;	/* all inodes */

	spinlock_t		s_inode_wblist_lock;
	struct list_head	s_inodes_wb;	/* writeback inodes */
} __randomize_layout;

The inode object：

存储特定文件的通用信息。对于基于磁盘的文件系统，该对象通常相当于存储在磁盘上的一个file control block，对于每一个inode，都有一个inode number用作唯一的标识。

文件系统处理文件所需要的所有信息都包含在inode结构体中，文件名通常设计为可以改变的标签，但是inode不同于文件，在文件的整个生命周期内保证唯一性。该结构在内核中的定义如下：

/*
 * Keep mostly read-only and often accessed (especially for
 * the RCU path lookup and 'stat' data) fields at the beginning
 * of the 'struct inode'
 */
struct inode {
	umode_t			i_mode;
	unsigned short		i_opflags;
	kuid_t			i_uid;
	kgid_t			i_gid;
	unsigned int		i_flags;

#ifdef CONFIG_FS_POSIX_ACL
	struct posix_acl	*i_acl;
	struct posix_acl	*i_default_acl;
#endif

	const struct inode_operations	*i_op;
	struct super_block	*i_sb;
	struct address_space	*i_mapping;

#ifdef CONFIG_SECURITY
	void			*i_security;
#endif

	/* Stat data, not accessed from path walking */
	unsigned long		i_ino;
	/*
	 * Filesystems may only read i_nlink directly.  They shall use the
	 * following functions for modification:
	 *
	 *    (set|clear|inc|drop)_nlink
	 *    inode_(inc|dec)_link_count
	 */
	union {
		const unsigned int i_nlink;
		unsigned int __i_nlink;
	};
	dev_t			i_rdev;
	loff_t			i_size;
	struct timespec64	i_atime;
	struct timespec64	i_mtime;
	struct timespec64	i_ctime;
	spinlock_t		i_lock;	/* i_blocks, i_bytes, maybe i_size */
	unsigned short          i_bytes;
	u8			i_blkbits;
	u8			i_write_hint;
	blkcnt_t		i_blocks;

#ifdef __NEED_I_SIZE_ORDERED
	seqcount_t		i_size_seqcount;
#endif

	/* Misc */
	unsigned long		i_state;
	struct rw_semaphore	i_rwsem;

	unsigned long		dirtied_when;	/* jiffies of first dirtying */
	unsigned long		dirtied_time_when;

	struct hlist_node	i_hash;
	struct list_head	i_io_list;	/* backing dev IO list */
#ifdef CONFIG_CGROUP_WRITEBACK
	struct bdi_writeback	*i_wb;		/* the associated cgroup wb */

	/* foreign inode detection, see wbc_detach_inode() */
	int			i_wb_frn_winner;
	u16			i_wb_frn_avg_time;
	u16			i_wb_frn_history;
#endif
	struct list_head	i_lru;		/* inode LRU list */
	struct list_head	i_sb_list;
	struct list_head	i_wb_list;	/* backing dev writeback list */
	union {
		struct hlist_head	i_dentry;
		struct rcu_head		i_rcu;
	};
	atomic64_t		i_version;
	atomic_t		i_count;
	atomic_t		i_dio_count;
	atomic_t		i_writecount;
#ifdef CONFIG_IMA
	atomic_t		i_readcount; /* struct files open RO */
#endif
	const struct file_operations	*i_fop;	/* former ->i_op->default_file_ops */
	struct file_lock_context	*i_flctx;
	struct address_space	i_data;
	struct list_head	i_devices;
	union {
		struct pipe_inode_info	*i_pipe;
		struct block_device	*i_bdev;
		struct cdev		*i_cdev;
		char			*i_link;
		unsigned		i_dir_seq;
	};

	__u32			i_generation;

#ifdef CONFIG_FSNOTIFY
	__u32			i_fsnotify_mask; /* all events this inode cares about */
	struct fsnotify_mark_connector __rcu	*i_fsnotify_marks;
#endif

#if IS_ENABLED(CONFIG_FS_ENCRYPTION)
	struct fscrypt_info	*i_crypt_info;
#endif

	void			*i_private; /* fs or device private pointer */
} __randomize_layout;

The file object：

存储一个打开的文件和进程之间的相互关系的信息。该信息只在一个进程打开文件的时候存在与内核的内存中。file对象在内核中定义为file结构体：

struct file {
	union {
		struct llist_node	fu_llist;
		struct rcu_head 	fu_rcuhead;
	} f_u;
	struct path		f_path;
	struct inode		*0;	/* cached value */
	const struct file_operations	*f_op;

	/*
	 * Protects f_ep_links, f_flags.
	 * Must not be taken from IRQ context.
	 */
	spinlock_t		f_lock;
	enum rw_hint		f_write_hint;
	atomic_long_t		f_count;
	unsigned int 		f_flags;
	fmode_t			f_mode;
	struct mutex		f_pos_lock;
	loff_t			f_pos;
	struct fown_struct	f_owner;
	const struct cred	*f_cred;
	struct file_ra_state	f_ra;

	u64			f_version;
#ifdef CONFIG_SECURITY
	void			*f_security;
#endif
	/* needed for tty driver, and maybe others */
	void			*private_data;

#ifdef CONFIG_EPOLL
	/* Used by fs/eventpoll.c to link all the hooks to this file */
	struct list_head	f_ep_links;
	struct list_head	f_tfile_llink;
#endif /* #ifdef CONFIG_EPOLL */
	struct address_space	*f_mapping;
	errseq_t		f_wb_err;
} __randomize_layout
  __attribute__((aligned(4)));	/* lest something weird decides that 2 is OK */

The dentry object：

存储目录和文件之间的链接信息，每个基于磁盘的文件系统都用其特定的方式存储该信息。

最近使用的dentry object包含于一个叫做“dentry cache”的磁盘缓存中，dentry cache提高了从一个文件路到最后一个路径组成部分的inode的转换速度。通常来说，磁盘缓存是一个软件机制，该机制运行内存将一些通常保存在磁盘中的信息保持在内存中，以提高对其的访问速度。磁盘缓存不同于硬件缓存或内存缓存，硬件缓存是一个快速静态内存，用来提高直接低速动态内存访问的速度。而内存缓存是用来为内核内存申请器分压的软件机制。

除了dentry cache和inode cache以外，linux还使用其他的磁盘缓存，最重要的一个就是页缓存。

struct dentry {
	/* RCU lookup touched fields */
	unsigned int d_flags;		/* protected by d_lock */
	seqcount_t d_seq;		/* per dentry seqlock */
	struct hlist_bl_node d_hash;	/* lookup hash list */
	struct dentry *d_parent;	/* parent directory */
	struct qstr d_name;
	struct inode *d_inode;		/* Where the name belongs to - NULL is
					 * negative */
	unsigned char d_iname[DNAME_INLINE_LEN];	/* small names */

	/* Ref lookup also touches following */
	struct lockref d_lockref;	/* per-dentry lock and refcount */
	const struct dentry_operations *d_op;
	struct super_block *d_sb;	/* The root of the dentry tree */
	unsigned long d_time;		/* used by d_revalidate */
	void *d_fsdata;			/* fs-specific data */

	union {
		struct list_head d_lru;		/* LRU list */
		wait_queue_head_t *d_wait;	/* in-lookup ones only */
	};
	struct list_head d_child;	/* child of parent list */
	struct list_head d_subdirs;	/* our children */
	/*
	 * d_alias and d_rcu can share memory
	 */
	union {
		struct hlist_node d_alias;	/* inode alias list */
		struct hlist_bl_node d_in_lookup_hash;	/* only for in-lookup ones */
	 	struct rcu_head d_rcu;
	} d_u;
} __randomize_layout;

四.VFS系统调用处理

VFS主要处理一些文件系统相关的系统调用（此处不一一列举），它处于应用程序和特定文件系统之间，应用程序通过VFS提供的统一的接口访问到特定文件系统的方法而操作特定文件系统，但是在一些情况下，VFS本身就能完成文件操作而不需要执行底层程序，比如当一个进程要关闭一个打开的文件时，磁盘文件通常不需要被操作，因此VFS层只需要简单的释放相关的file object即可。再比如调用lseek系统调用修改一个文件指针，由于文件指针只是打开的文件和进程之间的相关属性，VFS只需要修改file object而不需要访问磁盘文件，因此也不会执行底层特定文件系统的操作程序。在某种意义上，VFS可以被认为是一种必要时依赖于特定文件系统的通用文件系统。

五.进程和文件的关联

每一个进程都有自己的当前工作目录和根目录，在进程结构中包含了一个fs_struct类型的结构体指针fs用来描述进程的工作目录和根目录等信息，还有一个files_struct类型的结构体指针files用来描述进程当前打开的文件。struct file_struct在内核中的定义如下：

/*
 * Open file table structure
 */
struct files_struct {
  /*
   * read mostly part
   */
	atomic_t count;
	bool resize_in_progress;
	wait_queue_head_t resize_wait;

	struct fdtable __rcu *fdt;
	struct fdtable fdtab;
  /*
   * written part on a separate cache line in SMP
   */
	spinlock_t file_lock ____cacheline_aligned_in_smp;
	unsigned int next_fd;
	unsigned long close_on_exec_init[1];
	unsigned long open_fds_init[1];
	unsigned long full_fds_bits_init[1];
	struct file __rcu * fd_array[NR_OPEN_DEFAULT];
};

该结构中fd_array数组表示进程打开的文件对象。通常该数组长度为32个文件对象，如果当前进程打开的文件个数超过32个，内核将申请一个新的足够大的数组，通常，fd_array的第一个元素表示当前进程的标准输入文件，第二个元素表示标准输出文件，第三个表示标准错误文件。同一个下标的元素可能指向同一个打开的文件，比如用户使用了类似2>&1这样的操作重定向标准错误到标准输出。

六.文件系统类型注册

VFS通过文件系统类型的注册对所有文件系统进行跟踪管理，每一个注册的文件系统类型都用一个file_system_type对象表示。该对象在内核中的结构体实现如下：

struct file_system_type {
	const char *name;
	int fs_flags;
#define FS_REQUIRES_DEV		1 
#define FS_BINARY_MOUNTDATA	2
#define FS_HAS_SUBTYPE		4
#define FS_USERNS_MOUNT		8	/* Can be mounted by userns root */
#define FS_RENAME_DOES_D_MOVE	32768	/* FS will handle d_move() during rename() internally. */
	struct dentry *(*mount) (struct file_system_type *, int,
		       const char *, void *);
	void (*kill_sb) (struct super_block *);
	struct module *owner;
	struct file_system_type * next;
	struct hlist_head fs_supers;

	struct lock_class_key s_lock_key;
	struct lock_class_key s_umount_key;
	struct lock_class_key s_vfs_rename_key;
	struct lock_class_key s_writers_key[SB_FREEZE_LEVELS];

	struct lock_class_key i_lock_key;
	struct lock_class_key i_mutex_key;
	struct lock_class_key i_mutex_dir_key;
};

mount函数指针在挂载文件系统时调用，kill_sb函数指针用来销毁超级块。

文件系统类型的注册函数为register_filesystem()，其传入的参数为file_system_type结构指针，取消注册文件系统类型的函数为unregister_filesystem(),参数和注册函数参数相同。get_fs_type函数用来获取file_system_type，其传入参数为文件系统名字。

七.文件系统的挂载

不同于大部分类unix内核，linux内核允许一个文件系统被多次挂载，当一个文件系统被挂载n次的时候，其根目录的挂载点也有n个。虽然同一个文件系统可以通过不同的路径访问，但是他们之间确是不同的。一个文件系统对应一个超级块，不管它被挂载了多少次。文件系统的挂载点可能是另一个挂载在其他文件系统上的文件系统的某个目录。

Linux同样允许多个文件系统挂载到同一个挂载点，每一个新挂载的文件系统隐藏了前一个文件系统的目录，即便前一个文件系统目录中的某些文件被当前运行的某些进程使用（这些进程仍能正常使用被隐藏的文件）。当某个挂载点上的最后挂载的文件系统被卸载时，在它之前（如果有）挂载的文件系统目录此时变为可见的。

在Linux内核中通过vfsmount结构来描述和跟踪一个已挂载的文件系统的flags，挂载点，以及和其他文件系统之间的联系。vfsmount结构在内核源码中实现如下：

struct vfsmount {
	struct dentry *mnt_root;	/* root of the mounted tree */
	struct super_block *mnt_sb;	/* pointer to superblock */
	int mnt_flags;
};

mnt_root：挂载的文件系统的根目录的dentry结构指针。

mnt_sb：挂载的文件系统的超级块。

文件系统的挂载流程如下：

首先通过mount()系统调用挂载一个文件系统。改系统调在内核中声明如下：

1 2	SYSCALL_DEFINE5(mount, char __user , dev_name, char __user , dir_name, char __user , type, unsigned long, flags, void __user , data)

需要传入的参数依次为包含要挂载的文件系统的设备路径，或者NULL，如果在没有或不需要设备的情况下：

文件系统将要挂载到的路径名；

表示文件系统类型的字符串，该字符串必须为已经注册过的文件系统的名字；

挂载文件系统的flags；

文件系统相关的数据指针，可能是NULL。

在mount系统调用的服务函数中，首先将参数拷贝到内核空间临时缓存，然后调用do_mount函数进行挂载。当do_mount返回后，释放临时缓存。do_mount函数通过以下几个过程实际挂载文件系统：

1.检查flags，将用户传入的某些flags转换成挂载可识别的正确flags。

2.通过user_path函数获取挂载目录。

3.检查挂载flags，以确定执行怎么样的挂载动作。根据不同选项可选的的挂载操作有以下几种

do_remount ：重新挂载一个已经挂载的文件系统

do_loopback：将一个已经挂载的文件系统同时绑定到另一个目录

do_change_type：修改文件系统的类型属性

do_move_mount：移动一个已挂载文件系统到另一个挂载点

do_new_mount：挂载一个新的文件系统

在这里通常是要挂载新的文件系统，do_new_mount()函数首先通过get_fs_type函数获取内核中已经注册过的该文件系统的结构，然后执行vfs_kern_mount()函数执行挂载操作，vfs_kern_mount函数返回挂载的文件系统结构地址。然后执行do_add_mount()，该函数检查同一个目录重复挂载同一个文件系统，以及挂载点为链接文件等不允许的情况，最终调用graft_tree()函数将新挂载的文件系统添加到相关的链表和哈希表中（命名空间，父文件系统，哈希表等）。

4.释放查找到的挂载路径。

vfs_kern_mount()函数

文件系统挂载的核心函数。该函数主要操作如下：

1.通过alloc_vfsmnt函数申请一个挂载文件系统的描述符结构并保存到本地变量mnt中。

2.调用mount_fs函数初始化文件系统的super block，在该函数中主要通过已注册文件系统类型的mount函数完成sb的初始化等。并返回根目录dentry结构指针。

3.通过新挂载的文件系统的根路径dentry，根路径超级块等初始化本地mnt的vfsmount结构。

4.将mnt结构链入根节点dentry的超级块中的s_mounts链表。

5.返回mnt的vfsmount结构指针。

八.根文件系统挂载

根文件系统的挂载是系统初始化中重要的一步。在一定程度上它是一个复杂的过程，因为linux内核允许跟文件系统存储在不同的地方，比如硬盘分区，软盘，nfs网络服务器，或者ramdisk。在这里以硬盘为例。当系统启动的时候，内核从ROOT_DEV变量中查找包含跟文件系统的硬盘的主序号。跟文件系统可以被当作一个/dev/目录下的设备，该设备在编译内核的时候指定或者通过引导程序通过root选项传入。同样地，挂载根文件系统的flags存储在root_mountflags变量中。用户通过rdev程序设置一个编译好的内核镜像的根文件系统挂载flags，或者通过内核引导程序传入挂载的flags。

Linux内核根文件系统的挂载分两个阶段：

1.挂载一个特殊的根文件系统，该文件系统只是提供一个空目录作为初始挂载点。

2.在第一阶段挂载的空目录上挂载真实的文件系统。

第一阶段是在内核mnt_init函数中，通过执行init_rootfs函数向内核注册rootfs文件系统类型。紧接着调用init_mount_tree函数，在该函数中通过vfs_kern_mount函数挂载前面注册的rootfs类型。

第二阶段中，首先在prepare_namespace函数中设置root_device_name变量，表示根文件系统设备，该参数由内核启动参数中的“root”变量指定。同时设置了ROOT_DEV（根文件系统设备的主次设备）变量。接着调用mount_root函数，在mount_root函数中根据不同的文件系统类型以及设备类型（根据ROOT_DEV)判断作不同的处理，常用的为block类型。首先调用create_dev创建设备，接着执行mount_block_root函数挂载，在mount_block_root函数中调用do_mount_root函数，而在do_mount_root函数中最终调用sys_mount挂载给出的设备上的文件系统，并调用sys_chdir函数将当前路径切换到根目录下。