리눅스 커널 파일 시스템 [2] 코드 분석 - generic_file_read_iter()
Page cache와 관련된 코드를 함께 먼저 보기 위해 generic_file_read_iter() 부터 분석한다.
mm/filemap.c
먼저 이 함수 위 주석을 보자.
/**
* generic_file_read_iter - generic filesystem read routine
* @iocb: kernel I/O control block
* @iter: destination for the data read
*
* This is the "read_iter()" routine for all filesystems
* that can use the page cache directly.
*
* The IOCB_NOWAIT flag in iocb->ki_flags indicates that -EAGAIN shall
* be returned when no data can be read without waiting for I/O requests
* to complete; it doesn't prevent readahead.
* generic_file_read_iter - generic filesystem read routine
* @iocb: kernel I/O control block
* @iter: destination for the data read
*
* This is the "read_iter()" routine for all filesystems
* that can use the page cache directly.
*
* The IOCB_NOWAIT flag in iocb->ki_flags indicates that -EAGAIN shall
* be returned when no data can be read without waiting for I/O requests
* to complete; it doesn't prevent readahead.
--> IOCB_NOWAIT flag가 있으면 prefetch를 막지 않는다고 함.
--> "I/O requests가 complete 되는 것을 기다리지 않고" -EAGAIN이 리턴된다는
것임.
--> Prefetch (readahead) 던져 놓고 리턴될거야~ 라는 것인 듯.
*
* The IOCB_NOIO flag in iocb->ki_flags indicates that no new I/O
* requests shall be made for the read or for readahead. When no data
* can be read, -EAGAIN shall be returned. When readahead would be
* triggered, a partial, possibly empty read shall be returned.
* The IOCB_NOIO flag in iocb->ki_flags indicates that no new I/O
* requests shall be made for the read or for readahead. When no data
* can be read, -EAGAIN shall be returned. When readahead would be
* triggered, a partial, possibly empty read shall be returned.
--> IOCB_NOIO flag는 read/readahead를 위한 I/O request가 없을 거라는 것임.
--> 데이터를 읽을 수 없을 때 -EAGAIN이 return 됨.
--> readahead가 trigger될 수도 있지만 empty read가 return 되야함.
*
* Return:
* * number of bytes copied, even for partial reads
* * negative error code (or 0 if IOCB_NOIO) if nothing was read
*/
*
* Return:
* * number of bytes copied, even for partial reads
* * negative error code (or 0 if IOCB_NOIO) if nothing was read
*/
ssize_t generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
{
size_t count = iov_iter_count(iter);
ssize_t retval = 0;
if (!count)
return 0; /* skip atime */
if (iocb->ki_flags & IOCB_DIRECT) {
size_t count = iov_iter_count(iter);
ssize_t retval = 0;
if (!count)
return 0; /* skip atime */
if (iocb->ki_flags & IOCB_DIRECT) {
--> ext4에서
넘어왔다면 IOCB_DIRECT가 ext4_dio_read_iter() 를 불렀을 것이다.
--> 이쪽을 타지
않음.
...
}
return filemap_read(iocb, iter, retval);
}
...
}
return filemap_read(iocb, iter, retval);
}
먼저 folio란 무엇인가? include/linux/mm_types.h에서 볼 수 있는
folio에 대한 정의는 다음과 같다:
A folio is a physically, virtually and logically contiguous set of
bytes. It is a power-of-two in size, and it is aligned to that same
power-of-two.
folio는 "단일 메모리 소유권을 가지는 연속적인 물리 메모리 블록"이라고 볼 수
있는데, 하나의 folio는 여러 페이지로 구성되어 있을 수 있지만, 여러 페이지를
하나의 단일 단위로 간주하고 관리할 수 있도록 하는 자료구조이다.
filemap_read()에서 부르는 주요 함수:
filemap_get_pages()
/**
* filemap_read - Read data from the page cache.
--> page cache로부터 데이터를 읽는다.
* @iocb: The iocb to read.
* @iter: Destination for the data.
* @already_read: Number of bytes already read by the caller.
*
* Copies data from the page cache. If the data is not currently present,
* uses the readahead and read_folio address_space operations to fetch it.
--> page cache로부터 읽는 것을 시도하고, 데이터가 page cache에 없다면
스토리지에서 읽는다.
*
* Return: Total number of bytes copied, including those already read by
* the caller. If an error happens before any bytes are copied, returns
* a negative error number.
*/
ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
ssize_t already_read)
{
struct file *filp = iocb->ki_filp;
struct file_ra_state *ra = &filp->f_ra;
struct address_space *mapping = filp->f_mapping;
--> filp->f_mapping은 해당 파일(struct file)과 연결된 페이지 캐시 관리 객체를 가리킴. struct address_space는 페이지 캐시를 관리하는 핵심 데이터 구조로 파일의 모든 캐시 페이지들을 관리하는 구조체임. mapping을 통해 필요한 페이지를 페이지 캐시에서 찾을 수 있음.
struct inode *inode = mapping->host;
struct folio_batch fbatch;
int i, error = 0;
bool writably_mapped;
loff_t isize, end_offset;
loff_t last_pos = ra->prev_pos;
if (unlikely(iocb->ki_pos >= inode->i_sb->s_maxbytes))
return 0;
if (unlikely(!iov_iter_count(iter)))
return 0;
iov_iter_truncate(iter, inode->i_sb->s_maxbytes - iocb->ki_pos);
folio_batch_init(&fbatch);
do {
cond_resched();
/*
* If we've already successfully copied some data, then we
* can no longer safely return -EIOCBQUEUED. Hence mark
* an async read NOWAIT at that point.
*/
if ((iocb->ki_flags & IOCB_WAITQ) && already_read)
iocb->ki_flags |= IOCB_NOWAIT;
if (unlikely(iocb->ki_pos >= i_size_read(inode)))
break;
error = filemap_get_pages(iocb, iter->count, &fbatch, false);
--> 핵심함수이다.
if (error < 0)
break;
/*
* i_size must be checked after we know the pages are Uptodate.
*
* Checking i_size after the check allows us to calculate
* the correct value for "nr", which means the zero-filled
* part of the page is not copied back to userspace (unless
* another truncate extends the file - this is desired though).
*/
isize = i_size_read(inode);
if (unlikely(iocb->ki_pos >= isize))
goto put_folios;
end_offset = min_t(loff_t, isize, iocb->ki_pos + iter->count);
/*
* Once we start copying data, we don't want to be touching any
* cachelines that might be contended:
*/
writably_mapped = mapping_writably_mapped(mapping);
/*
* When a read accesses the same folio several times, only
* mark it as accessed the first time.
*/
if (!pos_same_folio(iocb->ki_pos, last_pos - 1,
fbatch.folios[0]))
folio_mark_accessed(fbatch.folios[0]);
for (i = 0; i < folio_batch_count(&fbatch); i++) {
struct folio *folio = fbatch.folios[i];
size_t fsize = folio_size(folio);
size_t offset = iocb->ki_pos & (fsize - 1);
size_t bytes = min_t(loff_t, end_offset - iocb->ki_pos,
fsize - offset);
size_t copied;
if (end_offset < folio_pos(folio))
break;
if (i > 0)
folio_mark_accessed(folio);
/*
* If users can be writing to this folio using arbitrary
* virtual addresses, take care of potential aliasing
* before reading the folio on the kernel side.
*/
if (writably_mapped)
flush_dcache_folio(folio);
copied = copy_folio_to_iter(folio, offset, bytes, iter);
--> page cache의 데이터를 사용자 공간 (사용자 버퍼) 으로 복사한다.
already_read += copied;
iocb->ki_pos += copied;
last_pos = iocb->ki_pos;
if (copied < bytes) {
error = -EFAULT;
break;
}
}
put_folios:
for (i = 0; i < folio_batch_count(&fbatch); i++)
folio_put(fbatch.folios[i]);
folio_batch_init(&fbatch);
} while (iov_iter_count(iter) && iocb->ki_pos < isize && !error);
file_accessed(filp);
ra->prev_pos = last_pos;
return already_read ? already_read : error;
}
static int filemap_get_pages(struct kiocb *iocb, size_t count,
struct folio_batch *fbatch,
bool need_uptodate)
filemap_get_pages()에서 부르는 주요함수는 아래 두 함수이다:
- page_cache_sync_readahead()
- filemap_readahead()
static int filemap_get_pages(struct kiocb *iocb, size_t count,
struct folio_batch *fbatch, bool need_uptodate)
{
struct file *filp = iocb->ki_filp;
struct address_space *mapping = filp->f_mapping;
struct file_ra_state *ra = &filp->f_ra;
pgoff_t index = iocb->ki_pos >> PAGE_SHIFT;
pgoff_t last_index;
struct folio *folio;
unsigned int flags;
int err = 0;
/* "last_index" is the index of the page beyond the end of the read */
last_index = DIV_ROUND_UP(iocb->ki_pos + count, PAGE_SIZE);
retry:
if (fatal_signal_pending(current))
return -EINTR;
filemap_get_read_batch(mapping, index, last_index - 1, fbatch);
--> 내부에서 xas_load를 통해 (존재한다면) page cache를 얻어옴
if (!folio_batch_count(fbatch)) {
if (iocb->ki_flags & IOCB_NOIO)
return -EAGAIN;
if (iocb->ki_flags & IOCB_NOWAIT)
flags = memalloc_noio_save();
page_cache_sync_readahead(mapping, ra, filp, index,
last_index - index);
if (iocb->ki_flags & IOCB_NOWAIT)
memalloc_noio_restore(flags);
filemap_get_read_batch(mapping, index, last_index - 1, fbatch);
}
if (!folio_batch_count(fbatch)) {
if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_WAITQ))
return -EAGAIN;
err = filemap_create_folio(filp, mapping, iocb->ki_pos, fbatch);
if (err == AOP_TRUNCATED_PAGE)
goto retry;
return err;
}
folio = fbatch->folios[folio_batch_count(fbatch) - 1];
if (folio_test_readahead(folio)) {
err = filemap_readahead(iocb, filp, mapping, folio, last_index);
if (err)
goto err;
}
if (!folio_test_uptodate(folio)) {
if ((iocb->ki_flags & IOCB_WAITQ) &&
folio_batch_count(fbatch) > 1)
iocb->ki_flags |= IOCB_NOWAIT;
err = filemap_update_page(iocb, mapping, count, folio,
need_uptodate);
if (err)
goto err;
}
trace_mm_filemap_get_pages(mapping, index, last_index - 1);
return 0;
err:
if (err < 0)
folio_put(folio);
if (likely(--fbatch->nr))
return 0;
if (err == AOP_TRUNCATED_PAGE)
goto retry;
return err;
}
주요 두 함수인 아래 함수에 대해서 살펴보자.
page_cache_sync_readahead() --> page_cache_sync_ra() - Linux kernel에서는 보통 이 함수를 이용하여 storage로부터 데이터를 읽어온다. 읽어올 때 까지 기다린다 (sync).filemap_readahead() --> page_cache_async_ra() - 이 함수는 async하게 읽어오는 함수로, 요청을 던져놓고 읽어오겠거니 한다 (async).
/**
* page_cache_sync_readahead - generic file readahead
--> Linux kernel에서는 보통 이 함수를 이용하여 storage로부터 데이터를 읽어온다. 읽어올 때 까지 기다린다 (sync).
* @mapping: address_space which holds the pagecache and I/O vectors
* @ra: file_ra_state which holds the readahead state
* @file: Used by the filesystem for authentication.
* @index: Index of first page to be read.
* @req_count: Total number of pages being read by the caller.
*
* page_cache_sync_readahead() should be called when a cache miss happened:
* it will submit the read. The readahead logic may decide to piggyback more
* pages onto the read request if access patterns suggest it will improve
* performance.
*/
static inline
void page_cache_sync_readahead(struct address_space *mapping,
struct file_ra_state *ra, struct file *file, pgoff_t index,
unsigned long req_count)
{
DEFINE_READAHEAD(ractl, file, ra, mapping, index);
page_cache_sync_ra(&ractl, req_count);
}
>주요 두 함수인 아래 함수에 대해서 살펴보자.
static int filemap_readahead(struct kiocb *iocb, struct file *file,
struct address_space *mapping, struct folio *folio,
pgoff_t last_index)
{
DEFINE_READAHEAD(ractl, file, &file->f_ra, mapping, folio->index);
if (iocb->ki_flags & IOCB_NOIO)
return -EAGAIN;
page_cache_async_ra(&ractl, folio, last_index - folio->index);
--> 이 함수는 async하게 읽어오는 함수로, 요청을 던져놓고 읽어오겠거니 한다 (async).
return 0;
}
댓글
댓글 쓰기