File Management
File Structure
At the simplest level, a file’s contents can be broken down into a variety of logical components.
-
Field. A single (indivisible) data item.
-
Array.A collection of equivalent fields.
-
Record. A collection of different fields.
For normal txt file, we can regard them as a file consists of many characters. And we can treat every file in this way.
Four method to manage fields
-
Fixed length. Fix the length of each field to a constant value.
-
Length indicator. Begin each field with a numeric value defining its length.
-
Delimiter. Separate each field with a delimiter character.
-
Key–value pair. Use a “keyword=value” representation to identify each field and its contents. A delimiter is also needed to separate key–value pairs.
How to access data?
Sequential access
Basically it’s linary scan from start of the file to the end of the file. It’s slow. So we have three ways to improve the performance.
- Move to Front (LRU)
- Transpose: like LRU, moving the record with its preceding one instead of moving to the front
- Count: record the access count of each record and sort records by count.
Direct access
Jump directly to the location of a record. That means we have a way to convert a record’s key into its location.
- Binary Search: sort the keys and find the key and index by binary search
Here are some disadvantages of binary search:
- The file must be sorted, and maintain this property is very expensive (every time insert or delete)
- Records must be fixed length
- Binary search still requires one or two seeks to find a record
File Management
We need to add, update, delete our records. Add is simple, updates can regrad as a deletion followed by an addition. So we only need to consider about deletion.
-
Fiexed length deletion
When deletion, we always use a term ‘hole’ to represent a deleted record. And we usually do the lazy deletion, which is mark the record as deleted and delete it when necessary or replace it with new records.
-
Variable length deletion
When deletion, we can use an available list of holes to represent available space. We also need to merge small holes and split big hole into small holes.
For the available list, we have three common stretegies to find a hole.
- First Fit: find first one in the available list
- Best Fit: find a smallest hole bigger than the record size
- Worst Fit: find a biggest hole
Index Management
One simple way is to use key and offset as index. As the index file size grows, we can utilize other data structure like B tree or external hash tables.
Secondary Key Index
For the secondary key index table, we use secondary key as key and first key as value. And we sort first by secondary key then by first key for combination search.
Improvement for secondary index: Inverted list and linked list.
Reference
[1] Disk-Based Algorithms for Big Data By Christopher G. Healey