File Management

2 minute read

File Structure

At the simplest level, a file’s contents can be broken down into a variety of logical components.

  • Field. A single (indivisible) data item.

  • Array.A collection of equivalent fields.

  • Record. A collection of different fields.

For normal txt file, we can regard them as a file consists of many characters. And we can treat every file in this way.

file

Four method to manage fields

  1. Fixed length. Fix the length of each field to a constant value.

  2. Length indicator. Begin each field with a numeric value defining its length.

  3. Delimiter. Separate each field with a delimiter character.

  4. Key–value pair. Use a “keyword=value” representation to identify each field and its contents. A delimiter is also needed to separate key–value pairs.

file

How to access data?

Sequential access

Basically it’s linary scan from start of the file to the end of the file. It’s slow. So we have three ways to improve the performance.

  • Move to Front (LRU)
  • Transpose: like LRU, moving the record with its preceding one instead of moving to the front
  • Count: record the access count of each record and sort records by count.

Direct access

Jump directly to the location of a record. That means we have a way to convert a record’s key into its location.

  • Binary Search: sort the keys and find the key and index by binary search

Here are some disadvantages of binary search:

  • The file must be sorted, and maintain this property is very expensive (every time insert or delete)
  • Records must be fixed length
  • Binary search still requires one or two seeks to find a record

File Management

We need to add, update, delete our records. Add is simple, updates can regrad as a deletion followed by an addition. So we only need to consider about deletion.

  • Fiexed length deletion

    When deletion, we always use a term ‘hole’ to represent a deleted record. And we usually do the lazy deletion, which is mark the record as deleted and delete it when necessary or replace it with new records.

  • Variable length deletion

    When deletion, we can use an available list of holes to represent available space. We also need to merge small holes and split big hole into small holes.

For the available list, we have three common stretegies to find a hole.

  • First Fit: find first one in the available list
  • Best Fit: find a smallest hole bigger than the record size
  • Worst Fit: find a biggest hole

Index Management

One simple way is to use key and offset as index. As the index file size grows, we can utilize other data structure like B tree or external hash tables.

Secondary Key Index

For the secondary key index table, we use secondary key as key and first key as value. And we sort first by secondary key then by first key for combination search.

file

Improvement for secondary index: Inverted list and linked list.

Reference

[1] Disk-Based Algorithms for Big Data By Christopher G. Healey