File organisation may be defined as a method of storing records
in file. Also, the
subsequent implications on the way these
records can be accessed. The following are the factors
involved in selecting a particular file
organisation:
|
• Ease of retrieval
• Convenience of updates
• Economy of storage
• Reliability
• Security
• Integrity
|
Different file organisations accord
different weightages to the above factors. The choice must be made
depending upon the individual needs of the
particular application in question.
|
• Sequential Files
Data records are stored in some specific
sequence e.g., order of arrival value of key field etc. Records
of a sequential file cannot be accessed at
random i.e., to access the nth
record, one must
traverse the
preceding (n–1) records. Sequential files
will be dealt with at length in the next section.
|
• Relative Files
Each data record has a fixed place in a
relative file. Each record must have associated with it in
integer key value that will help identify
this slot. This key, therefore, will be used for insertion and
retrieval of the record. Random as well as
sequential access is possible. Relative files can exist only
on random access devices like disks.
|
• Direct Files
These are similar to relative files,
except that the key value need not be an integer. The user can
specify keys which make sense to his
application.
|
• Indexed Sequential Files
An index is added to the sequential file
to provide random access. An overflow area needs to be
maintained to permit insertion in
sequence.
|
• Indexed Flies
In this file organisation, no sequence is
imposed on the storage of records in the data file, therefore, no
overflow area is needed. The index,
however, is maintained in strict sequence. Multiple indexes are
allowed on a file to improve access.
|
SEQUENTIAL FILE ORGANISATION
|
Sequential files have data records stored in a specific
sequence. A sequentially organised file may be
stored on either a serial-access or a
direct-access storage medium.
|
Structure
|
To provide the ‘sequence’ required, a ‘key’
must be defined for the data records. Usually a field
whose values can uniquely identify data
records is selected as the key. If a single field cannot fulfil
this criterion, then a combination of
fields can serve as the key.
|
Operations
1. Insertion: Records must be inserted at the place
dictated by the sequence of the keys. As is
|
obvious, direct insertions into the main
data file would lead to frequent rebuilding of the file.
This problem could be mitigated by
reserving’ overflow areas’ in the file for insertions. But,
|
this leads to wastage of space. The common method is to use
transaction logging. This works
as follows:
• It collects records
for insertion in a transaction file in their order of their arrival.
• When population of
the transactions file has ceased, sort the transaction file in the
order of the key of the primary data file
• Merge
the two files on the basis of the key to get a new copy of the primary
sequential file. Such insertions are usually done in a batch mode when
the
activity/program which populates the transaction file have ceased. The
structure of
the transaction files records will be identical to that of the primary
file.
|
2. Deletion: Deletion is the
reverse process of insertion. The space occupied by the record
|
should be freed for use. Usually deletion
(like-insertion) is not done immediately. The
concerned record is written to a
transaction file. At the time of merging, the corresponding
data record will be dropped from the primary
data file.
|
3. Updation: Updation is a
combination of insertion and deletions. The record with the new
|
values is inserted and the earlier version
deleted. This is also done using transaction files.
|
4. Retrieval: User
programs will often retrieve data for viewing prior to making decisions.
|
Therefore, it is vital that this data
reflects the latest state of the data, if the merging activity
has not yet taken place. Retrieval is
usually done for a particular value of the key field. Before
return in to the user, the data record
should be merged with the transaction record (if any) for
that key value. The other two operations
‘creation’ and ‘deletion’ of files are achieved by
simple programming language statements.
|
Disadvantages
|
Following are some of the disadvantages of
sequential file organisation:
• Updates are not easily accommodated.
• By definition, random access is not
possible.
• All records must be structurally
identical. If a new field has to be added, then every record must be
rewritten to provide space for the new
field.
|
Areas of Use
|
Sequential files are most frequently used
in commercial batch oriented data processing where there is
the concept of a master file to which
details are added periodically. Example is payroll applications.
|
DIRECT FILE ORGANISATION
|
It offers an effective way to organise
data when there is a need to access individual records directly.
To access a record directly (or random
access) a relationship is used to translate the key value into a
physical address. This is called the
mapping function R.
|
R(key value) = Address
|
Direct files are stored on DASD (Direct
Access Storage Device). A calculation is performed on the
key value to get an address. This address
calculation technique is often termed as hashing. The
calculation applied is called a hash
function. Here, we discus a very commonly used hash function
called Division - Remainder.
|
Division-Remainder Hashing
|
According to this method, key value is
divided by an appropriate number, generally a prime number,
and the division of remainder is used as
the address for the record. The choice of appropriate divisor
may not be so simple. If it is known that
the file is to contain n records, then we must, assuming that
only one record can be stored at a given
address, have divisor n.
|
Also we may have a very large key space as compared to the
address space. Key space refers to all the
possible key values. The address space
possibly may not match actually for the key values in the file.
Hence, calculated address may not be
unique. It is called Collision, i.e.,
R(K1) = R(K2), where K1 = K2.
|
Two unequal keys have been calculated to
have the same address. The keys are called synonyms.
There are various approaches to handle the
problem of collisions. One of these is to hash to buckets.
A bucket is a space that can accommodate
multiple records. The student is advised to read some text
on bucket Addressing and related topics.
|
INDEXED SEQUENTIAL FILE ORGANISATION
|
When there is need to access records
sequentially by some key value and also to access records
directly by the same key value, the
collection of records may be organised in an effective manner
called Indexed Sequential Organisation.
You must be familiar with search process for a word in a
language dictionary. The data in the
dictionary is stored in sequential manner. However an index is
provided in terms of thumb tabs. To search
for a word we do not search sequentially. We access the
index. That is, locate an approximate
location for the word and then proceed to find the word
sequentially.
|
To implement the concept of indexed
sequential file organisations, we consider an approach in which
the index part and data part reside on a
separate file. The index file has a tree structure and data file
has a sequential structure. Since the data
file is sequenced, it is not necessary for the index to have an
entry for each record. Consider the
sequential file with a two-level index.
|
Level 1 of the index holds an entry for
each three-record section of the main file. The Level 2 indexes
Level 1 in the same way. When the new
records are inserted in the data file, the sequence of records
need to be preserved and also the index is
accordingly updated. Two approaches used to implement
indexes are static indexes and dynamic
indexes.
|
As the main data file changes due to
insertions and deletions, the contents of the static index may
change, but the structure does not change.
In case of dynamic indexing approach, insertions and
deletions in the main data file may lead
to changes in the index structure. Recall the change in height
of B-Tree as records are inserted and
deleted. Both dynamic and static indexing techniques are useful
depending on the type of application.
|
A directory is a component of file.
Consider a file which doesn’t have any keys for its records. When
a query is executed on such a file, the
time consumed to execute the query is more when compared to
another file which is having keys,
because, there may be arising necessity where in the file has to be
sorted on the field(s) on which the query
is based. So, for each query, the file has to be sorted on the
field on which the query is based which is
cumbersome. In the case of files which have keys, different
versions of the files which result due to
sorting on the keys are stored in the directory of that file. Such
files are called index files and the
number of index files vary from file to file. For example, consider
the file of Figure 12.1. If
we designate Enrolment Number (Enum) and Name as keys, then we may
have two index files based on the each
key. Of course, we can have more than two index files, to deal
with queries which use both the keys.
Different software store index files in a different manner so that
the operations on the records can be
performed as soon as possible after the query is submitted.
|
One of the prominent indexing techniques is Cylinder-Surface
indexing. Since, there exists a primary
key for each of the files, there will be
an index file based on the primary key. Cylinder-Surface
Indexing is useful for such index file. In
this type of indexing, the records of the file are stored one
after another in such a way that the
primary keys of the records are in increasing order. The index file
will have two fields. They are cylinder
index and corresponding surface indexes. There will be
multiple cylinders and there are multiple
surfaces corresponding to each cylinder.
|
Suppose that the file needs m cylinders,
then cylinder index will have m entries. Each cylinder will be
having one entry which corresponds to the
largest key value in that cylinder. Assume that the disk has
n surfaces which can be used. Then, each
surface index has n entries. The k-th entry in surface index
for cylinder lth cylinder if the value of the largest key on the lth track of the kth
surface. Hence, m.n
indicates the total number of surface
index entries.
|
Suppose that the need arises to search for
a record whose key value is B. Then, the first step is to load
the cylinder index of the file into
memory. Usually, each cylinder index occupies only one track as the
number of cylinders are only few. The
cylinder which holds the desired record is found by searching
the cylinder index. Usually, the search
takes O(log m) time. After the search of cylinder index, the
corresponding cylinder is determined.
Based on the cylinder, the corresponding surface index is
retrieved to look the record for which the
search has started. Whenever a search is initiated for the
surface index, usually sequential search
is used. Of course, it depends on the number of surfaces. But,
usually, the number of surfaces are less.
After finding the cylinder to be accessed and after finding the
surface to be accessed, the corresponding
track is loaded into memory and that track is searched for
the needed record.
|
No comments:
Post a Comment