Search This Blog

Sunday, October 9, 2011

MCS-021 Assignments Solution January-2012:-Question 4: Briefly, describe various file organizations.(20 marks)


 File organisation may be defined as a method of storing records in file. Also, the
subsequent implications on the way these records can be accessed. The following are the factors
involved in selecting a particular file organisation:
Ease of retrieval
Convenience of updates
Economy of storage

Reliability
Security
Integrity
Different file organisations accord different weightages to the above factors. The choice must be made
depending upon the individual needs of the particular application in question.
Sequential Files
Data records are stored in some specific sequence e.g., order of arrival value of key field etc. Records
of a sequential file cannot be accessed at random i.e., to access the nth record, one must traverse the
preceding (n–1) records. Sequential files will be dealt with at length in the next section.
Relative Files
Each data record has a fixed place in a relative file. Each record must have associated with it in
integer key value that will help identify this slot. This key, therefore, will be used for insertion and
retrieval of the record. Random as well as sequential access is possible. Relative files can exist only
on random access devices like disks.
Direct Files
These are similar to relative files, except that the key value need not be an integer. The user can
specify keys which make sense to his application.
Indexed Sequential Files
An index is added to the sequential file to provide random access. An overflow area needs to be
maintained to permit insertion in sequence.
Indexed Flies
In this file organisation, no sequence is imposed on the storage of records in the data file, therefore, no
overflow area is needed. The index, however, is maintained in strict sequence. Multiple indexes are
allowed on a file to improve access.
SEQUENTIAL FILE ORGANISATION
Sequential files have data records stored in a specific sequence. A sequentially organised file may be
stored on either a serial-access or a direct-access storage medium.
Structure
To provide the ‘sequence’ required, a ‘key’ must be defined for the data records. Usually a field
whose values can uniquely identify data records is selected as the key. If a single field cannot fulfil
this criterion, then a combination of fields can serve as the key.
Operations
  1. Insertion: Records must be inserted at the place dictated by the sequence of the keys. As is
obvious, direct insertions into the main data file would lead to frequent rebuilding of the file.
This problem could be mitigated by reserving’ overflow areas’ in the file for insertions. But,



this leads to wastage of space. The common method is to use transaction logging. This works
as follows:
     It collects records for insertion in a transaction file in their order of their arrival.
     When population of the transactions file has ceased, sort the transaction file in the
         order of the key of the primary data file
     Merge the two files on the basis of the key to get a new copy of the primary
         sequential file. Such insertions are usually done in a batch mode when the
         activity/program which populates the transaction file have ceased. The structure of
         the transaction files records will be identical to that of the primary file.
2. Deletion: Deletion is the reverse process of insertion. The space occupied by the record
should be freed for use. Usually deletion (like-insertion) is not done immediately. The
concerned record is written to a transaction file. At the time of merging, the corresponding
data record will be dropped from the primary data file.
3. Updation: Updation is a combination of insertion and deletions. The record with the new
values is inserted and the earlier version deleted. This is also done using transaction files.
4. Retrieval: User programs will often retrieve data for viewing prior to making decisions.
Therefore, it is vital that this data reflects the latest state of the data, if the merging activity
has not yet taken place. Retrieval is usually done for a particular value of the key field. Before
return in to the user, the data record should be merged with the transaction record (if any) for
that key value. The other two operations ‘creation’ and ‘deletion’ of files are achieved by
simple programming language statements.
Disadvantages
Following are some of the disadvantages of sequential file organisation:
• Updates are not easily accommodated.
• By definition, random access is not possible.
• All records must be structurally identical. If a new field has to be added, then every record must be
rewritten to provide space for the new field.
Areas of Use
Sequential files are most frequently used in commercial batch oriented data processing where there is
the concept of a master file to which details are added periodically. Example is payroll applications.
DIRECT FILE ORGANISATION
It offers an effective way to organise data when there is a need to access individual records directly.
To access a record directly (or random access) a relationship is used to translate the key value into a
physical address. This is called the mapping function R.
R(key value) = Address
Direct files are stored on DASD (Direct Access Storage Device). A calculation is performed on the
key value to get an address. This address calculation technique is often termed as hashing. The
calculation applied is called a hash function. Here, we discus a very commonly used hash function
called Division - Remainder.
Division-Remainder Hashing
According to this method, key value is divided by an appropriate number, generally a prime number,
and the division of remainder is used as the address for the record. The choice of appropriate divisor
may not be so simple. If it is known that the file is to contain n records, then we must, assuming that
only one record can be stored at a given address, have divisor n.



Also we may have a very large key space as compared to the address space. Key space refers to all the
possible key values. The address space possibly may not match actually for the key values in the file.
Hence, calculated address may not be unique. It is called Collision, i.e.,
R(K1) = R(K2), where K1 = K2.
Two unequal keys have been calculated to have the same address. The keys are called synonyms.
There are various approaches to handle the problem of collisions. One of these is to hash to buckets.
A bucket is a space that can accommodate multiple records. The student is advised to read some text
on bucket Addressing and related topics.
INDEXED SEQUENTIAL FILE ORGANISATION
When there is need to access records sequentially by some key value and also to access records
directly by the same key value, the collection of records may be organised in an effective manner
called Indexed Sequential Organisation. You must be familiar with search process for a word in a
language dictionary. The data in the dictionary is stored in sequential manner. However an index is
provided in terms of thumb tabs. To search for a word we do not search sequentially. We access the
index. That is, locate an approximate location for the word and then proceed to find the word
sequentially.
To implement the concept of indexed sequential file organisations, we consider an approach in which
the index part and data part reside on a separate file. The index file has a tree structure and data file
has a sequential structure. Since the data file is sequenced, it is not necessary for the index to have an
entry for each record. Consider the sequential file with a two-level index.
Level 1 of the index holds an entry for each three-record section of the main file. The Level 2 indexes
Level 1 in the same way. When the new records are inserted in the data file, the sequence of records
need to be preserved and also the index is accordingly updated. Two approaches used to implement
indexes are static indexes and dynamic indexes.
As the main data file changes due to insertions and deletions, the contents of the static index may
change, but the structure does not change. In case of dynamic indexing approach, insertions and
deletions in the main data file may lead to changes in the index structure. Recall the change in height
of B-Tree as records are inserted and deleted. Both dynamic and static indexing techniques are useful
depending on the type of application.
A directory is a component of file. Consider a file which doesn’t have any keys for its records. When
a query is executed on such a file, the time consumed to execute the query is more when compared to
another file which is having keys, because, there may be arising necessity where in the file has to be
sorted on the field(s) on which the query is based. So, for each query, the file has to be sorted on the
field on which the query is based which is cumbersome. In the case of files which have keys, different
versions of the files which result due to sorting on the keys are stored in the directory of that file. Such
files are called index files and the number of index files vary from file to file. For example, consider
the file of Figure 12.1. If we designate Enrolment Number (Enum) and Name as keys, then we may
have two index files based on the each key. Of course, we can have more than two index files, to deal
with queries which use both the keys. Different software store index files in a different manner so that
the operations on the records can be performed as soon as possible after the query is submitted.



One of the prominent indexing techniques is Cylinder-Surface indexing. Since, there exists a primary
key for each of the files, there will be an index file based on the primary key. Cylinder-Surface
Indexing is useful for such index file. In this type of indexing, the records of the file are stored one
after another in such a way that the primary keys of the records are in increasing order. The index file
will have two fields. They are cylinder index and corresponding surface indexes. There will be
multiple cylinders and there are multiple surfaces corresponding to each cylinder.
Suppose that the file needs m cylinders, then cylinder index will have m entries. Each cylinder will be
having one entry which corresponds to the largest key value in that cylinder. Assume that the disk has
n surfaces which can be used. Then, each surface index has n entries. The k-th entry in surface index
for cylinder lth cylinder if the value of the largest key on the lth track of the kth surface. Hence, m.n
indicates the total number of surface index entries.
Suppose that the need arises to search for a record whose key value is B. Then, the first step is to load
the cylinder index of the file into memory. Usually, each cylinder index occupies only one track as the
number of cylinders are only few. The cylinder which holds the desired record is found by searching
the cylinder index. Usually, the search takes O(log m) time. After the search of cylinder index, the
corresponding cylinder is determined. Based on the cylinder, the corresponding surface index is
retrieved to look the record for which the search has started. Whenever a search is initiated for the
surface index, usually sequential search is used. Of course, it depends on the number of surfaces. But,
usually, the number of surfaces are less. After finding the cylinder to be accessed and after finding the
surface to be accessed, the corresponding track is loaded into memory and that track is searched for
the needed record.


No comments:

Post a Comment