Data and file formats describes different types of data file. Files are the most commonly used packets of information in the world of the desktop computer. But when the volume of data becomes very large, or you need to allow many people to access the data at the same time, it becomes preferable to store the information in a database.
A database is a tool capable of storing large amounts of complex information in a structured way. Information in a database is organised into individual records that can be referenced, sorted, indexed, linked and queried. Most computer systems you interact with on a daily basis have some kind of database behind the scenes. These contain many different types of information, for example, an ATM showing your bank account details; on-screen flight information at an airport; products and prices at the supermarket checkout and so on. Databases are, in principle, more robust, secure and scalable than storing information in flat files.
In large industrial software systems there will usually be multiple databases operating together in a database management system (DBMS). There are two distinct major database types: relational (RDBMS) and object-oriented (OODBMS).
The storage of information in a relational database is fairly simple. The records of information are organised into rows and columns in a table, with a separate row for each entity and a column for each property stored about that set of entities. For example, a table could contain a list of cars with each entry containing values such as a registration number, colour, mileage, make or mode. This is essentially the same as a spreadsheet. Each column is formatted to store a particular type of data, be that text, numbers, dates or boolean. A structured query language (SQL) that is used by almost all RDBMS to allow interrogation of the data has been developed. For example, the SQL statement Select * from CARS where MAKE = "Ford" will generate a list of all the Fords in the original table.
For the computer to understand the data correctly there can be no duplicate records in a table. In a relational database each record in a table must be uniquely identified for queries to be meaningful. This usually means that one of the columns should contain unique values – the primary key. In our example you could assume that car registration numbers will never be duplicated so this column could be used as the primary key. Often the data does not inherently contain a primary key, so the database will generate its own set of unique ID numbers for that table.
The term relational derives from the fact that such databases use multiple tables to store the information, with data linked together by relationships between these tables. These tables contain information about like entities and make storage more efficient by avoiding duplication. For example, in the car table there is no point in storing both the make and model of each car. You know that every record having Vectra as the model will also have Vauxhall as the make. You can store the relationship between model and make in a small separate table and only record the model in the main table. This is a key concept of database design and is known as normalisation. Database normalisation saves storage space and makes the data easier to index and analyse. Querying highly normalised relational databases can become quite complex since a large number of tables may need to be linked together.
Crucial to the efficient querying of a database is the way in which the tables are indexed. You can create indexes on tables that enable the computer to sort through and answer a query quickly. One way in which this can work is for a column containing textual data to generate an index in which the values are sorted alphabetically. The index stores ranges of values so that to respond to a query the software only has to search through a small subset of the whole table to find the required records.
Object-oriented databases have their origins in the realm of object-oriented programming languages. This subject is notoriously difficult to explain in simple terms and it may be advisable to seek more detailed resources and devote some time to fully grasp the concepts of object-orientation.
OODBMS organise information very differently than RDBMS. Rather than spreading the information about an entity across a range of linked tables, it is stored together in discrete lumps called objects. Each object is defined within a hierarchy of object classes so that it inherits properties from a parent class. Additional attributes can be defined within the object and they are said to exhibit encapsulation because they can self-describe their own particular set of properties, and therefore the way in which they can be queried. Object-oriented databases can make it easier to model real-world phenomena in a logical form.
There is a particular problem for the GIS student learning about object databases: the use of the term object. The situation gets clouded by the fact that geospatial data corresponds to real-world objects. In GIS the shapes and locations of things are stored as coordinate geometry. GIS data is often stored in a database, either storing the coordinates as numbers or using special geometry data types. You will hear GIS practitioners refer to an object database to describe any database that can store the geometry of topographic objects in its tables. To a practitioner of pure computer science the distinction between relational and object databases has nothing to do with geography. To make things worse, there is a hybrid type of database called object-relational in which advanced data types can be stored in relational tables that reproduce some of the advantages of the object model. In theory, you can store object geometry in each of these database types: relational, object and object-relational.
As you can imagine, it is very important when using this jargon that you know exactly what you mean by an object! The term spatial database is better for specifying the storage of geographical features.
GIS and databases
As previously mentioned, many implementations of GIS use a database rather than simple files for the data storage. Geographical datasets can be extremely large and so the benefits of database storage are as applicable to GIS as to any other kind of system. The elements of data security, the ability to cope with large data volumes and the accessibility to multiple users are equally important. The explosion in the availability of cheap, high-volume disk space has fuelled the proliferation of large databases.
Many large organisations maintain vast enterprise-wide information systems that incorporate different types of geographical objects. For example, a utility company will use GIS to store information about its pipe networks, the location of its customers and the location of its maintenance teams. This information will need to be continually updated as it is much easier to lock the record for a single feature, perform edits and then perform the update if the features are stored in a database rather than in a file. The database also allows the various departments to view the information in different ways.
All major GIS software vendors provide tools to enable database storage. Database management can be a complicated and specialist task, so products are developed to provide a user interface similar to that of a regular desktop GIS but which also handle the database administration side. Such products are often referred to as middleware. Middleware is usually designed to operate across a range of the most popular database products such as MS-Access, SQL Server, Oracle®, Sybase®, Ingres® and IBM®. More recently, the database software companies themselves have been producing extensions to the standard functionality that allow for the storage of complex data types, for example, coordinate geometry, raster images and terrain models. This is a telling sign of how GIS has become recognised as a key component of information technology.
Advanced database technology
Databases are now used in sophisticated ways that go beyond the simple storage of information and its retrieval by structured queries. Two techniques that are often encountered in the study of GIS are data warehousing and data mining. Databases are designed as part of a particular system, only storing the information needed to make that system work. Organisations end up with many databases serving different functions and often this data can contain information that has value beyond the purpose of the individual systems for which they were designed. The centralised gathering together of diverse sets of information stored within an organisation is known as data warehousing.
Techniques have emerged that automatically scan the information held in a data warehouse to identify possible relationships between data items. This is known as data mining, which can reveal phenomena that may otherwise remain undetected. Terms you will hear often associated with data mining are regression, classification and clustering. Mostly, data mining concerns a statistical analysis of the contents of the data warehouse to identify commonalities and patterns. For example, regression refers to the mathematical analysis of numerical data to identify a formula that best fits the trends in the data. If successful this can enable successful prediction of future results.
Another feature of databases that is very important to their application in GIS is indexing. The way in which indexes speed up the response to queries has already been described. This becomes very important when performing geographical search queries as it is possible to generate spatial indexes that break down the space occupied by features in the table and sort them into a hierarchy similar to the alphabetical sorting of text values. In response to a request to find all objects that intersect a polygon, it can be quicker to find a subset near that polygon first and then analyse each object of this subset more accurately to find those that actually intersect.
Indexing is important because spatial queries can be very complicated and time consuming. If you are trying to select all features lying within a county boundary you could be checking from many thousands of records against the shape with thousands of vertices, a very convoluted geometric algorithm. Spatial databases that contain features in three dimensions are starting to be developed – for example, to store a building as a 3-D volumetric object, not just a planar polygon shape. The generation of spatial indexing for three-dimensional space presents interesting challenges!
Finally in this section, a mention of another key challenge – the storage of data in the fourth dimension. GIS databases designed to store information about real-world objects and how they change over time are called spatio-temporal. To truly reflect real-world changes in data form, a GIS needs to maintain historical records. The simplest way of achieving this is to keep copies of the data at intervals to create a series of time slices. More ideal, but harder to achieve, is to archive each feature every time a change is made to it; this means you can answer a query for any moment in time.