GI data compared to other data
In its most elemental form digital data is composed of bits: indicators that have a state of either 0 or 1. Information can be encoded in these binary characters. The way in which this code works varies between systems.
One of the most universal conventions is the organisation into sets of eight bits, called words or bytes. A byte therefore has a sequence of 0s and 1s in any of 256 combinations – be it 00000000, 11111111 or 01101000. These are essentially the same as the set of numbers in base 2 equating to the decimals 0 to 255. Streams of bytes can be used to encode all kinds of information – the power of the computer comes from the volume in which these streams can be stored, and the speed with which they can be transmitted and manipulated.
The file is a collection of bytes that make up a logical unit of information. In describing types of file, the terms ASCII and binary are commonly encountered. ASCII (American Standard Code for Information Interchange) files adopt a convention in which each eight-bit sequence corresponds to one of a set of common characters. ASCII files are very simple and can be created using basic text editing tools like Notepad. A binary file encodes information in the bit sequence. You can only interpret the information contained in it by knowing the code for that particular file. If a file is described as binary it means that information is encoded in the bit sequence in some way or another – only by knowing the code for that particular file can you interpret the information, be it text, graphics, mathematical formulæ, video or whatever. This is how different software products use different types of file, identified with their different file extensions (for example, .doc, .tiff, .java and so on). In a sense, all files are binary, but in common usage the term refers to a file that does not conform to the ASCII convention. Learn more about ASCII codes.
The data used in GIS is no different. It is also organised into files with different software products using their own particular file types, binary coding and file extensions. The earlier chapters of The GIS files described how geographical data comes in many different forms. This fact is reflected in the files that are used to store this data. The range of different file types used in GIS can be very confusing! Word processing packages have a simple use of file types. With Microsoft® Word you store each document in a single .doc file. GIS can be much more complex. With geometry, attributes, indexes, topology, image and history information to store, most systems use more than one single file (with multiple file types) to encode a particular data layer. Chapter 6.1 attempts to put these file formats into perspective.
Proprietary file types
Every software product is designed to work with a specific set of file types. In essence that is what the software does: it reads the particular binary code to extract the stored information and then does something with it, for example, displays it on screen, sends it to a printer or performs calculations. Commercial software products usually have their own specific binary formats.
It would be impossible to describe the full range of different file types used in GIS as there are so many. However, it is important to recognise that each product handles the storage of information in different ways and, to fully understand what is happening to the data, it is useful to consider the files involved for your own system – what's going on under the bonnet, so to speak. Here are a few examples:
- single file for each layer – in some systems all information for a given layer is stored in a single file (for example, .dxf files in AutoCAD®);
- multiple files for each layer – some systems use a series of files for each layer (for example, MapInfo® has a .tab file for each table of information, but this is just a pointer to a set of files containing the geometry, attributes, identifiers and indexes separately); and
- a folder of files for each layer – more complex systems can use a more complicated hierarchy of files held in a specific folder to store the information (for example, ArcInfo® coverages).
In these last two examples you only ever interact with the data through the GIS software interface. The individual files are not designed to be edited outside the GIS as this will almost certainly corrupt the data.
GIS can read image data from standard graphics file formats but often need an additional file to register the image in space. Examples of such files are MapInfo .tab and ESRI® .tfw. These are in fact simple ASCII files. If you have any examples on your own computer try viewing the contents in a text editor; this can be useful to understand how they work. On the next page we'll see how ASCII files can be important in the transfer of data between systems.
Translators and transfer formats
GIS software is designed to work with data stored in specific proprietary binary data formats. The skill of the software developer is to optimise the system’s performance to include as many functions as possible while remaining fast and robust.
The binary code used to store the data is critical to the performance of the software and GIS vendors guard their binary formats as part of their unique intellectual property.
This means, however, that many different file formats exist. With so many different software products available, each with their own data formats, users of one system may, therefore, find it difficult to swap their information with users of another. In the early days of GIS this was serious; if you used data in one package you could not use the same data in a separate system from a different vendor. More recently it has become standard for GIS software to have import facilities that can open files from a diverse range of formats and store them in the preferred local format. Discover more about file extensions.
There has also been an explosion in the development of translator software. Products have been developed to convert geographical data between a whole array of formats. It would be unfair in these pages to highlight a specific translator product in comparison to any other. But it is now possible to convert between practically any of the possible data formats, of which there are over a hundred, in either direction. Type GIS translators into a search engine and see the results.
Most of these formats are binary as in this form data is more closely integrated with the software engineering of the products and can be manipulated more quickly. Codes to work with such data can only be written if you know the binary format. There is a series of more simple ASCII file formats that have been developed to enable easier transfer between systems. The human readable nature of ASCII files means that it is an order of magnitude more straightforward for other developers to write programs that can read these files. The MapInfo MID/MIF format is an example of an ASCII transfer format.
The previous pages in this section have discussed why so many different file types are used in today's GIS applications. Traditionally, data providers have supplied data in open ASCII formats with systems simultaneously loading and translating it into the proprietary binary format. Data can be exchanged between systems where an import option exists for the particular formats. There are also bespoke translation tools available to cover every possible option. Exchange between formats has advanced further in recent years and the term interoperability has become important. Within a single organisation there can be several different software products being used and it is imperative that information can be shared between them.
According to Moore's Law, the processing power of computer hardware can be expected to continue improving. It is therefore becoming less critical to the performance of GIS software for the data to be stored in the optimum format for that particular system. The recent trend for systems to use non-specific data formats means that data is read from, and written to, different native formats on the fly as the software performs its functions.
As a data provider, Ordnance Survey has to make careful decisions about the formats it uses to supply data. Data users want a choice of formats to avoid the need for translation, and although it is difficult to provide every possible format, excluding just one would be unfair to that software vendor. This explains the need for standard open data exchange formats that create a level playing field for the producers of software and translators.
The increasing significance of databases and the Internet is also playing a big role in the advance of interoperability. Increasingly, systems are being built around the use of databases to hold the information in each GIS layer, replacing the use of flat files. Proprietary binary formats are therefore becoming less important. A similar effect is seen in the way that systems can now read data in real time from central locations on a network, rather than reading from locally stored files. Although this section on data formats set out to highlight differences between file types, these issues will probably have a greater resonance for those practicing GIS in the late 1990s rather than today.