Data Files: Technical Notes

The Comparative Archaeology Database attempts to present datasets from different projects in a relatively consistent format, while recognizing that archaeology deals with an extremely wide variety of kinds of data and that archaeologists have traditionally employed many different forms of organizing their data. This often seems a nuisance, but it is one that cannot simply be legislated away. For one thing, the nature of the archaeological record varies from one part of the world to another to a degree an archaeologist with experience in only one region cannot appreciate. This variation requires different field methods and different ways of organizing datasets. Even under identical conditions, archaeologists with different research objectives will collect and thus present data in different ways. What comprises efficient data organization can be dictated by the different structures required by different analytical software (which of course can change rapidly). If all that were not enough, archaeologists have idiosyncratic personal preferences, that we all cling to stubbornly, resisting others' dicta on "best practices." To paraphrase Marshall Sahlins writing about the authority of chiefs, one word from a standards committee, and archaeologists will organize their data exactly as they please. In this situation, the Comparative Archaeology Database does not attempt to impose much standardization on the data files in different datasets. A number of datasets include entirely unique elements which we have tried to present in inventive ways that take advantage of on-line digital presentation with intuitive organization not requiring separate explanation. There are also at least a few recurrent kinds of elements that we attempt to treat consistently. We have chosen a few standard file formats largely because they are importable to an especially wide array of analysis software and because they seem likely to persist as widely used standard formats into the future. And we have tried to present metadata in a consistent (and perhaps, we fear, overly detailed) way.

Cases and Variables Data

Quantitative data often take the form of cases coded for a series of variables. Such data files are presented as tables in which rows are cases and columns are variables, as usually organized by software for statistical analysis. The comma-delimited ASCII text file is a lowest-common-denominator format for such files. The usual file extension is .txt. These files can be read by almost any program written for dealing with such data, and this format is used consistently in the datasets in the Comparative Archaeology Database. The first and last lines of these .txt files are displayed and explained to help users avoid their own mistakes in interpreting them and their software's mistakes in importing them. The central elements of the metadata for such files include an explicit indication of just what the cases (rows) are and a list of the variables (columns) with whatever description is needed to make sense of the content of the table. Variable names are not included on the first lines of the ASCII files, as is sometimes done, since the conventions for variable names vary so much across different data analysis software. Users will need to refer to the metadata and create their own variable names according to the requirements of the software they use. In many datasets such data are also downloadable in spreadsheet format (.xls), which is easier to browse (including with free open-source software such as OpenOffice).

Spatial Data and Georeferencing

Spatial data present a number of particular complexities. One of these is georeferencing. Much archaeological data collection obviously occurred before global georeferencing was standardized and before the use of GPS equipment was widespread. At small scales (such as locations on a living surface or within a site) it is relative spatial position that matters, and the use of arbitrary non-georeferenced coordinate systems has been and continues to be common. The same actually is often true for larger (regional) scales, although GPS-based field equipment may collect data in georeferenced coordinate systems, and some GIS software automatically assumes that coordinate systems are georeferenced, and extra effort is required to work with an arbitrary coordinate system if georeferencing is not useful.

The nature of available georeferencing information varies from one dataset to another, depending on when, where, and how the spatial information was collected. As widely available georeferenced environmental data increases in quantity and improves in resolution, the temptation to overlay regional-scale archaeological settlement distributions on newly available environmental information of various kinds increases. For the regional-scale datasets in the Comparative Archaeology Database, an effort has been made to be explicit about the coordinate systems used for spatial data, and to provide information that can be helpful for georeferencing. As is always the case in attempting to overlay spatial datasets from different sources, the user should use extreme caution in checking that the data from different sources are properly registered before proceeding with any form of analysis. One way to do this, of course, is to see whether readily visible features in data from different sources (streams, highways, sharply distinguishable topographic features, etc.) actually match in the overlay. All powerful GIS software provides tools for registering data from different sources properly by matching such shared features, whether metadata about coordinate systems are available or not. These tools are likely to produce better results (in the form of more accurately and reliably registered datasets) than "automatic" registration of multiple datasets based on faulty or imprecise georeferencing metadata.

Vector Format Spatial Data and AutoCAD Map .dxf Files

Spatial information in vector format in the Comparative Archaeology Database is usually provided as AutoCAD Release 12 .dxf files. This format is importable to a wide variety of graphics and GIS software. It is a native format for AutoCAD Map; thus "importation" is not necessary, and the data structure will be familiar to AutoCAD Map users. Data structures and vocabularies, however, differ sharply between GIS programs. Different programs vary substantially in their ability to detect and correctly utilize georeferencing information embedded in AutoCAD Map .dxf files. For this reason, most .dxf files in the Comparative Archaeology Database do not contain embedded georeferencing information, which is provided instead as part of the metadata. The user who needs georeferenced archaeological data is thus required to pay explicit attention to the substance of the georeferencing information available and to know how to use it correctly in the software to which a .dxf file is imported. Since, as of now, many archaeologists use ArcGIS, some tips on importing data to ArcGIS are provided. In some datasets, spatial information in vector form is also provided in the file formats originally used by the authors of the datasets, as well as in .dxf files.

Raster Format Spatial Data and GeoTIFF Files

Spatial information in raster format in the Comparative Archaeology Database is usually provided as GeoTIFF files. Georeferencing information is always embedded in these .tif files, as well as in their accompanying but separate .tfw files. Some GIS software georeferences GeoTIFF files upon importation according to the embedded information and makes no use of the .tfw file. Other software relies upon the .tfw file and makes no use of the georeferencing information embedded in the .tif file. Yet other software offers the option of using either of these sources of georeferencing data (sometimes with different outcomes). Georeferencing information is also provided explicitly on-line as part of the metadata for the dataset. The GIS datasets in the Comparative Archaeology Database are internally consistent; if they are combined with spatial data from other sources, it is important to verify proper registration by confirming that the locations of spatial features in data from both sources actually do match (as discussed above).