The Comparative Archaeology Database attempts to present datasets from different projects in a relatively consistent format, while recognizing that archaeology deals with an extremely wide variety of kinds of data and that archaeologists have traditionally employed many different forms of organizing their data. As such, we do not attempt to impose much standardization to the data files of different datasets. A number of datasets include entirely unique elements which we have tried to present in inventive ways that take advantage of on-line digital presentation with intuitive organization not requiring separate explanation. There are also at least a few recurrent kinds of elements that we attempt to treat consistently. We have chosen a few standard file formats largely because they are importable to an especially wide array of analysis software and because they seem likely to persist as widely used standard formats into the future. We have also tried to present metadata in a consistent way whenever possible.
Cases and Variables Data
Quantitative data often take the form of cases coded for a series of variables. Such data files are presented as tables in which rows are cases and columns are variables, as usually organized by software for statistical analysis. The comma-separated values file is a lowest-common-denominator format for such files (which usually has a .csv extension, but may also use the .txt extension). These files can be read by almost any program written for dealing with such data, and this format is used consistently in the datasets in the Comparative Archaeology Database. The first line of these files are displayed and explained in the metadata section of each data page to help users avoid mistakes in interpreting them and their software's mistakes in importing them. The central elements of the metadata for such files include an explicit indication of just what the cases (rows) are and a list of the variables (columns) with whatever description is needed to make sense of the content of the table. Variable names are not included on the first lines of the comma-separated values files, as is sometimes done, since the conventions for variable names vary so much across different data analysis software. Users will need to refer to the metadata and create their own variable names according to the requirements of the software they use. In many datasets such data are also downloadable in spreadsheet format, which is easier to browse (including with free open-source software such as OpenOffice).
Spatial Data and Georeferencing
Spatial data present a number of particular complexities. One of these is georeferencing. Much archaeological data collection occurred before global georeferencing was standardized and before the use of GPS equipment was widespread. At small scales (such as locations on a living surface or within a site) it is relative spatial position that matters, and the use of arbitrary non-georeferenced coordinate systems has been and continues to be common. The same actually is often true for larger (regional) scales, although GPS-based field equipment may collect data in georeferenced coordinate systems, and some GIS software automatically assumes that coordinate systems are georeferenced, and extra effort is required to work with an arbitrary coordinate system if georeferencing is not useful.
The nature of available georeferencing information varies from one dataset to another, depending on when, where, and how the spatial information was collected. As widely available georeferenced environmental data increases in quantity and improves in resolution, the temptation to overlay regional-scale archaeological settlement distributions on newly available environmental information of various kinds increases. For the regional-scale datasets in the Comparative Archaeology Database, an effort has been made to be explicit about the coordinate systems used for spatial data, and to provide information that can be helpful for georeferencing. As is always the case in attempting to overlay spatial datasets from different sources, the user should use extreme caution in checking that the data from different sources are properly registered before proceeding with any form of analysis. One way to do this, of course, is to see whether readily visible features in data from different sources (streams, highways, sharply distinguishable topographic features, etc.) actually match in the overlay. All powerful GIS software provides tools for registering data from different sources properly by matching such shared features, whether metadata about coordinate systems are available or not. These tools are likely to produce better results (in the form of more accurately and reliably registered datasets) than "automatic" registration of multiple datasets based on faulty or imprecise georeferencing metadata.
Vector Drawings
Spatial information in vector format in the Comparative Archaeology Database is usually provided as AutoCAD (Release 12) .dxf files. This format is importable to a wide variety of graphics and GIS software. It is the native format for AutoCAD Map; thus "importation" to that application is not necessary, and the data structure will be familiar to AutoCAD Map users. Data structures and vocabularies, however, differ sharply between GIS programs. Different programs vary substantially in their ability to detect and correctly utilize georeferencing information embedded in AutoCAD Map .dxf files. For this reason, most .dxf files in the Comparative Archaeology Database do not contain embedded georeferencing information, which is provided instead as part of the metadata. The user who needs georeferenced archaeological data is thus required to pay explicit attention to the substance of the georeferencing information available and to know how to use it correctly in the software to which a .dxf file is imported. Since, as of now, many archaeologists use ArcGIS, some tips on importing data to ArcGIS are provided below. In some datasets, spatial information in vector form is also provided in the file formats originally used by the authors of those datasets in addition to the .dxf versions.
Raster Imagery
Spatial information in raster format in the Comparative Archaeology Database is usually provided as GeoTIFF files. Georeferencing information is always embedded in these .tif files, as well as in their accompanying but separate .tfw files. Some GIS software georeferences GeoTIFF files upon importation according to the embedded information and makes no use of the .tfw file. Other software relies upon the .tfw file and makes no use of the georeferencing information embedded in the .tif file. Yet other software offers the option of using either of these sources of georeferencing data (sometimes with different outcomes). Georeferencing information is also provided explicitly on-line as part of the metadata for the dataset. The GIS datasets in the Comparative Archaeology Database are internally consistent; if they are combined with spatial data from other sources, it is important to verify proper registration by confirming that the locations of spatial features in data from both sources actually do match (as discussed above).
Importing DXFs to ArcGIS
ArcGIS assumes that any map layer is georeferenced to a particular location on the earth's surface. As powerful and universal an approach as this may be for combining spatial data from different sources, many kinds of spatial analysis do not require georeferencing. Such analyses may be carried out with data from the Comparative Archaeology Database irrespective of the availability or precision of georeferencing information. Subjects of such analysis include spatial clustering or dispersion, centralization, settlement hierarchy, network relationships, similarity of multiple spatial distributions, and many others. Most of the regional datasets in the Comparative Archaeology Database also include information on such environmental elements as topography, hydrology, soils, and other resources. These different map layers within a dataset are properly referenced to each other, even for datasets collected before attention to precise georeferencing was common or even possible for some regions. These sets of geographic information are internally consistent and analyzable on their own terms without need of georeferencing. Some GIS programs (including AutoCAD Map, GRASS, Idrisi and others) make it easy to carry out analyses in whatever internally consistent coordinate system is convenient, irrespective of georeferencing. Queries from users of ArcGIS, however, reveal that many are stymied by ArcGIS's insistence that all map layers be georeferenced. This issue can be resolved in ArcGIS, as discussed below. Georeferencing can also be accomplished even when georeferencing metadata are not available, also discussed below. These notes are not intended to substitute for ArcGIS documentation but simply to call attention to relevant ArcGIS tools that some users may not be aware of.
A .dxf file, for example, can be imported to ArcGIS by dragging it from the Catalog pane or using the Add Data shortcut. Ignore the ArcGIS warning that spatial information is missing and proceed. The AutoCAD entities now appear under the file Group Layer in the ArcGIS Table of Contents. AutoCAD Text entities appear as ArcGIS Annotation features; AutoCAD Point entities appear as ArcGIS Point features; AutoCAD Polyline entities appear as ArcGIS Polyline features; and AutoCAD Closed Polyline entities appear as ArcGIS Polygon features. The Attribute Tables for the ArcGIS features contain variables corresponding to various properties of the AutoCAD entities, including, for example, Layer, Color, and others.
If georeferencing metadata are available, this information can be used to set the properties of the ArcGIS Data Frame. This will likely require specifying the coordinate system, the base datum, and the map units. ArcGIS, for example, is one of several GIS programs that are quite persistent about using meters as the map units for the UTM projection system, whereas other software (including AutoCAD Map, GRASS, and Idrisi) allows more flexibility and UTM-based datasets may often use kilometers as the basic map unit.
ArcGIS provides excellent tools for georeferencing maps for which georeferencing metadata are not available or seem to be inaccurate. These tools are found on the Georeferencing and Spatial Adjustment toolbars. The most powerful and flexible tools allow establishing Control Points on the ungeoreferenced map and telling ArcGIS what the real-world coordinates of those control points are in a known coordinate system. Specific locations of sites, rivers, highways, modern towns, etc. are often included in a GIS dataset. The coordinates of these places can be established, in the UTM system or Lat/Long, based on the WGS84 datum, by finding them in a source such as Google Earth. To the extent that the coordinates of a number of control points can be established accurately in this way, georeferencing may have sufficient precision to permit overlays of georeferenced data from other sources. In any case of overlaying data from different sources, the wise analyst insists on visual verification of the accuracy of the spatial match. (Just how well do the locations of rivers, highways, or other identifiable features in both map layers match once they are georeferenced and overlaid?)
If the intended analysis does not involve overlays of spatial data from other sources, but only requires internal consistency within a single dataset, then it may be convenient in ArcGIS to locate the dataset in its approximate location in the world, quickly using the above approach. Great investment in achieving high precision is not necessary. As long as all imported elements of an internally consistent GIS dataset are treated in the same way, the internal consistency will be maintained.
Once spatial data are imported to ArcGIS, a Shapefile can be created with the Export Data option. If the Attribute Table containing the AutoCAD properties of the entities is no longer attached to them, it can be reattached, for example with the Feature to Point tool in the Data Management section of the ArcToolbox.
In a .dxf file, the identifiers of polygons often take the form of AutoCAD Text entities whose Insertion Points are located inside the Closed Polylines that delineate the polygons. When such a map is imported to ArcGIS, the connection between the polygons and their identifiers must be re-established. This can be done with a Spatial Join. The layer with the polygons is the Target Features layer, and the Annotation Features layer that the AutoCAD Text entities have become is the Join Features layer. The Join Operation is JOIN_ONE_TO_ONE. A new layer will be created with the polygons, and its Attribute Table will include the text-string identifiers.
These identifiers can then be the basis of attaching other data tables (in the form, for example, of a spreadsheet) to the polygons. By default, identifiers in an Annotation layer made this way will be text fields. If the external data table has identifiers in a number field, one or the other must be changed, as ArcGIS will not recognize a match between a number field and a text field. To change the field type in ArcGIS, create a new column in the table with the desired field type, and use the Field Calculator to calculate its values based on the existing identifier field. The external table can then be linked to the polygons with the Join Attributes from a Table tool found in the Join Data dialog. The field containing the identifiers must be selected for both the Annotation Table and the external data table. The columns from the external table will be added to the Annotation Table. The entire structure can then be saved as a Shapefile with the Export Data tool.
Guide for Authors: ArcGIS to DXF
After submitting your files, we are happy to carry out the conversion of all vector spatial data to .dxf format. However, for authors wishing to carry out the conversion themselves, we provide the following guide, adapted from this ESRI walkthrough for ArcGIS Pro. It provides step-by-step instructions on how to create georefernced .dxf files that have attribute IDs as text strings whose insertion point falls within the polygons they represent: download guide.