Test data

Table of Contents

  • Discussion of classifications
    • Simple vs Composite
    • Rendered vs Non-rendered
    • Static vs Dynamic
    • Active vs Passive
    • Multiple-classifications
    • Summary
  • Classifications, summary of examples and testbeds
  • Detailed test data examples

Discussion of classifications

It is impossible to give an exhaustive list of types of digital objects, yet it is useful to remind ourselves of at least some of the great variety that we must be able to deal with. By types we mean not just different formats, but rather different classifications.

One reason for being interested in the variety of types is that unless one is aware of the distinctions it is very easy to assume that everything is the same and the same tools can be used. For example if one normally deals with the preservation of documents, for example Word or PDF, then one might assume that all digitally encoded information can be preserved using the same tools. Unfortunately this is not true, as we will see. The next sections present a brief overview of some of the distinctions which can be made, without any claim of being exhaustive.

Simple vs Composite

One way to classify digital objects is by whether they normally are treated as a whole – for example an image – or whether they are normally treated as a collection of simpler parts, face.jpg

for example a FITS file which has several images and tables. The latter we will call Composite (or sometime Complex) objects.

It is important to make this distinction because if we can break the preservation challenge of a composite object into smaller components then it will make the preservation task easier. On the other hand if we treat the composite object as if it were a simple one then we could run into a great deal of trouble in future.

However it is never so clear cut – because whether a digital object is simple or composite often depends upon the eye of the beholder. A Word document may normally be treated a simple object. In actual fact it is, internally, very complex, containing information about styles and page layout etc. However one normally disregards this because the software we use deals with the Word file as a whole. On the other hand some Word files have embedded spreadsheets and drawing objects which can be edited separately; in this case one might often treat such an object as a collection of parts.

The FITS file is a whole digital object but the analysis is normally done on a component by component basis, in other words Image 1 is displayed and processed, and the same with Image 2.

A particular format may allow many possibilities, and such formats may evolve and increase in complexity over time [209]. The original FITS format allowed simple images; the current definition allows much greater complexity – but can still contain a single image. Thus we need to be concerned with the particular digital object, not the format, when we look at whether it is simple or composite.

CompositeObject.jpg FITS file as a composite object

ContainerObject.png Composite object as a container

Rendered vs Non-rendered

Another way to divide the digital world is as follows.

There are digital objects which are usually processed by some software to produce a rendering which is presented to a human user who can then interpret what he/she sees/hears/feels/tastes. This can include documents, pictures, videos and sounds. These we will refer to as Rendered Digital Objects.

On the other hand one can have a digital object for which it is not enough to simply render it but for which one needs to know what the contents mean in order to be able to further process it. It is useful to make this distinction because it is easy to think that every digital object is simply rendered; that every digital object need only be displayed.

Indeed one could argue that the ultimate user of a digital object is a human who needs to see or hear (or perhaps in future to feel, taste or smell) the result. For example even a FITS image is (often) displayed.

However displaying a FITS image is rarely the ultimate aim. Instead an astronomer might want to make measurements which require an understanding of the units and coordinate systems. He/she might also reasonably want to combine this piece of data with another. In other words what is wanted is to do more than render it in one particular way; instead there is an enormous variety of ways users may want to deal with the object. When we are thinking about digital preservation one must look to the future – not in order to guess what it may to be but rather to recognise that it may be different from today. Therefore we need to identify what someone – at least the Designated Community - needs in order to understand and use a non-rendered object digital object in any number of different ways. Of course it is not always clear cut.

For example consider two text files. In one case one can have some English text, say a recipe for a cake in a file “recipe.txt”. recipe.jpg Using a Windows PC the file is easily readable because the “.txt” part of the name lets the machine try an application which can display an ASCII encoded file – which is what this is. Normally one would say that no special knowledge is needed to understand this – it simply needs to be read. However there is a requirement to be able to read English and also to know what the various measures are (for example what size is “a cup”?) and also to know what the ingredients are (for example what is “lemon zest”?); without such knowledge the recipe is neither understandable nor usable. Consider now another text file (“table.txt”) which, as a simple “.txt” file is easily readable on a PC – again the “.txt” usually lets us guess, correctly in this case, that this is an ASCII encoded file. In this case we are more obviously in some trouble because although we can see something which we can reasonably assume are numbers, we do not know what the numbers mean. If we are told that the numbers under the headings “X”, “Y” and “Z” provide us with the sides of a rectangular cuboid, then we can calculate the volume of that shape using the formula “X*Y*Z” for each row, namely 14.742. 31.8 and 114.034. table-text.jpg

On the other hand we might be told that “X” is the longitude on Earth, “Y” the latitude, both measured in degrees and “Z” is the concentration of a certain chemical in parts per billion.

We see that the format alone is insufficient; one needs to know what the contents (e.g. the numbers) mean.

By Non-Rendered Digital Object we mean things which, like table.txt, are not simply rendered but rather are to be processed to produce any number of possible outputs. For example table.txt could be plotted, displayed as a pie-chart or histogram. Alternatively the information in the columns of table.txt could be used to calculate the density of chlorophyll in the Amazon rain forest (if that is the sort of information there is in table.txt). As another example one can take a digital object from the GOME instrument [237], which might be as shown in Figure 14 - Figure 16.

GOME-binary.png GOME data - binary

GOME-numbers.png GOME data - as numbers/characters

GOME-derived.png GOME data - processed to show Ozone data with particular projection

We can also have two files of the same format, say a sound file such as MP3, the first of which (“music.mpg”) is indeed something that can be used to play music, but a second, also an MP3 file (“config.mpg”), which contains numbers which are configuration parameters for setting up some software. If we click on the first on a home computer then it will play some music because the “.mpg” makes the computer try to use a music application. Clicking on the second will cause the computer to try to use that same application but it may produce only a brief grating sound, or perhaps nothing audible at all. The important points are that we currently rely on many clues, such as having a file ending “.txt” or “.mpg” which many computers use to choose an application for displaying or playing the file. On the other hand, even now these clues are insufficient, as with “table.txt” (Figure 13).

Of course computers are not intelligent - in fact they have been instructed which applications to use for which file extensions, for example Notepad for files with names ending in “.txt”. Sometimes this does not do what is expected, as with “config.mpg”. In other cases we can do something with the file but not very much, as with “table.txt”.

Some others mentioned in the introduction, such as family photographs (“face.jpg”, Figure 9) are very similar in that what one expects is to display or play contents of the file and then it is up to the viewer, or listener, to understand it. Of course one is not listening to the bits – what we mean is that there is an application which is used to convert the bits to an image or a sound. The application may also allow one to zoom in to part of an image or search for a piece of text or copy a piece of music and insert it in a separate file. But even without these extra functions, one can make use of the file, by which we mean we can look at or hear the output of the application and we would be quite happy if that was all we could do.

These type of files – let’s use the term Digital Object as a more general term instead of “file”- we will refer to as Rendered Digital Objects. For these types of objects it is (currently) normally regarded as sufficient if in future one can simply display it if it is an image or movie, or play it if it is a sound. These are the types of digital objects which one commonly deal with in everyday life, documents, images, web pages etc. There are many books which talk about the preservation of these kinds of objects:

  • word processor documents
  • financial files
  • spreadsheets
  • databases of various sorts
  • .....
Throughout this book we will also look at examples from a variety of disciplines including science, cultural heritage and contemporary performing arts. Science:
  • Observations of the Earth from space, including multi-spectral images, synthetic aperture radar images
  • Measurements of the atmosphere, chemical or electrical composition
  • Software for processing raw date to data which is scientifically useful
Cultural Heritage
  • Laser scans of buildings and artefacts
  • Plans of buildings
  • 3-D virtual reality models
Performing Arts
  • “patch” file for processing what the performer plays
  • configuration file which map video capture of movement to musical performance.
All the above are just some of the example of “non-rendered” data which are of importance to society.

Static vs Dynamic

It may seem strange to consider the preservation of non-static files. However Buneman has argued (http://www.era.lib.ed.ac.uk/bitstream/1842/3203/1/IJDC_Iss3_Vol4_Buneman_et_al.pdf) that these are an important class of objects. The basic idea is that it is that the various changes are important and there is a desire to make queries about such changes.

It may be argued that such changes are more an issue of Provenance, or that a series of snapshots of the object between changes is what is being preserved. However Buneman argues that curated databases, which are the examples of objects he focuses on, are extremely important.

The CIA world factbook is the example often used. Buneman says "The CIA World Factbook is a prime example of a curated database – a database that is constructed and maintained with a great deal of human effort in collecting, verifying, and annotating data. Preservation of old versions of the Factbook is important for verification of citations; it is also essential for anyone interested in the history of the data such as demographic change."

Digital objects do (usually) need software and hardware to extract information from the bits. Static objects are ones which, unless they are transformed, are unchanged as bit sequences. These we will refer to as Static Digital Objects.

On the other hand we can think about database files which naturally change over time as entries are changed. Alternatively we can consider a whole collection of files as the data object. Such a collection might change as additional files are added to the collection over time. Such digital objects we will refer to as Dynamic Digital Objects.

Of course at any particular time the Dynamic Digital Object is a particular Static Digital Object which we may preserve. On the other hand it may be of interest, in the case of a Dynamic Digital Object, to know what the state of the object was at any particular time. In fact some would say most datasets change over time and the state at each particular moment in time may be important. This is an important area requiring further research; however from the point of view in this book it may be useful to break the issue into separate parts. At each moment in time we could, in principle, take a snapshot and store it. That snapshot has its associated Representation Net. Efficient storage of a series of snapshots may lead one to store differences or include time tags in the data. Additional Representation Information would be needed which describes how to get to a particular time's snapshot from the efficiently encoded version.

Active vs Passive

One other useful distinction is between what may be called active and passive digital objects.

By Passive Digital Object we mean something which is used by other applications (software) to do something. For example a document file is used by a word processing programme to print the document or display it on the screen, or an astronomical image in a FITS file would be used by astronomical analysis software to do scientific research. Such digital objects are often referred to as “data” but since the term Data Object is already used by OAIS we prefer the term Passive Digital Object.

An Active Digital Object on the other hand does something. For example the word processing application or the astronomical analysis software mentioned in the previous paragraph might be the digital objects to be preserved.

Once again there will always be fuzzy boundaries, so one could consider an Access[TM] database as a Passive Digital Object - used by the Access software - but it could easily itself contain software (for example some form of BASIC) which would mean that it could be considered to be an Active Digital Object.

Multiple-classifications

The classifications are not mutually exclusive, and in fact one can think of a simple-rendered-static-passive object – the image “face.jpg” is an example of this. One can also have a composite-non-rendered-dynamic-active object such as a database with built in queries into which new rows are being inserted. The Word.exe executable file may be thought of as a composite-non-rendered-static-active object.

Multiple-classes.jpg a representation of multiple classification – although we are limited to drawing in 3-dimensions!

Summary

The purpose of this chapter has been to provide a partial view of the variety of types of digital objects which exist “in the wild” and which one might be required to preserve. The reason has been to ensure that the reader can at least recognise the possibilities when confronted with the challenge of preserving a digital object. Later chapters will discuss preservation techniques for some of this multitude of possibilities.

Classifications, summary of examples and testbeds

CDO
changed?
Y Y Y Y Y Y Y Y N N N N N N N N
- Access
service
changed?
Y Y Y Y N N N N Y Y Y Y N N N N
- - - - - SPT TPT DCT MVT SPT TPT DCT MVT SPT TPT DCT MVT SPT TPT DCT MVT
Rendered? Static? Simple? Passive? Example - - - - - - - - - - - - - - - -
Y Y Y Y simple JPEG image

Word 97 test file

Word docx test file
y y y ? y y y ? y y y ? ? y y ?
Y Y Y N Probably no such objects. - y y - - - y - - - y - - - y -
Y Y N Y Word file with macros y y y ? y y y ? y y y ? ? y y ?
Y Y N N Probably no such objects. - y y - - - y - - - y - - - y -
Y N Y Y Active Blog - y y - - - y - - - y - - - y -
Y N Y N Probably no such objects. - y y - - - y - - - y - - - y -
Y N N Y ? - y y - - - y - - - y - - - y -
Y N N N Probably no such objects. - y y - - - y - - - y - - - y -
N Y Y Y Time series - y y - - - y - - - y - - - y -
N Y Y N Executable file(?) - y y - - - y - - - y - - - y -
N Y N Y FITS files test data
ESA data
- y y - - - y - - - y - - - y -
N Y N N Linux "file" executable - y y - - - y - - - y - - - y -
N N Y Y Non-static objects - y y - - - y - - - y - - - y -
N N Y N Probably no such objects. - y y - - - y - - - y - - - y -
N N N Y Java class files - y y - - - y - - - y - - - y -
N N N N Transactional database with built-in procedures - y y - - - y - - - y - - - y -

Detailed test data examples

document_new.png
Add data type example
list.png
Show all

DescriptionData type name Rendered? Static? Simple? Passive?Keywords Preservation issues
DataTypeExampleID10008 View entry Edit entry ASCII text file Simple ASCII text file y y y y Overview

The ASCII file is normally thought of as being made humanly readable by rendering, and hence is marked as such above. However such files may also be data, for example as a configuration file.

Designated Community

English speakers.

Preservation objectives

For the DC to be able to render, read and understand the text contained in the file.

Specific issues

The ASCII bit encoding is well understood. Some semantics may be needed to fully understand the text.

DataTypeExampleID10009 View entry Edit entry Simple JPEG image Simple JPEG image y y y y image Overview

This is a simple JPEG. As far as I know there are no hidden semantics associated with it. The Transformational Information Properties are those to do with the overall appearance (geometry, colours) so that the original and transformed appear similar to the human eye.

Designated Community

Everyone

Preservation objectives

To be able to render the image such that the picture can be recognised as a person, and which could be compared with other images of that person and recognised as the same. There is no special semantics associated with the picture. The Transformational Information Properties would be the colour scheme e.g. skin tones.

Specific issues

None

DataTypeExampleID10010 View entry Edit entry Various astronomical FITS files Science data n y n y ---++ Overview

For tools see http://www.fileinfo.com/extension/fits

The NASA tool at http://heasarc.gsfc.nasa.gov/docs/software/ftools/fv/ is useful.
Some specific points about the data files:

  • swp05569slg.fits: IUE image low resolution - note that the second image is the "Quality Flag" images which gives information about the quality (reliability/issues/problems) with individual pixels in the first image
  • orion-16.fits: Simple FITS file - note however that "This image also tests the blank pixel convention. This image contains 144 blank pixels in the centers of the saturated images of the four Trapezium stars". (Blank pixel has value -32768 according to parameter BLANK). Note that there is no World Coordinate System
  • file003.fits: FITS file containing AIPS data. The second component is the "AIPS CC" extension which contains the "AIPS Clean Component" which can be used to remove confusing sources which are buried in the sidelobe structure of the phase calibrator
  • file002.fits: FITS ASCII table.Note that the Primary extension is empty and the table is in the "NoName" extension

Designated Community

Astronomers or people with a graduate level knowledge of astronomy.

Preservation objectives

The DC should be able to extract astronomically useful information from these kinds of images.

Specific issues

The FITS standards are currently readily available, however some of the data uses various dictionaries, and some use non-standard conventions (e.g. use of the blank pixel convention)
DataTypeExampleID10011 View entry Edit entry Simple Word 97 file created for test purposes Word 97 test file y y y y ---++ Overview

This is a simple Word file which contains a little joke about the semantics.

Designated Community

Essentially everyone.

Preservation objectives

The DC should be able to render the text and understand what it means. If the colour of some of the text is important then the semantics should be captured. The joke in the text should be explained.

Specific issues

None
DataTypeExampleID10012 View entry Edit entry Word file in DOCX format created with Word 2010 Simple docx file y y y y This was created from the Word 97 file from here

The rendering should be the same as for the Word 97 file.

There is again the joke about semantics - that will be preserved with a simple rendering, but perhaps the semantics of the joke needs some Semantic Representation Information.

DataTypeExampleID10013 View entry Edit entry Very simple Word file with macros created as a test Word file with macros y y n y ---++ Overview

The macros are presumably some kind of code help within the Word file. I'm not sure whether the macros make this a non-passive object. On the whole I would say not since it is used here as a simple formatting device. However it is probably fair to say it is not "simple" in that the macros are distinct from the material which is rendered.

Designated Community

Essentially everyone

Preservation objectives

Should be able to render the text. The macro should also be usable.

Specific issues

None
DataTypeExampleID10014 View entry Edit entry ESA test data - various science file Science dadta n y n y ---++ Overview

This is a collection a several ESA data files from the GOME satellite. As science data, in general various aspects may be visualised/rendered during its analysis, but there is no single preferred rendering. Moreover the important capability is to be able to understand the semantics associated with the various individual data elements so that they can be combined with other data sources.

Designated Community

The most specific would be: Science users - those with a knowledge of Earth Observation. The most general would be citizen scientists e.g. reasonably intelligent individuals with a level of knowledge up to but not including university/college education.

Preservation objectives

Designated Community should be able to extract and use scientific information about the Ozone concentration so that they can determine

Specific issues

GOME SAFE data file has a significant amount of documentation and also DRB detailed descriptions of the data down to the bit level. There may also be a significant amount of software available to examine the data. However any transformation of the datafile must be very carefully done in order to avoid corrupting the implicit semantic links between various data elements. Moreover the software is not generally useful for combining the data with other information.

DataTypeExampleID10015 View entry Edit entry Explanation and example of why non-static data is of interest Non-static data n n n y Overview

It may seem strange to consider the preservation of non-static files. However Buneman has argued (http://www.era.lib.ed.ac.uk/bitstream/1842/3203/1/IJDC_Iss3_Vol4_Buneman_et_al.pdf) that these are an important class of objects. The basic idea is that it is that the various changes are important and there is a desire to make queries about such changes.

It may be argued that such changes are more an issue of Provenance, or that a series of snapshots of the object between changes is what is being preserved. However Buneman argues that curated databases, which are the examples of objects he focuses on, are extremely important.

The CIA world factbook is the example often used. Buneman says "The CIA World Factbook is a prime example of a curated database – a database that is constructed and maintained with a great deal of human effort in collecting, verifying, and annotating data. Preservation of old versions of the Factbook is important for verification of citations; it is also essential for anyone interested in the history of the data such as demographic change."

Designated Community

The CIA factbook is used by any English speaking general public

Preservation objectives

To ensure that the DC can understand the derivation and history of the various data entries, as well as to use the data contained therein.

Specific issues

DataTypeExampleID10016 View entry Edit entry Java class file Software n y n n Overview

JAVA class files are created (compiled) from java source code. The class files run on a Java Virtual Machine (JVM) which in turn runs on other operating systems. Note that this particular application simply puts out "Hello World" - usually on the user's screen. It does not need network access nor does it need to access data files.

Designated Community

The Designated Community could be computer scientists.

Preservation objectives

The DC would want to be able to run the class file on a JVM on whatever operating system is available at the time.

Specific issues

DataTypeExampleID10017 View entry Edit entry Linux "file" executable Executable file n y n n ---++ Overview Executable files are an important type of digital object.

Designated Community

Essentially everyone

Preservation objectives

To be able to run this command

Specific issues

Needs the "magic" database and the required Linux libraries

Topic attachments
I Attachment History Action Size Date Who Comment
JPEGjpg CompositeObject.jpg r1 manage 25.4 K 2012-06-05 - 22:05 DavidGiaretta  
PNGpng ContainerObject.png r1 manage 253.4 K 2012-06-05 - 22:05 DavidGiaretta  
PNGpng GOME-binary.png r1 manage 0.4 K 2012-06-05 - 22:05 DavidGiaretta  
PNGpng GOME-derived.png r1 manage 37.6 K 2012-06-05 - 22:05 DavidGiaretta  
PNGpng GOME-numbers.png r1 manage 0.9 K 2012-06-05 - 22:05 DavidGiaretta  
JPEGjpg Multiple-classes.jpg r1 manage 40.4 K 2012-06-05 - 22:05 DavidGiaretta  
Microsoft Word filedoc Word97TestFile.doc r1 manage 25.5 K 2012-06-04 - 07:28 DavidGiaretta  
Microsoft Word filedocx WordDocxTestFile.docx r1 manage 12.6 K 2012-06-04 - 07:28 DavidGiaretta  
Unknown file formatdocm WordMacroEnabledTestFile.docm r1 manage 16.4 K 2012-06-04 - 07:28 DavidGiaretta  
PNGpng document_new.png r1 manage 8.7 K 2012-06-06 - 22:30 DavidGiaretta  
JPEGjpg face.jpg r2 r1 manage 4.1 K 2012-06-05 - 22:05 DavidGiaretta  
PNGpng list.png r1 manage 1.7 K 2012-06-06 - 22:33 DavidGiaretta  
JPEGjpg recipe.jpg r1 manage 8.8 K 2012-06-05 - 22:05 DavidGiaretta  
JPEGjpg table-text.jpg r1 manage 8.6 K 2012-06-05 - 22:05 DavidGiaretta  
Edit | Attach | Watch | Print version | History: r14 < r13 < r12 < r11 < r10 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r14 - 2012-06-09 - DavidGiaretta
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback