Combined scenarios

Warning: Can't find topic Main.UserScenarioKNAWDANS


Warning: Can't find topic Main.UserScenarioDNB

STFC Scenarios

STFC Scientific Data - MSST

User Scenario ID S.1
Author David Giaretta STFC/APA
Background Scientific laboratories collect and archive data from various sources. It is important for other scientists to be able to use that data to, for example, reprocess to confirm some published results or, probably more frequently, to analyse in new ways and/or to combine with data from other sources,
Type of digital information MSST radar data
Threat(s) to the data (1) Existing software libraries to access the data may be unusable, .(2) the structure of the data i.e. the format, may be forgotten (3) the semantics i.e. the meaning of the individual numbers may not be understood e.g. this number is a temperature measured in degrees C, measurement from (...) using a type of thermometer and the raw values were turned into degrees C using this (...) calibration curve.
Designated Community Scientists involved in atmospheric physics
Preservation Technique Create a fairly complete Representation Information Network to analyse risks. Create additional Representation Information - Structure and Semantic. Save the BADC web site for information about the measurement instrument. Save the software and associated algorithms
Usage  
Success Criteria Ask members of the Designated Community and those from a closely related discipline if they would be able to sensibly use the data, given that additional Representation Information

User Scenario ID S.1a
Author David Giaretta STFC/APA
Background The International Ultraviolet Explorer (IUE) was an astronomical satellite which obtained UV spectra of tens of thousands of astronomical objects. The data for one object consists of an image which has one or more spectral orders (each as a band across the image). The raw data is processed through several stages, first correcting photometrically and geometrically and then extracting the spectrum.
The original IUE processing software created what is called VICAR format; the VICAR files had binary (i.e. non text) header files followed by data with, in the processed files, various quality flags to show where pixels cannot be trusted.
Since the launch of IUE the FITS astronomical format has become the accepted astronomical format for data.
What actually happened was that it was decided to create the "IUE Final Archive" was a way to ensure that the scientific data collected by IUE would not be lost. In the process of doing this a new way to process the raw data in a more accurate was developed however putting this to one side there were a number of interesting considerations.
* the binary header data encoded temperatures and voltages of the instrument - in FITS the headers are essentially text. Therefore what was done was to convert the various temperatures, voltages into physical into (degrees K or Volts) and these were put into the header as numerical data in characters. The FITS headers take the form of NAME = VALUE where NAME is limited to 8 characters. Therefore for each value a name must be created - limited to 8 characters.
* The quality flags needed to be converted into separate images within the FITS file. The meaning of each quality image pixel value needed to be defined (e.g. 1 means pixel was saturated to has no meaningful data, 2 means the calculated value is affected by reseaux marks and so should be regarded with suspicion...)
Type of digital information Astronomical data
Threat(s) to the data 1 Existing software libraries to access the data may be unusable,
2 the structure of the data i.e. the format, may be forgotten
3 the semantics i.e. the meaning of the individual numbers may not be understood e.g. this number is a temperature meanured in degrees C, measurement from (...) using a type of thermometer and the raw values were turned into degrees C using this (...) calibration curve.
Designated Community Astronomers
Preservation Technique Transformation into a new format - FITS.
However as noted above, even ignoring the processing algorithms, this is not a simple transformation. A great deal of new semantics must be passed on to the users and the relationship between the various components within the new format file must be explained in order for the digitally encoded information.
Usage Astronomers access the FITS file and use a variety of different suites of astronomical software to extract new astronomical information, perhaps combining with data from other sources.
Success Criteria The success of the preservation activity can be seen from the fact that the IUE data is still used by astronomers, 33 years after the launch of the satellite and 13 years since the satellite was closed down.

User Scenario ID S.1b
Author David Giaretta STFC/APA
Background Astronomical data is often in the form of tables. These vary from simple text files, with a few lines of headers followed by columns of numbers and/text, to components in FITS files either as text or binary. Although the column headings often seem simple e.g. "VsubJ" - Johnson visual magnitude. However the accurate interpretation of the data values one needs to know the filter transmission curves. Moreover in some cases the names can be misleading.
Type of digital information Tabular data containing data from various sources.
Threat(s) to the data (1) Existing software libraries to access the data may be unusable, .(2) the structure of the data i.e. the format, may be forgotten (3) the semantics i.e. the meaning of the individual numbers may not be understood e.g. this number is a temperature meanured in degrees C, measurement from (...) using a type of thermometer and th raw values were turned into degrees C using this (...) calibration curve.
Designated Community Mostly astronomers
Preservation Technique Several preservation techniques have been used.
In many cases the original data format has been Transformed to FITS. However since XML became popular it was decided to create a table format which would be better suited to exchange via Web Services. This new format was called VOTable. It was also believed that as XML it would have some advantages for preservation.
In many cases a Uniform Column Descriptor (UCD) has been created to capture the semantics of the column to give some idea of which column could sensibly be combined.
Usage The various encodings of data must be able to be understood and used. IN particular the various datasets must be combinable sensibly. One way this is done is to virtualise the various encodings into the Java AbstractTableModel and a variety of specialisations which capture additional information and semantics. This allows data from multiple sources and in multiple formats to be combined.
Success Criteria Astronomers can use and understand the data that is encoded, and in particular can combine data from various sources.

User Scenario ID S.1d
Author David Giaretta STFC/APA
Background The European Space Agency launches a number of satellites which captured data and, after processing these were stored in Common Data Format (CDF). CDF was a format which originated in NASA to encode data which was repeatedly measured on a grid. The internal format is very complex and was not described anywhere except in the access libraries - which itself was very complex. However NASA decided that it would no longer support the CDF access software at some point in the future.
ESA had what was, at that time, a huge amount of data in this format.
Moreover the CDF file format had certain limitations and so a number of "conventions" were imposed on the CDF files which meant that analysis software had some built-in semantics which the "standard" CDF software did not know about.
Type of digital information Solar Terrestrial Physics (STP) measurements obtained from a number of satellites.
Threat(s) to the data 1 Existing software libraries to access the data may be unusable,
2 the structure of the data i.e. the format, may be forgotten
3 the semantics i.e. the meaning of the individual numbers may not be understood e.g. this number is a temperature measured in degrees C, measurement from (...) using a type of thermometer and the raw values were turned into degrees C using this (...) calibration curve.
Designated Community STP scientists
Preservation Technique Transforming to another format was a possibility although the volume and the hidden semantics made this unattractive.
Instead it was decided to ensure the long-term usability of the data by describing it using the EAST language. This first required the CDF software team to write a fairly full description of the CDF internal structures which could then be described in EAST.
This gave ESA the confidence to continue to keep the data in CDF format for a considerable time. Eventually technology changes in storage and the emergence of new analysis tools and formats meant that at least some of the data was transformed, but the hoidden semantics had to be exposed.
Usage The data was used in a variety of analysis tools.
Success Criteria The ability of scientists to use and combine data.

Contemporary Performing Arts

User Scenario ID S.2
Author David Giaretta STFC/APA
Background New contemporary performing arts composition must be able to be re-performed over time.
Type of digital information Consists of musical composition perhaps PDF) plus software (known as patches) which changes the music e.g. adding reverberation etc in a complex workflow using proprietary software and hardware. The patches are essentially subroutines which run in the proprietary software.
Link to sample data Provide a link to the samples of data that are available for testing with the Testbeds
Threat(s) to the data The interaction and timing of the interactions between the music and computer effects must be maintained. The computer generated effects must be maintained despite the lack of ability to run the software patches and the availability of the hardware.
Designated Community Performers of this type of music plus their musical assistants.
Usage The performer and musical assistant must be able to re-perform the music.
Success Criteria The music must be able to be re-performed to the satisfaction of the composer, if available, or to the performer.

UNESCO World Heritage Site data

User Scenario ID S.3
Author David Giaretta STFC/APA
Background World Hertiage Sites (WHS) are documented using a variety of techniques including laser-scans, satellite observations, etc captures as a variety of data files. The state of the site at one poit in time must be able to be compared with other measurements later on in order to determine if the site has deterorated
Type of digital information ESRI shape files etc etc.
Link to sample data Provide a link to the samples of data that are available for testing with the Testbeds
Threat(s) to the data The ESRI software needed may become unavailable in future years.
Designated Community UNESCO WHS experts.
Usage The data from one time must be able to be compared with measurements taken with different instruments in order to see if the site has deteriorated.
Success Criteria The measurements are successfully compared and the older data does, in spot checks, agree with the data values extracted by the original s/w.

-- DavidGiaretta - 2011-08-03


FORTH

I wonder if we should also include more complex scenarios that involve converters and emulators:

Indicative scenario:

Suppose one has also an old source file in Pascal programming language, say game.pas, and he has found a converter from Pascal to C++, say p2c++. Further suppose that he has just bought a smart phone running Android OS and he has found an emulator of WinOS over Android OS. It should follow that James can run game.pas on his mobile phone (by first converting it in C++, then compiling the outcome, and finally by running over the emulator the executable yielded by the compilation).

In other words, a sequence of conversions and emulations can be enough for vanishing an intelligibility gap, or for allowing performing a task. Since there is a plethora of emulation and migration approaches that concern various layers of a computer system (from hardware to software) or various source/target formats it is beneficial to use advanced knowledge management techniques for aiding the exploitation of all possibilities that the existing and emerging emulators/converters enable, and assist preservation planning.

NOTE: FORTH is currently doing research for supporting the aforementioned scenario (also related to WP25 ApanWp25).

-- YannisTzitzikas - 2011-08-29

Just to keep our terminology in sync with our common glossary of terms, is a converter the same as a transformer i.e. a process or tool that transforms digital objects from one form to another? This would suggest that your usage scenario is actually a composite function of both a transformation followed by an emulation in order to provide access to the digital object of interest (the game).

-- AshHunter - 2011-10-13

Yes. What I call converter is a transformer in your vocabulary (feel free to use your own terminology).

Indeed this what this scenario wants to stress: composite functions.

Yannis


NDL - Evaluation of various migration paths

User Scenario ID 6.1
Author Pekka Mustonen, CSC
Background

The Finnish National Digital Library project - launched by the Finnish Ministry of Education and Culture in year 2008 - brings the achievements of culture and science to general public. The aims of the NDL project are improving availability and usability of the key national information resources of libraries, archives and museums in information networks, and the development of long-term preservation solutions for digital cultural heritage content data objects. The long-term preservation section of the NDL project has prepared a plan describing the model for centralized national long-term preservation solution for the digital objects of memory organisations responsible for the preservation of cultural heritage.

In National Digital Library project file formats are divided into "acceptable for preservation" and "acceptable for transfer" -categories. For example, file formats used in MS Office suite are considered "acceptable for transfer" but these files will be converted into long-term preservation format before being archived.

We want to study various migration paths from "acceptable for transfer" to "acceptable for preservation" to be able to instruct depositors in preservation planning.

Type of digital information

According to a recent study, the current number of digital objects to be deposited to the NDL long-term preservation system is roughly 687 000 000 (2500TB), and the size of the collection is estimated to be 1 458 000 000 objects (5700TB) in 2015 when the system will be in production (obviously only a tiny share of this will be available for testing).

Possible migration paths can be any of the following (Note: also scenarios by DNB are very relevant to us):

  • "Text":
    • Acceptable for transfer: Microsoft Word for Windows Document
    • Acceptable for preservation: Open Document Format (ODF), PDF for long-term preservation (PDF/A)
  • Audio:
    • Acceptable for transfer: Audio Interchange File Format (AIFF), Mpeg-1 layer-3, Mepg-2 layer-3 (MP3), Mpeg-4 aac – advanced audio coding (AAC), Window media audio
    • Acceptable for preservation: Broadcast Wave Format (BWF), Waveform Audio Format (WAV), AIFF (PCM-coded), AAC,
  • Video:
    • Acceptable for transfer: Audio video interleave (AVI), Moving pictures expert group (MPEG-2), Moving pictures expert group (MPEG-4), Quicktime (MOV), Windows media video (WMV)
    • Acceptable for preservation: JPEG 2000 MXF or Motion JPEG 2000
  • Still images
    • Acceptable for transfer: Encapsulated postscript (EPS), Graphics interchange format (GIF), Portable network graphics (PNG)
    • Acceptable for preservation: Joint photographic experts group (JPEG), Joint photographic experts group jpeg 2000 (JP2), Tagged image file format (TIFF)
Link to sample data Not yet available.
Threat(s) to the data Information loss during the conversion
Usage (Mainly) producers
Success Criteria Object properties are preserved with satisfying quality

-- HeikkiHelin - 2011-09-09


Tessella Scenarios (typical customer scenarios)

Scenario 26.1 Born-Digital Government Departmental Records

User Scenario ID 26.1
Author Ashley Hunter, Tessella plc
Background National Archives are founded on the basis that they must provide long term access to a wide variety of government records. Simple cases may include keeping the minutes of specific departmental meetings in PDF or DOC safe and available for access for a given period of time. More complex data types may be digital objects like databases, websites, CAD files etc. Each digital information object may have a defined closure period during which access permissions can only be granted to specific authorised individuals, but after this period, the record becomes open for wider or even public consumption. The digital records may well only reach the National Archive after a specific holding period (typically 10-20yrs) has expired within the issuing department's own Content Management System, setting the status of the material as "Archival". This further complicates the task of the National Archive as it strives to ensure that records do not become obsolete before they are even allowed to be transferred out of the issuing department to the Archival Repository. Records that are transferred to the Archive may then remain closed for significant periods (e.g. 50yrs, etc) or until such a reasonable time has expired that anyone referenced in the record is likely to be deceased.
Type of digital information MS Word documents(DOC), Excel Spreadsheets(XLS), Powerpoint presentations (PPT), Outlook PST files, Simple Text Files (TXT), Presentation copies of printed documents (PDF, PDF/A)
Link to sample data Tessella to provide a test set of typical files (not real, made up - likely the DROID test file corpora) for testing purposes
Threat(s) to the data The main purpose of the Archive is to provide long term search and access of records for the various approved user communites, and to this end the Archive must defend against passive preservation issues, relating to (1) keeping the digital objects safe from 'bit rot' or 'data decay' on the proimary storage media. (2) Complex container & compression formats, or files with embedded digital certificates & signatures can also introduce further risk to the ability to access the information objects in the future, as access may be dependent on having a tool available that knows how to unpack or uncompress the objects, or be dependent on performing a verification process with a 3rd party entity that may not be available anymore. (3) The Archive may want to provide multiple versions of its digital objects, rendering them in different formats for different access purposes (free low grade manifestation, and commercially available full high-resolution manifestations), and to this end Archives can use Active Preservation methods including file format migration to provide these various manifestation types. (4) Representation Information about the record's provenance and authenticity is provided through arrangement of the digital objects in to a hierarchy of collections, with Descriptive Metadata provided according to agreed metadata schemas. Commonly these are bespoke to each Archive, rather than using standard Schemas such as METS, MODS, EADS, PREMIS, etc.
Designated Community Initially the submitting Government Department and any other affiliated departments and organisations will have access to the material and over time this access will be relaxed for some of the records leading to wider or even public access to records for use by policy investigators, historians, and the general public.
Preservation Technique (1)The standard approach to defend against bit rot is to keep multiple copies of the AIP on different storage media, and to periodically test the integrity of these objects against their known fixity/checksum values at the time of ingest. Where corruptions are found, the system should notify the Archive staff so that the corrupted file can be replaced from one of the other AIP copies that has recently passed the integrity test. (2.a) During ingest, unpack and uncompress all digital objects and characterise these objects as individual digital objects in their own right, extracting descriptive, administrative, technical metadata and structural information in relation to any other extracted digital objects. Remove (2.b) Digital certificates and signatures from objects where possible and re-assert provenance and authenticity from within the Archival system itself (i.e. make it too have a discernable and guaranteed provenance and authenticity) whilst maintaining references to the original certification system. (3) Use file migration techniques to provide digital information in alternative file formats. This process may in itself be a lossy process, so some process of quality assessment and validation needs to be applied to ensure that the appropriate level of quality is maintained following each format migration. Several migrations may be required over time to provide the Designated Community with the information that they want to access in a format suitable to their needs. (4) Provide methods for translating between metadata schema definitions via code or XSLT, to enable exchange of networks of representation information. Facilitate further integration with catalogue collection systems through the use of OAI-PMH exchange protocols.
Usage Archive staff, submitting government department staff via access requests, public access request information
Success Criteria Digital objects remain available to their designated community through 'Search' &/or 'Browse' functionality, and are directly accessible to these users in forms and ways that are meaningful to them and can faciliate their re-use if allowed. (e.g. secure downloads, scheduled reader-room deliveries, public internet, etc)

Scenario 26.2 Digitised Presentation Manifestations / Hybrid Catalogues of Paper based Government Records

User Scenario ID 26.2
Author Ashley Hunter, Tessella plc
Background Collections of Government records may span several years of accessions, during which time the producers of the material moved from creating paper based records to digital records. To maintain accessibility across the 'divide' archives are digitising the paper manifestations in order to make these accessible along with the born-digital material. Typically, the paper based record will be scanned (or photographed) to create a high resolution Preservation Manifestation, and at the same time create a lower grade presentation copy along with other files including OCR'd text where available, and specific technical metadata extracted during the digitisation process (Page number, camera specification, creator, etc).
Type of digital information Preservation Manifestation formats include, but not limited to TIF, JP2, RAW (Where this is a specific format to the camera manufacturer; Presentation formats including JPG, JP2, PDF; OCR ouput in the form of TXT, RTF, CSV; and metadata in XML formats.
Link to sample data Tessella to provide a test set of typical files (not real, made up - likely the DROID test file corpora) for testing purposes
Threat(s) to the data The main purpose of the Archive is to provide long term search and access of records for the various approved user communites, and to this end the Archive must defend against passive preservation issues, relating to (1) keeping the digital objects safe from 'bit rot' or 'data decay' on the proimary storage media. (2) The Archive may want to provide multiple versions of its digital objects, rendering them in different formats for different access purposes (free low grade manifestation, and commercially available full high-resolution manifestations), and to this end Archives can use Active Preservation methods including file format migration to provide these various manifestation types. (3) Representation Information about the record's provenance and authenticity is provided through arrangement of the digital objects in to a hierarchy of collections, with Descriptive Metadata provided according to agreed metadata schemas. Commonly these are bespoke to each Archive, rather than using standard Schemas such as METS, MODS, EADS, PREMIS, etc.
Designated Community Initially the submitting Government Department and any other affiliated departments and organisations will have access to the material and over time this access will be relaxed for some of the records leading to wider or even public access to records for use by policy investigators, historians, and the general public.
Preservation Technique (1)The standard approach to defend against bit rot is to keep multiple copies of the AIP on different storage media, and to periodically test the integrity of these objects against their known fixity/checksum values at the time of ingest. Where corruptions are found, the system should notify the Archive staff so that the corrupted file can be replaced from one of the other AIP copies that has recently passed the integrity test. (2) Use file migration techniques to provide digital information in alternative file formats. This process may in itself be a lossy process, so some process of quality assessment and validation needs to be applied to ensure that the appropriate level of quality is maintained following each format migration. Several migrations may be required over time to provide the Designated Community with the information that they want to access in a format suitable to their needs. (3) Provide methods for translating between metadata schema definitions via code or XSLT, to enable exchange of networks of representation information. Facilitate further integration with catalogue collection systems through the use of OAI-PMH exchange protocols.
Usage Archive staff, submitting government department staff via access requests, public access request information
Success Criteria Digital objects remain available to their designated community through 'Search' &/or 'Browse' functionality, and are directly accessible to these users in forms and ways that are meaningful to them and can faciliate their re-use if allowed. (e.g. secure downloads, scheduled reader-room deliveries, public internet, etc)

Scenario 26.3 Scientific Datasets (ISIS Neutron and Muon Facility, STFC)

User Scenario ID 26.3
Author Ashley Hunter, Tessella plc
Background ISIS, the Neutron and Muon Source at Rutherford Appleton Laboratory in the UK, and part of STFC, wanted to preserve their instrument data that has been generated over the many years of operation. The instrument data has evolved over time as the instruments themselves have been updated and enhanced, resulting in various file formats specific to each instrument. A common data structure was developed for these known as ISIS RAW. Additional ad-hoc files may also be present to describe additional metadata about the instrument or its operation, and temporary files are generated during the cyclic operation of the instruments for backup purposes (SAV file formats). A separate catalogue system maintains representation information relating to the setup of each instrument and why it was used in a specific experiment and by whom. This remains confidential for a 2 year period so that results can be derived by the initiating investigator before the results are made publically available for use by the wider research community.
Type of digital information Instrument files in SAV, ISIS RAW and Nexus (NXS). Ad-hoc formats include TXT.
Link to sample data Link to publically available historic instrument data to be added here
Threat(s) to the data The main purpose of the Archive is to provide long term search and access of records for the various approved user communites, and to this end the Archive must defend against passive preservation issues, relating to (1) keeping the digital objects safe from 'bit rot' or 'data decay' on the proimary storage media. (2) ISIS may want to provide multiple versions of its digital objects, like aggregating older ISIS RAW and other ad-hoc files in to one combined NXS format. (3) Network of Representation Information about the instrument data is held in the iCAT cataloguing system.
Designated Community Initially the Research scientists that commission the work, but later this will become public access to all the instrument data after the 2 year closure period has passed.
Preservation Technique (1)The standard approach to defend against bit rot is to keep multiple copies of the AIP on different storage media, and to periodically test the integrity of these objects against their known fixity/checksum values at the time of ingest. Where corruptions are found, the system should notify the ISIS support staff so that the corrupted file can be replaced from one of the other AIP copies that has recently passed the integrity test. (2) Use file migration techniques to provide digital information in alternative file formats. The MANTID software has been used to provide this migration pathway from the older ISIS RAW data formats to the newer NXS formats. (3) Provide methods for translating between metadata schema definitions via code or XSLT, to enable exchange of networks of representation information. Facilitate further integration with catalogue collection systems through webservices.
Usage ISIS Research staff, Client sponsoring research staff, and wider academic community following release as public access
Success Criteria Research datasets remain available to their designated community through 'Search' &/or 'Browse' functionality, and are directly accessible to these users in forms / formats that are meaningful to them and can faciliate their re-use when allowed.

-- AshHunter - 2011-10-25


UKDA Scenarios

User Scenario ID 33
Author Sharon Bolton (UK Data Archive), Contact Hervé L’Hours (UK Data Archive)
Background The UK Data Archive is curator of the largest collection of digital data in the social sciences and humanities in the United Kingdom. The Archive holds several thousand datasets relating to society, both historical and contemporary. For conversion to standard preservation (and access) formats, a standard approach is required at ingest to ensure that processing staff and, subsequently, users have confidence in the conversion output .
Type of digital information A large proportion of our files are quantitative data files which are deposited primarily in SPSS but we also receive STATA, SAS and Excel format.
Link to sample data An ESDS Government sample dataset can be made available on request
Threat(s) to the data Various proprietary statistical software packages manage files in a variety of formats to support general and software-specific functions. Differences between packages and their formats and the large and complex nature of the data sets present a significant risk that some information will be lost (through error or truncation) or damaged on conversion with no record produced of the changes incurred. These are ongoing problems with ingest conversion but similar issues exist with obsolescence of a particular format version of a statistical software package. Threats include: Truncation of variables (to a reduced number of decimal places), Truncation of labels (to a reduced number of characters), Non-identical feature sets supported (Not all of format A included in format B or some elements of format B left blank because not present in format A), Different approaches to the application of weighting (and truncation of decimal places may occur with weighting variables)
Designated Community Social Scientists working with quantitative data
Usage Statistical information used for secondary analysis or replication of results
Success Criteria Processing users will either be presented with confirmation that the outcome of the format conversion is content-identical or with clearly flagged difference between the two files, preferably with verbose explanations. The end users will either be presented with content-identical statistics or detailed explanations of any variation from the originally deposited material sufficient to replicate any analysis made on that material

-- HerveLH - 2011-10-25

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r2 - 2011-10-26 - AshHunter
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback