Monday, May 11, 2009

Data Integrity through DEFINE.XML

You can use the DEFINE.PDF/DEINF.XML files that are created for electronic submission to review your own data. This is usually performed by an independent reviewer outside of the development team. The fresh perspective from the reviewer creates a redundancy that ensures the accuracy and integrity of your data. This would allow you to catch discrepancies that may otherwise be captured during a review from regulatory agencies. There are steps which you can perform to ensure that your domain documentation is accurate and that the data which it is describing is accurate.

Step 1: Verify that any hyperlinks such as the one to external transport (XPT) files link to the right files. This is to ensure that the domain document itself has accurate hyperlinks.

Step 2: At the top list for datasets, verify the key fields. Ensure that the following criteria are met:

  1. The key field exists and is listed first in the list of variables.

  2. The dataset is sorted by the key fields.

Step 3: Verify all decoded formats. Verify that the values of the decodes match what was defined in the analysis plan or original case report form. Review the data to see if there are any values that do not meet the formatted codes and therefore was not properly de-coded.


Step 4: All derived variables need to be verified. You can choose to do some or all of the following recommended verification tasks to ensure the integrity of the derived variables:

Code Review

Systematic review of program code pertaining to the derivation according to a predetermined checklist of verification criteria.

Code Testing

Perform testing on SAS programs pertaining to the derivation supplying valid and invalid inputs and verify expected output.

Log Evaluation

Evaluate the SAS log for error, warning and other unexpected messages.

Output Review

Visual or programmatic review of report outputs related to the derivation as compared to expected results.

Data Review

Review attributes and contents of output data for accuracy and integrity.

Duplicate Programming

Independent programming to produce the same derivation and output for comparison.



There are many tasks performed in the process of verifying and validating SAS programs to ensure the quality of your data. Many of these tasks are overlooked for their significance in maintaining accuracy and integrity of the program logic and output which it produces. The repetitive aspect of these tasks gives them a bad reputation of being unglamorized grunt work that must be done to meet departmental SOPs. However, it is an essential step which can be performed and directed by what is documented in the domain documentation.


File Formats
The domain documentation originally was specified to be a DEFINE.PDF file. The PDF format is good in that it is not intended to be edited and can be viewed both on screen and on printed paper on many computing platforms. This is a good file format for the final electronic submission but if it is used for other purposes, other formats may be more suitable. PDF does have limitations in that it is not extensible. You cannot add extra information. For example, if you wanted to store information about the user name and date and time as to when they have last updated a particular variable, you cannot easily do this within the current DEFINE.PDF. However, XML file format is extensible which allows you to add more information. The new standard therefore is calling for the documentation to be stored in a more vendor neutral, universal and extensible structure of DEFINE.XML.


If you are to use the domain documentation for project management, the information can be stored in either an Excel spreadsheet or a Word document. For electronic submissions, the XML or DEFIND.PDF is a better choice. There may be slight variations but the main core information is the same for these files.
Since the content of the data is similar, file format becomes less significant. The file can be converted from one format to the other while maintaining all the same information. There are tools that will make this a transparent process. The goal is to make use of the information stored within the domain documentation and not be restricted by being forced to use one particular file format.

complete paper found at " Data Integrity through DEFINE.PDF and DEFINE.XML " and related DEFINE.XML Software.

Bookmark and Share

1 comment:

  1. Since as much as 80% of a programmer’s time is
    invested in testing and validation, it’s important to
    focus on tools that facilitate correction of syntax, data,
    and logic errors in SAS programs. The presentation
    focuses on wide variety of SAS features, tips,
    techniques, tricks, and system tools that can become
    part of your routine testing methodologyclick here

    ReplyDelete