Producer Quality Model__________________________
The producer quality model introduces elements to record qualitative and quantitative quality information, and to identify resources (i.e., datasets) in order to relate metadata in hierarchical or other ways. We have defined a clear conceptual model for producer quality data and user feedback. The models have now gone through several iterations and are at a stable v3.0. Both models have been converted to xsd schemas.
The model extends ISO 191151, 19115-22 and 191573, adding means to report publications, discovered issues, reference datasets used for quality evaluation, traceability, and statistical summaries of quantified uncertainty.
Publications (e.g., journal articles, technical reports) may be added to a number of quality elements within the metadata document. In each case, an existing DQ_ or MD_ element is extended to allow a ‘referenceDoc' element to be added. The resulting new objects are GVQ_Lineage, GVQ_DataIdentification and GVQ_Usage. DQ_Evaluation already has a ‘referenceDoc', and since GVQ_Publication is substitutable for CI_Citation, it may also be employed here.
A fourth element has been added to the ‘metaquality' concrete types, to allow the lineage of a data quality assessment to be recorded, along with its representativity and coverage. This element contains a full LI_Lineage or GVQ_Lineage element as its ‘trace' component, allowing process steps to be recorded for the sampling and evaluation involved in generating the quality statement. If the substitutable GVQ_Lineage is used instead of LI_Lineage, then one or more reference documents may also be cited in support of the traceability statement.
Reference datasets used for evaluation
An important element of data quality assurance is the verification of selected values against an accepted independent calibration or validation dataset. Currently, however, the ISO metadata standards allow no systematic way to record the identity of this data, other than describing its origin as part of a free text entry or a non-standard document describing the cal/val process. We have therefore extended the ‘dataEvaluation' section of the 19157 schema to allow any data evaluation record to report the identity of the reference dataset, and the way in which it was used.
Producer soft knowledge: Discovered issues
As well as legal and security constraints, it is now possible to add one or more discovered issue (e.g., a problem which the producer has identified during generation of a dataset) to an extended version of the DQ_DataQuality element of a metadata document. The GVQ_DiscoveredIssue type is a standalone class which offers the option of supplying a reference to a corrected dataset, suggestions for workarounds, alternative datasets, and a free text description of the identified problem. It also allows reference to be made to expected dates when a problem will be fixed, and to the location of datasets in which the problem has been fixed.
Populating DQ_Result elements using UncertML
A data quality statement which conforms to 19115/19139 or the proposed 19157 standard contains one or more ‘report' elements where numerical information such as quantified accuracy may be recorded. These all extend the abstract DQ_Element, which can contain one or more ‘result' DQ_Result elements.
The 19157 standard retains the ‘DQ_Result' components of a data quality element, within which are embedded the actual results of that specific quality assessment.
The usual means of recording thematic accuracy (e.g., categorical labels for a dataset of vector polygons or classified raster pixels) is through a confusion matrix with an associated Kappa statistic. Historically, this information has been recorded in free text or through hyperlinks to external documents recording the matrix values. This type of information must by definition be read by a human, so no numerical information can be automatically extracted. The UncertML standard defines a ConfusionMatrix encoding element which means that a client capable of parsing UncertML could extract numerical information instead. This potentially allows the use of those confusion counts to generate likelihoods for specific misclassifications, and carry out modelling and simulation in automated workflows.