Introduction

The purpose of this guide is to instruct the user in some advanced concepts within the CAS-Metadata project, including the ramifications of metadata extraction with regard to repository search, type, checking, etc, and the use of mime-type detection. For basic topics, including the basics of metadata, how to write metadata extractors, and explanations of existing metadata extractors, see our Basic Guide. In the rest of this guide, we will cover the following topics:

Planning Metadata for Search

As discussed in the Basic Guide, one of the primary uses of metadata is search. When you consider search, remember that neither the CAS-Metadata container nor the CAS-Filemgr are cognizant over how people will search your data - you are.

We recommend developing a data dictionary as a "best practice." The IBM Dictionary of Computing describes a data dictionary as a "centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format." This is a highly-related concept to ontology.

The attributes of and relationships between products in your system will not only help you to develop appropriate product types but also the metadata you will need to extract from products to establish both these attributes and relationships.

In the next subsections, we will discuss specific aspects of metadata extraction that impact downstream search.

Missing Elements

Because metadata extraction is a separate activity from extraction (as discussed in the Basic Guide), it is possible that there is a miss-match between the metadata elements extracted by an extractor and the metadata elements that the CAS-Filemgr associated with a product type. CAS-Filemgr, therefore, only ingests the intersection of the metadata extracted from the product and the metadata associated with the product in CAS-Filemgr configuration. This means that missing elements are possible.

String-based Comparison

Metadata values are stored as strings in CAS-Metadata. While there are a number of good reasons for this, it is a design point that has a number of important ramifications for search. Specifically, all metadata elements should be comparable. Of course strings are comparable, but without some forethought, a string-based comparison can act differently than would a type-based comparison.

This is where standards come into play. For various types, there are standard string-based representations that ensure comparisons that behave identically to how a type-based comparison would work. There is currently no plan to enforce these standards (and, depending on the particular type, the application domain, etc., there might be multiple competing standards - e.g., TAI vs. UTC formatted Time strings).

Some example string representation standards by type:

Date-Time

With time, consistency is key. There are multiple formats, such as Julian Day Numbers, or UTC. Additionally, there are different time standards such as UTC, local time, and TAI. One also needs to be aware of leap second observance and local time conventions such as daylight savings time, depending on representation and standard selection.

Additionally, it is important to remember that inter-product consistency can be just as important as intra-product consistency because there are many downstream use cases of the search features of CAS-Filemgr and other CAS components that involve cross-product comparisons.

Integers

Integers can be easily represented as comparable strings, but you must remember to pad correctly. The string "1" is greater than "01234", but "0001" is less than "01234."

Like Date-Time, appropriate numerical representation is the responsibility of the metadata extractor, though we have built some additionally support for representational transformations during the ingest process of the CAS-Filemgr.

More on Padding..

Floating Point Numbers

The most prevalent of Floating point representations is IEEE 754-2008. This is a convenient representation because, amongst other reasons, it is the representation that Java uses so if you use Java to develop an extractor, you can use the floatToIntBits(float value) method of the Float class. Remember that, like Integers, you will need padding.

Blobs, URLs, and Character Sets

Remember that text encoding is important. Depending on the catalog extension point used by the CAS-Filemgr, metadata values might not be formatting correctly for storage. For example, if a metadata object containing a blob of text is ingested into a CAS-Filemgr instance that is configured to run with a DataStore extension backed by an Oracle DBMS, then UTF8 encoding is important.

Planning Metadata for Provenance

Coming Soon...

MIME-type Detection

Coming Soon...

Conclusion

This is intended to a living document discussing advanced topics within the CAS-Metadata project, though it is not comprehensive. In our Basic Guide, we cover more topics regarding basic topics within CAS-Metadata.