Monday, January 10, 2022

The Scanned Image Metadata Project

This is the first in a series of posts about putting metadata into scanned picture files, including why it's desirable, how I approach it, and how well it works. The series consists of: 

Part 1: The Scanned Image Metadata Project (this post)

Part 2: Standards, Guidelines, and ExifTool

Part 3: Dealing with Timestamps

Part 4: My Approach

Part 5: Viewing What I Wrote

Part 6: The Metadata Removal Problem

Part 7: Thoughts after 4000+ Scans


Not long ago, my wife asked if I could find a particular photograph. I dug up what turned out to be a slide from 1992. The exercise reminded me that the bulk of our photographic history exists only in non-digital form: slides, prints, and negatives. That puts it one disaster away from annihilation. A fire, a flood, a theft, and we lose everything. Not that a sudden catastrophe is necessary. Slides, negatives, and prints degrade over time. Colors shift. Details fade.

I've known for many years that I should have our pictures scanned into digital form. In 2008, I looked down that road, but I was stymied by the challenge of storing metadata. Getting images into files is easy. Capturing the metadata for the pictures--who's in them, when and where they were taken, etc.--is anything but. 

The image metadata problem is an old one. News photographers have long needed a way to electronically convey photos and associated information to their central offices. By 1991, there was a technical standard for it. Thirty-plus years later, you'd think we'd have a well-established, straightforward way to handle image metadata. You'd be wrong. As a comment at Stack Exchange Photography put it last month, "Image and video metadata is a complete hot mess."

There are two basic reasons for this. First, there are three overlapping standards for metadata storage. All are in broad use. Terminology and conventions within and among them are inconsistent and confusing. One standard's Description field is another standard's Caption_Abstract, for example, and that's sometimes referred to simply as Caption. It's different from the Title field, which is not to be confused with the UserComment field.

The second issue is that programs working with metadata layer on additional inconsistent and confusing names. It's not easy to remember that one standard's DateTimeOriginal field is called DateCreated in some programs, but DateCreated is completely different from CreateDate, which is the name some programs use for a field officially called DateTimeDigitized. Though the Title field is not the same as the Description field, File Explorer and Photo Viewer on Windows 10 sometimes show the value of the Description field with the label Title. Sometimes with the label Subject. Occasionally with both.

Mastering the name game is one challenge. Dealing with redundancy is another. Each image file typically has three description fields, for example, one per standard. Do you write the same data into all three fields, thus ensuring consistency, but risking incoherence if one of the fields is edited, or do you write to only a single field and leave the other two blank? Sorry--trick question! Many programs automatically write to all three fields, even if you edit only one. At the same time, some programs that show descriptions read from only one of the fields, so if the one they look at is empty, you won't see anything, even if other description fields have information in them. Redundancy and potential inconsistency are, sadly, the only practical choice.

Little wonder that some people throw up their hands and look for a solution not involving embedded metadata. One approach is to store the metadata separately from the image, often using the image file's name as a key to look up in a spreadsheet or text file. For me, this as a non-starter. It's too easy for the image and the metadata to get separated. Another approach is to use an image's metadata as its file name. This is clumsy even in concept ("Joe, Bob, Sue, Fred at Lincoln Beach celebrating Bob's retirement 1980-07-16.jpg"), but a bigger problem is that it doesn't address photos stored in the cloud (where file names may not be visible) and photos sent via text message (where the sender's file name is not provided). Image file metadata is a mess, to be sure, but it's still the best of a bad lot.

I want to store metadata about a scanned photo in its image file such that it will be easily accessible in any program that displays metadata. Unless expressly removed from the file, the metadata should stay with the image if it's copied, moved, emailed, texted, uploaded, or shared in the cloud. The comments written on the back of a physical photograph stay with the photo as it's moved about. Image metadata should do the same.

Achieving my goal requires figuring out the following:

  • What metadata should be stored.
  • Which metadata fields it should be stored in.
  • How to put metadata into those fields.
  • How to view metadata in an image file.
  • How to preserve metadata when an image is moved around (e.g., emailed, texted, uploaded, etc.)

In recent weeks, I've spent a lot of time wrestling with these issues. In subsequent blog posts, I'll explain what I've learned and the conclusions I've come to. Links to the full series are at the top of this post.


Irfan Surdar said...

Wishing you the best of luck with your project with the hope that finally somebody will be able to devise a standard acceptable to all.

Avisenna said...

looking forward to reading more about this. There are also some issues when you add Metadata to the images on one OS and transfer the image to another OS. No all data may be viewable

Scott Meyers said...

@Avisenna: In my experience, the metadata you see is not dependent on the OS, but rather on the program you're using to view the metadata. Some programs show more than others. I'll address this issue in a later blog post.