I used a couple new (to me) command line tools this week, so I thought I'd introduce them. Feel free to skip ahead to the THE TOOLS: section if you don't care about the whys or hows.
BACKGROUND STORY:
I have a family server and have had for many years. We, the whole family, have been basically just dumping digital photos onto it with not much regard to any sort of organization or duplication. My wife typically copies her photos to her laptop, which I backup to the server at regular intervals. I, before each new outing or family event, clear the desired cameras' photo cards onto the server. My sons will occasionally add their stuff as well and we often borrowed the memory cards from friends and family members to capture all their shots of events and get-togethers.
Last year I began the long sloooowwww process of scanning ancient family slides from 1960 to about 1977. There are thousands of them and it's tough to do more than a handful in a day.
Bottom line: I had built up to 68GBs of photos and videos and viewing them in an enjoyable way was almost impossible. What's the point of even taking photos if you can't retrieve them conveniently for viewing when your Mom comes over?
FIRST ATTEMPT AT ORGANIZATION:
Initially, I believed separation was king. I divided all the photos into years and months by their file dates. Made sense at the time. However during viewing sessions, I would think "Did we go to Paris in July or August of 2008? Was the Alaska cruise 2007 or 2009?" This also left several photos languishing in folders all their own. A single photo taken in May nested three-deep into the file structure - a needless state.
Clearly, year and month was insufficient. Add to that the fact that a couple of the camera's were not properly date stamped or the move and copy processes had re-dated the file dates. Slides have no date at all except the stamp on the cardboard frame showing the month and year they were developed. Plus - many photos were not of the "family" variety, but of utilitarian purposes: photos of cars or house designs we like, objects for later uses in decorating, etc. and scans are dated the date of the scan, bot the source of the photo.
In short:I had a big ol' mess.
THE ORGANIZATION REVISION:
I settled on keeping the year as the primary sort criterion. I then judged major events as the secondary sort - vacations, weddings, etc. I figured remembering the year we went to Alaska was easy enough to deduce, but the month not. Non-family type photos got their own primary folders: properties, cars, humorous, projects, etc. The remaining day-to-day family snaps were left all bundled together in the primary (year) folder. Additionally, I removed all the videos for later stitching together, editing, and re-sorting (this was 25GB of the total).
This left me with this basic directory tree:
/shared/Pictures
This folder arrangement seems much more useful. I can jump right to our Paris trip or the Alaska cruise with little delay and re-playing the entire year of 2004 when the baby first came home is easy too.
This re-alignment also revealed thousands of duplicates that had been double-stored - half of them in the wrong folder or left in a camera folder and not sorted at all - but it also revealed the mis-dating of so many photos. Those dups were easily discovered if they hadn't been renamed or re-dated. Another issue is each camera uses it's own file naming system. A year might be spread out into six different groups of photos or more - not a very useful viewing order.
A re-naming scheme was also sorely needed. One that made sense and aided organization. The actual name of a photo isn't as important as that it in the right place and sequence. The photo slide-show software doesn't care what the file name is, just the order.
NEW NAMES:
First: the re-naming. The only naming scheme that made sense to me was a date/time label. This would serve to keep the photos in their proper year folder and in the proper sequence in that folder. Here's an example of what I wanted:
070812_132244.jpg
It looks totally random until you decipher it. The first segment is YEAR/MONTH/DAY and the second is HOUR/MINUTE/SECOND. So this photo was taken August 12th, 2007 at 1:22 and 44 seconds pm. I don't care that it's not a descriptive name. It is a very useful name for my needs. In theory, this sorts and orders all the photos exactly how I want regardless of source (assuming of course the clocks in the camera are reasonably correct).
Here's the cool part: virtually all digital cameras include this information with the jpg in a data segment called exif data. But how to get that data to replace the file name without typing it in? I have thousands of photos!
CULL THE DUPLICATES:
The re-naming and changes to the folder scheme revealed about 2000 dupes! Could there be more? Probably.
THE TOOLS:
exiftool
This tool is extremely detailed and I won't go into anything here other than what I'm using it for. I recommend you do your own research to get an inkling of the hundreds of things this tool can do. Start here as I did. Credit for the info below goes to this webpage.
To rename my photo's the command I used was:
exiftool '-filename<CreateDate' -d %y%m%d_%H%M%S%%-c.%%le -r -ext jpg /shared/Pictures
Here's the breakdown of each element:
Unless the jpg has missing or malformed exif data, this command will do the renaming. In my case, this renamed about 80% of all my jpgs, uncovered 2000ish dupes, and significantly reduced the remaining work ahead. Eventually, I will be renaming the remaining scans and photos manually. exiftool will also allow you to re-date/time the exif date or even incrementally add time to the exif data so the date/time stamps are correct.
findimagedupes
This command has only a tenth of the options the exiftool has but performs an amazing amount of work. Basically, it resizes, blurs, re-sizes again, and then compares (all in tmp or memory - no files are harmed) all the given images to one another and then makes a guess as to which ones are duplicates. As you can imaging, this takes some time. I let it run on the entire Pictures folder while I was at work, dumping the results into a text file for later use. I have no idea how long this took; but 10 hours later when I returned it was done:
findimagedupes -R -- . >alldupes.txt
I ran this while in my /shared/Pictures folder so all my photos were compared to each other. I then dumped the result file alldupes.txt (1000+ lines) into a spreadsheet so I could sort the results easily. I am now manually comparing the list to remove every last duplicate. The downside of this command is that similar photos are matched as dupes. Based on my file naming scheme, most of these erroneous dupe reports are easily discarded and again I have saved myself days and weeks of manual file comparison.
THE RESULTS (SO FAR):
Now I have about 99% of the dupes deleted and about 80% the photos sensibly sorted and in order. Already, we can sit at the living room TV and enjoy a slide show and I recovered over 20GBs of drive space. I still have more to do - many files had no or bad exif data - but it's in the order of days' worth or work rather than months' or years' worth.
Good luck with your family photo project! Please post solutions you used if you have had similar problems. I'd love to compare results.
BACKGROUND STORY:
I have a family server and have had for many years. We, the whole family, have been basically just dumping digital photos onto it with not much regard to any sort of organization or duplication. My wife typically copies her photos to her laptop, which I backup to the server at regular intervals. I, before each new outing or family event, clear the desired cameras' photo cards onto the server. My sons will occasionally add their stuff as well and we often borrowed the memory cards from friends and family members to capture all their shots of events and get-togethers.
Last year I began the long sloooowwww process of scanning ancient family slides from 1960 to about 1977. There are thousands of them and it's tough to do more than a handful in a day.
Bottom line: I had built up to 68GBs of photos and videos and viewing them in an enjoyable way was almost impossible. What's the point of even taking photos if you can't retrieve them conveniently for viewing when your Mom comes over?
FIRST ATTEMPT AT ORGANIZATION:
Initially, I believed separation was king. I divided all the photos into years and months by their file dates. Made sense at the time. However during viewing sessions, I would think "Did we go to Paris in July or August of 2008? Was the Alaska cruise 2007 or 2009?" This also left several photos languishing in folders all their own. A single photo taken in May nested three-deep into the file structure - a needless state.
Clearly, year and month was insufficient. Add to that the fact that a couple of the camera's were not properly date stamped or the move and copy processes had re-dated the file dates. Slides have no date at all except the stamp on the cardboard frame showing the month and year they were developed. Plus - many photos were not of the "family" variety, but of utilitarian purposes: photos of cars or house designs we like, objects for later uses in decorating, etc. and scans are dated the date of the scan, bot the source of the photo.
In short:I had a big ol' mess.
THE ORGANIZATION REVISION:
I settled on keeping the year as the primary sort criterion. I then judged major events as the secondary sort - vacations, weddings, etc. I figured remembering the year we went to Alaska was easy enough to deduce, but the month not. Non-family type photos got their own primary folders: properties, cars, humorous, projects, etc. The remaining day-to-day family snaps were left all bundled together in the primary (year) folder. Additionally, I removed all the videos for later stitching together, editing, and re-sorting (this was 25GB of the total).
This left me with this basic directory tree:
/shared/Pictures
/2007
/Alaska
/Thanksgiving
This folder arrangement seems much more useful. I can jump right to our Paris trip or the Alaska cruise with little delay and re-playing the entire year of 2004 when the baby first came home is easy too.
This re-alignment also revealed thousands of duplicates that had been double-stored - half of them in the wrong folder or left in a camera folder and not sorted at all - but it also revealed the mis-dating of so many photos. Those dups were easily discovered if they hadn't been renamed or re-dated. Another issue is each camera uses it's own file naming system. A year might be spread out into six different groups of photos or more - not a very useful viewing order.
A re-naming scheme was also sorely needed. One that made sense and aided organization. The actual name of a photo isn't as important as that it in the right place and sequence. The photo slide-show software doesn't care what the file name is, just the order.
NEW NAMES:
First: the re-naming. The only naming scheme that made sense to me was a date/time label. This would serve to keep the photos in their proper year folder and in the proper sequence in that folder. Here's an example of what I wanted:
070812_132244.jpg
It looks totally random until you decipher it. The first segment is YEAR/MONTH/DAY and the second is HOUR/MINUTE/SECOND. So this photo was taken August 12th, 2007 at 1:22 and 44 seconds pm. I don't care that it's not a descriptive name. It is a very useful name for my needs. In theory, this sorts and orders all the photos exactly how I want regardless of source (assuming of course the clocks in the camera are reasonably correct).
Here's the cool part: virtually all digital cameras include this information with the jpg in a data segment called exif data. But how to get that data to replace the file name without typing it in? I have thousands of photos!
CULL THE DUPLICATES:
The re-naming and changes to the folder scheme revealed about 2000 dupes! Could there be more? Probably.
THE TOOLS:
exiftool
This tool is extremely detailed and I won't go into anything here other than what I'm using it for. I recommend you do your own research to get an inkling of the hundreds of things this tool can do. Start here as I did. Credit for the info below goes to this webpage.
To rename my photo's the command I used was:
exiftool '-filename<CreateDate' -d %y%m%d_%H%M%S%%-c.%%le -r -ext jpg /shared/Pictures
Here's the breakdown of each element:
- '-filename<CreateDate' means rename the image file using the image's creation date and time.
- -d means "Set format for date/time values".
- %y%m%d_ means the first part of the new file name should be composed of the last two digits of the creation-date year, followed by the month and day, both represented by two digits. The underscore _ means put in an underscore after the date part of the file name.
- %H%M%S means add the hour, minute, and second of the creation time, all represented by two digits.
- %%-c means that if two images have the same file name up to this point in the naming process, it will automatically add an incremented number to the end to give each image a unique name. Note the doubled %% is required to preventing "escaping" the command. The "-" before the "c" isn't really necessary, but it puts a dash before the copy number.
- .%%le means keep the original file name extension, but make it lower-case if it was originally upper-case, a nice option when cameras insist on using "JPG" instead of "jpg". (If you prefer upper-case extensions, then use .%. If you prefer to keep the original case intact, use .%%e.)
- -ext jpg means only rename files with the "jpg" extension. To rename all image files in the source folder, don't specify any extensions or you can add other extensions by adding more -ext switches followed by your desired extension, one -ext for each extension.
- -r means recurse through all sub-directories below the target folder.
- /shared/Pictures is the absolute path to the top folder holding all my images to be renamed. Use your own path, of course.
Unless the jpg has missing or malformed exif data, this command will do the renaming. In my case, this renamed about 80% of all my jpgs, uncovered 2000ish dupes, and significantly reduced the remaining work ahead. Eventually, I will be renaming the remaining scans and photos manually. exiftool will also allow you to re-date/time the exif date or even incrementally add time to the exif data so the date/time stamps are correct.
findimagedupes
This command has only a tenth of the options the exiftool has but performs an amazing amount of work. Basically, it resizes, blurs, re-sizes again, and then compares (all in tmp or memory - no files are harmed) all the given images to one another and then makes a guess as to which ones are duplicates. As you can imaging, this takes some time. I let it run on the entire Pictures folder while I was at work, dumping the results into a text file for later use. I have no idea how long this took; but 10 hours later when I returned it was done:
findimagedupes -R -- . >alldupes.txt
I ran this while in my /shared/Pictures folder so all my photos were compared to each other. I then dumped the result file alldupes.txt (1000+ lines) into a spreadsheet so I could sort the results easily. I am now manually comparing the list to remove every last duplicate. The downside of this command is that similar photos are matched as dupes. Based on my file naming scheme, most of these erroneous dupe reports are easily discarded and again I have saved myself days and weeks of manual file comparison.
THE RESULTS (SO FAR):
Now I have about 99% of the dupes deleted and about 80% the photos sensibly sorted and in order. Already, we can sit at the living room TV and enjoy a slide show and I recovered over 20GBs of drive space. I still have more to do - many files had no or bad exif data - but it's in the order of days' worth or work rather than months' or years' worth.
Good luck with your family photo project! Please post solutions you used if you have had similar problems. I'd love to compare results.
Comment