XML

p5-XML-SAX-0.99 Install Error

Tuesday, May 8, 2012

If you’re trying to install p5-XML-SAX-0.99 and get a

Can't locate XML/SAX/Exception.pm in @INC

error then you may need to

cd /usr/ports/textproc/p5-XML-SAX-Base
make distclean
make deinstall
make install clean

Then you can

cd /usr/ports/textproc/p5-XML-SAX
make distclean
make install clean

And you should get an error free install. Apparently p5-XML-SAX-Base is a prereq that isn’t getting cleanly detected or updated in the make process for p5-XML-SAX.

Posted at 05:42:59 GMT-0700

Category: FreeBSD • Technology

Managing (lots of) Digitial Pictures

Sunday, July 31, 2011

This is an older method, I do a lot of this with DigiKam now on Linux.

Over the decades, I’ve taken a lot of digital pictures. I was a bit haphazard in backing them up to CDs to random hard disks etc – meaning several copies. Over the years, bit rot has corrupted some copies, CDs from 20 years ago have started to go blank etc. Once I put together a ZFS 6 FreeNAS box, I thought it would be a good place to organize them, especially once I started playing with Picasa’s face recognition tool, which is awesome for reminding me who some of those people are in those old .jpgs staring back through the bit flip block defects of the ages.

I’ve tried a couple of face recognition tools – Microsoft’s, some other thing that really sucked, and Picasa, and Picasa’s is by far the best. Unfortunately Picasa suffers horribly from Google Hubris, that infuriating disease that renders otherwise excellent technologies almost unusable. An example many people have run into is Google’s idiotic threading model in gmail. They’ve decided that all messages are non-hierarchical blobs, that the meta information means nothing, and that we should trust the lucky feeling. If the messages Google chooses to show us aren’t what we were actually looking for, then we are doing it wrong.

Picasa is infected with the same disease, but has it even worse. Picasa has one uniquely good trick, it tags faces fairly well. It is not a particularly good tool, certainly not the best, for many other tasks people do with images. But since failing to recognize that the only right way to do any of these tasks is with Picasa, and really failing to understand that anything anyone would legitimately ever actually want to do with a digital image falls into the set of features Picasa has (or it is not legitimate), the fact that touching your images with any other program corrupts Picasa’s database and, entertainingly, wipes out any work that you’ve done with Picasa is, as reiterated over and over by Google’s reps in the Picasa forums, just proof that you’re doing it wrong.

And, of course, Google and Picasa will be with us forever, just like every image management and editing application that I was using back in 1990 when I started taking digital photos.

My little image collection, once fully deduplicated, is 52,000+ images and 122 GB of data, which I think crosses most predictable size fail thresholds, so if these tools work here, they should be pretty reliable for most people. If you don’t get it yet, and still fail to adhere to the Google Way, the following utilities aided my heresy.

Face Tagging (Fix Picasa with AvPicFaceXmpTagger)
If it wasn’t for the face tagging feature, I’d never use Picasa. I can’t wait until somebody competent writes a face tagging application that is as well written, straight forward, and standards compliant as Friedemann Schmidt’s GeoSetter – a gold standard in image utilities matched only by Irfan Skiljan’s IrfanView. Until then, there is, alas, only Picasa.

With a large collection of images, especially those with crowd shots, one quickly discovers that even Picasa’s devs haven’t through through the UI very well yet: there’s no way to reject large groups of pictures. It is also very tedious to work in manual mode: you can’t add faces in the “identify unknown faces” mode where you’d want to, for example. Another odd artifact is that to move a misidentified collection of faces to the right name, you have to select from a text-only popup list that quickly spans several 1200 pixel screens as you add names. If you type the first letter of the name, it jumps to it, but the scroll wheel doesn’t scroll the list and if you start typing the second letter of the name thinking you’ll get to the one you want (a standard UI reflex) you instead jump around to names beginning with that letter – but bonus feature – if you have only one person in the list who’s name begins with that letter, the reassignment executes automatically, which can make it hard to find where the pictures even went.

If it were me, I’d add an “indicate face” mode where I can indicate with just a click (not click, drag, name each time) where a face is and trigger a “look harder” iteration of the detection algo. It would also be useful to hint to the algo that a folder of images has more faces than already detected, try again. The algo should use meta information to aid in narrowing – for example certain faces tend to appear in different periods of one’s life. A good example might be taking a vacation with a friend: in that folder, everyone who kind of looks like the friend is more likely to be so. That is, look at frequency of appearance by metadata cluster and weight accordingly where metadata might be folder, file naming structure, GeoIP, date, time, etc.

But the huge problem with Picasa is that for reasons that could only make sense to a company that is absolutely, religiously certain they know the one and only true way to do anything correctly, Picasa writes the face ID information to a contacts.xml file, not using standards-compliant XMP face tagging. This means that when your picasa database gets corrupted (and it will, regularly) most of your face tagging efforts are lost if you don’t use a utility to write the face tag data to the EXIF meta information so it stays with the picture.

Fortunately, there is a tool to do just that: Andreas Vogel’s AvPicFaceXmpTagger. This utility will read the contacts.xml file and write the data into the image files as XMP compliant tags so the work will stay with your images. I ran it on my entire pre-deduplicated collection before deduplicating, and while it took about 20 hours, it did not barf.

What is particularly annoying is that the face detection algorithm is actually quite good, it is the database management that is beyond useless. Google has no excuse for being bad at information management. The meta information being attached to a picture couldn’t be easier – a name and coordinates. The contacts.xml file is intolerably fragile and completely tied to Picasa.

GeoTagging (Use GeoSetter)

Picasa used to be my geotag program, but then I found GeoSetter, and I completely abandoned Picasa’s inferior geotagging features and never looked back. It is now just a face recognition tool. It pretty much sucks at managing the data, and while AvPicFaceXmpTagger fixes the inexcusable shortcoming of not writing XMP tags with the face data, as soon as there’s a GeoSetter-quality, XMP-compliant face tagging solution, Picasa is so voted off the hard disk.

GeoSetter uses map integration to make tagging pictures easy, but it does The Right Thing, that is it puts as tags hierarchical place and altitude information as tags. Oddly, Picasa reps argue that geotags don’t do that any more, that is they only put the lat/lon into the picture assuming that the user will always be connected to Google’s servers and look up additional metadata from the lat/long as needed, arrogant, self-centered morons that they are. Real world users that don’t live on the Google campus still interact with their image data when their not connected to the interwebs, as difficult as this would be for Google to understand and as contrary to their plans for world domination as it is.

But Geosetter does it right, so don’t bother geotagging with Picasa. Geosetter will also look up the additional place name metadata based on lat/lon data in the picture and write that to the appropriate EXIF fields. It is powerful, easy to use, and very reliable.

Folder Organization (Organize folders by date with AmoK Exif Sorter)

Organizing pictures is highly subjective and there’s no right way – well except Picasa’s One True Way, but if your read this far, you’re probably not drinking that cool-aide. I, personally, like YYYY/YYYY-MO/YYYY-MO-DY/Image name folder structures. I, personally, don’t end up with more than 3-400 images in any single folder that way (and that very rarely) so OS’s don’t ever barf on a 20,000 image folder and it is fairly easy to find pictures. The tool I use to organize into year/month/day folders is AmoK Exif Sorter, which can read the EXIF create date and move images into my favorite folder structure automatically. It is a little slow on large folders of more than 2-3,000 images, but it didn’t fail on 20,000 images and sorted them all perfectly.

This works well because I use the same image organization with my EyeFi card, which transmits images directly from my camera to my laptop via wifi and sorts them as it goes. Everything prior to getting the card was randomly sorted until Exif Sorter fixed it, but now it should stay in sync. I really like my EyeFi card, but if upload is enabled when I am not in range of a discoverable network, the card sometimes crashes and I lose the last couple of pictures taken. I’m not happy about that, but I usually remember to turn upload off from the camera interface, and it has only made me really sad a few times so far.

DeDuplication (AntiTwin and DupDetector)

If you’re as disorganized as I am then you’ll ultimately end up with quite a few extra copies of your images as the years go by. Some of my collections had more than 10 copies in the nearly two decades since I first took them. I actually use two tools for deduplication: AntiTwin and DupDetector; I tried Picasa’s deduplication tool but it sucks and it isn’t clear that it is actually removing duplicates, rather just faking you into doing work with it that will later be lost when you have to reinstall Picasa in a few days after the database gets corrupted again (see rotation, below).

I do first pass deduplication with AntiTwin and use the byte by byte comparison at 100% match to find bit-for-bit copies. This does not detect copies with different EXIF tags (which happens) or images that are scaled for email and cluttering up your disk along with their original resolution master images, but you can be confident you’re not going to lose anything. I directly delete the copies AntiTwin finds. AntiTwin also has an image compare function, but it is useless on a large image collection.

To find scaled copies, copies with exif info, copies with minor bit rot, etc that AntiTwin won’t find, I use Prismatic Software’s DupDetector. I’ve found an odd mix of versions on download sites, and the author site is very slow, but it isn’t too huge and it works very well and has been recently updated. I use it to move, not delete copies, into a dead storage folder. If I make a mistake, the copies are still there, but I don’t need to have them in my primary search path. I am fairly confident that everything detected as a duplicate with match at 99.9% was actually a duplicate, but at 99.7%, it turned up some icon sized scaled pictures along with a lot of false matches in very dark pictures. I suggest first running at 100% in fully automatic mode, then cautiously at 99.9% in fully automatic mode; I only had 420 detected duplicates at 99.7%, and about half of those were true duplicates, so I ran at 99.7% in semi-auto mode.

Rotation (IrfanView)

One of the last steps for me is orienting all of my pictures upright using a JPEG Lossless rotation. In yet another facepalm move, Picasa fakes you out with rotations – it does not actually rotate the image, it just stores your rotation specification in picasa.ini file in the folder, which only Picasa uses, and that’s only until that file gets hosed for some reason. So if you spent a couple of days scrolling through the giant list of all your images rotating them one by one in Picasa, you wasted your time. Sorry. Thank Google.

Fire up IrfanView, load a directory of images, or even all subdirectories, and you can autorotate a giant library according to EXIF information. If your pictures go back more than about 5 years, your camera probably didn’t have an orientation sensor, so auto-rotate wont work. But Irfan’s thumbnail mode lets you select a few thousand images that need to be rotated the same way on by one (but quickly) and batch rotate them all losslessly.

If you do this, Picasa will still apply the picasa.ini rotation you created and it will be wrong, which is a good reminder not to use Picasa for anything any other program does better.

Posted at 19:27:57 GMT-0700

Category: photo • Positive • Reviews • Technology

Search Engine Enhancement

Wednesday, September 12, 2007

Getting timely search engine coverage of a site means people can find things soon after you change or post them.

Linked pages get searched by most search engines following external links or manual URL submissions every few days or so, but they won’t find unlinked pages or broken links, and it is likely that the ranking and efficiency of the search is suboptimal compared to a site that is indexed for easy searching using a sitemap.

There are three basic steps to having a page optimally indexed:

Generating a Sitemap
Creating an appropriate robots.txt file
Informing search engines of the site’s existence

Sitemaps
It seems like the world has settled on sitemaps for making search engine’s lives easier. There is no indication that a sitemap actually improves rank or search rate, but it seems likely that it does, or that it will soon. The format was created by Google, and is supported by Google, Yahoo, Ask, and IBM, at least. The reference is at sitemaps.org.

Google has created a python script to generate a sitemap through a number of methods: walking the HTML path, walking the directory structure, parsing Apache-standard access logs, parsing external files, or direct entry. It seems to me that walking the server-side directory structure is the easiest, most accurate method. The script itself is on sourceforge . The directions are good, but if you’re only using directory structure, the config.xml file can be edited down to something like:

<?xml version="1.0" encoding="UTF-8"?>
<site
  base_url="http://www.your-site.com/"
  store_into="/www/data-dist/your_site_directory/sitemap.xml.gz"
  verbose="1"
  >

 <url href="http://www.your-site.com/" />
 <directory
    path="/www/data-dist/your_site_directory"
    url="http://www.your-site.com/"
    default_file="index.html"
 />

Note that this will index every file on the site, which can be large. If you use your site for media files or file transfer, you might not want to index every part of the site. In which case you can use filters to block the indexing of parts of the site or certain file types. If you only want to index web files you might insert the following:

 <filter  action="pass"  type="wildcard"  pattern="*.htm"           />
 <filter  action="pass"  type="wildcard"  pattern="*.html"          />
 <filter  action="pass"  type="wildcard"  pattern="*.php"           />
 <filter  action="drop"  type="wildcard"  pattern="*"               />

Running the script with

python sitemap_gen.py --config=config.xml

will generate the sitemap.xml.gz file and put it in the right place. If the uncompressed file size is over 10MB, you’ll need to pare down the files listed. This can happen if the filters are more inclusive than what I’ve given, particularly if you have large photo or media directories or something like that and index all the media and thumbnail files.

The sitemap will tend to get out of date. If you want to update it regularly , there are a few options: one is to use a wordpress sitemap generator (if that’s what you’re using and indexing) which does the right thing and generates a sitemap using relevant data available to wordpress and not to the file system (a good thing) and/or add a chron script to regenerate the sitemap regularly, for example

3  3  *  *  *  root python /path_to/sitemap_gen.py --config=/path_to/config.xml

will update the sitemap daily.

robots.txt

The robots.txt file can be used to exclude certain search engines, for example MSN if you don’t like Microsoft for some reason and are willing to sacrifice traffic to make a point, it also points search engines to your sitemap.txt file. There’s kind of a cool tool here that generates a robots.txt file for you but a simple one might look like:

User-agent: MSNBot                             % Agent I don't like for some reason
Disallow: /                                    % path it isn't allowed to traverse
User-agent: *                                  % For everything else
Disallow:                                      % Nothing is disallowed
Disallow: /cgi-bin/                            % Directory nobody can index
Sitemap: http://www.my_site.com/sitemap.xml.gz % Where my sitemap is.

Telling the world

Search engines are supposed to do the work, that’s their job, and they should find your robots.txt file eventually and then read the sitemap and then parse your site without any further assistance. But to expedite the process and possibly enhance search results there are some submission tools at Yahooo, Ask, and particularly Google that generally allow you to add meta information.
Ask
Ask.com allows you to submit your sitemap via URL (and that seems to be all they do)
http://submissions.ask.com/ping?sitemap=http://www.your_site.com/sitemap.xml.gz

Yahoo
Yahoo has some site submission tools and supports site authentication, which means putting a random string in a file they can find to prove you have write-access to the server. Their tools are at
https://siteexplorer.search.yahoo.com/mysites

with submissions at
https://siteexplorer.search.yahoo.com/submit.php

you can submit sites and feeds. I usually use the file authentication which means creating a file with some random string (y_key_random_string.html) with another random string as the only contents. They authenticate within 24 hours.
It isn’t clear that if you have a feed and submit it that it does not also add a site, it looks like it does. If you don’t have a feed you may not need to authenticate the site for submission.
Google
Google has a lot of webmaster tools at
https://www.google.com/webmasters/tools/siteoverview?hl=en

The verification process is similar but you don’t have to put data inside the verification file so

touch googlerandomstring.html

is all you need to get the verification file up. You submit the URL to the sitemap directly.
Google also offers blog tools at
http://blogsearch.google.com/ping

Where you can manually add the feed for the blog to Google’s blog search tool.

Posted at 13:25:13 GMT-0700

Category: FreeBSD • Technology

Recent Posts
- Goodbye, Tortuga. 2024 April 25
- A one page home/new tab page with random pictures, time, and weather 2024 April 11
- Putting ccache on a backed RAM disk to speed compiles 2024 March 16
- Audio File Analysis With Sox 2024 February 07
- Manually Update Time Zone Data on Android 10 2023 October 31
- Autodictating to self using Whisper to preserve privacy 2023 August 17
- Projecting Qubit Realizations to the Cryptopocalpyse Date 2023 August 04
- AI PSYOPS are changing strategic messaging 2023 July 29
- Convert A Slideshow/Presentation into HTML 5 Video 2023 July 23
- Mobotix Notifier in Python – get desktop messages from your cameras 2023 June 06
Categories
Categories
Links
Search
Search for:
Archives
Archives
Post History
April 2024

M T W T F S S

1 2 3 4 5 6 7

8 9 10 11 12 13 14

15 16 17 18 19 20 21

22 23 24 25 26 27 28

29 30

« Mar

Gessel On…

XML

p5-XML-SAX-0.99 Install Error

Managing (lots of) Digitial Pictures

Search Engine Enhancement