Google’s Chrome is a useful tool to have around, but the security features have gotten out of hand and make it increasingly useless for real work without actually improving security.
After a brief rant about SSL, there’s a quick solution at the bottom of this post.
Chrome’s Idiotic SSL Handling Model
I don’t like Chrome nearly as much as Firefox, but it does do some things better (I have a persistent annoyance with pfSense certificates that cause slow loading of the pfSense management page in FF, for example). Lately I’ve found that the Google+ script seems to kill firefox, so I use Chrome for logged-in Google activities.
But Chrome’s handling of certificates is abhorrent. I’ve never seen anything so resolutely destructive to security and utility. It is the most ill-considered, poorly implemented, counter-productive failure in UI design and security policy I’ve ever encountered. It is hateful and obscene. A disaster. An abomination. The ill-conceived excrement of ignorant twits. I’d be happy to share my unrestrained feelings privately.
I’ve discussed the problem before, but the basic issues are that:
- The certificate authority is NOT INVALID, Chrome just doesn’t recognize it because it is self-signed. There is a difference, dimwits.
- This is a private network (10.x.x.x or 192.168.x.x) and if you pulled your head out for a second and thought about it, white-listing private networks is obvious. Why on earth would anyone pay the cert mafia for a private cert? Every web-interfaced appliance in existence automatically generates a self-signed cert, and Chrome flags every one of them as a security risk INCORRECTLY.
- A “valid” certificate merely means that one of the zillions of cert mafia organizations ripping people off by pretending to offer security has “verified” the “ownership” of a site before taking their money and issuing a certificate that placates browsers
- Or a compromised certificate is being used.
- Or a law enforcement certificate is being used.
- Or the site has been hacked by criminals or some country’s law enforcement.
A “valid” certificate doesn’t mean nothing at all, but close to it.
So one might think it is harmless security theater, like a TSA checkpoint: it does no real harm and may have some deterrent value. It is a necessary fiction to ensure people feel safe doing commerce on the internet. If a few percent of people are reassured by firm warnings and are thus seduced into consummating their shopping carts, improving ad traffic quality and thus ensuring Google’s ad revenue continues to flow, ensuring their servers continue sucking up our data, what’s the harm?
The harm is that it makes it hard to secure a website. SSL does two things: it pretends to verify that the website you connect to is the one you intended to connect to (but it does not do this) and it does actually serve to encrypt data between the browser and the server, making eavesdropping very difficult. The latter useful function does not require verifying who owns the server, which can only be done with a web of trust model like perspectives or with centralized, authoritarian certificate management.
How to fix Chrome:
The damage is done. Millions of websites that could be encrypted are not because idiots writing browsers have made it very difficult for users to override inane, inaccurate, misleading browser warnings. However, if you’re reading this, you can reduce the headache with a simple step (Thanks!):
Right click on the shortcut you use to launch Chrome and modify the launch command by adding the following “
Once you’ve done this, chrome will open with a warning:
YAY. Suffer my ass.
Java? What happened to Java?
Java sucks so bad. It is the second worst abomination loosed on the internet, yet lots of systems use it for useful features, or try to. There’s endless compatibility problems with JVM versions and there’s the absolutely idiotic horror of the recent security requirement that disables setting “medium” security completely no matter how hard you want to override it, which means you can’t ever update past JVM 7. Ever. Because 8 is utterly useless because they broke it completely thinking they’d protect you from man in the middle attacks on your own LAN.
However, even if you have frozen with the last moderately usable version of Java, you’ll find that since Chrome 42 (yeah, the 42nd major release of chrome. That numbering scheme is another frustratingly stupid move, but anyway, get off my lawn) Java just doesn’t run in chrome. WTF?
Turns out Google, happy enough to push their own crappy products like Google+, won’t support Oracle’s crappy product any more. As of 42 Java is disabled by default. Apparently, after 45 it won’t ever work again. I’d be happy to see Java die, but I have a lot of infrastructure that requires Java for KVM connections, camera management, and other equipment that foolishly embraced that horrible standard. Anyhow, you can fix it until 45 comes along…
To enable Java in Chrome for a little while longer, you can follow these instructions to enable NPAPI (which enables Java). Type “
chrome://flags/#enable-npapi” in the browser bar and click “
Occasionally you find the crankypants commentary about the “problems” with PGP. These commentaries are invariably written by people who fail to recognize the use modality that PGP is meant to address.
PGP is a cryptographic tool that is, genuinely, annoying to use in most current implementations (though I find the APG extension to the K9 mail app on the android as easy or easier to use than the current Enigmail implementation for Thunderbird.) The purpose of PGP is to encrypt the contents of mail messages sent between correspondents. Characteristics of these messages are that they have more than ephemeral value (you might need to reference them again in the future) and that the correspondents are not attempting to hide the fact that they correspond.
It is intrinsic to the capabilities of the tool that it does not serve to hide with whom you are communicating (there are tools for doing this, but they involve additional complexity) and all messages encrypted with a single key can be decrypted with that key. As such keys are typically protected by a password the user must remember. It is a sufficiently accurate simplification of the process to consider the messages themselves protected by a password that the owner of the messages must remember and might possibly be forced to divulge as the fundamental limit on the security of the messages so protected. There are different tools for different purposes that exchange ephemeral keys that the user doesn’t ever know, aren’t protected by a mnemonic password, and therefore can never be forced to divulge).
These rants against PGP annoy me because PGP is an excellent tool that is marred by minor usability problems. Energy expended on ignorantly dismissing the tool is energy that could be better spent improving it. By far the most important use cases for the vast majority of users that have any real reason to consider cryptography are only addressed by PGP. I make such a claim based on the following:
- Most business and important correspondence is conducted by email and despite the hyperventilation of some ignorant children, will remain so for the foreseeable future.
- Important correspondence, more or less by definition, has a useful shelf life of more than one read and generally serves as a durable (and legally admissible) record.
- There are people who have legitimate reasons to obfuscate their correspondents: email, even PGP encrypted email, is not a suitable tool for this task.
- There are people who have legitimate reason to communicate messages that must not be permanently recorded and for which either the value of the communication is ephemeral or the risk is so great that destroying the archive is a reasonable trade-off: email, even with PGP, is not a suitable tool for this task.
- There’s some noise in the rant about not being sufficient to protect against NSA targeted intercept or thwarting NSA data archiving, which makes an implicit claim that the author has some solution that might provide such protection to end users. I consider such claims tantamount to homicide. If someone is targeted by state-level surveillance, they can’t use a Turing-complete device (any digital device) to communicate information that puts them at risk; any suggestion to the contrary is dangerous misinformation.
Current implementations of PGP have flaws:
- For some reason, mail clients still don’t prompt for the import or generation of PGP keys whenever a new account is set up. That’s somewhat pathetic.
- For some reason, address books integrated into mail clients don’t have a field for the public key of the associate. This is a bizarre omission that necessitates add-on key management plug-ins that just make the process more complicated.
- It is somewhat complicated by IMAP, but no client stores encrypted messages locally in unencrypted form, which makes them difficult to search and reduces their value as an archival record. This has trivial security value: your storage device is, of course, encrypted or exposing your email should your device be lost is likely to be the least of your problems.
PGP is, despite these shortcomings, one of the most important cryptographic tools available.
Awesome properties of PGP keys no other cryptographic system can touch
PGP keys are (like all cryptographic keys in use by any system) long strings of seemingly random data. The more seemingly random, the better. They are, by that very nature, nonmnemonic. Public key cryptosystems, like PGP, have an awesome, incredibly useful characteristic that you can publish your public key (a long, random string of numbers) and someone you’ve never met can encrypt a message using that public key and only your private key can decrypt it.
Conversely, you can “sign” data with your “private key” and anyone can verify that you signed it by decrypting it with your public key (or more precisely a short mathematical summary of your message). This is so secure, it is a federally accepted signature mechanism.
There’s a hypothesized attack called a Man In The Middle attack (often abbreviated “MITM”) that exploits the fact these keys aren’t really human readable (you can, but they’re so long you won’t) whereby an attacker (traditionally the much maligned Eve) intercepts messages between two parties (traditionally the secretive Alice and Bob), pretending to be Bob whilst communicating with Alice and pretending to be Alice whilst communicating with Bob. By substituting her keys for Alice’s and Bob’s, both Bob and Alice inadvertently send messages that Eve can decrypt and she “simply” forwards Bob’s to Alice using Alice’s public key and vice versa so they decrypt as expected, despite coming from the evil Eve.
Eve must, however, be able to intercept all of Alice and Bob’s communication or her attack may be discovered when the keys change, which is not practical in the real world on an ongoing basis (but, ironically, is easier with ephemeral keys). Pretending to be someone famous is easier and could be more valuable as people you don’t know might send you unsolicited private correspondence intended for the famous person: the cure is widely disseminating key “fingerprints” to make the discovery of false keys very hard to prevent. And if you expect people to blindly send you high-priority information with your public key, you have an obligation to mitigate the risk of a false recipient.
Occasionally it is hypothesized that this attack compromises the utility of PGP; it is a shortcoming of all cryptographic systems that the keys are not human readable if they are even marginally secure. It is intrinsic to a public key infrastructure that the public keys must be exchanged. It is therefore axiomatic that a PKI-based cryptographic system will be predicated on mechanisms to exchange nonmnemonic key information. And hidden key exchange, as implemented by OTR or other ephemeral key systems makes MITM attacks harder to detect.
While it is true that elliptic curve PKI algorithms achieve equivalent security with shorter keys, they are still far too long to be mnemonic. One might nominally equivalence a 4k RSA key with a 0.5k elliptic curve key, a non-trivial factor of 8 reduction with some significance to algorithmic efficiency, but no practical difference in human readability. Migrating to elliptic curves is on the roadmap for PGP (with GPG 2.1, now in beta) and should be expedited.
PGP Key management is a little annoying
Actually, it isn’t so much PGP that makes this true, but rather the fact that mail clients haven’t integrated PGP into the client. That Gmail and Yahoo mail will soon be integrating PGP into their mail clients is a huge step in the right direction even if integrating encryption into a webmail client is kind of pointless since the user is already clearly utterly unconcerned with privacy at all if they’re gifting Google or Yahoo their correspondence. Why people who should know better still use Gmail is a mystery to me. When people who care about data security use a gmail address it is like passing the temperance preacher passed out drunk in the gutter. With every single message sent. Even so, this is a step in the right direction by some good people at Google.
It is tragic that Mozilla has back-burnered Thunderbird, but on the plus side they don’t screw up the interface with pointless changes to justify otherwise irrelevant UX designers as does every idiotic change in Firefox with each release. Hopefully the remaining community will rally around full integration of PGP following the astonishingly ironic lead of the privacy exploiting industry.
If keys were integrated into address books in every client and every corporate LDAP server, it would go a long way toward solving the valid annoyances with PGP key management; however, in my experience key management is never the sticking point, it is either key generation or the hassle of trying to deal with data rendered opaque and nearly useless by residual encryption of the data once it has reached me.
Forward Secrecy has a place. It isn’t email.
A complaint levied against PGP that proves beyond any doubt that the complainant doesn’t understand the use case of PGP is that it doesn’t incorporate forward secrecy. Forward secrecy is a consequence of a cryptosystem that negotiates a new key for each message thread which is not shared with the users and which the system doesn’t store. By doing this, the correspondents cannot be forced to reveal the keys to decrypt the contents of stored or captured messages since they don’t know them. Which also means they can’t access the contents of their stored messages because they’re encrypted with keys they don’t know. You can’t read your own messages. There are messaging modalities where such a “feature” isn’t crippling, but email isn’t one of them; sexting perhaps, but not email.
Indeed, the biggest, most annoying, most discouraging problem with PGP is that clients do not insert the unencrypted message into the local message store after decrypting it. This forces the user to decrypt the message again each time they need to reference it, if they can ever find it again. One of the problems with this is you can’t search encrypted messages without decrypting them. No open source client I’m aware of has faced this debilitating failure of use awareness, though Symantec’s PGP desktop does (so it is solvable). Being naive about message use wouldn’t have been surprising for the first few months of GPG’s general use, but that this failure persists after decades is somewhat shocking and frustrating. It is my belief that the geekiness of most PGP interfaces has so limited use that most people (myself included) aren’t crippled by not being able to find PGP encrypted mail because we get so little of it. If even a small percentage of our mail was encrypted, not ever being able to find it again would be a disaster and we’d stop using encryption.
This is really annoying because messages have the frequently intolerable drawbacks of being ephemeral without the cryptographic value of forward secrecy.
Email is normally used as a messaging modality of record. It is the way in which we exchange contemplative comments and data that exceeds a sentence or so. This capability remains important to almost all collaborative efforts. The record thus created has archival value and is a fundamental requirement in many environments. Maximizing the availability, searchability, and ease of recall of this archive is essential. Indeed, even short form communication (“chat” in various forms), which is typically amenable to forward secrecy because of the generally low content value thus communicated, should have the option of PGP encryption instead of just OTR in order to create a secure but archival communications channel.
A modest proposal
I’ve been using PGP since the mid 1990s. I have a key from early correspondence on PGP from 1997 and mine is from 1998. Yet while I have about 2,967 contacts in my address book I have only 139 keys in my GPG keyring. An adoption rate of 4.7% for encrypted email isn’t exactly a wild success. I don’t think the problems are challenging and while I very much appreciate the emergence of cryptographically secure communications modalities such as OTR for chat and ZRTP for voice, I’ve been waiting for decades for easy-to-use secure email. And yet, when people ask me to help them set up encrypted email, I generally tell them it is complicated, I’m willing to help them out, but they probably won’t end up using it. Over the years, a few relatively easy to fix issues have retarded even my own use:
- The fact that users have to find and install a somewhat complex plugin to handle encryption is daunting to the vast majority of users. Enigmail is complicated enough that it is unusable without in-person walk-through support for most users. Even phone support doesn’t get most people through setup. Basic GPG key generation and management should be built into the mail client. Every time one sets up a new account, you should have to opt out of setting up a public key and there’s no reason for any options by default other than entering a password to protect the private key.
- Key fields should be built into the address book of every mail client by default. Any mail client that doesn’t support a public key field should be shamed and ridiculed. That’s all of them until Gmail releases end-to-end as a default feature, though that may never happen as that breaks Google’s advertising model. Remember, Google pays all their developers and buys them all lunch solely by selling your private data to advertisers. That is their entire business model. They do not consider this “evil,” but you might.
- I have no idea why my received encrypted mail is stored encrypted on my encrypted hard disk along with hundreds of thousands of unencrypted messages and tens of thousands of unencrypted documents. Like any sensible person who takes a digital device out of the house (or leaves it unprotected in the house), I encrypt my local storage to protect those messages and documents from theft and exploitation. My encrypted email messages are merely data cruft I can’t make much use of since I can’t search for them. That’s idiotic and cripples the most important use modality of email: the persistent record. Any mail client should permanently decrypt the local message store unless the user specifically requests a message be stored encrypted, an option that should be the same for a message that arrived encrypted or unencrypted.
- Once we solve the client storage failure and make encrypted email useful for something other than sending attachments (which you can save, ZOMG, in unencrypted form) and feeling clever for having gotten the magic decoder ring to work, then it would make sense to modify mail servers to encrypt all unencrypted incoming mail with the user’s public key, which mitigates a huge risk in having a mail server accessible on the internet: that the historical store of data there contained is remotely compromised. This protects data at rest (data which is often, but not assuredly, already protected in transit by encrypted transport protocols.) End-to-End encryption using shared public keys is still optimal, but leaving the mail store unencrypted at rest is an easily solved security failure and in protection in transit is largely solved (and would be quickly if gmail bounced any SMTP connection not protected by TLS 1.2+.)
Fixing the obvious usability flaws in encrypted email are fairly easy. Public key cryptography in the form of PGP/GPG is an incredibly powerful and tremendously useful tool that has been hindered in uptake by limitations of perception and by overly stringent use cases that have created onerous limitations. Adjusting the use model to match requirements would make PGP far more useful and far easier to convince people to use.
Phil Zimmerman’s essay “why I wrote PGP” applies today as much as it did in 1991:
What if everyone believed that law-abiding citizens should use postcards for their mail? If a nonconformist tried to assert his privacy by using an envelope for his mail, it would draw suspicion. Perhaps the authorities would open his mail to see what he’s hiding.
It has been almost 25 years and never has the need for universally encrypted mail been more obvious. It is time to integrate PGP into all mail clients.
As of Sept 30 2013, Xabber added Orbot support. This is a huge win for chat security. (Gibberbot has done this for a long time, but it isn’t as user-friendly or pretty as Xabber and it is hard to convince people to use it).
The combination of Xabber and Orbot solves the three most critical problems in chat privacy: obscuring what you say via message encryption, obscuring who you’re talking to via transport encryption, and obscuring what servers to subpoena for at least the last information by onion routing. OTR solves the first and Tor fixes the last two (SSL solves the middle one too, though Tor has a fairly secure SSL ciphersuite, who knows what that random SSL-enabled chat server uses – “none?”)
There’s a fly in the ointment of all this crypto: we’ve recently learned a few entirely predictable (and predicted) things about how communications are monitored:
1) All communications are captured and stored indefinitely. Nothing is ephemeral; neither a phone conversation nor an email, nor the web sites you visit. It is all stored and indexed should somebody sometime in the future decide that your actions are immoral or illegal or insidious or insufficiently respectful this record may be used to prove your guilt or otherwise tag you for punishment; who knows what clever future algorithms will be used in concert with big data and cloud services to identify and segregate the optimal scapegoat population for whatever political crises is thus most expediently deflected. Therefore, when you encrypt a conversation it has to be safe not just against current cryptanalytic attacks, but against those that might emerge before the sins of the present are sufficiently in the past to exceed the limitations of whatever entity is enforcing whatever rules. A lifetime is probably a safe bet. YMMV.
2) Those that specialize in snooping at the national scale have tools that aren’t available to the academic community and there are cryptanalytic attacks of unknown efficacy against some or all of the current cryptographic protocols. I heard someone who should know better poo poo the idea that the NSA might have better cryptographers than the commercial world because the commercial world pays better, as if the obsessive brilliance that defines a world-class cryptographer is motivated by remuneration. Not.
But you can still do better than nothing while understanding that a vulnerability to the NSA isn’t likely to be an issue for many, though if PRISM access is already being disseminated downstream to the DEA, it is only a matter of time before politically affiliated hate groups are trolling emails looking for evidence of moral turpitude with which to tar the unfaithful. Any complacency that might be engendered by not being a terrorist may be short lived. Enjoy it while it lasts.
And thus (assuming you have an Android device) you can download Xabber and Orbot. Xabber supports real OTR, not the fake-we-stole-your-acronym-for-our-marketing-good-luck-suing-us “OTR” that Google hugger-muggers and carom–shotts you into believing your chats are ephemeral with (of course they and all their intelligence and commercial data mining partners store your chats, they just make it harder for your SO to read your flirty transgressions). Real OTR is a fairly strong, cryptographically secured protocol that transparently and securely negotiates a cryptographic key to secure each chat, which you never know and which is lost forever when the chat is over. There’s no open community way to recover your chat (that is, the NSA might be able to but we can’t). Sure, your chat partner can screen shot or copy-pasta the chat, but if you trust the person you’re chatting with and you aren’t a target of the NSA or DEA, your chat is probably secure.
But there’s still a flaw. You’re probably using Google. So anyone can just go to Google and ask them who you were chatting with, for how long, and about how many words you exchanged. The content is lost, but there’s a lot of meta-data there to play with.
So don’t use gchat if you care about that. It isn’t that hard to set up a chat server.
But maybe you’re a little concerned that your ISP not know who you’re chatting with. Given that your ISP (at the local or national level) might have a bluecoat device and could easily be man-in-the-middling every user on their network simultaneously, you might have reason to doubt Google’s SSL connection. While OTR still protects the content of your chat, an inexpensive bluecoat device renders the meta information visible to whoever along your coms path has bought one. This is where Tor comes in. While Google will still know (you’re still using Google even after they lied to you about PRISM and said, in court, that nobody using Gmail has any reasonable expectation of privacy?) your ISP (commercial or national) is going to have a very hard time figuring out that you’re even talking to Google, let alone with whom. Even the fact that you’re using chat is obscured.
Off-Site scripts are annoying.
To explain – I use noscript (as everyone should) with Firefox (it doesn’t work with Chrome: I might consider trusting Google’s browser for some mainstream websites when it does, but I don’t really like that Chrome logs every keystroke back to Google and I’m not sure why anyone would tolerate that). NoScript enables me to give per-site permission to execute scripts.
The best sites don’t need any scripts to give me the information I need. It is OK if the whizzy experience is degraded somewhat for security’s sake, as long as that is my choice. Offsite scripting can add useful functionality, but the visitor should be able to opt out.
Most sites use offsite scripting for privacy invasion – generally they have made a deal with some heinous data aggregator who’s business model is to compile dossiers of every petty interest and quirk you might personally have and sell them to whoever can make money off them: advertisers, insurance companies, potential employers, national governments, anyone who can pay. In return for letting them scrounge your data off the site, they give the site operator some slick graphs (and who doesn’t love slick graphs). But you lose. Or you block google analytics with noscript. This was easy – block offsite scripts if you’re not using private browsing or switch to private browsing (and Chrome’s private browsing mode is probably fine) and enjoy the fully scripted experience.
But I’ve noticed recently a lot of sites are borrowing basic functionality from Google APIs. Simple things, for which there are plenty of open source scripts to use like uploading images – this basic functionality is being sold to them in an easy to integrate form in exchange for your personal information: in effect, you’re paying for their code with your privacy. And you either have to temporarily allow Google APIs to execute scripts in your browser and suck up your personal information or you can’t use the site.
Never trust your business, applications, or critical data to a cloud service because you are at the mercy of the provider both for security and availability, neither of which are terribly likely. Cloud services are the .coms of the 2nd decade of the 21st century, they come and go and with them so go your data and possibly your entire enterprise. Typically the argument is that larger brands are safer, that a company like Google would not wipe out a service leaving their customers or partners high and dry, that they would be safe.
That would be a false assumption.
It is necessary to understand the mathematics of serial risk to evaluate the risk-weighted cost of integrating a cloud-provisioned service into a business. It is important to note that this is entirely different from integrating third party code, which just as frequently becomes abandonware; while abandonware can result in substantial enterprise costs in engineering an internally developed replacement, a could service simply vanishes when the provisioning company “pivots” or craters, instantly breaking all dependent applications and even entire dependent enterprises: it is a zero day catastrophe.
Serial risks create an exponential risk of failure. When one establishes a business with N critical partners, the business risk of failure is mathematically similar to RAID 0. If each business has a probability of failure of X%, the chances of the business failing is 1-(1-X/100)^N. If X is 30% and your startup is dependent on another startup providing, say, a novel authentication mechanism to validate your cloud service, then the chances of failure for your startup rise from 30% to 51%. Two such dependencies and chances of failure rise to 64% (survival is a dismal 36%).
I was searching for something random on Google (no, not that, regular expression examples) and noticed that funny little bar they put up there a while back when Google+ had the world all a-flutter. My little box had a  in it. Hmmm.. A few people I’d never heard of had “circled” me. Nobody I knew. I think I last checked G+ a few weeks ago, maybe it was a month. Oh well, so much for that one. Facebook will eventually do a MySpace, taking everyone’s cleverly crafted content out with it, but G+ won’t be the Facebook that does it. Or something like that.
Typing of Google, anyone else notice that Google has become much more aggressive about implicit substitution? I’m used to it autocorrecting typing, which actually led to ever more lazy typing, at least on my part. But I thought it always let me know when it was making a presumptive change. Search for [Congres] (using square brackets to denote the text box since “” has meaning in this context) and it used to note “Do you mean “Congress”” Yeah yeah, just fat-fingered the last letter, NP. Now it just silently corrects unless you use quotes. Maybe you actually wanted to find the “Hotel Du Congres.”
OK, annoying, but not fatal. But what is actually quite tedious is when you search for something slightly esoteric like [“white screen of death” client certificate]. 122,000 results. Whee. Oh, wait, most have nothing to do with client certificates – how can that be? [“white screen of death” “client” “certificate”] yields 367 results, almost all relevant. So for about 121,000 results Google assumes I just accidentally typed “client” and/or “certificate”? Those do not seem like common typos for [ ] (blank). If I went to all the trouble of typing out the words “client” and “certificate” does it not generally undermine the utility of a search function if it arbitrarily decides to ignore any inconvenient terms?
I find my self quote-forcing ([“white screen of death” +client +certificate] yields the same 367 results) most of my searches. Since when did my search terms become optional? WTF Google? Search is the one thing you do well. Well, that and advertising. Please don’t break it. Trust me, if you blow search you are not going to make up the difference with Social Networking.
Update: I recently searched for a scholarly article to back an assumption that document collections stored in structured databases can be accessed faster than document collections stored in file systems. I used the word “median” rather than “average” in my search, but clever Google knows the two are often synonyms and rather than limit my search to documents that use the typically academic “median,” I got almost entirely useless results referencing various colloquial “average” constructions.
Over the decades, I’ve taken a lot of digital pictures. I was a bit haphazard in backing them up to CDs to random hard disks etc – meaning several copies. Over the years, bit rot has corrupted some copies, CDs from 20 years ago have started to go blank etc. Once I put together a ZFS 6 FreeNAS box, I thought it would be a good place to organize them, especially once I started playing with Picasa’s face recognition tool, which is awesome for reminding me who some of those people are in those old .jpgs staring back through the bit flip block defects of the ages.
I’ve tried a couple of face recognition tools – Microsoft’s, some other thing that really sucked, and Picasa, and Picasa’s is by far the best. Unfortunately Picasa suffers horribly from Google Hubris, that infuriating disease that renders otherwise excellent technologies almost unusable. An example many people have run into is Google’s idiotic threading model in gmail. They’ve decided that all messages are non-hierarchical blobs, that the meta information means nothing, and that we should trust the lucky feeling. If the messages Google chooses to show us aren’t what we were actually looking for, then we are doing it wrong.
Picasa is infected with the same disease, but has it even worse. Picasa has one uniquely good trick, it tags faces fairly well. It is not a particularly good tool, certainly not the best, for many other tasks people do with images. But since failing to recognize that the only right way to do any of these tasks is with Picasa, and really failing to understand that anything anyone would legitimately ever actually want to do with a digital image falls into the set of features Picasa has (or it is not legitimate), the fact that touching your images with any other program corrupts Picasa’s database and, entertainingly, wipes out any work that you’ve done with Picasa is, as reiterated over and over by Google’s reps in the Picasa forums, just proof that you’re doing it wrong.
And, of course, Google and Picasa will be with us forever, just like every image management and editing application that I was using back in 1990 when I started taking digital photos.
My little image collection, once fully deduplicated, is 52,000+ images and 122 GB of data, which I think crosses most predictable size fail thresholds, so if these tools work here, they should be pretty reliable for most people. If you don’t get it yet, and still fail to adhere to the Google Way, the following utilities aided my heresy.
Face Tagging (Fix Picasa with AvPicFaceXmpTagger)
If it wasn’t for the face tagging feature, I’d never use Picasa. I can’t wait until somebody competent writes a face tagging application that is as well written, straight forward, and standards compliant as Friedemann Schmidt’s GeoSetter – a gold standard in image utilities matched only by Irfan Skiljan’s IrfanView. Until then, there is, alas, only Picasa.
With a large collection of images, especially those with crowd shots, one quickly discovers that even Picasa’s devs haven’t through through the UI very well yet: there’s no way to reject large groups of pictures. It is also very tedious to work in manual mode: you can’t add faces in the “identify unknown faces” mode where you’d want to, for example. Another odd artifact is that to move a misidentified collection of faces to the right name, you have to select from a text-only popup list that quickly spans several 1200 pixel screens as you add names. If you type the first letter of the name, it jumps to it, but the scroll wheel doesn’t scroll the list and if you start typing the second letter of the name thinking you’ll get to the one you want (a standard UI reflex) you instead jump around to names beginning with that letter – but bonus feature – if you have only one person in the list who’s name begins with that letter, the reassignment executes automatically, which can make it hard to find where the pictures even went.
If it were me, I’d add an “indicate face” mode where I can indicate with just a click (not click, drag, name each time) where a face is and trigger a “look harder” iteration of the detection algo. It would also be useful to hint to the algo that a folder of images has more faces than already detected, try again. The algo should use meta information to aid in narrowing – for example certain faces tend to appear in different periods of one’s life. A good example might be taking a vacation with a friend: in that folder, everyone who kind of looks like the friend is more likely to be so. That is, look at frequency of appearance by metadata cluster and weight accordingly where metadata might be folder, file naming structure, GeoIP, date, time, etc.
But the huge problem with Picasa is that for reasons that could only make sense to a company that is absolutely, religiously certain they know the one and only true way to do anything correctly, Picasa writes the face ID information to a contacts.xml file, not using standards-compliant XMP face tagging. This means that when your picasa database gets corrupted (and it will, regularly) most of your face tagging efforts are lost if you don’t use a utility to write the face tag data to the EXIF meta information so it stays with the picture.
Fortunately, there is a tool to do just that: Andreas Vogel’s AvPicFaceXmpTagger. This utility will read the contacts.xml file and write the data into the image files as XMP compliant tags so the work will stay with your images. I ran it on my entire pre-deduplicated collection before deduplicating, and while it took about 20 hours, it did not barf.
What is particularly annoying is that the face detection algorithm is actually quite good, it is the database management that is beyond useless. Google has no excuse for being bad at information management. The meta information being attached to a picture couldn’t be easier – a name and coordinates. The contacts.xml file is intolerably fragile and completely tied to Picasa.
GeoTagging (Use GeoSetter)
Picasa used to be my geotag program, but then I found GeoSetter, and I completely abandoned Picasa’s inferior geotagging features and never looked back. It is now just a face recognition tool. It pretty much sucks at managing the data, and while AvPicFaceXmpTagger fixes the inexcusable shortcoming of not writing XMP tags with the face data, as soon as there’s a GeoSetter-quality, XMP-compliant face tagging solution, Picasa is so voted off the hard disk.
GeoSetter uses map integration to make tagging pictures easy, but it does The Right Thing, that is it puts as tags hierarchical place and altitude information as tags. Oddly, Picasa reps argue that geotags don’t do that any more, that is they only put the lat/lon into the picture assuming that the user will always be connected to Google’s servers and look up additional metadata from the lat/long as needed, arrogant, self-centered morons that they are. Real world users that don’t live on the Google campus still interact with their image data when their not connected to the interwebs, as difficult as this would be for Google to understand and as contrary to their plans for world domination as it is.
But Geosetter does it right, so don’t bother geotagging with Picasa. Geosetter will also look up the additional place name metadata based on lat/lon data in the picture and write that to the appropriate EXIF fields. It is powerful, easy to use, and very reliable.
Folder Organization (Organize folders by date with AmoK Exif Sorter)
Organizing pictures is highly subjective and there’s no right way – well except Picasa’s One True Way, but if your read this far, you’re probably not drinking that cool-aide. I, personally, like YYYY/YYYY-MO/YYYY-MO-DY/Image name folder structures. I, personally, don’t end up with more than 3-400 images in any single folder that way (and that very rarely) so OS’s don’t ever barf on a 20,000 image folder and it is fairly easy to find pictures. The tool I use to organize into year/month/day folders is AmoK Exif Sorter, which can read the EXIF create date and move images into my favorite folder structure automatically. It is a little slow on large folders of more than 2-3,000 images, but it didn’t fail on 20,000 images and sorted them all perfectly.
This works well because I use the same image organization with my EyeFi card, which transmits images directly from my camera to my laptop via wifi and sorts them as it goes. Everything prior to getting the card was randomly sorted until Exif Sorter fixed it, but now it should stay in sync. I really like my EyeFi card, but if upload is enabled when I am not in range of a discoverable network, the card sometimes crashes and I lose the last couple of pictures taken. I’m not happy about that, but I usually remember to turn upload off from the camera interface, and it has only made me really sad a few times so far.
If you’re as disorganized as I am then you’ll ultimately end up with quite a few extra copies of your images as the years go by. Some of my collections had more than 10 copies in the nearly two decades since I first took them. I actually use two tools for deduplication: AntiTwin and DupDetector; I tried Picasa’s deduplication tool but it sucks and it isn’t clear that it is actually removing duplicates, rather just faking you into doing work with it that will later be lost when you have to reinstall Picasa in a few days after the database gets corrupted again (see rotation, below).
I do first pass deduplication with AntiTwin and use the byte by byte comparison at 100% match to find bit-for-bit copies. This does not detect copies with different EXIF tags (which happens) or images that are scaled for email and cluttering up your disk along with their original resolution master images, but you can be confident you’re not going to lose anything. I directly delete the copies AntiTwin finds. AntiTwin also has an image compare function, but it is useless on a large image collection.
To find scaled copies, copies with exif info, copies with minor bit rot, etc that AntiTwin won’t find, I use Prismatic Software’s DupDetector. I’ve found an odd mix of versions on download sites, and the author site is very slow, but it isn’t too huge and it works very well and has been recently updated. I use it to move, not delete copies, into a dead storage folder. If I make a mistake, the copies are still there, but I don’t need to have them in my primary search path. I am fairly confident that everything detected as a duplicate with match at 99.9% was actually a duplicate, but at 99.7%, it turned up some icon sized scaled pictures along with a lot of false matches in very dark pictures. I suggest first running at 100% in fully automatic mode, then cautiously at 99.9% in fully automatic mode; I only had 420 detected duplicates at 99.7%, and about half of those were true duplicates, so I ran at 99.7% in semi-auto mode.
One of the last steps for me is orienting all of my pictures upright using a JPEG Lossless rotation. In yet another facepalm move, Picasa fakes you out with rotations – it does not actually rotate the image, it just stores your rotation specification in picasa.ini file in the folder, which only Picasa uses, and that’s only until that file gets hosed for some reason. So if you spent a couple of days scrolling through the giant list of all your images rotating them one by one in Picasa, you wasted your time. Sorry. Thank Google.
Fire up IrfanView, load a directory of images, or even all subdirectories, and you can autorotate a giant library according to EXIF information. If your pictures go back more than about 5 years, your camera probably didn’t have an orientation sensor, so auto-rotate wont work. But Irfan’s thumbnail mode lets you select a few thousand images that need to be rotated the same way on by one (but quickly) and batch rotate them all losslessly.
If you do this, Picasa will still apply the picasa.ini rotation you created and it will be wrong, which is a good reminder not to use Picasa for anything any other program does better.
Weird: I have yet to find a way to import an RSS feed into G+. This is one of those things that significantly undermines Google’s “your data” cred. Anyone know of a way to do it? I haven’t found an “import RSS feed into your feed” the way facebook kinda does and the wordpress/facebook plugin does.
I’m a very strong believer in “he who owns the hardware, owns the data,” so, for example, posting this on G+ means that this text is Google’s (note, this was originally published on G+, then I stole it back!). And since it didn’t originate on my personal wordpress installation (free as in speech, free as in beer) running on my server at home (free as in speech, not absurdly expensive as in cheap beer), it isn’t mine.
My server also runs my mail server, my file server, my web server etc. all from my garage meaning that’s my data and my hardware and fully protected by law, while any data on Google’s server is effectively shared with every good and bad government in the world and my only legal recourse if it gets hacked or stolen or sold or given away or simply deleted is to… write an angry post on my blog and swear never to trust a cloud service again.
This is, obviously, exactly the same at FaceBook and every other cloud service. I use Facebook as a syndication service: I post on my own servers and syndicate via RSS to FaceBook, which becomes, in effect, the most frequently used RSS reader should people who haven’t gotten around to blocking me in their streams might find and by which perhaps occasionally be amused. This means I still own my data and my data has no particular dependence on FaceBook’s survival.
This post is visible only as long as Google wants it to be. If Google changes the rules, I lose the data. OK, I can download it – as long as they choose to let me, but it isn’t my data. When I post on my server then give FaceBook permission to republish the data, I control my data and they get only what I decide to give them. When I post this on Google and then ask “please, sir, may I recover my post for another use?” the power relationship is reversed: Google owns and controls everything and my rights and usage are only what they deign to offer me.
That almost everyone trusts the billionaire playboys who put king sized beds in their 767 party plane as “do no evil” paragons of virtue is odd to me, but nothing better validates Erich Fromm’s thesis than the pseudo-religious idolatry of Google and Apple. Still, even the True Believers should realize that the founders of these Great Empires are not truly immortal and that even if Google is doing no evil now, it will change hands and those that inherit every search you’ve ever done, every web page you’ve ever visited, every email you’ve ever sent, every phone call you’ve ever made or received, the audio of every message ever left for you, the GPS traces of every step you’ve ever taken, every text and chat and tweet might think, say, that Doing Good means something different than you think it does. One should also remember the Socratic Paradox that renders tautological Google’s vaunted motto.
Unfortunately, at least so far, Google won’t let me use G+ to syndicate my data – they insist on owning it and dictating the terms by which I can access it. If I want to syndicate content through my G+ network, it seems I have to fully gift Google that content. I’m hoping there’s a tool to populate my “posts” from RSS so the canonical will remain on my server. Because it is the Right Thing To Do.
(Shhhh.. I’m going to copy and paste this into my own wordpress installation, even though I wrote it here on the G+ interface. They probably won’t send me a DMCA takedown, but I do run the risk that they’ll hit me with a “duplicate content penalty” and set my page rank to 0 thus ensuring nobody ever finds my site again. Ah, absolute power, so reassuring to remember that it is absolutely incorruptible.)
An interesting artifact of the FB vs. G+ debate is the justification by a lot of tech-savvy people in moving to G+ from FB because they believe Google to be less evil. It is an odd comparison to make, both companies are in essentially the same business: putting out honey pots of desirable web properties, attracting users, harvesting them, and selling their data.
Distinguishing between grades of evil in companies that harvest and sell user data seems a little arbitrary. I’d think it would make more sense to use each resource for what it does well rather than arbitrarily announce that you’re one or the other.
However, if one is making the choice as to what service to call home on the basis of least “evil” and assuming that metric is derived in some way from the degree to which the company in question harvests your data and sells it, then it is somewhat illuminating to look at real numbers. One can assume that the more deeply one probes each user captured by the honey pot, the more data extracted, the more aggressively sold, the more money one makes. The company that makes the most money per user is probing the deepest and selling the hardest.
From Technology Review May/June 2011, annual revenue per monthly unique US visitor:
Facebook: $ 12.10
Google squeezes out and sells more than 13.5x the data per user. Google wins. But Facebook is gathering $12.10 worth of user data, why should Google allow Facebook to have it? If Google wins that last morsel of data to take to market and takes out Facebook, Google can increase their gross revenue by 7%.
I’ve also heard people argue that Zuckerberg seems more personally avaricious, mean, or evil than Google’s founders, comparing Google’s marketing spin to “The Social Network”
Zuckerberg’s only newsworthy purchase was a $7m house in Palo Alto. Google co-founders were in the news over a lawsuit between them over whether their 767 “party plane” (Eric Schmidt) could house Brin’s California king bed. This is in addition to their 757 and two Gulfstream Vs they talked NASA into letting them park at Moffet under the pretense that the planes would be retrofit with instruments for NASA. When they couldn’t do that (FAA regs, who knew?), they bought a Dornier Alpha, but still get to park their jumbo jets and gulfstreams inside NASA hangers for some reason. Suck on that, Ellison!
I was curious after posting some hints about how to protect your privacy to see how they worked.
Using EFF’s convenient panopticlick browser fingerprinting site. Panopticlick doesn’t use all the tricks available, such as measuring the time delta between your machine and a reference time, but it does a pretty good job. Most of my machines test as “completely unique,” which I find complementary but isn’t really all that good for not being tracked.
Personally I’m not too wound up about targeted marketing style uses of information. If I’m going to see ads I’d rather they be closer to my interests than not. But there are bad actors using the same information for more nefarious purposes and I’d rather see mistargeted ads than give the wrong person useful information.
Testing Panopticlick with scripts blocked (note TACO doesn’t help with browser fingerprinting, just cookie control) I cut my fingerprint to 12.32 bits from 20.29 bits, the additional data comes from fonts and plugins.
It is also interesting to note that fingerprint scanners (fingerprints as on the ends of fingers) have false reject rates of about 0.5% and false acceptance rates of about 0.001%. Obviously they’re tuned that way to be 50x more likely to reject a legitimate user than to accept the wrong person and the algorithms are intrinsically fallible in both directions, so this is a necessary trade-off. Actual entropy measures in fingerprints are the subject of much debate. An estimate based on Pankanti‘s analysis computes a 5.5×10^59 chance of a collision or 193 bits of entropy but manufacturer published false acceptance rates of 0.001% are equivalent to 16.6 bits, less accurate than browser fingerprinting.
- Recent Comments
- 10 Gbyte Win10 Spyware “upgrade” now forced on users (4)
- Post History