Privacy

Autodictating to self using Whisper to preserve privacy

Thursday, August 17, 2023 

Whisper is a very nice bit of code released by OpenAI, the kind people who brought us ChatGPT.  It’s a speech to text tool that can handle a huge array of languages and runs locally, as in on your hardware with your data.  There’s an API you can use on their servers, but only if you are sure the audio files and text can be released to the public.  Never put any data on anyone else’s hardware that you wouldn’t want to have leaked on pastebin or published in the New York Times; that goes for all services including gmail, Outlook, Office 365, etc.  Never, ever use someone else’s hardware to store proprietary or sensitive data.  It’s just mind-bogglingly stupid, and yet so many people fail to comprehend that “in the cloud” just means “on someone else’s computer.”

This is also true for most speech-t0-text tools that (seemingly) kindly offer to translate your ramblings to text.  Lots of people use this feature on their phones without realizing that, like Alexa, any voice command tool is an audio monitoring device you stupidly paid for and installed yourself on behalf of corporate spies who are all too happy to listen to whatever you have to say.  If you have an Alexa, get a hammer right now and smash it.  Go on, I’ll wait.  Good job.  Privacy restored.  Oh, smart TV too? Unplug that stupid thing from the internet.  Same for all your “smart” devices.  You thought “smart” meant you were smart for buying it?  Noooo… you’re a moron for buying it, the company was smart for convincing you to install monitoring devices in your house at your own expense.  Congrats. Own goal.  When you’re finished destroying all your corporate spyware here’s a way to get speech to text capability on your own hardware without the spying thanks to a very nice bit of FOSS code from OpenAI.

The workflow is to record some audio (speech probably) on your phone, store & forward that to your server (no synchronous connection required, unlike most spyware), (optionally) store and forward that to your desktop computer with a GPU to run AI text to speech, pop the results into an email queue to store & forward it back to you and all your searchable text archives. Speech is converted to accessible, indexed text easily and robustly and fairly legibly.

For the recording step, I use an Open Source app called Audio Recorder (available on F-Droid and other reliable repositories; if you need an app, try F-droid first and only use Play Store after deciding it is worth being spied on and having ads pushed to you).  Audio can be any length, seconds or hours.  I configured the settings to record to /storage/emulated/0/recordings and use 48khz, 16 bit, opus for speech; on my device the app supports up to 24bit/192khz, which vastly exceeds the S:N ratio and bandwidth of any microphone I’ll connect to a phone, but nice to know for audiophiles.

I also run NextCloud on my phone which connects to a NextCloud instance on my own server.  NextCloud is like a free, open source version of dropbox and provides directory sharing, calendar, password, etc – almost all services you want a server for on your own hardware so you actually retain possession and ownership of your data – amazing!  You do not have to give away your data to people you don’t know to use the internet.

The NextCloud client on my phone tries to sync the recording folder to my server so after I make a recording and hit the ✅ button, when the aether makes it possible the audio is uploaded (and, optionally, deleted from the mobile device). Nextcloud then syncs down to other clients, specifically one of my Linux clients for processing. It is entirely possible to do everything server side and the same scripts will work, but I don’t have a GPU on my server and Whisper has some dependencies that are easier to meet on a more frequently updated client, at least for now.

I’ve installed whisper on a Linux box, along with a NextCloud client and there I have a fairly simple script running as a cron job. Every 10 minutes it scans all the files in the locally synced “Recordings” directory and if there’s an audio file without a matching text “TSV” file, it calls whisper to convert the audio to text and then emails me the converted text.  That text is also synced back up to the server and to any other synced device and indexed both on the server and locally to make it easily discoverable (on clients I use the very awesome Recoll for indexing).

The whole process is very easy and any audio file like this:

is then automagically converted to text

test if we can record in Opus and then autoconvert the file back to text and
get that text as an email automatically this seems like quite a powerful tool
and should make it fairly easy to self take notes don’t we think yes

and then ends up in my inbox like this:

So what script does this good thing?  Just a few bash lines.  This version uses the time stamps in the TSV files to throw in fairly reasonable paragraph breaks. If the speaker pauses long enough that Whisper inserts a timing break, the script printfs in two newlines. There are a few other tricks below to try to infer or force reasonable paragraph breaks.

It also uses a slightly more robust construction to extract the subject of the email, which includes the first 60 characters of the text, minus any new lines (which make mailx barf).  The resulting text is flowed, pretty easy to copypasta into an email or document, and has moderately natural paragraph breaks.  It isn’t publication ready, but the accuracy seems quite good and it is hard to imagine an easier mechanism for making useful autodictations.  The process supports very long rambling diatribes, you should be able to talk for hours and get book’s worth of text in your inbox. I mean, maybe you shouldn’t be able to do that, but you can.

I put in a feature request with the Audio Recorder devs to add some metainfo to the files; what I’d really like is location data.  I can script up extracting that and (optionally) converting it to a place name, but aside from Nominatim or Gisography, there aren’t many options other than using big data APIs.  Anyway, seems like a reasonable bit of metadata to insert at the top or tail of the text: time+date+location the stream was recorded. If it is implemented, I’ll update to script to extract the metadata and create a dateline header.

Mailing flowed plain text

I found that mailx can’t handle long (flowed) text lines over ~1000 characters and inserts \n  at 998 or 997, which breaks up the pause to paragraphs code, so I switched the mailer to mpack (sudo apt install mpack) which simplifies the mail command and MIME encodes the text body and adds a checksum and a few other modern mail niceties and it now flows as desired without weird line breaks.

And then I found out that mpack thinks it is too good to send text files, it sets the MIME type to application/octet-stream and using the -c text/plain option yields the somewhat prissy error This program is not appropriate for encoding textual data oh my.  Thunderbird actually parses the attachment into a nicely flowed email, ignoring the quirks, but the best mobile client ever, FairEmail, does not and treats the attachment as something that it would prefer not to display inline (thanks for the details Marcel, you’re awesome!), given mailx isn’t very active any more changing that behavior is unlikely.  Next option: Mutt.  Mutt does something to a text attachment (using the -a option) that causes both TB and FairEmail to decline to display inline, but the body option -i yields a clean text-only email with the right flow, meaning no random line breaks inserted, so don’t install mpack, but sudo apt install mutt and create a /home/{user}/.muttrc file with at least the below (search engine around if you need to use a remote SMTP server to configure the server address, authentication, and encryption; mutt does the right things):

set realname = "{desired name}"
set from = "{your from email}"
set use_from = yes
set envelope_from = yes

And once that (and whisper) is working, the following script will convert your audio file to text and then mail it to you with paragraph breaks.

TextTiling

I didn’t plan to get into anything more complex, but long text conversions are kinda unreadable because Whisper doesn’t infer text.  There’s a whole science to inferring contextual shifts that should start new paragraphs using LSA/LDA/LSI that’s quite advanced mathematically and works sort of OK but is an awful lot of pipping modules and trying this or that.

I opted instead to go for a more brute force method, well three of them, really:

First: whisper has an experimental feature to compute word timings, which would normally be used to generate those unbelievably distracting and annoying and utterly horrible subtitles that are one word at a time or bouncing highlight word by word, but the feature can do more than create a miserable, distracting, utterly pretentious viewing experience: they seem to increase the frequency and possibly accuracy of gaps in the exported timing data. The first method of paragraph finding is detecting “long” gaps after a Whisper inferred sentence, effectively deriving speaker intent from cadence and AI content inference.  It works OK.

Second: I implemented a wake_word:command set that seds through the text and search-replaces the wake_word:command with the requested punctuation: .¶,:()…—?!“” There’s a whole theory behind wake words, but “insert” seems to be understood well and the command terms are ones that I tend to think of (e.g. “dots” not “ellipsis”), but that’s all obviously editable to preference.

Third: recommended paragraph length depends on the target and advice ranges from 3 sentence to 6.  I tend to be a bit long winded so I picked 5.  There’s an arbitrary script to look for any line that, after the timing inference and explicit breaks, still has more than 5 sentences and breaks it into multiple lines (meaning paragraph splits when the text is rendered). If that’s too long or too short, change the 5 in /usr/bin/sed -i "s/[.?!] /.\n\n/5;P;D" "$txt_file".

This all work fairly well, though there’s a known quirk with Whisper where it just randomly stops inserting punctuation after about 10 minutes and mechanisms 1 and 3 obviously also fail.  The way to deal with that is to break the audio into about 5 minute segments and then concatenate the results, but it’s a moderate chunk of code and debug and I’m assuming whisper will be updated.  If not and it gets annoying, I’ll work out that routine.

The script

Replace {user} and {domain} as appropriate to your system.  You may also have a different layout for commands, which bin (for example) is your friend.  I find full paths in cron execution provides better consistent reliability at the expense of portability.

#!/bin/bash 

watchdir="/home/username/Work/Recordings/"
to="email@domain.com"
stop_prev="0"
start=""
stop=""
text=""
wake_word="insert"

# Function to check if an audio file has a matching .txt file, then convert to text and email it
convert_to_text() {
    audio_file_file="$1"
    txt_file="${audio_file%.*}.txt"
    tsv_file="${audio_file%.*}.tsv"
    dir="$(/usr/bin/dirname "${audio_file}")"
    base_ext="$(/usr/bin/basename "${audio_file}")"
    base="${base_ext%.*}"


    if [ ! -e "$tsv_file" ]; then
        /home/gessel/.local/bin/whisper "$audio_file" -f tsv --model small.en -o $dir --word_timestamps True --prepend_punctuations True --append_punctuations True --initial_prompt "Hello."

        while IFS=$'\t' read -r start stop text; do
            # First line detection and skip checking it for gaps
            if [ $start == "start" ]; then
                /usr/bin/printf "" > "$txt_file"
                continue
            fi
            # Check if line ends in period or question mark for paragraph insertion
            if [[ $text =~ \.$|\?$ ]]; then
                # find natural pauses and insert paragraph breaks
                if [[ $stop_prev != $start ]]; then
                /usr/bin/printf "\n\n" >> "$txt_file"
                fi
            fi
            /usr/bin/printf "$text " >> "$txt_file"
            stop_prev=$stop
        done  < "$tsv_file"

        stop_prev="0"
        # search for explicit formatting commands and in-line replace them.
        /usr/bin/sed -i "s/[?,. ]*$wake_word period[?,. ]*/. /gI" "$txt_file"
        /usr/bin/sed -i "s/[?,. ]*$wake_word paragraph[?,. ]*/.\n\n/gI" "$txt_file"
        /usr/bin/sed -i "s/[?,. ]*$wake_word comma[?,. ]*/, /gI" "$txt_file"
        /usr/bin/sed -i "s/[?,. ]*$wake_word colon[?,. ]*/: /gI" "$txt_file"
        /usr/bin/sed -i "s/[?,. ]*$wake_word open paren[?,. ]*/ (/gI" "$txt_file"
        /usr/bin/sed -i "s/[?,. ]*$wake_word close paren[?,. ]*/) /gI" "$txt_file"
        /usr/bin/sed -i "s/[?,. ]*$wake_word dots[?,. ]*/… /gI" "$txt_file"
        /usr/bin/sed -i "s/[?,. ]*$wake_word long dash[?,. ]*/—/gI" "$txt_file"
        /usr/bin/sed -i "s/[?,. ]*$wake_word question[?,. ]*/? /gI" "$txt_file"
        /usr/bin/sed -i "s/[?,. ]*$wake_word exclamation[?,. ]*/? /gI" "$txt_file"
        /usr/bin/sed -i "s/[?,. ]*$wake_word open quote[?,. ]*/ “/gI" "$txt_file"
        /usr/bin/sed -i "s/[?,. ]*$wake_word close quote[?,. ]*/” /gI" "$txt_file"
        # brute force paragraphing: 5 sentences is enough, adjust for audience
        /usr/bin/sed -i "s/\([.?!]\) /\1\n\n/5;P;D" "$txt_file"
        # fix any sentence start/finish errors induced by the above edits
        /usr/bin/sed -i "s/^[a-z]/\U&/g" "$txt_file" # start with uppercase
        /usr/bin/sed -i "s/: [A-Z]/\L&/g" "$txt_file" # no uppercase after colon
        /usr/bin/sed -i 's/\s\+$//g' "$txt_file" # don't end with whitespace
        /usr/bin/sed -i "s/[,]$/./g" "$txt_file" # don't end with a comma, use .
        /usr/bin/sed -i '/[.?!]$/! s/$/./' "$txt_file" # if not ending with punctuation at all, add .
        /usr/bin/sed -i 's/^\.$//'  "$txt_file" # oops, no lines with just periods 
        /usr/bin/sed -i "s/\([a-z]\) \./\1./g" "$txt_file" # remove any spaces before periods
        /usr/bin/sed -i "s/  / /g" "$txt_file" # no double spaces
        /usr/bin/sed -i 's/\([0-9]\+\) \([FC]\) /\1°\2 /g' "$txt_file" # write temp to AMA, Chicago, Nat Geo, NOT APA or NIST
        # generate subject line from first sentence no longer than 80 char and remove any newlines
        subject=$(/usr/bin/head -n 1 -c 80 "$txt_file" | /usr/bin/sed 's/\(.*\)\..*/\1/')
        subject=$(/usr/bin/echo $subject | /usr/bin/tr -d '\n')
        subject=$(/usr/bin/echo $subject | /usr/bin/tr -d '\r')
        # send the cleaned up file as email
        /usr/bin/echo "" | /usr/bin/mutt  -F /home/gessel/.muttrc -s "AudioText - $base - $subject" -i "$txt_file" $to
    fi
}

# Main script scan the watch dir for unprocessed files (within the last 30 days)
/usr/bin/find "$watchdir" -mtime -30 -type f \( -iname \*.opus -o -iname \*.wav -o -iname \*.ogg -o -iname \*.mp3 \) | while read audio_file; do
    convert_to_text "$audio_file"
done

 

Note that Whisper has a lot of tricks not used here.  I’ve used it to add subtitles to lectures and it can do things like auto-translate one spoken language into another text language, and much more.

Posted at 10:53:58 GMT-0700

Category: CodeHowToLinuxTechnology

Never put important data on anyone else’s hardware. Ever.

Friday, January 22, 2021 

In early January, 2021, two internet services provided unintentional and unequivocal demonstrations of the intrinsic trade-offs between running one’s own hardware and trusting “The Cloud.”  Parler and Gab, two “social network” services competing for the white supremacist demographic both came under fire in the wake of a violent insurrection against the US government when the plotters used their platforms (among other less explicitly extremist-friendly services) to organize the attack.

Parler had elected to take the expeditious route of deploying their service on AWS and discovered just how literally the cloud is metaphorically like atmospheric clouds—public and ephemeral—when first their entire data set was extracted and then their services were unilaterally terminated by AWS knocking them completely offline (except, of course, for the exfiltrated data, which is still online and being combed by law enforcement for evidence of sedition.)

Gab owns their own servers and while they had trouble with their domain registrar, such problems are relatively easy to resolve: Gab remains online.  Gab did face the challenge of rapid scaling as the entire right-wing extremist market searched for a safe haven away from the fragile Parler and from the timid and begrudging regulation of hate speech and calls for immediate violence by mainstream social networks in the fallout over their contributions to the insurrection and other acts of right-wing terrorism.

In general customers who engage cloud service providers rather than self-hosting do so to speed deployment, take advantage of easy scalability (up or down), and offload management of common denominator infrastructure to a large-scale provider, all superficially compelling arguments.  However convenient this may seem, it is rarely a good decision and fails to rationally consider some of the intrinsic shortcomings, as Parler discovered in rather dramatic fashion, including loss of legal ownership of the data on those services, complete abdication of control of that data and service, and an intrinsic and inescapable misalignment of business interests between supplier and customer.

Anyone considering engaging a cloud service provider for a service that results in proprietary data being stored on third party hardware or on the provision of a business critical service by a third party should ensure contractual obligations with well defined penalties explicitly match the implicit expectations of privacy, stewardship, suitability of service, and continuity and that failures are actionable sufficient to make whole the client in the event of material breach.

Below is a list of questions I would have for any cloud provider of any critical service.  In general, if a provider is willing to even consider answering the results will be shockingly unsatisfactory.  Every company that uses a cloud service, whether it is hosting on AWS or email provisioning by Google or Microsoft is a Parler waiting to happen: all of your data exposed and then your business terminated.  Cloud services are acceptable only for insecure data and for services that are a convenience, not a core requirement.

Like clouds in the sky, The Cloud is public and ephemeral.


A: A first consideration is data protection and privacy:

What liability does The Company, and employees of The Company individually, have should they sell or lose control of The Customer’s data?   What compensation will The Customer receive if control of The Customer’s data is lost?  Please clarify The Company’s criminal and civil liabilities and contractual obligations under the following scenarios:

1) A third party exfiltrates The Customer’s data entrusted to The Company’s care in an unauthorized manner.

2) An employee of The Company willfully misuses The Customer’s data entrusted to The Company in any way.

3) The Company disposes of equipment in a manner which makes The Customer’s data entrusted to The Company accessible to third parties.

4) The company receives a National Security Letter (NSL) requesting information pertaining to The Customer or to others who have data about The Customer on The Company’s service.

5) The company receives a warrant requesting information pertaining to The Customer or  to others who have data regarding The Customer on The Company’s service.

6) The company receives a subpoena requesting information pertaining to The Customer or to others who have data regarding The Customer on The Company’s service that is opened or has been in stored on their hardware for more than 180 days.

7) The company receives a civil discovery request for information pertaining to The Customer or to others who have data regarding The Customer on The Company’s service.

8) The company sells or provides access to The Customer’s data or meta information about The Customer or The Customer’s use of The Company’s system to a third party.

9) The Company changes their terms of service at some future date in a way that is inconsistent with the terms agreed to at the time of The Customer’s engagement of the services of The Company.

10) The Company fails to inform The Customer of a breach of control of The Customer’s data.

11) The Company fails to inform The Customer in a timely manner of a change in policy regarding third party access to The Customer’s data.

12) The Company erroneously exposes The Customer’s data to third party access due to negligence or incompetence.

B: A second consideration is a serial dependency on the reliability of The Company’s service to The Customer’s activity:

By relying on The Company’s service, The Customer typically will rely on the performance and availability of The Company’s products.  If The Company product fails or fails to provide service as expected, The Customer may incur losses, including direct financial losses, loss of reputation, loss of convenience, or other harms.  What warranty does The Company make in the performance of their services?  What recourse does The Customer have for recovery of losses should The Company fail to perform?

Please provide details on what compensation The Company will provide in the following scenarios:

1) The Company can no longer perform the agreed and expected services due to reasons beyond The Company’s control.

2) The Company’s service fails to meet expectations in way that causes a material loss to The Customer.

3) The Company suffers an extended outage or compromise of service that exceeds a reasonable or agreed maximum accepted duration.

C: A third consideration is the alignment of interests between The Customer and The Company which may not be complete and may diverge in the future:

Engagement of the services of The Company requires an investment of time and resources on the part of The Customer in excess of any fees The Company may charge to adopt The Company’s products and services.  What compensation will be provided should The Company’s products fail to meet  performance and utility expectations?  What compensation will be provided should expenditure of resources be required to compensate for The Company’s failure to meet service expectations?

Please provide details on what compensation The Company will provide in the following scenarios:

1) The Company elects to no longer perform the agreed and expected services due to business decisions made by The Company.

2) Ownership or control of The Company changes to an entity that is not aligned with the values of The Customer and which The Customer can not support, directly or indirectly.

3) Control of The Company passes to a third party e.g. through an acquisition or change of control of the board and which results in use of The Customer’s data in a way that is unacceptable to The Customer.

4) The Company or employees of The Company are found to have engaged in behavior, speech, or conduct which is unacceptable to The Customer.

5) The Company’s products or services are found to be unacceptable to The Customer for any reason not limited to security flaws, missing features, access failures, lack of performance, etc and The Company is not able to or is unwilling to meet The Customer’s requirements in a timely manner.

If your company depends on third party provisioning of IT services, you’re just one viral tweet¹ away from being out of business.  Build an IT department that knows how to use a command line and run your critical services on your own hardware.

 

 

1) “Toot” now. Any company that relied on Twitter should review this post, but given the rumors around unpaid hosting bills, the chances of recovering any losses from Twitter are dim. At least those businesses that built models around Reddit APIs share your pain.

Posted at 16:01:48 GMT-0700

Category: FreeBSDLinuxSecurity

Signal Desktop: Probably a good thing

Tuesday, December 8, 2015 

Signal is an easy to use chat tool that competes (effectively) with What’sApp or Viber. They’ve just released a desktop version which is being “preview released/buzz generating released.”  It is developed by a guy with some cred in the open source and crypto movement, Moxie Marlinspike.  I use it, but do not entirely trust it.

I’m not completely on board with Signal.  It is open source, and so in theory we can verify the code.  But there’s some history I find disquieting.  So while I recommend it as the best, easiest to use, (probably) most secure messaging tool available, I do so with some reservations.

  • It originally handled encrypted SMS messages.  There is a long argument about why they broke SMS support on the mailing lists.  I find all of the arguments Whisper Systems made specious and unconvincing and cannot ignore the fact that the SMS tool sent messages through the local carrier (Asiacell, Korek, or Zain here).  Breaking that meant secure messages only go through Whisper Systems’ Google-managed servers where all metadata is captured and accessible to the USG. Since it was open source, that version has been forked and is still developed, I use the SMSSecure fork myself
  • Signal has captured all the USG funding for messaging systems.  Alternatives are not getting funds.  This may make sense from a purely managerial point of view, but also creates a single point of infiltration.  It is far easier to compromise a single project if there aren’t competing projects.   Part of the strength of Open Source is only achieved when competing development teams are trying to one up each other and expose each other’s flaws (FreeBSD and OpenBSD for example).  In a monoculture, the checks and balances are weaker.
  • Signal has grown more intimate with Google over time.  The desktop version sign up uses your “google ID” to get you in the queue.  Google is the largest commercial spy agency in the world, collecting more data on more people than any other organization except probably the NSA.  They’re currently an advertising company and make their money selling your data to advertisers, something they’re quite disingenuous about, but the data trove they’ve built is regularly mined by organizations with more nefarious aims than merely fleecing you.

What to do?  Well, I use signal.  I’m pretty confident the encryption is good, or at least as good as anything else available.  I know my metadata is being collected and shared, but until Jake convinces Moxie to use anonymous identifiers for accounts and message through Tor hidden nodes, you have to be very tech savvy to get around that and there’s no Civil Society grants going to any other messaging services using, for example, an open standard like a Jabber server on a hidden node with OTR.

For now, take a half step up the security ladder and stop using commercial faux security (or unverifiable security, which is the same thing) and give Signal a try.

Maybe at some later date I’ll write up an easy to follow guide on setting up your own jabber server as a tor hidden service and federating it so you can message securely, anonymously, and keep your data (meta and otherwise) on your own hardware in your own house, where it still has at least a little legal protection.

 

Posted at 10:21:22 GMT-0700

Category: PositivePrivacyReviewsSecurityTechnology

Microsoft Spyware Now Being Installed On Win 7

Monday, August 24, 2015 

If you’re the sort of person who isn’t entirely happy about the idea of Microsoft claiming the right to copy your personal files, photos, emails, chat logs, diary entries, medical records, etc over to their own servers to sell to whoever they want for whatever they can get for your personal data – into markets that already exist for insurance companies to deny you insurance based on algorithmic analysis of your habits or your friends habits or for financial institutions to set your interest rates based on similar criterion, or perhaps even for law enforcement to investigate you without a warrant, then OBVIOUSLY you would never, ever install Windows 10 under any circumstances.

Well, Microsoft seems to have fully jumped on the Google/Facebook gravy train and is now completely invested in stealing your data and selling it to the highest bidder (Apple has been exfiltrating your data for a long time, but so far for internal use).  I’ve become more suspect of Microsoft’s updates since they made the Windows 10 advertisement an important (not optional) update (important for what? their bottom line, obviously).  Turns out that the latest updates to Windows 7 are pushing Microsoft’s new business model of stealing your data for profit to Windows 7 and 8.

Staying safe is going to require ever more vigilance.  It may be possible to block windows components from reaching out to microsoft’s servers at the personal firewall level and certainly it can be done at the corporate firewall level (and should be), but blocking Microsoft is a somewhat complex issue.  You can’t run Windows safely without installing security patches because the underlying OS is so completely insecure that new, critical, exploitable flaws are discovered every single week.  If you don’t constantly patch these security failures, you will be hacked by people other than microsoft.  If you install the wrong microsoft patch, you will be hacked by microsoft.  Debian anyone? Also, software developers developing enterprise software, please, please, please stop developing for that horrible, insecure, performance hobbling abomination of a tarted-up single-user OS “Server” and focus on a secure, stable server OS like FreeBSD.  Please.  I hate, hate having to fork over $1k to microsoft for each box to run their horrible OS just so I can run your software.  Why do you support that extortion? Do you despise your customers that much? Stop.

If you care about corporate governance and data security or HIPAA compliance, you are probably violating some critical requirements by installing windows 10 or these new updates to your existing Win7/8 base if you do not block data exfiltration to Microsoft’s servers.  This is spyware.  These updates are stealing your data and sending it to Microsoft.  If your business is subject to data privacy laws, these updates put you in violation of those laws.  Microsoft is doing something that is extremely significant and extremely evil and completely wrong.  Take action or you may very well be facing personal or corporate consequences.  srsly.

I am a strong believer in data privacy and extremely suspect of what I consider highly disingenuous business practices like Google’s but I recognize that there are reasonable people out there who think Google isn’t evil.  However, this windows 10 issue, now being pushed to windows 7, goes well beyond Google taking advantage of people’s historical assumptions about the security of email to offer them a free look-alike honey trap to gather their data.  Windows 10 and these Win 7 updates are intrusive, not merely misleading.  Do not update.  Srsly.  Do not update.  Block the spyware “hotfixes.”

Stop Gap Fixes

In researching these updates, I came across this article on techworm that has a nice summary of the Malware updates Microsoft is pushing out (with some additional amendments I found):

With a whiff of irony, this google search “telemetry site:https://support.microsoft.com/en-us/kb” shows these patches and many more…

Do not automatically install Microsoft updates.  You must turn that feature off or you will keep getting additional spyware installed.  Go to windows update and verify your settings.  I have mine set so windows downloads the updates (so the updates are waiting locally), but I don’t let windows install them automatically.  That gives me a chance to review the updates and look for spyware.

windows_update_settings

When you get updates, you now have to check each one of them to find out if it is spyware or not.  The list above is current as far as I know, but clicking on the “more information” link to the right of the updates list will get you microsoft’s marketing speak obfuscation of the true purpose.  Any update that “adds telemetry points” or something like that is spyware.  Uncheck the install and hide the update.  Note that some of these were moved from “optional” to “important.”  Microsoft is absolutely intent on stealing your data and is taking some pretty underhanded steps to make it difficult for you to avoid it.

block_microsoft_spyware

 

If updates get past you or it turns out later that a seemingly important or innocuous update was spyware (the fun part is that you now have to be vigilant and look all this stuff up), then you can uninstall them from the “installed updates” control panel.

uninstall_microsoft_spyware

Work to be done

I’ll start looking into firewall settings to block communication to microsoft’s servers.  This is a standard anti-malware technique and should work here, except that microsoft has so many servers it is more challenging to block them than your typical malware botnet.

We need something like a variant of Peer Guardian to block microsoft’s servers using the standard P2P crowd-sourcing model to keep the list up to date. I’m not aware of anything like this yet, but I’m looking.  Microsoft has become more of an enemy to privacy than the RIAA ever was.

UPDATE:  this superuser answer includes a list of telemetry endpoints to block at your firewall or router.  Alternatively you can edit your hosts file and add these entries from DSL reports.

Larger Significance

This shift in business focus by Microsoft from providing a product people are willing to pay for to stealing data from people to sell on the commercial market has some significant lessons for the entire software model.

It isn’t just that Microsoft is now adopting Google’s business model of giving away “free” goodies as traps to collect product (you) to sell to the highest bidder, but that the model of corporate trust that underpins most of the security assumptions the internet is built on is manifestly false and unsustainable.  If any hacker tried to create these spyware updates, locked-down computers that only install signed code would refuse to install them.  Ignoring for the moment that the signed code model is idiotically flawed as signing keys are stolen all the time, this microsoft spyware is properly signed with legitimate keys.  It will be installed on locked down computers without complaint and will not show up in commercial anti-virus software.  But it is spyware.  It contains keyloggers and extremely productive data exfiltration code that is currently copying wholesale data dumps from unfortunate victims to Microsoft’s servers in such volume that their data caps are being hit.

If a non-commercial third party (e.g. “hacker”) did this, they’d be prosecuted.  It makes no difference to you that your data is being stolen by Microsoft rather than by some clever teenager in a former eastern block country: your data is being stolen.  But the model that has been promoted, a model of centralized corporate trust to validate the “security” of your system has been utterly and irrevocably shattered.  This isn’t an accident, isn’t something that better data management might have prevented, this is an intentional ex post facto rewrite of the usual, customary, and regular assumptions we have about the privacy of our computer systems and one that significantly impacts the security of almost everyone in the world: military, medical, legal, fiduciary, as well as personal.

And even if you trust Microsoft (for whatever bizarre, irrational reason), Microsoft is creating a whole series of security holes in their already crappy and insecure operating system that will be exploited by third parties.  By adding keyloggers and data exfiltration tools to the core OS, they’re making it even easier for non-corporate hackers to jump on the data theft gravy train. Everyone profits but you. You lose.

Posted at 04:19:18 GMT-0700

Category: PrivacyTechnology

Windows 10 Privacy Annihilator

Tuesday, August 4, 2015 

Why would Microsoft, a company whose revenue comes entirely from sales of Windows and Office, start giving Windows 10 away – not just giving it away, but foisting it on users with unbelievably annoying integrated advertisements in the menu of Win 7/8 that pop up endlessly and are tedious to remove and reinstall themselves constantly?

Have they just gone altruistic?  Decided that while they won’t make software free like speech, they’ll make it free like beer? Or is there something more nefarious going on? Something truly horrible, something that will basically screw over the entire windows-using population and sell them off like chattel to any bidder without consent or knowledge?

Of course, it is the latter.

Microsoft is a for-profit company and while their star has been waning lately and they’ve basically ceded the evil empire mantle to Apple, they desperately want to get into the game of stealing your private information and selling it to whoever is willing to pay.

So that’s what Windows 10 does.  It enables Microsoft to steal all of your information, every email, photo, or document you have on your computer and exfiltrate it silently to Microsoft’s servers, and to make it legal they have reserved the right to give it to whoever they want.  This isn’t just the information you stupidly gifted to Google by being dumb enough to use Gmail or ignorantly gifted to Apple by being idiotic enough to load into the iButt, but the files you think are private, on your computer, the ones you don’t upload.  Microsoft gets those.

Finally, we will access, disclose and preserve personal data, including your content (such as the content of your emails, other private communications or files in private folders), when we have a good faith belief that doing so is necessary.

They’ll “access” your data and “disclose” it (meaning to a third party) whenever they have a good faith belief that doing so is necessary.  No warrant needed.  It is necessary for Microsoft to make a buck, so if a  buck is offered for your data, they’re gonna sell it.

If you install Windows 10, you lose. So don’t. If you need to upgrade your operating system, it is time to switch to something that preserves Free like speech: Linux Mint is probably the best choice.

If you’re forced to run Windows 10 for some reason and can’t upgrade to windows 7, then follow these instructions (and these) and remain vigilant, Microsoft’s new strategy is to steal your data and sell it via any backdoor they can sneak past you. Locking them down is going to be a lot of work and might not be possible so keep an eye out for your selfies showing up on pr0n sites: they pay for pix and once you install Windows 10, Microsoft has every right to sell yours.


 

Update: you can’t stop windows 10 from stealing your private data

That’s not quite true – if you never connect your computer to a network, it is very unlikely that Microsoft will be able to secretly exfiltrate your private data through the Windows 10 trojan.  However, it turns out that while the privacy settings do reduce the amount of data that gets sent back to Microsoft, they continue to steal your data even though you’ve told them not to.

Windows 10 is spyware.  It is not an operating system, it is Trojan malware masquerading as an operating system that’s true purpose is to steal your data so Microsoft can sell it without your consent.  If you install Windows 10, you are installing spyware.

Win 10 has apparently been installed 65 million times.  That’s more than 3x as many users’ most intimate, most private data stolen as by the Ashley Madison attack.  If you value privacy, if the idea that you might be denied a loan or insurance because of secret data stolen from your computer without your consent bothers you, if the idea of having evidence of your potential crimes shared with law enforcement without your knowledge and without a warrant worries you then do not install windows 10.  Ever.

Posted at 11:00:30 GMT-0700

Category: PrivacyTechnology

Sony-style Attacks and eMail Encryption

Friday, December 19, 2014 

Some of the summaries of the Sony attacks are a little despairing of the viability of internet security, for example Schneier:

This could be any of us. We have no choice but to entrust companies with our intimate conversations: on email, on Facebook, by text and so on. We have no choice but to entrust the retailers that we use with our financial details. And we have little choice but to use butt services such as iButt and Google Docs.

I respectfully disagree with some of the nihilism here: you do not need to put your data in the butt. Butt services are “free,” but only because you’re the product.  If you think you have nothing to hide and privacy is dead and irrelevant, you are both failing to keep up with the news and extremely unimaginative. You think you have no enemies?  Nobody would do you wrong for the lulz?  Nobody who would exploit information leaks for social engineering to rip you off?

Use butt services only when the function the service provides is predicated on a network effect (like Facebook) or simply can’t be replicated with individual scale resources (Google Search).  Individuals can reduce the risk of being a collateral target by setting up their own services like an email server, web server, chat server, file server, drop-box style server, etc. on their own hardware with minimal expertise (and the internet is actually full of really good and expert help if you make an honest attempt to try), or use a local ISP instead of relying on a global giant that is a global target.

Email Can be Both Secure AND Convenient:

But there’s something this Sony attack has made even more plain: eMail security is bad.  Not every company uses the least insecure email system possible and basically invites hackers to a data smorgasborg like Sony did by using outlook (I mean seriously, they can’t afford an IT guy who’s expertise extends beyond point-n-click?  Though frankly the most disappointing deployment of outlook is by MIT’s IT staff.  WTF?).

As lame as that is, email systems in general suffer from an easily remediated flaw: email is stored on the server in plain text which means that as soon as someone gets access to the email server, which is by necessity of function always globally network accessible, all historical mail is there for the taking.

Companies institute deletion policies where exposed correspondence is minimized by auto-deleting mail after a relatively short period, typically about as short as possible while still, more or less, enabling people to do their jobs.  This forced amnesia is a somewhat pathetic and destructive solution to what is otherwise an excellent historical resource: it is as useful to the employees as to hackers to have access to historical records and forced deletion is no more than self-mutilation to become a less attractive target.

It is trivial to create a much more secure environment with no meaningful loss of utility with just a few simple steps.

Proposal to Encrypt eMail at Rest:

I wrote in detail about this recently.  I realize it is a TLDR article, but as everyone’s wound up about Sony, a summary might serve as a lead-in for the more actively procrastinating. With a few very simple fixes to email clients (which could be implemented with a plug-in) and to email servers (which can be implemented via mail scripting like procmail or amavis), email servers can be genuinely secure against data theft.  These fixes don’t exist yet, but the two critical but trivial changes are:

Step One: Server Fix

  • Your mail server will have your public key on it (which is not a security risk) and use it to encrypt every message before delivering it to your mailbox if it didn’t come in already encrypted.

This means all the mail on the sever is encrypted as soon as it arrives and if someone hacks in, the store of messages is unreadable.  Maybe a clever hacker can install a program to exfiltrate incoming messages before they get encrypted, but doing this without being detected is very difficult and time consuming.  Grabbing an .ost file off some lame Windows server is trivial. I don’t mean to engage in victim blaming, but seriously, if you don’t want to get hacked, don’t go out wearing Microsoft.

Encrypting all mail on arrival is great security, but it also means that your inbox is encrypted and as current email clients decrypt your mail for viewing, but then “forget” the decrypted contents, encrypted messages are slower to view than unencrypted ones and, most crippling of all, you can’t search your encrypted mail.  This makes encrypted mail unusable, which is why nobody uses it after decades. This unusability is a tragic and pointless design flaw that originated to mitigate what was then, apparently, a sore spot with one of Phil’s friends who’s wife had read his correspondence with another woman and divorce ensued; protecting the contents of email from client-side snooping has ever since been perceived as critical.1I remember this anecdote from an early 1990’s version of PGP.  I may be mis-remembering it as the closest reference I can find is this FAQ:

It was a well-intentioned design constraint and has become a core canon of the GPG community, but is wrong-headed on multiple counts:

  1. An intimate partner is unlikely to need the contents of the messages to reach sufficient confidence in distrust: the presence of encrypted messages from a suspected paramour would be more than sufficient cause for a confrontation.
  2. It breaks far more frequent use such as business correspondence where operational efficiency is entirely predicated on content search which doesn’t work when the contents are encrypted.
  3. Most email compromises happen at the server, not at the client.
  4. Everyone seems to trust butt companies to keep their affairs private, much to the never-ending lulz of such companies.
  5. Substantive classes of client compromises, particularly targeted ones, capture keystrokes from the client, meaning if the legitimate user has access to the content of the messages, so too does the hacker, so the inconvenience of locally encrypted mail stores gains almost nothing.
  6. Server attacks are invisible to most users and most users can’t do anything about them.  Users, like Sony’s employees, are passive victims of sysadmin failures. Client security failures are the user’s own damn fault and the user can do something about them like encrypting the local storage of their device which protects their email and all their other sensitive and critical selfies, sexts, purchase records, and business correspondence at the same time.
  7. If you’re personally targeted at the client side, that some of your messages are encrypted provides very little additional security: the attacker will merely force you to reveal the keys.

Step Two: Client Fix

  • Your mail clients will decrypt your mail automatically and create local stores of unencrypted messages on your local devices.

If you’ve used GPG, you probably can’t access any mail you got more than a few days ago; it is dead to you because it is encrypted.  I’ve said before this makes it as useless as an ephemeral key encrypted chat but without the security of an ephemeral key in the event somebody is willing to force you to reveal your key and is interested enough to go through your encrypted data looking for something.  They’ll get it if they want it that bad, but you won’t be bothered.

But by storing mail decrypted locally and by decrypting mail as it is downloaded from the server, the user gets the benefit of “end-to-end encryption” without any of the hassles.

GPG-encrypted mail would work a lot more like an OTR encrypted chat.  You don’t get a message from OTR that reads “This chat message is encrypted, do you want to decrypt it?  Enter your password” every time you get a new chat, nor does the thread get re-encrypted as soon as you type something, requiring you to reenter your key to review any previous chat message.  That’d be idiotic.  But that’s what email does now.

Adoption Matters

These two simple changes would mean that server-side mail stores are secure, but just as easy to use and as accessible to clients as they are now.  Your local device security, as it is now, would be up to you.  You should encrypt your hard disk and use strong passwords because sooner or later your personal device will be lost or stolen and you don’t want all that stuff published all over the internet, whether it comes from your mail folder or your DCIM folder.

It doesn’t solve a targeted attack against your local device, but you’ll always be vulnerable to that and pretending that storing your encrypted email on your encrypted device in an encrypted form adds security is false security that has the unfortunate side effect of reducing usability and thus retarding adoption of real security.

If we did this, all of our email will be encrypted, which means there’s no additional hassle to getting mail that was encrypted with your GPG key by the sender (rather than on the server).  The way it works now, GPG is annoying enough to warrant asking people not to send encrypted mail unless they have to, which tags that mail as worth encrypting to anyone who cares.  By eliminating the disincentive, universally end-to-end encrypted email would become possible.

A few other minor enhancements that would help to really make end-to-end, universally encrypted email the norm include:

  • Update mail clients to prompt for key generation along with any new account (the only required option would be a password, which should be different from the server-log-in password since a hash of that has to be on the server and a hash crack of the account password would then permit decryption of the mail there, so UX programmers take note!)
  • Update address books, vcard, and LDAP servers so they expect a public key for each correspondent and complain if one isn’t provided or can’t be found.  An email address without a corresponding key should be flagged as problematic.
  • Corporate and hierarchical organizations should use a certificate authority-based key certification system, everyone else should use web-of-trust/perspectives style key verification, which can be easily automated to significantly reduce the risk of MitM attacks.

This is easy. It should have been done a long time ago.

 

Footnotes

Footnotes
1 I remember this anecdote from an early 1990’s version of PGP.  I may be mis-remembering it as the closest reference I can find is this FAQ:
Posted at 16:21:29 GMT-0700

Category: FreeBSDPrivacySecurityTechnology

Moar Privacy

Thursday, December 9, 2010 

I’m using an Ubuntu VM for private browsing, and like many people, I’m stuck using a mainstream OS for much of my work (Win7) due to software availability constraints. But some software works much better in a linux environment and Ubuntu is as pretty as OSX, free, and installs easily on generic x86 hardware.

It is also pretty straightforward to install an isolated and secure browsing instance using VirtualBox. It takes about 20G of hard disk and will use up at least 512K (better 1G) of your system RAM. If you want to run this sort of config, your laptop should have more than enough disk space and RAM to support the extra load without bogging, but it is a very solid solution.

Installing Ubuntu is easy – even easier with an application like VirtualBox – just install virtualbox, download the latest ubuntu ISO, and install from there. If you’re on bare metal, the easiest thing to do is burn a CD and install off that.

Ubuntu desktop comes with Firefox in the tool bar. Customizing for private browsing is a bit more involved.

My first steps are to install:

NoScript is an easy win. It is a bit of a pain to set up at first, but soon you add exceptions for all your favorite sites and while that isn’t great security practice, it is essential for sane browsing. NoScript is particularly helpful when browsing the wacky parts of the net and not getting exotic browsing diseases: it is your default dental dam. Be careful of allowing domains you don’t recognize – Google them first and make sure you understand why they need to run a script on your computer and that it is safe. A lot of sites use partners for things like video feeds, so if some function seems broken, you probably need to allow that particular domain. On the other hand, most of the off-site scripts are tracking or stats and you really don’t need to play along with them.

BetterPrivacy is a new one for me. I am very impressed that it found approximately 1.3 zillion (OK 266) different company flash cookies AFTER I had installed TACO and noscript etc. You bastards. I’m sure I can enjoy hulu without making my play history shared-available to every flash site I might visit. Always Sunny in Philadelphia marks me as a miscreant. I flush the flash cookies on starting silently (preferences).

TACO is a bit intrusive, but it seems to work to selectively block tracking and advertising cookies. At least the pop up is comforting. For private browsing, I’d set it to reject all classes of tracking cookies (change the preferences from default).

User Agent Switcher is useful when you’re deviating from the mainstream. Running Ubuntu pretty much flags you as a trouble maker or at least a dissident. Firefox maybe a bit less so, but you are indicating to advertisers that you don’t respect the expertise of those people far smarter than you who pre-installed IE (or Safari) to make your life easier. Set your user agent to IE 8 because the nail that sticks up gets pounded down.

Torbutton needs Tor to work. Tor provides really good privacy, but is a bit involved. The Tor Button Plugin for firefox makes it seem easier than it really is: you install it and click “use tor” and it looks like it is working but the first site you visit you get an proxy error because Tor isn’t actually running (DOH!).

To get Tor to work, you will have to open a terminal and do some command line fu before it will actually let you browse. Tor is also easier to install on Ubuntu than on Windows (at least for me, but as my browser history indicates I’m a bit of a miscreant dissident, so your mileage may vary).

Starting with these fine instructions.

sudu gedit /etc/apt/sources.list
add
deb http://deb.torproject.org/torproject.org/ lucid main
deb-src http://deb.torproject.org/torproject.org/ lucid main

Then run
gpg --keyserver keys.gnupg.net --recv 886DDD89
gpg --export A3C4F0F979CAA22CDBA8F512EE8CBC9E886DDD89 | sudo apt-key add -
sudo apt-get update
sudo apt-get upgrade
sudo apt-get update
sudo apt-get install tor tor-geoipdb

Install vidalia with the graphical ubuntu software center or with
sudo apt-get install vidalia

Tor expects Polipo. And vidalia makes launching and checking on Tor easier, so remove the startup scripts. (If Tor is running and you try to start it from vidalia, you get an uninformative error, vidalia has a “launch at startup” option, so let it run things.) Vidalia appears under the Applications->Network.

sudo update-rc.d -f tor remove

Polipo was installed with Tor, so configure it:
sudo gedit /etc/polipo/config

Clear the file (ctrl-a, delete)
paste in the contents of this file:

UPDATE: paste in the contents of this file:

(if the link above fails, search for “polipo.conf” to find the latest version)

I added the binary for polipo in Vidalia’s control panel, but that may be redundant (it lives in /usr/bin/polipo).

I had to reboot to get everything started.

And for private chats, consider OTR!

Posted at 17:45:45 GMT-0700

Category: PoliticsTechnology

Opting Out for Privacy

Friday, December 3, 2010 

There’s a great story at the wall street journal describing some of the techniques that are being used to track people on line that I found informative (as are the other articles listed in the series in the box below).  EFF is doing some good work on this; your browser configuration probably uniquely identifies you and thus every site you’ve ever visited (via data exchanges).  Unique information about you is worth about $0.00_1.  Collecting a few hundred million 1/10ths of a cent starts to add up and may end up raising your insurance premiums.

One of the more entertaining/disturbing tricks is to use “click jacking” to remotely enable a person’s webcam or microphone.  Is your computer or network running slowly? Maybe it is the video you’re inadvertently streaming back (and maybe you just have way too many tabs open…)

A few things you can do to improve your privacy include:

  • Opt out of Rapleaf. Rapleaf collects user information about you and ties it to your email address.  You have to opt out with each email address individually, which almost certainly confirms to them that all your email addresses belong to the same person.  You might want to use unique Tor sessions for each opt out if you don’t want them to get more information than they already have via the process.
  • Opt out at NAI. This is a one stop shop for the basic cookie tracking companies that are attempting to be semi-compliant with privacy requests.  If you enable javascript for the site (which would be disabled by default if you’re using scriptblocker) then you can opt out of all of them at once.  Presumably you have to return and opt out again every time a new company comes along.
  • Use Tor for anything sensitive.  If you care about privacy, learn about Tor.  It does slow browsing so you have to be very committed to use it for everything.  But the browser plug in makes it pretty easy to turn it on for easy browsing.
  • Don’t use IE for anything personal or important.
  • Run SpyBot Search and Destory regularly.  Spybot helps block BHOs and toolbars that seem to proliferate automagically and helps remove tracking cookies.  You’ll be amazed at how many are installed on your system.  I have used or not used TeaTimer.  I’m less excited about having a lot of background tools, even helpful ones than I used to be.  Spybot currently starts out looking for 1,359,854 different known spywares.  Yikes.
  • Check what people know about you:  Google will tell you, so will Yahoo.  Spooky.
  • Use firefox.  If for no other reason than the following plugins (personally, it is my favorite, but I know people who favor chrome or even rockmelt, but talk about tracking!)  Just don’t use IE.
  • Use the private browsing mode in your browser (CTRL-SHIFT-P in FireFox).  It’d be nice if you could enable non-private browsing on a whitelist basis for sites you either trust or have to trust.  We’ll get there eventually…
  • TACO should help block flash cookies.
  • Install noscript to block scripts by default.  You can add all your favorite sites as you go so things work.  It is a pain in the ass for a while, but security requires vigilance.
  • Install adblock plus.  It helps keep the cookies away.    It also reduces ad annoyance.  You can enable ads for your favorite sites so they can pay their colo fees.
  • Add HTTPS Everywhere from EFF. The more your connections to sites are encrypted, the less your ISP (and others) can see about what you’re doing while you’re there.  Your ISP still knows every site you visit, and probably sells that information, but if your sessions are encrypted they don’t see the actual text you type.  It also makes it harder for script kiddies to grab your passwords at the cafe.
Posted at 02:44:43 GMT-0700

Category: PoliticsPrivacySecurityTechnology