media files

Smol bash script for finding oversize media files

Friday, September 2, 2022 

Sometimes you want to know if you have media files that are taking up more than their fair share of space.  You compressed the file some time ago in an old, inefficient format, or you just need to archive the oversize stuff, this can help you find em.  It’s different from file size detection in that it uses mediainfo to determine the media file length and a variety of other useful data bits and wc -c to get the size (so data rate includes any file overhead), and from that computes the total effective data rate. All math is done with bc, which is usually installed. Files are found recursively (descending into sub-directories) from the starting point (passed as first argument) using find.

basic usage would be:

./find-high-rate-media.sh /search/path/tostart/ [min bpp] [min data rate] [min size] > oversize.csv 2>&1

The script will then report media with a rate higher than minimum and size larger than minimum as a tab delimited list of filenames, calculated rate, and calculated size. Piping the output to a file, output.csv, makes it easy to sort and otherwise manipulate in LibreOffice Calc as a tab delimited file.  The values are interpreted as the minimum for suppression of output, so any file that exceeds all three minimum triggers will be output to the screen (or .csv file if so redirected).

The script takes four command line variables:

  • The starting directory, which defaults to . [defaults to the directory the script is executed in]
  • The minimum bits per pixel (including audio, sorry) for exclusions (i.e. more bpp and the filename will be output)  [defaults to 0.25 bpp]
  • The minimum data rate in kbps [defaults to 1 kbps so files would by default only be excluded by bits per pixel rate]
  • The minimum file size in megabytes [defaults to 1mb so files would by default only be excluded by bits per pixel rate]

Save the file as a name you like (such as find-high-rate-media.sh) and # chmod  +x find-high-rate-media.sh and run it to find your oversized media.

!/usr/bin/bash
############################# USE #######################################################
# This creates a tab-delimeted CSV file of recursive directories of media files enumerating
# key compression parameters.  Note bits per pixel includes audio, somewhat necessarily given
# the simplicity of the analysis. This can throw off the calculation.
# find_media.sh /starting/path/ [min bits per pixel] [min data rate] [min file size mb]
# /find-high-rate-media.sh /Media 0.2 400 0 > /recomp.csv 2>&1
# The "find" command will traverse the file system from the starting path down.
# if output isn't directed to a CSV file, it will be written to screen. If directed to CSV
# this will generate a tab delimted csv file with key information about all found media files
# the extensions supported can be extended if it isn't complete, but verify that the 
# format is parsable by the tools called for extracting media information - mostly mediainfo
# Typical bits per pixel range from 0.015 for a HVEC highly compressed file at the edge of obvious
# degradation to quite a bit higher.  Raw would be 24 or even 30 bits per pixel for 10bit raw.
# Uncompressed YUV video is about 12 bpp. 
# this can be useful for finding under and/or overcompressed video files
# the program will suppress output if the files bits per pixel is below the supplied threshold
# to reverse this invert the rate test to " if (( $(bc  <<<"$rate < $maxr") )); then..."
# if a min data rate is supplied, output will be suppressed for files with a lower data rate
# if a min file size is supplied, output will be suppressed for files smaller than this size
########################################################################################

# No argument given?
if [ -z "$1" ]; then
  printf "\nUsage:\n  starting by default in the current directory and searchign recusrively \n"
  dir="$(pwd)"
  else
        dir="$1"
        echo -e "starting in " $dir ""
fi

if [ -z "$2" ]; then
  printf "\nUsage:\n  returning files with bits per pixel greater than default max of .25 bpp \n" 
  maxr=0.25
  else
        maxr=$2
        echo -e "returning files with bits per pixel greater than " $maxr " bpp" 
fi

if [ -z "$3" ]; then
  printf "\nUsage:\n  returning files with data rate greater than default max of 1 kbps \n" 
  maxdr=1
  else
        maxdr=$3
        echo -e "returning files with data rate greater than " $maxdr " kbps" 
fi


if [ -z "$4" ]; then
  printf "\nUsage:\n  no min file size provided returning files larger than 1MB \n" 
  maxs=1
  else
        maxs=$4
        echo -e "returning files with file size greater than " $maxs " MB  \n\n" 
fi


msec="1000"
kilo="1024"
reint='^[0-9]+$'
refp='^[0-9]+([.][0-9]+)?$'

echo -e "file path \t rate bpp \t rate kbps \t V CODEC \t A CODEC \t Frame Size \t FPS \t Runtime \t size MB"

find "$dir" -type f \( -iname \*.avi -o -iname \*.mkv -o -iname \*.mp4 -o -iname \*.wmv -iname \*.m4v \) -print0 | while read -rd $'\0' file
do
  if [[ -f "$file" ]]; then
    bps="0.1"
    size="$(wc -c  "$file" |  awk '{print $1}')"
    duration="$(mediainfo --Inform="Video;%Duration%" "$file")"
    if ! [[ $duration =~ $refp ]] ; then
       duration=$msec
    fi
    seconds=$(bc -l <<<"${duration}/${msec}")
    sizek=$(bc -l <<<"scale=1; ${size}/${kilo}")
    sizem=$(bc -l <<<"scale=1; ${sizek}/${kilo}")
    rate=$(bc -l <<<"scale=1; ${sizek}/${seconds}")
    codec="$(mediainfo --Inform="Video;%Format%" "$file")"
    audio="$(mediainfo --Inform="Audio;%Format%" "$file")"
    framerate="$(mediainfo --Inform="General;%FrameRate%" "$file")"
    if ! [[ $framerate =~ $refp ]] ; then
       framerate=100
    fi
    rtime="$(mediainfo --Inform="General;%Duration/String3%" "$file")"
    width="$(mediainfo --Inform="Video;%Width%" "$file")"
    if ! [[ $width =~ $reint ]] ; then
       width=1
    fi
    height="$(mediainfo --Inform="Video;%Height%" "$file")"
    if ! [[ $height =~ $reint ]] ; then
       height=1
    fi
    pixels=$(bc -l <<<"scale=1; ${width}*${height}*${seconds}*${framerate}")
    bps=$(bc -l <<<"scale=4; ${size}*8/${pixels}")
    if (( $(bc -l <<<"$bps > $maxr") )); then
        if (( $(bc -l <<<"$sizem > $maxs") )); then
            if (( $(bc -l <<<"$rate > $maxdr") )); then
                echo -e "$file" "\t" $bps "\t" $rate "\t" $codec "\t" $audio "\t" $width"x"$height "\t" $framerate "\t" $rtime "\t" $sizem
            fi
        fi
    fi
  fi
done

Results might look like:

Another common task is renaming video files with some key stats on the contents so they’re easier to find and compare. Linux has limited integration with media information (dolphin is somewhat capable, but thunar not so much). This little script also leans on mediainfo command line to append the following to the file name of media files recursively found below a starting directory path:

  • WidthxHeight in pixels (e.g. 1920×1080)
  • Runtime in HH-MM-SS.msec (e.g. 02-38-15.111) (colons aren’t a good thing in filenames, yah, it is confusingly like a date)
  • CODEC name (e.g. AVC)
  • Datarate (e.g. 1323kbps)

For example

kittyplay.mp4 -> kittyplay_1280x682_02-38-15.111_AVC_154.3kbps.mp4

The code is also available here.

#!/usr/bin/bash
PATH="/home/gessel/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"

############################# USE #######################################################
# find_media.sh /starting/path/ (quote path names with spaces)
########################################################################################

# No argument given?
if [ -z "$1" ]; then
  printf "\nUsage:\n  pass a starting point like \"/Downloads/Media files/\" \n" 
  exit 1
fi

msec="1000"
kilo="1024"
s="_"
x="x"
kbps="kbps"
dot="."

find "$1" -type f \( -iname \*.avi -o -iname \*.mkv -o -iname \*.mp4 -o -iname \*.wmv \) -print0 | while read -rd $'\0' file
do
  if [[ -f "$file" ]]; then
    size="$(wc -c  "$file" |  awk '{print $1}')"
    duration="$(mediainfo --Inform="Video;%Duration%" "$file")"
    seconds=$(bc -l <<<"${duration}/${msec}")
    sizek=$(bc -l <<<"scale=1; ${size}/${kilo}")
    sizem=$(bc -l <<<"scale=1; ${sizek}/${kilo}")
    rate=$(bc -l <<<"scale=1; ${sizek}/${seconds}")
    codec="$(mediainfo --Inform="Video;%Format%" "$file")"
    framerate="$(mediainfo --Inform="General;%FrameRate%" "$file")"
    rtime="$(mediainfo --Inform="General;%Duration/String3%" "$file")"
    runtime="${rtime//:/-}"
    width="$(mediainfo --Inform="Video;%Width%" "$file")"
    height="$(mediainfo --Inform="Video;%Height%" "$file")"
    fname="${file%.*}"
    ext="${file##*.}"
    $(mv "$file" "$fname$s$width$x$height$s$runtime$s$codec$s$rate$kbps$dot$ext")
  fi
done

If you don’t have mediainfo installed,

sudo apt update
sudo apt install mediainfo
Posted at 10:18:58 GMT-0700

Category: AudioHowToLinuxvideo

Search Engine Enhancement

Wednesday, September 12, 2007 

Getting timely search engine coverage of a site means people can find things soon after you change or post them.

Linked pages get searched by most search engines following external links or manual URL submissions every few days or so, but they won’t find unlinked pages or broken links, and it is likely that the ranking and efficiency of the search is suboptimal compared to a site that is indexed for easy searching using a sitemap.

There are three basic steps to having a page optimally indexed:

  • Generating a Sitemap
  • Creating an appropriate robots.txt file
  • Informing search engines of the site’s existence

Sitemaps
It seems like the world has settled on sitemaps for making search engine’s lives easier. There is no indication that a sitemap actually improves rank or search rate, but it seems likely that it does, or that it will soon. The format was created by Google, and is supported by Google, Yahoo, Ask, and IBM, at least. The reference is at sitemaps.org.

Google has created a python script to generate a sitemap through a number of methods: walking the HTML path, walking the directory structure, parsing Apache-standard access logs, parsing external files, or direct entry. It seems to me that walking the server-side directory structure is the easiest, most accurate method. The script itself is on sourceforge . The directions are good, but if you’re only using directory structure, the config.xml file can be edited down to something like:

<?xml version="1.0" encoding="UTF-8"?>
<site
  base_url="http://www.your-site.com/"
  store_into="/www/data-dist/your_site_directory/sitemap.xml.gz"
  verbose="1"
  >

 <url href="http://www.your-site.com/" />
 <directory
    path="/www/data-dist/your_site_directory"
    url="http://www.your-site.com/"
    default_file="index.html"
 />

Note that this will index every file on the site, which can be large. If you use your site for media files or file transfer, you might not want to index every part of the site. In which case you can use filters to block the indexing of parts of the site or certain file types. If you only want to index web files you might insert the following:

 <filter  action="pass"  type="wildcard"  pattern="*.htm"           />
 <filter  action="pass"  type="wildcard"  pattern="*.html"          />
 <filter  action="pass"  type="wildcard"  pattern="*.php"           />
 <filter  action="drop"  type="wildcard"  pattern="*"               />

Running the script with

python sitemap_gen.py --config=config.xml

will generate the sitemap.xml.gz file and put it in the right place. If the uncompressed file size is over 10MB, you’ll need to pare down the files listed. This can happen if the filters are more inclusive than what I’ve given, particularly if you have large photo or media directories or something like that and index all the media and thumbnail files.

The sitemap will tend to get out of date. If you want to update it regularly , there are a few options: one is to use a wordpress sitemap generator (if that’s what you’re using and indexing) which does the right thing and generates a sitemap using relevant data available to wordpress and not to the file system (a good thing) and/or add a chron script to regenerate the sitemap regularly, for example

3  3  *  *  *  root python /path_to/sitemap_gen.py --config=/path_to/config.xml

will update the sitemap daily.

robots.txt

The robots.txt file can be used to exclude certain search engines, for example MSN if you don’t like Microsoft for some reason and are willing to sacrifice traffic to make a point, it also points search engines to your sitemap.txt file. There’s kind of a cool tool here that generates a robots.txt file for you but a simple one might look like:

User-agent: MSNBot                             % Agent I don't like for some reason
Disallow: /                                    % path it isn't allowed to traverse
User-agent: *                                  % For everything else
Disallow:                                      % Nothing is disallowed
Disallow: /cgi-bin/                            % Directory nobody can index
Sitemap: http://www.my_site.com/sitemap.xml.gz % Where my sitemap is.

Telling the world

Search engines are supposed to do the work, that’s their job, and they should find your robots.txt file eventually and then read the sitemap and then parse your site without any further assistance. But to expedite the process and possibly enhance search results there are some submission tools at Yahooo, Ask, and particularly Google that generally allow you to add meta information.
Ask
Ask.com allows you to submit your sitemap via URL (and that seems to be all they do)
http://submissions.ask.com/ping?sitemap=http://www.your_site.com/sitemap.xml.gz


Yahoo
Yahoo has some site submission tools and supports site authentication, which means putting a random string in a file they can find to prove you have write-access to the server. Their tools are at
https://siteexplorer.search.yahoo.com/mysites


with submissions at
https://siteexplorer.search.yahoo.com/submit.php


you can submit sites and feeds. I usually use the file authentication which means creating a file with some random string (y_key_random_string.html) with another random string as the only contents. They authenticate within 24 hours.
It isn’t clear that if you have a feed and submit it that it does not also add a site, it looks like it does. If you don’t have a feed you may not need to authenticate the site for submission.
Google
Google has a lot of webmaster tools at
https://www.google.com/webmasters/tools/siteoverview?hl=en


The verification process is similar but you don’t have to put data inside the verification file so

touch googlerandomstring.html

is all you need to get the verification file up. You submit the URL to the sitemap directly.
Google also offers blog tools at
http://blogsearch.google.com/ping


Where you can manually add the feed for the blog to Google’s blog search tool.

Posted at 13:25:13 GMT-0700

Category: FreeBSDTechnology