Bash: Check for Duplicates

Large tubes in a concrete bunker.
Image courtesy of beeveephoto

I often find myself wanting to do a series of repetitive tasks for a small set of data. XKCD has a nice guide for when to write a script for repetitive tasks. Unfortunately I find myself with one off tasks that I would never use a script for again. Still I will write a script to automate my task. Part of me thinks I am saving time, but I’m probably not.

Why would I write a script for a one off task? Because it’s fun, but more importantly I always learn a little something new when solving these kinds of tasks.

Recently I was faced with the task of finding logical duplicates of my music that may have similar names. I used an automagic tool to move and organize most of my music files, but some were left in the previous location after the operation. I needed to check to see if they were duplicates which could be deleted, or if needed to manually import the leftovers into my sorted music folder.

One of the challenges is that the files could have been renamed as they were automatically sorted and moved during the process to the new music folder. I wanted a short command that would check part of file name for a pattern. Originally I wrote this on the command line on one single line. I’ve broken it up into several lines to make it easier to decompose.

This is what I ended up with.

1
2
3
4
5
for i in *; do
    if [[ -d "${i}" ]]; then
        find ../audio/_music-sorted/ -iname "*`echo ${i:0:12} | cut -d '-' -f 1 | awk '{$1=$1};1'`*" -type d ;
    fi
done

Let’s take it line by line.

First we are going to loop over everything in the current directory

1
 for i in *; do

Then we are going to test for only directories. My music is sorted by album, so I’m really just looking to see of the album name exist in the directory tree in the new location.

1
    if [[ -d "${i}" ]]; then

Now we run a find command, rooted at the new location for sorted music. We are looking for directories that contain the part of the name that is returned from the subshell command. It’s worth nothing too the subshell is wrapped in a beginning "* and ending *" which are the wildcards that wrap the album name substring.

1
        find ../audio/_music-sorted/ -iname "*`echo ${i:0:12} | cut -d '-' -f 1 | awk '{$1=$1};1'`*" -type d ;

The value used for -iname is a series of commands in a subshell, so we will break it out to look at each command individually.

The aggregate output of this is a substring of the album name. First we use echo to pass the substring of the album name to cut which splits on -, though I also used _ for a couple of runs. This will ensure that I don’t get the artist’s name as part of my substring. This is then passed to awk which strips the beginning and ending whitespace.

1
2
3
`echo ${i:0:12} | \
    cut -d '-' -f 1 | \
    awk '{$1=$1};1'`

The final output of this overall command is a list of paths to album in the new sorted directory for albums that existed in the old location. It was then easy to delete the duplicates and to move the ones that were missed.

Bash: Check for Duplicates by
 
Like what you read? Share it:
  Facebook   Email