Large file listing diffs with postgres

1 minute read

I recently needed to know what files were missing from a backup (~2 million files). Directory structure aside, I simply needed to know what files were missing from the destination. Once I got a recursive file listing of both source and destination directories, I imported them into unique postgres tables and did a LEFT OUTER JOIN query to see what was present in the source but absent in the destination.

docker postgres

Spin up postgres in Docker leveraging env_file with credentials and an initial database named sample_db.

# postgres.env
POSTGRES_DB="sample_db" # default Postgres database
# docker-compose.yaml
version: '3'

    container_name: postgres
    image: postgres:14.1
    restart: always
      - 5432:5432
      - ./postgres.env # contains POSTGRES_USER, POSTGRES_PASSWORD, and POSTGRES_DB
      - ./postgresql_data:/var/lib/postgresql/data

docker-compose up -d

connect to postgres

Export the following environment variables for psql.

# .envrc
export PGDATABASE=sample_db
export PGUSER=post_user
export PGPASSWORD=post_pass

Connect with psql. You’ll be dropped into the default Postgres database.

➜ psql
psql (16.1, server 14.1 (Debian 14.1-1.pgdg110+1))
Type "help" for help.


create tables

From the sample_db=# prompt create two tables with rows named file_name.

CREATE TABLE destination_files (
  file_name TEXT
CREATE TABLE source_files (
  file_name TEXT

List the tables.

               List of relations
 Schema |       Name        | Type  |   Owner
 public | destination_files | table | post_user
 public | source_files      | table | post_user
(2 rows)

import data

sample data snippets

Below are snippets from destination_files.txt and source_files.txt.

# source_files.txt
# destination_files.txt

use psql to copy the data

Exit from the psql sample_db=# prompt. Use psql -c to copy the data into their respective tables.

➜ psql -c "\COPY source_files (file_name) FROM './source_files.txt'"
COPY 2131354
➜ psql -c "\COPY destination_files (file_name) FROM './destination_files.txt'"
COPY 2131350

diff the data

Connect with psql again. From the sample_db=# prompt run the following query.

FROM source_files
LEFT OUTER JOIN destination_files ON source_files.file_name = destination_files.file_name
WHERE destination_files.file_name IS NULL;

From the output we can see the four files present in the source but absent in the destination.

 file_name | file_name
-----------+----------- | | | |
(4 rows)