Large file listing diffs with postgres
I recently needed to know what files were missing from a backup (~2 million files). Directory structure aside, I simply needed to know what files were missing from the destination. Once I got a recursive file listing of both source and destination directories, I imported them into unique postgres tables and did a LEFT OUTER JOIN
query to see what was present in the source but absent in the destination.
docker postgres
Spin up postgres in Docker leveraging env_file with credentials and an initial database named sample_db
.
# postgres.env
POSTGRES_USER="post_user"
POSTGRES_PASSWORD="post_pass"
POSTGRES_DB="sample_db" # default Postgres database
# docker-compose.yaml
version: '3'
services:
postgres:
container_name: postgres
image: postgres:14.1
restart: always
ports:
- 5432:5432
env_file:
- ./postgres.env # contains POSTGRES_USER, POSTGRES_PASSWORD, and POSTGRES_DB
volumes:
- ./postgresql_data:/var/lib/postgresql/data
docker-compose up -d
connect to postgres
Export the following environment variables for psql
.
# .envrc
export PGHOSTADDR=127.0.0.1
export PGDATABASE=sample_db
export PGUSER=post_user
export PGPASSWORD=post_pass
Connect with psql
. You’ll be dropped into the default Postgres database.
➜ psql
psql (16.1, server 14.1 (Debian 14.1-1.pgdg110+1))
Type "help" for help.
sample_db=#
create tables
From the sample_db=#
prompt create two tables with rows named file_name
.
CREATE TABLE destination_files (
file_name TEXT
);
CREATE TABLE source_files (
file_name TEXT
);
List the tables.
\dt
List of relations
Schema | Name | Type | Owner
--------+-------------------+-------+-----------
public | destination_files | table | post_user
public | source_files | table | post_user
(2 rows)
import data
sample data snippets
Below are snippets from destination_files.txt
and source_files.txt
.
# source_files.txt
36719.zip
36720.zip
36721.zip
36722.zip
36723.zip
36724.zip
36725.zip
36726.zip
# destination_files.txt
893145.zip
893146.zip
893147.zip
893148.zip
893149.zip
893150.zip
893151.zip
893152.zip
use psql
to copy the data
Exit from the psql
sample_db=#
prompt. Use psql -c
to copy the data into their respective tables.
➜ psql -c "\COPY source_files (file_name) FROM './source_files.txt'"
COPY 2131354
➜ psql -c "\COPY destination_files (file_name) FROM './destination_files.txt'"
COPY 2131350
diff the data
Connect with psql
again. From the sample_db=#
prompt run the following query.
SELECT *
FROM source_files
LEFT OUTER JOIN destination_files ON source_files.file_name = destination_files.file_name
WHERE destination_files.file_name IS NULL;
From the output we can see the four files present in the source but absent in the destination.
file_name | file_name
-----------+-----------
1344167.zip |
1310270.zip |
1420005.zip |
2033203.zip |
(4 rows)