Iso-anonymizer.pl

From Cactus Howto
Revision as of 11:49, 15 May 2016 by Tim (talk | contribs)
Jump to navigationJump to search
#! /usr/bin/perl -w
# -------------------------------------------------------------------------------------------
# iso-anonymizer.pl
# run like this: 
# ./iso-anonymizer.pl -txt-subst-file=/var/tmp/strings.txt [-net="192.168.0.0/16"] <config-file1 config-file2 ...> 
# -------------------------------------------------------------------------------------------
require 5.006_000; # Needed for NetAddr::IP and file handler
require Exporter;
use strict;
use warnings;
use CGI qw(:standard);
use NetAddr::IP;
use Carp;
use Time::HiRes qw(time tv_interval); # for exact recording of script execution time

my ($cfg_file, $line);
our @ISA = qw(Exporter);
my $infile;
my $txt_subst_file;
my $net;
my $outfile;
my %anonymized_ip;	
my %anonymized_text;
my $ano_txt = "IsoAAAA";	# starting pattern - needs to be alpha chars only for incrementing to work
my $ano_suffix = '.anonymized';

sub create_string_subst_hash {
	my $txt_subst_file_local = shift;
	open( my $txt_file, $txt_subst_file_local ) or croak "Unable to open $txt_subst_file_local: $!\n";
	while (my $line = <$txt_file>) {
		chomp ($line);
		$anonymized_text{$line} = $ano_txt;
		# adding separator chars (_-) contained in pattern again:
		if ($line =~ /.*?([\_\-])$/) { $anonymized_text{$line} .= $1; }	
		if ($line =~ /^([\_\-]).*?/) { $anonymized_text{$line} = $1 . $anonymized_text{$line}; }	
		++$ano_txt;
	}
	close ($txt_file);
	return;
}
sub _in_range { return 0 <= $_[0] && $_[0] <= 255; }

sub find_ipaddrs (\$&) {
    my($r_text, $callback) = @_;
    my $addrs_found = 0;
	my $regex = qr<(\d+)\.(\d+)\.(\d+)\.(\d+)(\/\d\d?)?>;

    $$r_text =~ s{$regex}{
        my $orig_match = join '.', $1, $2, $3, $4;
        if (defined($5) && $5 ne '') { $orig_match .= '/32'; }
        if ((my $num_matches = grep { _in_range($_) } $1, $2, $3, $4) == 4) {
            $addrs_found++;
            my $ipaddr = NetAddr::IP->new($orig_match);
            $callback->($ipaddr, $orig_match);
        } else {
            $orig_match;
        }
    }eg;
    return $addrs_found;
}

sub anonymize {
	my $infile = shift;
	my $net = shift;
	my $outfile = shift;
	my $ip = NetAddr::IP->new("$net");

	open( my $ifh, $infile ) or croak "Unable to open $infile: $!\n";
	open( my $ofh, ">$outfile" ) or croak "Unable to open $outfile: $!\n" ;

	while (my $line = <$ifh>) {
		find_ipaddrs($line, sub {
			my($ipaddr, $orig) = @_;
			if ($orig =~ /^2[45][0258]\./) { # found netmask (assuming IPs starting with 24x.* and 25x.* are netmasks)
				return $anonymized_ip{$orig} if exists $anonymized_ip{$orig};
				$anonymized_ip{$orig} = "255.255.255.255"; # changing all netmask to /32 to avoid invalid cidrs
				return $anonymized_ip{$orig};
			} elsif ($orig eq '0.0.0.0') { 	# leave /0 netmask alone
				return $ipaddr->addr;
			} else {  
				my $netmask = '';
				if ($orig =~ /(.+?)\/32$/) {
					$orig = $1;
					$netmask = '/32';
				}
				return $anonymized_ip{$orig} . $netmask if exists $anonymized_ip{$orig};
				# if found ip has not yet an anonymous equivalent in hash - create new ip
				++$ip;
				$anonymized_ip{$orig} = $ip->addr;
				return $anonymized_ip{$orig} . $netmask;
			}
		});
		if (defined($txt_subst_file)) { # obfuscating text
			my $regex_all_texts = join("|", map {quotemeta} keys %anonymized_text);
			$line =~ s/($regex_all_texts)/$anonymized_text{$1}/go;
		}  
		print $ofh $line;
	}
	close ($ifh); close ($ofh); return;	
}

my $start_time = time();
my $query = CGI->new;
if (defined(param("-txt-subst-file"))) { $txt_subst_file = param("-txt-subst-file"); } else { print ("error: no -txt-subst-file param specified\n"); exit 1; }
if (defined(param("-net"))) { $net = param("-net"); } else { $net = "10.0.0.0/8"; print ("no net specified, using default net $net\n"); }
if (defined($txt_subst_file)) { &create_string_subst_hash($txt_subst_file); }

my $total_filesize = 0;
# treating all params not starting with - as files to anonymize
# do not re-anonymize files with .anonymized extension and do not anonymize binary files
foreach my $file ($query->param) { 
	if ($file !~ /^-/ && $file !~ /.*?$ano_suffix$/ && -T $file) {
		$total_filesize += -s $file;
		print ("anonymizing: $file ... "); 
		&anonymize($file, $net, $file . $ano_suffix);
		print ("result file = $file$ano_suffix\n"); 
	}
}

# Generating statistics
my @ki=keys(%anonymized_ip);
my @kt=keys(%anonymized_text);
my $duration = time() - $start_time;
print("Anonymized " . $#ki . " ip addresses and " . $#kt . " strings in " . sprintf("%.1f",$duration) . " seconds");
printf(" (total %.2f MB, %.2f Mbytes/second).\n", $total_filesize/1000000, $total_filesize/$duration/1000000);

=head1 NAME

iso-anonymizer.pl - replace IP addresses with anonymized IPs as well as text with anonymized text in plain text files

=head1 SYNOPSIS
  ./iso-anonymizer.pl -txt-subst-file=/var/tmp/strings.txt [-net="192.168.0.0/16"] <config-file1 config-file2 ...> 

=head1 DESCRIPTION

This is a module for 
a) replacing IP addresses in plain text with anonymized equivalents from 
the network range supplied.

b) replacing strings in a file with anonymized strings

Input is a number of ASCII files (all parameters not starting with -)
IP addresses as well as strings are replaced  one-for-one throughout 
all text files, so once an IP address has an anonymized equivalent, 
it stays that way. 

This is useful if you need to use production configuration data for testing.
E.g. from firewalls but do not want to expose the production data on a
test system. This way you can protect an organization's 
identity at the same time.

Caveats: 
- currently only implemented for IPv4
- beware of anonymizing common strings; e.g. "INT" when handling database dumps is part of keyword CONSTRAINT
  use slightly longer strings like "INT_" instead

Params:
- The network range used for replacement, is set to "10.0.0.0/8" if omitted.
- For each file <infile> supplied an anonymized file called 
  <infile>.anonymized is created.

The second argument is a network address, which should be given in
CIDR notation, and really represents a range of IP addresses from
which we can draw from while doing the IP address substitutions (Note
that the use of NetAddr::IP means that we will never overflow this
range - but it will wrap around if we increment it enough). Using an
RFC1918 private address range is a good idea.

Note that the script tries to handle network addresses so that 
network address and netmask (both given in 255.255.255.x notation
as well as a.b.c.d/xy notation) will match by simply setting 
all netmasks to /32. 

=head1 EXAMPLES
./iso-anonymizer.pl -net="172.20.0.0/21" -txt-subst-file=/var/tmp/strings.txt /var/tmp/firewall17.cfg /var/tmp/router9.cfg

 tim@lacantha:$ sudo perl iso-anonymizer.pl -txt-subst-file=strings.txt /var/tmp/netscreen1.cfg
 no net specified, using default net 10.0.0.0/8
 anonymizing: /var/tmp/netscreen1.cfg ... result file = /var/tmp/netscreen1.cfg.anonymized
 Anonymized 20197 ip addresses and 150 strings in 31.1 seconds (0.46 Mbytes/second).
 tim@lacantha:~$ 
 
Anonymizing a whole (ASCII) Postgresql database:
  # creating a dump of the database:
  pg_dump
  # turn binary .Fc dump into ascii (only necessary if you do not already have an ascii dump):
  pg_restore >dbdump.sql
  sudo perl iso-anonymizer.pl -txt-subst-file=/var/tmp/strings.txt /var/tmp/files-to-anonymize/*
  psql --set ON_ERROR_STOP=on targetdb <dbdump.sql

=head1 TODO
- define test cases
- reliably replase network address by networks with consistent netmasks
  currently all networks are reduced to a /32 netmask
- optimize speed

=head1 AUTHOR
Tim Purschke E<lt>tmp@cactus.deE<gt>

=head1 COPYRIGHT AND LICENSE
Copyright (C) 2016 by Cactus eSecurity GmbH

=head1 SEE ALSO
Behind the door

=cut