I've been able to reduce my dataset by 75%, but it still leaves me with a file of 47 gigs. I'm trying to find the frequency of each line using: open(TEMP, "< $tempfile") die "cannot open file $tempfile: $!"; foreach (<TEMP>) { $seen{$_}++; } close(TEMP) die "cannot close file