---------------------------------------------------------------------- --- Knud van Eeden --- 03 January 2010 - 00:30 am -------------------- Awk: Record: Field: Unique: How to get the unique records based on the first field? --- The idea of using the frequency (e.g. of occurring only 1 time, which means non-duplicate) is certainly a very good one, and can certainly avoid sorting. The idea of using the frequency can indeed avoid sorting alltogether (and can be done in linear time (that is fast)). It is more a special case of hashing (that is you map each relevant field you find in each line onto an array. If the array value is empty (=0) then it has not be found before, and you set a counter there (to 1). --- Only field (e.g. social security numbers) occurrences with a frequency of 1 are then by definition non-duplicates. --- You can go linearly once through the data to extract the first field (=social security number), and put it in an array (an associative array here, that means you use the string itself as a counter) only when it has not been found yet (in other words data[ <for that field> ] = 0) --- And then at the end (after having gone through all the lines of the file) go linearly through all the found occurrences which have a frequency of 1. --- I created an AWK program, which does the job in 5 seconds using this approach. === Steps: Overview: 1. -E.g. create the following program: --- cut here: begin --------------------------------------------------# # Do this by design only once at the beginning # # set the field separator (=FS) to a space # BEGIN { # FS = " " # } # # This is by design done for every line # { # # if the associative array (called 'data') for the first field (which is the social security number) is empty # if ( data[ $1 ]++ == 0 ) # # then add the current whole line (called '$0') to an array (called 'lines') and increase a counter (called 'count') # lines[ ++count ] = $0 # } # # Do this by design *only once* at the end # END { # # for all values in the array (called 'lines') show its content # Because only the entries with a value of count equal to 1 are shown, you automatically see only the non-duplicates # (duplicates would have a value greater than 1 (e.g. 2, 3, ...), so these are not shown) # for ( i = 1; i <= count; i++ ) # print lines[ i ] # }--- cut here: end ---------------------------------------------------- 2. -Save this program as --- cut here: begin --------------------------------------------------ddd.awk--- cut here: end ---------------------------------------------------- 3. -The data (200 megabytes) is stored in the file --- cut here: begin --------------------------------------------------ddd.txt--- cut here: end ---------------------------------------------------- 4. -Run the program --- cut here: begin --------------------------------------------------c:\cygwin\bin\gawk.exe -f c:\temp\ddd.awk c:\temp\ddd.txt--- cut here: end ---------------------------------------------------- 5. -That will show a screen output similar to the following: --- cut here: begin --------------------------------------------------> dir c:\temp\ddd.txt 1/02/2010 0:07 193,462,272 ddd.txt > time /T & gawk -f c:\temp\ddd.awk c:\temp\ddd.txt & time /T 0:24:40 025-60-4044 joe average 004-16-4077 jane doe 014-27-9076 mike smith 098-43-2098 rodolfo pilas 073-15-6005 gustavo boksar 147-79-9074 bea busaniche 165-63-0189 pablo medrano 124-96-7092 jeff aaron 172-30-6069 michael peters 059-85-1062 leroy baker 0:24:45--- cut here: end ---------------------------------------------------- 6. -The AWK program takes thus about 5 seconds to go through the about 200 megabytes large file containing about 6 million records (almost all duplicates, except this 10 records) similar to the 2 column values shown here. 7. -Another test using an about 1 gigabyte large data file showed it took about 22 seconds on my system using the Awk program. Similar GNU sort.exe took about 5 minutes to process the same data file. --- cut here: begin --------------------------------------------------c:\cygwin\bin Sun Jan 3, 2010 1:46>time /T & gawk -f c:\temp\ddd.awk c:\temp\ddd1.txt & time /T 1:46:10 025-60-4044 joe average 004-16-4077 jane doe 014-27-9076 mike smith 098-43-2098 rodolfo pilas 073-15-6005 gustavo boksar 147-79-9074 bea busaniche 165-63-0189 pablo medrano 124-96-7092 jeff aaron 172-30-6069 michael peters 059-85-1062 leroy baker 1:46:32 C:\cygwin\bin Sun Jan 3, 2010 1:46>time /T & sort.exe -k1,11 -un c:\temp\ddd1.txt & time /T 1:47:19 004-16-4077 jane doe 014-27-9076 mike smith 025-60-4044 joe average 059-85-1062 leroy baker 073-15-6005 gustavo boksar 098-43-2098 rodolfo pilas 124-96-7092 jeff aaron 147-79-9074 bea busaniche 165-63-0189 pablo medrano 172-30-6069 michael peters 1:51:46 C:\cygwin\bin Sun Jan 3, 2010 1:51>dir c:\temp\ddd1.txt 1/03/2010 1:45 967,311,361 ddd1.txt 967,311,361 bytes in 1 file and 0 dirs 967,311,872 bytes allocated--- cut here: end ---------------------------------------------------- === Book: see also: === Diagram: see also: === File: see also: === Help: see also: === Image: see also: === Internet: see also: Computer: Editor: TSE: Line: Operation: Select: Unique: How to get only the unique values from a given column in a given text file? http://www.knudvaneeden.com/tinyurl.php?urlKey=url000385 === GNU Awk manual http://www.gnu.org/manual/gawk/gawk.html === Podcast: see also: === Screencast: see also: === Table: see also: === Video: see also: --- ----------------------------------------------------------------------