---------------------------------------------------------------------- --- Knud van Eeden --- 03 January 2010 - 00:30 am -------------------- Awk: Record: Field: Unique: How to get the unique records based on the first field? --- The idea of using the frequency (e.g. of occurring only 1 time, which means non-duplicate) is certainly a very good one, and can certainly avoid sorting. The idea of using the frequency can indeed avoid sorting alltogether (and can be done in linear time (that is fast)). It is more a special case of hashing (that is you map each relevant field you find in each line onto an array. If the array value is empty (=0) then it has not be found before, and you set a counter there (to 1). --- Only field (e.g. social security numbers) occurrences with a frequency of 1 are then by definition non-duplicates. --- You can go linearly once through the data to extract the first field (=social security number), and put it in an array (an associative array here, that means you use the string itself as a counter) only when it has not been found yet (in other words data[ <for that field> ] = 0) --- And then at the end (after having gone through all the lines of the file) go linearly through all the found occurrences which have a frequency of 1. --- I created an AWK program, which does the job in 5 seconds using this approach. === Steps: Overview: 1. -E.g. create the following program: --- cut here: begin --------------------------------------------------

#
# Do this by design only once at the beginning
#
# set the field separator (=FS) to a space
#
BEGIN {
 #
 FS = " "
 #
}
#
# This is by design done for every line
#
{
 #
 # if the associative array (called 'data') for the first field (which is the social security number) is empty
 #
 if ( data[ $1 ]++ == 0 )
  #
  # then add the current whole line (called '$0') to an array (called 'lines') and increase a counter (called 'count')
  #
  lines[ ++count ] = $0
  #
}
#
# Do this by design *only once* at the end
#
END {
 #
 # for all values in the array (called 'lines') show its content
 # Because only the entries with a value of count equal to 1 are shown, you automatically see only the non-duplicates
 # (duplicates would have a value greater than 1 (e.g. 2, 3, ...), so these are not shown)
 #
 for ( i = 1; i <= count; i++ )
  #
  print lines[ i ]
  #
}

--- cut here: end ---------------------------------------------------- 2. -Save this program as --- cut here: begin --------------------------------------------------

ddd.awk

--- cut here: end ---------------------------------------------------- 3. -The data (200 megabytes) is stored in the file --- cut here: begin --------------------------------------------------

ddd.txt

--- cut here: end ---------------------------------------------------- 4. -Run the program --- cut here: begin --------------------------------------------------

c:\cygwin\bin\gawk.exe -f c:\temp\ddd.awk c:\temp\ddd.txt

--- cut here: end ---------------------------------------------------- 5. -That will show a screen output similar to the following: --- cut here: begin --------------------------------------------------

> dir c:\temp\ddd.txt

 1/02/2010   0:07     193,462,272  ddd.txt


> time /T & gawk -f c:\temp\ddd.awk c:\temp\ddd.txt & time /T

 0:24:40

025-60-4044      joe average
004-16-4077      jane doe
014-27-9076      mike smith
098-43-2098      rodolfo pilas
073-15-6005      gustavo boksar
147-79-9074      bea busaniche
165-63-0189      pablo medrano
124-96-7092      jeff aaron
172-30-6069      michael peters
059-85-1062      leroy baker

 0:24:45

--- cut here: end ---------------------------------------------------- 6. -The AWK program takes thus about 5 seconds to go through the about 200 megabytes large file containing about 6 million records (almost all duplicates, except this 10 records) similar to the 2 column values shown here. 7. -Another test using an about 1 gigabyte large data file showed it took about 22 seconds on my system using the Awk program. Similar GNU sort.exe took about 5 minutes to process the same data file. --- cut here: begin --------------------------------------------------

c:\cygwin\bin Sun  Jan 3, 2010  1:46>time /T & gawk -f c:\temp\ddd.awk c:\temp\ddd1.txt & time /T

1:46:10

025-60-4044      joe average
004-16-4077      jane doe
014-27-9076      mike smith
098-43-2098      rodolfo pilas
073-15-6005      gustavo boksar
147-79-9074      bea busaniche
165-63-0189      pablo medrano
124-96-7092      jeff aaron
172-30-6069      michael peters
059-85-1062      leroy baker

1:46:32

C:\cygwin\bin Sun  Jan 3, 2010  1:46>time /T & sort.exe -k1,11 -un c:\temp\ddd1.txt & time /T

1:47:19

004-16-4077      jane doe
014-27-9076      mike smith
025-60-4044      joe average
059-85-1062      leroy baker
073-15-6005      gustavo boksar
098-43-2098      rodolfo pilas
124-96-7092      jeff aaron
147-79-9074      bea busaniche
165-63-0189      pablo medrano
172-30-6069      michael peters

1:51:46

C:\cygwin\bin Sun  Jan 3, 2010  1:51>dir c:\temp\ddd1.txt

 1/03/2010   1:45     967,311,361  ddd1.txt
    967,311,361 bytes in 1 file and 0 dirs    967,311,872 bytes allocated

--- cut here: end ---------------------------------------------------- === Book: see also: === Diagram: see also: === File: see also: === Help: see also: === Image: see also: === Internet: see also: Computer: Editor: TSE: Line: Operation: Select: Unique: How to get only the unique values from a given column in a given text file? http://www.knudvaneeden.com/tinyurl.php?urlKey=url000385 === GNU Awk manual http://www.gnu.org/manual/gawk/gawk.html === Podcast: see also: === Screencast: see also: === Table: see also: === Video: see also: --- ----------------------------------------------------------------------