----------------------------------------------------------------------
--- Knud van Eeden --- 03 January 2010 - 03:35 am --------------------

Computer: Editor: Text: TSE: Line: Operation: Select: Unique: How to get only unique values from given column in given text file? Frequency method [hash table / associative array]

===

Steps: Overview:

 1. -E.g. create the following program:

E.g. create the following program:

--- cut here: begin --------------------------------------------------
// filenamemacro=seleblfv.s
--- cut here: end ---------------------------------------------------- --- cut here: begin --------------------------------------------------
// PROC PROCBlockSelectBlockLineInputFrequencyUnique( INTEGER C0, INTEGER C1, STRING fileNameS )
--- cut here: end ---------------------------------------------------- --- cut here: begin --------------------------------------------------
FORWARD PROC Main() FORWARD PROC PROCBlockSelectBlockLineFrequencyUnique( INTEGER i1, INTEGER i2 ) FORWARD PROC PROCBlockSelectBlockLineInputFrequencyUnique( INTEGER i1, INTEGER i2, STRING s1 ) // --- MAIN --- // PROC Main() STRING s1[255] = "1" STRING s2[255] = "11" STRING s3[255] = "c:\temp\ddd.txt" IF NOT Ask( "block: select: block: line: input: frequency: unique: blockBeginI = ", s1, _FIND_HISTORY_ ) RETURN() ENDIF IF NOT Ask( "block: select: block: line: input: frequency: unique: blockEndI = ", s2, _FIND_HISTORY_ ) RETURN() ENDIF IF NOT AskFilename( "block: select: block: input: line: frequency: unique: fileNameS = ", s3, _DEFAULT_, _EDIT_HISTORY_ ) RETURN() ENDIF PROCBlockSelectBlockLineInputFrequencyUnique( Val( s1 ), Val( s2 ), s3 ) END <F12> Main() // --- LIBRARY --- // // library: block: select: block: line: input: frequency: unique <description>Idea based on Awk' solution and Larry's TSE macro</description> <version>1.0.0.0.14</version> <version control></version control> (filenamemacro=seleblfv.s) [<Program>] [<Research>] [kn, ri, su, 03-01-2010 03:20:33] PROC PROCBlockSelectBlockLineInputFrequencyUnique( INTEGER C0, INTEGER C1, STRING fileNameS ) // e.g. PROC Main() // e.g. STRING s1[255] = "1" // e.g. STRING s2[255] = "11" // e.g. STRING s3[255] = "c:\temp\ddd.txt" // e.g. IF NOT Ask( "block: select: block: line: input: frequency: unique: blockBeginI = ", s1, _FIND_HISTORY_ ) RETURN() ENDIF // e.g. IF NOT Ask( "block: select: block: line: input: frequency: unique: blockEndI = ", s2, _FIND_HISTORY_ ) RETURN() ENDIF // e.g. IF NOT AskFilename( "block: select: block: input: line: frequency: unique: fileNameS = ", s3, _DEFAULT_, _EDIT_HISTORY_ ) RETURN() ENDIF // e.g. PROCBlockSelectBlockLineInputFrequencyUnique( Val( s1 ), Val( s2 ), s3 ) // e.g. END // e.g. // e.g. <F12> Main() // // === // // get the row height of the block // INTEGER timeI = 0 // INTEGER R0 = 0 // INTEGER R1 = 0 // // first load the large file (which might take a long time), only then start measuring the elapsed time // IF NOT EditFile( fileNameS ) // Warn( Format( "could not find the data file", " ", fileNameS, ".", " ", "Please check this filename" ) ) // RETURN() // ENDIF // // Make sure the file is reloaded (it might already be loaded from a previous session, and so you are sure you are working with a fresh copy) // IF ( EditFile( fileNameS ) ) AbandonFile() EditFile( fileNameS ) ENDIF // // select a block equal to the given column begin and column end values, and selecting all lines from begin to the end of the file // R0 = 1 R1 = NumLines() // MarkColumn( R0, C0, R1, C1 ) // // set the counter for the elapsed time // timeI = GetClockTicks() // PROCBlockSelectBlockLineFrequencyUnique( C0, C1 ) // Warn( Format( ( GetClockTicks() - timeI ) / 18, " ", "seconds elapsed" ) ) // END // library: block: select: block: line: frequency: unique <description></description> <version control></version control> <version>1.0.0.0.36</version> (filenamemacro=seleblfu.s) [<Program>] [<Research>] [ // associative array is not already set, so set / kn, ri, su, 03-01-2010 02:35:41] PROC PROCBlockSelectBlockLineFrequencyUnique( INTEGER C0, INTEGER C1 ) // e.g. PROC Main() // e.g. PROCBlockSelectBlockLineFrequencyUnique( Query( BlockBegCol ), Query( BlockEndCol ) ) // e.g. END // e.g. // e.g. <F12> Main() // // === // // Method: // // Use the associative array technique (using the string value of the current column to set a flag (=counter equal to 0 or not)) // // If that flag is set, then the column entry was already encountered earlier, thus is a duplicate, and thus can that line be deleted // // === // // get the total column width of the block // INTEGER C4 = C1 - C0 + 1 // // get the linenumber of the last line in the block // INTEGER R1 = Query( blockEndLine ) // INTEGER downB = FALSE // IF ( NOT ( IsBlockInCurrFile() ) ) Warn( "Please mark a block first" ) RETURN() ENDIF // return from the current procedure if no block is marked // PushPosition() // PushBlock() // GotoBlockBegin() // REPEAT // // if associative array is already set then it is a duplicate, thus delete // IF ( ExistGlobalVar( GetText( C0, C4 ) ) ) // DelLine() // R1 = R1 - 1 // ELSE // // associative array is not already set, so set the flag now // SetGlobalInt( Gettext( C0, C4 ), 0 ) // downB = Down() // ENDIF // UNTIL ( ( CurrLine() > R1 ) OR ( NOT ( downB ) ) ) // GotoBlockBegin() // REPEAT // // delete all existing unique global variables in the associative array // IF ( ExistGlobalVar( GetText( C0, C4 ) ) ) // DelGlobalVar( GetText( C0, C4 ) ) // ENDIF // UNTIL ( ( CurrLine() > R1 ) OR ( NOT ( Down() ) ) ) // PopBlock() // PopPosition() // END
--- cut here: end ---------------------------------------------------- 2. -Run the program 3. -Supply the given begin field column (=begin of social security column) 1. -E.g. 1 4. -Supply the given end field column (=end of social security column) 1. -E.g. 11 5. -Supply the given data file name 1. -E.g. c:\temp\ddd.txt 6. -Run the program 7. -The program goes once through each line, takes out the text line in the highlighted block only, checks if an associative array value for that string has been set if yes, it has already been encountered, thus it is a duplicate and can be deleted. If not, it sets the associative array value 8. -When all lines have been traversed all duplicate lines have been deleted, and only the non-duplicate lines are left 9. -In another loop all declared global variables are deleted. (If you do not do that their values might influence the duplicate/non-duplicate testing in another run of this macro) 10. -Tested successfully on Microsoft Windows XP Professional (service pack 3), running TSE v4.x and a data file of 200 megabytes containing about 6 million records with 2 columns. Time used about 80 seconds === Book: see also: === Diagram: see also: === File: see also: === File: version: control: see also: === Help: see also: === Image: see also: === Internet: see also: --- Computer: Editor: TSE: Line: Operation: Select: Unique: How to get only the unique values from a given column in a given text file? http://www.knudvaneeden.com/tinyurl.php?urlKey=url000385 === Podcast: see also: === Record: see also: === Screencast: see also: === Table: see also: === Video: see also: === <version>1.0.0.0.2</version> ----------------------------------------------------------------------

Share |

This web page is created and maintained using the Semware TSE text editor