----------------------------------------------------------------------
--- Knud van Eeden --- 02 January 2010 - 00:31 am --------------------

Computer: Editor: TSE: Line: Operation: Select: Unique: How to get only the unique values from a given column in a given text file?

---

If you can load your complete database file in TSE (without (memory) problems), it should work
fine to do everything in TSE.

I tested it with a 200 megabytes file on my system (or thus about 6.5 million
records of about 30 bytes per record in average), that worked OK.

Holding 6.5 million records in memory with a bit larger average size
of 70 bytes per record would require about
( 6.5 . 10^6 ) . ( 7 . 10^1 ) = about 450 megabytes.
If handling this file requires keeping a copy in memory, it would require 2 times that
size for RAM. Thus about 900 megabytes or thus 0.9 gigabyte.
Thus it should be able to be done in memory in TSE only
(with say 1 gigabytes of RAM memory installed on that computer)

So everything was done in TSE, and e.g. no subfile splitting was
necessary, and the sort used was not even (slow) external TSE sort, but
using the (fast) internal TSE sort.

(see the screenshots below and the example file download which I used for testing).

---

Getting the unique values in general requires to

 1. -First sort the data

 2. -Then get the unique values from that sorted data.

===

Ad 1. Sort the data

If you want to select the unique column values if working with *really
huge* files (e.g. several gigabytes) which you want to sort, there are
if applicable (e.g. because the TSE (external) sort might be sometimes
(too) slow, and I have e.g. also seen cases where it could not handle
the (several gigabytes large (log) files)) several options outside of
TSE.

===

E.g.

using Cygwin 'sort.exe'

(http://www.cygwin.com, run setup.exe, and download at least the sort.exe,
 then run the command sort.exe)

The following gives a fast solution.
E.g. 200 megabytes in under 1 minute.

The one liner will sort the file depending on the values between begin
column and end column.

Then only show the unique values once.

===

E.g.

--- cut here: begin --------------------------------------------------

 sort.exe -k <begin column>,<end column number> -un <your filename you want to sort>

--- cut here: end ----------------------------------------------------

E.g.

--- cut here: begin --------------------------------------------------

 c:\cygwin\bin\sort.exe -k1,11 -un c:\temp\ddd.txt

--- cut here: end ----------------------------------------------------

===

E.g.

--- cut here: begin --------------------------------------------------

 c:\cygwin\bin\sort.exe -k 1,20 -un c:\temp\sowpods.txt

--- cut here: end ----------------------------------------------------

where in the example 1 is the starting column number of your data,
and 20 is the end column number of your data.

===

Adapt this values to the conditions on your system.

Type

--- cut here: begin --------------------------------------------------

 sort.exe --help

--- cut here: end ----------------------------------------------------

to see the options.

===

Also Microsoft Windows has a command line 'sort.exe' command.

Type

--- cut here: begin --------------------------------------------------

 sort.exe /?

--- cut here: end ----------------------------------------------------

to see the options.

E.g.

--- cut here: begin --------------------------------------------------

 sort.exe /+20 <your filename you want to sort>

--- cut here: end ----------------------------------------------------

will start the sort from the 20th character in each line

===

Ad 2. Get unique values

E.g. Cygwin has also a uniq.exe command.

===

You can e.g use the pipe | command to move the output from one command to another command
(e.g. from 'sort.exe' to 'uniq.exe')

--- cut here: begin --------------------------------------------------

 sort.exe -k <begin column>, <end column number> <your filename you want to sort> | uniq.exe

--- cut here: end ----------------------------------------------------

But the problem is as far as I know, that Cygwin uniq.exe does not have a parameter
to select the column begin and end where it should look.

Do e.g.

--- cut here: begin --------------------------------------------------

 uniq.exe --help

--- cut here: end ----------------------------------------------------

Cygwin also has an AWK version (called gawk.exe), which one could thus use.

Similar using Perl.exe, Ruby.exe, Python.exe, ...

Using e.g. the 'capture.s' macro in TSE you can even automatically load this
file results back in TSE for further handling.

===

But my tests show thus that one should be able to quickly do *all* actions
(sort + unique) in TSE itself (within a very reasonable time of about 5
to 10 minutes alltogether).

===

Steps: Overview:

 1. -Load your complete (e.g. database) text file in TSE

 2. -Select the column from which you want the unique values
     (by highlighting manually e.g. the social security number column)

 3. -Sort this highlighted column in TSE
     (menu 'Block' > 'Sort').

     Optionally choose menu 'Block' > 'Case sensitive sort' > 'Off'
     to ignore upper or lower case.

 4. -Run my unique lines in block TSE macro on the sorted column
     (see the URL link below)

     Note:

     The big difference between this macro and the built-in TSE unique
     macro in the TSE potpourri menu is that my macro by design really
     only checks for the text *inside the highlighted block*. Whereas
     the unique.s macro looks for differences in the *whole line*
     (thus and in e.g. social security number but also in the name).

 5. -The expected end result showing the unique lines should only be 10
     unique lines
     (because all the other about 6 million lines were duplicates of that
      10 lines in my manually created example file. See screenshots
      below and the download URL link to test it yourself with that file).

 6. -Tested successfully on
     Microsoft Windows XP Professional (service pack 3),
     with 3.7 gigabytes RAM memory and Dual Core 2 Intel processor
     running
     TSE v4.x (only) (no external tools were used)
     on a 200 megabytes text file of about 6 million lines with 2 columns
     (average record size about 30 bytes)
     Time used on my computer about 4 minutes to sort using TSE internal sort
     and about 1 minute to select the unique values using a custom made TSE macro.
     So about 5 minutes in total on my computer.

===

Book: see also:



===

Diagram: see also:



===

File: see also:


Used example text file with about 6 million lines and 2 columns


===

Help: see also:



===

Image: see also:















===

Internet: see also:

Computer: Editor: TSE: Line: Operation: Select: Unique: How to get only the unique (possibly greater than 255 characters) lines (in a highlighted block)?
http://www.knudvaneeden.com/tinyurl.php?urlKey=url000384

---

sed/awk sort help
http://www.linuxquestions.org/questions/programming-9/sedawk-sort-help-478461/

---

Awk: Record: Field: Unique: How to get the unique records based on the first field?
http://www.knudvaneeden.com/tinyurl.php?urlKey=url000386

===

Podcast: see also:



===

Screencast: see also:



===

Table: see also:



===

Video: see also:





---



----------------------------------------------------------------------