----------------------------------------------------------------------
--- Knud van Eeden --- 11 April 2008 - 12:19 am ----------------------
Computer: Editor: Text: TSE: How to extract all URLs from an Internet page? [urlmon.dll / The TSE Cookbook: Recipe]
===
The TSE Cookbook: Recipe:
===
How to get URLs from a HTML page?
---
You often have the problem that you copy/paste a HTML page, but its
URLs are not copied with it.
So how do you include this URLs also?
===
1. -If you run the below method which extracts all the URLs in a given
web page, you will see that you get very many URLs (e.g. from
additional advertisements, ...), in a usual
rather complex looking block of text. Much more URLs than are
needed to extract. You only want to extract the URL corresponding
to your block of text, which you e.g. copy/pasted from the
original web page.
2. -What you could do to further extract the URLs and paste them
in the right position in your block of TSE flat text is:
1. -For first to last extracted URL in that block of text
1. -Get the URL ??? between ...
2. -Get the text ... between ...
3. -Search if that text ... can be found in your block of text
--- cut here: begin --------------------------------------------------
ssfffasdfasdfaafdffd
adfadsfasdf...adfadf
afffafddddfasdfasdfa
--- cut here: end ----------------------------------------------------
4. If yes, insert that URL at that position in the text
--- cut here: begin --------------------------------------------------
ssfffasdfasdfaafdffd
adfadsfasdf...
???
adfadf
afffafddddfasdfasdfa
--- cut here: end ----------------------------------------------------
2. -Another method which is less automatica and needs more manual editing is:
1. -You know that the URLs should only be added between the first
and last line of your block of text.
2. -So search with the text of the first line of your flat text TSE block in the
extracted URL text
3. -Search also for the last line text of your flat text TSE block in the extracted
URL text
4. -Only extract and use the URLs between that found first and last line text
from all the extracted URLs
===
To extract all URLS from a given web page there are several methods.
===
A. Method: Do it all manually
You can of course right click on each URL shortcut hyperlink on the
page, choose 'Copy shortcut' and copy/paste it. But if there are a lot
of this URLs that will usually take too much time. So automation might
be the choice.
===
B. Method: Write a TSE macro which does it all:
1. -Supply the URL and download the source code of that page to local
disk (e.g. in combination with the Windows URLMON.dll and the
function URLDownloadToFileA.
You call the Microsoft Windows API function
--- cut here: begin --------------------------------------------------
URLDownloadToFileA
--- cut here: end ----------------------------------------------------
located in the file
--- cut here: begin --------------------------------------------------
urlmon.dll
--- cut here: end ----------------------------------------------------
with parameters your URL
-E.g.
--- cut here: begin ----------------------------------------------------- cut here: end ----------------------------------------------------
and the local file where to store the result in
E.g.
--- cut here: begin --------------------------------------------------
c:\temp\index.htm
--- cut here: end ----------------------------------------------------
You then load that file in TSE
E.g.
--- cut here: begin --------------------------------------------------
EditFile( "c:\temp\index.htm" )
--- cut here: end ----------------------------------------------------
===
Get the source code (nice and simple routine, all in TSE, with no
downloading or installation of external programs)
===
E.g. create the following program:
--- cut here: begin --------------------------------------------------
// filenamemacro=viewinae.s
--- cut here: end ----------------------------------------------------
--- cut here: begin --------------------------------------------------
// PROC PROCInternetViewLinkUrlSourceProgramUrlApiDefault( STRING urlS, STRING fileNameS )
--- cut here: end ----------------------------------------------------
--- cut here: begin --------------------------------------------------
FORWARD INTEGER PROC FNHistoryCheckAskCentralB( STRING s1, VAR STRING s2, INTEGER i1 )
FORWARD INTEGER PROC FNKeyCheckPressEscapeB( STRING s1 )
FORWARD INTEGER PROC FNMathCheckGetLogicFalseB()
FORWARD INTEGER PROC FNMathCheckInitializeNewBooleanFalseB()
FORWARD INTEGER PROC FNMathCheckLogicNotB( INTEGER i1 )
FORWARD INTEGER PROC FNStringCheckEmptyB( STRING s1 )
FORWARD INTEGER PROC FNStringCheckEmptyNotB( STRING s1 )
FORWARD INTEGER PROC FNStringCheckEqualB( STRING s1, STRING s2 )
FORWARD PROC Main()
FORWARD PROC PROCInternetViewLinkUrlSourceProgramUrlApiDefault( STRING s1, STRING s2 )
FORWARD PROC PROCInternetViewUrlSource_ProgramUrlApi( STRING s1, STRING s2 )
FORWARD PROC PROCInternetView_UrlSourceProgramUrlApiDefault( STRING s1, STRING s2 )
FORWARD PROC PROCUrlGetSource( STRING s1, STRING s2 )
FORWARD STRING PROC FNStringGetEmptyS()
FORWARD STRING PROC FNStringGetEscapeS()
FORWARD STRING PROC FNStringGetHistoryInputS( STRING s1, STRING s2, INTEGER i1 )
FORWARD STRING PROC FNStringGetInitializeNewStringS()
FORWARD STRING PROC FNStringGetInputS( STRING s1, STRING s2 )
FORWARD STRING PROC FNStringGetSearchHistoryFindInputS( STRING s1, STRING s2 )
// --- MAIN --- //
DLL "<urlmon.dll>"
INTEGER PROC FNUrlGetSourceApiI(
INTEGER lpunknown,
STRING urlS : CSTRVAL,
STRING filenameS : CSTRVAL,
INTEGER dword,
INTEGER tlpbindstatuscallback
) : "URLDownloadToFileA"
END
PROC Main()
STRING s1[255] = FNStringGetInitializeNewStringS()
STRING s2[255] = FNStringGetInitializeNewStringS()
PushKey( <Shift Ins> ) // added [kn, ri, su, 16-09-2012 14:31:54]
s1 = FNStringGetInputS( "Microsoft: API: urlmon.dll: url of which you want to see the source code = ", "
http://www.google.com/index.html" )
IF FNKeyCheckPressEscapeB( s1 ) RETURN() ENDIF
// s2 = FNStringGetInputS( "Microsoft: API: urlmon.dll: filename to store the url source code = ", MakeTempName( "." ) ) // old [kn, ri, su, 16-09-2012 14:33:24]
s2 = MakeTempName( "." ) // new [kn, ri, su, 16-09-2012 14:33:28]
// IF FNKeyCheckPressEscapeB( s2 ) RETURN() ENDIF // old [kn, ri, su, 16-09-2012 14:33:30]
PROCInternetViewLinkUrlSourceProgramUrlApiDefault( s1, s2 )
END
<F12> Main()
// --- LIBRARY --- //
// library: string: initialize [kn, ri, mo, 09-07-2001 12:00:07]
STRING PROC FNStringGetInitializeNewStringS()
RETURN( FNStringGetEmptyS() )
END
// library: string: get: input <description>input a string</description> <version>1.0.0.0.1</version> (filenamemacro=getstgiq.s) [kn, ni, mo, 03-08-1998 13:04:18]
STRING PROC FNStringGetInputS( STRING askS, STRING answerDefaultS )
// e.g. PROC Main()
// e.g. Message( FNStringGetConcat3S( "'", FNStringGetInputS( "Choose option (Y/n)", "Y" ), "'" ) ) // gives e.g. "Y"
// e.g. END
// e.g.
// e.g. <F12> Main()
//
RETURN( FNStringGetSearchHistoryFindInputS( askS, answerDefaultS ) )
//
END
// library: key: check: press: escape <description>input: escape: test if escape was pressed</description> <version>1.0.0.0.2</version> (filenamemacro=checkepe.s) [kn, ni, we, 05-08-1998 20:29:00]
INTEGER PROC FNKeyCheckPressEscapeB( STRING s ) // version with testing local variable
// e.g. PROC Main()
// e.g. Message( FNKeyCheckPressEscapeB( "" ) ) // version with testing local variable ) // gives e.g. FALSE
// e.g. END
// e.g.
// e.g. <F12> Main()
//
RETURN( FNStringCheckEqualB( s, FNStringGetEscapeS() ) )
//
END
// library: internet: get: link: url: source: program: url: api: default <description>extract the URLs from a given Internet web page URL</description> <version control></version control> <version>1.0.0.0.20</version> (filenamemacro=viewinae.s) [kn, ri, su, 13-04-2008 05:35:51]
PROC PROCInternetViewLinkUrlSourceProgramUrlApiDefault( STRING urlS, STRING fileNameS )
// e.g. DLL "<urlmon.dll>"
// e.g. INTEGER PROC FNUrlGetSourceApiI(
// e.g. INTEGER lpunknown,
// e.g. STRING urlS : CSTRVAL,
// e.g. STRING filenameS : CSTRVAL,
// e.g. INTEGER dword,
// e.g. INTEGER tlpbindstatuscallback
// e.g. ) : "URLDownloadToFileA"
// e.g. END
// e.g.
// e.g. PROC Main()
// e.g. STRING s1[255] = FNStringGetInitializeNewStringS()
// e.g. STRING s2[255] = FNStringGetInitializeNewStringS()
// e.g. PushKey( <Shift Ins> ) // added [kn, ri, su, 16-09-2012 14:31:54]
// e.g. s1 = FNStringGetInputS( "Microsoft: API: urlmon.dll: url of which you want to see the source code = ", "
http://www.google.com/index.html" )
// e.g. IF FNKeyCheckPressEscapeB( s1 ) RETURN() ENDIF
// e.g. // s2 = FNStringGetInputS( "Microsoft: API: urlmon.dll: filename to store the url source code = ", MakeTempName( "." ) ) // old [kn, ri, su, 16-09-2012 14:33:24]
// e.g. s2 = MakeTempName( "." ) // new [kn, ri, su, 16-09-2012 14:33:28]
// e.g. // IF FNKeyCheckPressEscapeB( s2 ) RETURN() ENDIF // old [kn, ri, su, 16-09-2012 14:33:30]
// e.g. PROCInternetViewLinkUrlSourceProgramUrlApiDefault( s1, s2 )
// e.g. END
// e.g.
// e.g. <F12> Main()
//
STRING s1[255] = ""
//
STRING s2[255] = ""
//
STRING s[255] = Format( "\<a.*href[ ]*=[ ]*", '"', "?{.*}", '"', "?\>{.*}{{<\/a}|$}\c" )
//
INTEGER bufferI = 0
//
PushPosition()
bufferI = CreateTempBuffer()
PopPosition()
//
PROCInternetView_UrlSourceProgramUrlApiDefault( urlS, fileNameS )
//
BegFile()
//
WHILE ( LFind( s, "ix" ) )
//
s1 = GetFoundText( 1 )
//
s2 = GetFoundText( 2 )
//
AddLine( s2, bufferI )
AddLine( s1, bufferI )
AddLine( "", bufferI )
AddLine( "---", bufferI ) // added [kn, ri, sa, 15-09-2012 19:27:51]
AddLine( "", bufferI ) // added [kn, ri, sa, 15-09-2012 19:27:55]
//
ENDWHILE
//
IF ( GotoBufferId( bufferI ) )
//
// clean up in buffer
//
LReplace( 'target[ ]*=[ ]*["]?_blank', "", "ginx" )
LReplace( 'rel[ ]*=[ ]*["]?nofollow', "", "ginx" )
LReplace( '"', "", "ginx" ) // remove all double quotes
LReplace( "'", "", "ginx" ) // remove all single quotes
//
ENDIF
//
END
// library: string: get: empty (return an empty string) <description></description> <version control></version control> <version>1.0.0.0.0</version> (filenamemacro=getstgem.s) [kn, ri, sa, 20-05-2000 20:11:03]
STRING PROC FNStringGetEmptyS()
// e.g. PROC Main()
// e.g. Message( FNStringGetEmptyS() ) // gives e.g. ...""
// e.g. END
// e.g.
// e.g. <F12> Main()
RETURN( "" )
END
// library: string: get: search: history: find: input <description>input a string: history: find</description> <version>1.0.0.0.2</version> (filenamemacro=getstfir.s) [kn, ri, sa, 25-08-2001 21:00:25]
STRING PROC FNStringGetSearchHistoryFindInputS( STRING askS, STRING answerDefaultS )
// e.g. PROC Main()
// e.g. Message( FNStringGetSearchHistoryFindInputS( "Please input something", "test" ) ) // gives e.g. "test"
// e.g. END
// e.g.
// e.g. <F12> Main()
//
RETURN( FNStringGetHistoryInputS( askS, answerDefaultS, _FIND_HISTORY_ ) )
//
END
// library: string: equal: are two given strings equal? (stored in 'checstcf.s') [kn, zoe, we, 04-10-2000 18:23:27]
INTEGER PROC FNStringCheckEqualB( STRING s1, STRING s2 )
// e.g. PROC Main()
// e.g. STRING s1[255] = FNStringGetInitializeNewStringS()
// e.g. STRING s2[255] = FNStringGetInitializeNewStringS()
// e.g. s1 = FNStringGetInputS( "string: check: equal: first string = ", "a" )
// e.g. IF FNKeyCheckPressEscapeB( s1 ) RETURN() ENDIF
// e.g. s2 = FNStringGetInputS( "string: check: equal: second string = ", "a" )
// e.g. IF FNKeyCheckPressEscapeB( s2 ) RETURN() ENDIF
// e.g. Message( FNStringCheckEqualB( s1, s2 ) ) // gives e.g. TRUE when string1 is equal to string2
// e.g. END
// e.g.
// e.g. <F12> Main()
//
// // <F12> PROCMessage( FNStringCheckEqualB( "knud", "knud" ) ) // gives TRUE
// // <F12> PROCMessage( FNStringCheckEqualB( "knud", "van" ) ) // gives FALSE
RETURN( s1 == s2 )
END
// library: string: get: escape <description>general output string to recognize an escape (e.g. in another routine). Central routine, only one occurrence of this constant string</description> <version>1.0.0.0.2</version> (filenamemacro=getstges.s) [kn, ri, sa, 05-12-1998 18:52:24]
STRING PROC FNStringGetEscapeS()
// e.g. PROC Main()
// e.g. Message( FNStringGetEscapeS() ) // gives e.g. ...""
// e.g. END
// e.g.
// e.g. <F12> Main()
//
RETURN( "<ESCAPE>" )
//
END
// library: internet: view: url: source: program: url: api: default <description></description> <version control></version control> <version>1.0.0.0.0</version> (filenamemacro=viewinad.s) [kn, ri, su, 13-04-2008 05:19:22]
PROC PROCInternetView_UrlSourceProgramUrlApiDefault( STRING urlS, STRING fileNameS )
// e.g. DLL "<urlmon.dll>"
// e.g. INTEGER PROC FNUrlGetSourceApiI(
// e.g. INTEGER lpunknown,
// e.g. STRING urlS : CSTRVAL,
// e.g. STRING filenameS : CSTRVAL,
// e.g. INTEGER dword,
// e.g. INTEGER tlpbindstatuscallback
// e.g. ) : "URLDownloadToFileA"
// e.g. END
// e.g.
// e.g. PROC Main()
// e.g. STRING s1[255] = FNStringGetInitializeNewStringS()
// e.g. STRING s2[255] = FNStringGetInitializeNewStringS()
// e.g. s1 = FNStringGetInputS( "Microsoft: API: urlmon.dll: url of which you want to see the source code = ", "
http://www.google.com/index.html" )
// e.g. IF FNKeyCheckPressEscapeB( s1 ) RETURN() ENDIF
// e.g. s2 = FNStringGetInputS( "Microsoft: API: urlmon.dll: filename to store the url source code = ", MakeTempName( "." ) )
// e.g. IF FNKeyCheckPressEscapeB( s2 ) RETURN() ENDIF
// e.g. PROCInternetView_UrlSourceProgramUrlApiDefault( s1, s2 )
// e.g. END
// e.g.
// e.g. <F12> Main()
PROCInternetViewUrlSource_ProgramUrlApi( urlS, fileNameS )
END
// library: string: get: history: input <description>input a string and store it in that specific history list</description> <version>1.0.0.0.2</version> (filenamemacro=getsthin.s) [kn, ni, mo, 03-08-1998 13:04:18]
STRING PROC FNStringGetHistoryInputS( STRING infoS, STRING answerDefaultS, INTEGER historyI )
// e.g. PROC Main()
// e.g. Message( FNStringGetHistoryInputS( "Please input something", "test", _FIND_HISTORY_ ) ) // gives e.g. "test"
// e.g. END
// e.g.
// e.g. <F12> Main()
//
STRING s[255] = answerDefaultS
//
INTEGER escapeB = FNMathCheckInitializeNewBooleanFalseB()
//
escapeB = FNMathCheckLogicNotB( FNHistoryCheckAskCentralB( infoS, s, historyI ) )
//
IF ( escapeB )
//
RETURN( FNStringGetEscapeS() )
//
ENDIF // <Escape> was pressed, in response
//
IF FNStringCheckEmptyB( s ) AND FNStringCheckEmptyNotB( answerDefaultS )
//
RETURN( FNStringGetEmptyS() ) // input of an empty string, user has removed the string to indicate that an empty string was wanted
//
ENDIF
//
IF FNStringCheckEmptyB( s )
//
RETURN( answerDefaultS )
//
ENDIF // <Enter> was pressed, in response (variation: IF FNMathCheckLogicNotB( MathGetStringLengthI( s ) ) ...) // removed FN because it gave problems compiling [kn, ri, sa, 16-02-2008 21:53:49]
//
RETURN( s ) // response was entered
//
END
// library: internet: view: url: source: program: url: api <description></description> <version control></version control> <version>1.0.0.0.1</version> (filenamemacro=viewinua.s) [kn, ri, su, 13-04-2008 05:24:46]
PROC PROCInternetViewUrlSource_ProgramUrlApi( STRING urlS, STRING fileNameS )
// e.g. DLL "<urlmon.dll>"
// e.g. INTEGER PROC FNUrlGetSourceApiI(
// e.g. INTEGER lpunknown,
// e.g. STRING urlS : CSTRVAL,
// e.g. STRING filenameS : CSTRVAL,
// e.g. INTEGER dword,
// e.g. INTEGER tlpbindstatuscallback
// e.g. ) : "URLDownloadToFileA"
// e.g. END
// e.g.
// e.g. PROC Main()
// e.g. STRING s1[255] = FNStringGetInitializeNewStringS()
// e.g. s1 = FNStringGetInputS( "internet: view: url: source: program: url: api: urlS = ", "
http://www.google.com/index.html" )
// e.g. IF FNKeyCheckPressEscapeB( s1 ) RETURN() ENDIF
// e.g. PROCInternetViewUrlSource_ProgramUrlApi( s1, "c:\temp\ddd.txt" )
// e.g. END
// e.g.
// e.g. <F12> Main()
PROCUrlGetSource( urlS, filenameS )
EditFile( filenameS )
// EditFile( Format( "-b250", " ", filenameS ) ) // if you want to load the file in binary format // [kn, ri, su, 12-04-2009 19:46:28]
EraseDiskFile( fileNameS ) // to remove this temporary file from your disk.
END
// library: initialize: check: new: boolean: false <description></description> <version control></version control> <version>1.0.0.0.0</version> (filenamemacro=checinbf.s) [kn, ri, su, 22-07-2001 15:58:06]
INTEGER PROC FNMathCheckInitializeNewBooleanFalseB()
// e.g. PROC Main()
// e.g. Message( FNMathCheckInitializeNewBooleanFalseB() ) // gives e.g. FALSE
// e.g. END
// e.g.
// e.g. <F12> Main()
RETURN( FNMathCheckGetLogicFalseB() )
END
// library: math: check: logic: not <description></description> <version control></version control> <version>1.0.0.0.1</version> (filenamemacro=checmaln.s) [kn, ri, tu, 15-05-2001 16:54:21]
INTEGER PROC FNMathCheckLogicNotB( INTEGER B )
// e.g. PROC Main()
// e.g. STRING s[255] = FNStringGetInitializeNewStringS()
// e.g. s = FNStringGetInputS( "math: check: logic: not: number = ", "1" )
// e.g. IF FNKeyCheckPressEscapeB( s ) RETURN() ENDIF
// e.g. Message( FNMathCheckLogicNotB( FNStringGetToIntegerI( s ) ) )
// e.g. END
// e.g.
// e.g. <F12> Main()
//
RETURN( NOT B )
//
END
// library: history: check: ask: central <description>input: ask: find history</description> <version>1.0.0.0.1</version> (filenamemacro=chechiac.s) [kn, ri, sa, 25-08-2001 20:34:13]
INTEGER PROC FNHistoryCheckAskCentralB( STRING askS, VAR STRING answerDefaultS, INTEGER historyI )
// e.g. PROC Main()
// e.g. STRING s[255] = "test"
// e.g. Message( FNHistoryCheckAskCentralB( "Please input something", s, _FIND_HISTORY_ ) ) // gives e.g. "test"
// e.g. END
// e.g.
// e.g. <F12> Main()
//
RETURN( Ask( askS, answerDefaultS, historyI ) )
//
END
// library: string: empty: is given string empty? [kn, ri, sa, 20-05-2000 20:11:08]
INTEGER PROC FNStringCheckEmptyB( STRING s )
RETURN( FNStringCheckEqualB( s, FNStringGetEmptyS() ) )
END
// library: string: check: empty: not <description></description> <version control></version control> <version>1.0.0.0.0</version> (filenamemacro=checstep.s) [kn, ri, su, 21-05-2006 22:32:11]
INTEGER PROC FNStringCheckEmptyNotB( STRING s )
// e.g. PROC Main()
// e.g. Message( FNStringCheckEmptyNotB( FNStringGetEmptyS() ) ) // gives e.g. FALSE
// e.g. END
// e.g.
// e.g. <F12> Main()
RETURN( FNMathCheckLogicNotB( FNStringCheckEmptyB( s ) ) )
END
// library: url: get: source <description></description> <version control></version control> <version>1.0.0.0.3</version> (filenamemacro=geturgso.s) [kn, ri, su, 13-04-2008 05:12:53]
PROC PROCUrlGetSource( STRING urlS, STRING filenameS )
// e.g. DLL "<urlmon.dll>"
// e.g. INTEGER PROC FNUrlGetSourceApiI(
// e.g. INTEGER lpunknown,
// e.g. STRING urlS : CSTRVAL,
// e.g. STRING filenameS : CSTRVAL,
// e.g. INTEGER dword,
// e.g. INTEGER tlpbindstatuscallback
// e.g. ) : "URLDownloadToFileA"
// e.g. END
// e.g.
// e.g. PROC Main()
// e.g. STRING s1[255] = "
http://www.google.com/index.html"
// e.g. STRING s2[255] = "c:\temp\ddd.txt"
// e.g. IF ( NOT ( Ask( "url: get: source: urlS = ", s1, _EDIT_HISTORY_ ) ) AND ( Length( s1 ) > 0 ) ) RETURN() ENDIF
// e.g. IF ( NOT ( AskFilename( "url: get: source: filenameS = ", s2, _DEFAULT_, _EDIT_HISTORY_ ) ) AND ( Length( s2 ) > 0 ) ) RETURN() ENDIF
// e.g. PROCUrlGetSource( s1, s2 )
// e.g. EditFile( s2 )
// e.g. END
// e.g.
// e.g. <F12> Main()
//
FNUrlGetSourceApiI( 0, urlS, filenameS, 0, 0 )
//
END
// library: math: check: get: logic: false: wrapper <description></description> <version control></version control> <version>1.0.0.0.0</version> (filenamemacro=checmalf.s) [kn, ri, su, 22-07-2001 15:43:08]
INTEGER PROC FNMathCheckGetLogicFalseB()
// e.g. PROC Main()
// e.g. Message( FNMathCheckGetLogicFalseB() ) // gives e.g. ...""
// e.g. END
// e.g.
// e.g. <F12> Main()
RETURN( FALSE )
END
--- cut here: end ----------------------------------------------------
===
Note:
[Thursday 17 April 2008]
There is something wrong with the parameters, or even with the routine.
I did find examples where I did not get the same source code as when
using right click in Microsoft Internet Explorer, then 'view source'.
I will have to check this parameters further (e.g. dword, tlpbindstatuscallback, ...)
to see if that makes a difference.
I checked it in BBCBASIC using the same urlmon.dll and parameters,
and got the same unexpected result. So very probably the parameters are the problem.
Possible root cause:
Assumed currently probably caused by the use of HTML frames. That is
two or more web pages nested inside each other. The URL then points to
another frame than the frame you are interested in.
===
--- cut here: begin --------------------------------------------------
DLL "<urlmon.dll>"
INTEGER PROC FNUrlGetSourceApiI(
INTEGER lpunknown,
STRING urlS : CSTRVAL,
STRING filenameS : CSTRVAL,
INTEGER dword,
INTEGER tlpbindstatuscallback
) : "URLDownloadToFileA"
END
PROC PROCUrlGetSource( STRING urlS, STRING fileNameS )
FNUrlGetSourceApiI( 0, urlS, filenameS, 0, 0 )
END
PROC Main()
PROCUrlGetSource( "
http://www.google.com/index.html", "c:\ddd.txt" )
EditFile( "c:\ddd.txt" )
END
--- cut here: end ----------------------------------------------------
===
Get the links in the source code
--- cut here: begin --------------------------------------------------
DLL "<urlmon.dll>"
INTEGER PROC FNUrlGetSourceApiI(
INTEGER lpunknown,
STRING urlS : CSTRVAL,
STRING filenameS : CSTRVAL,
INTEGER dword,
INTEGER tlpbindstatuscallback
) : "URLDownloadToFileA"
END
PROC PROCUrlGetSource( STRING urlS, STRING fileNameS )
FNUrlGetSourceApiI( 0, urlS, filenameS, 0, 0 )
END
PROC Main()
STRING urlS[255] = "
http://www.google.com/index.html"
STRING fileNameS[255] = MakeTempName( "." )
//
IF NOT Ask( "get source code at which url = ", urlS, _FIND_HISTORY_ )
RETURN()
ENDIF
//
IF NOT Ask( "store this source code in which filename = ", fileNameS, _FIND_HISTORY_ )
RETURN()
ENDIF
//
PROCUrlGetSource( urlS, fileNameS )
EditFile( fileNameS )
EraseDiskFile( fileNameS ) // to remove this temporary file from your disk.
//
PushKey( <ALT E> )
//
IF NOT ( LFind( "{www.}|{http}", "ngixv" ) )
Warn( "no 'www.' or 'http' links found in current page" )
ENDIF
//
END
--- cut here: end ----------------------------------------------------
===
Note
---
Of course the URL link extraction can be done better with a more
complex regular expression.
===
Currently it is just illustrating the principles to keep it simple.
===
Possibly analyse the source code
If you can get the source of a web page (e.g. on the Internet, or on
your own local network or Intranet), you can automate all kind of tasks
by extracting the information from this web page (e.g. by searching
with LFind(), and e.g. regular expressions)
---
It will be really handy to automate web applications handling (I will
e.g. extract some (form) information out of given web pages (e.g. name,
address of a given user), and can then use this data to automatically
create some other information (e.g. store it in database tables) in
TSE).
===
E.g. here an illustrative example of monitoring a change in a web page
--- cut here: begin --------------------------------------------------
DLL "<urlmon.dll>"
INTEGER PROC FNUrlGetSourceApiI(
INTEGER lpunknown,
STRING urlS : CSTRVAL,
STRING filenameS : CSTRVAL,
INTEGER dword,
INTEGER tlpbindstatuscallback
) : "URLDownloadToFileA"
END
PROC PROCUrlGetSource( STRING urlS, STRING fileNameS )
FNUrlGetSourceApiI( 0, urlS, filenameS, 0, 0 )
END
PROC Main()
STRING urlS[255] = "
http://www.semware.com/index.php"
STRING fileNameS[255] = MakeTempName( "." )
//
IF NOT Ask( "get source code at which url = ", urlS, _FIND_HISTORY_ )
RETURN()
ENDIF
//
IF NOT Ask( "store this source code in which filename = ", fileNameS, _FIND_HISTORY_ )
RETURN()
ENDIF
//
PROCUrlGetSource( urlS, fileNameS )
EditFile( fileNameS )
EraseDiskFile( fileNameS ) // to remove this temporary file from your disk.
//
IF LFind( "last updated{.*}$", "gix" )
Warn( Format( "this page is last updated at", " ", Trim( GetFoundText( 1 ) ) ) )
ENDIF
//
END
--- cut here: end ----------------------------------------------------
===
2. Load that file in TSE
3. Search with regular expressions for 'http://www...' in the text
(e.g. something like
--- cut here: begin --------------------------------------------------
WHILE LFind( "{http:\/\/}?{www.@~[ ]}", "ix" )
s = GetFoundText( 2 )
AddLine( s, <your bufferid> )
ENDWHILE
--- cut here: end ----------------------------------------------------
or use the regular expression (debug this)
["]?{http:\/\/}?{www\..@}\c{["]|[ ]}
to add the output to some temporary file
===
C. Method: Download that page manually, then search for the URLs
Similar to above you do only step 2 and 3. You get as usual the source
by going to the URL in your browser, right clicking on the page, and
choosing 'View source'. This will e.g. open notepad (otherwise your TSE
if you have set it as that editor (via a registry setting)).
===
D. Method: Use an external program (in combination with TSE)
1. -E.g. use a library in another computer language (like Perl, PHP, Ruby, ...) to extract and handle the URLs (e.g. combine it with TSE capture macro)
2. -E.g.
--- cut here: begin --------------------------------------------------
urlget.exe
--- cut here: end ----------------------------------------------------
2. -E.g.
--- cut here: begin --------------------------------------------------
wget.exe
--- cut here: end ----------------------------------------------------
===
--- cut here: begin ----------------------------------------------------- cut here: end ----------------------------------------------------
3. -E.g.
Curl
--- cut here: begin ----------------------------------------------------- cut here: end ----------------------------------------------------
4. -E.g.
XSite
--- cut here: begin ----------------------------------------------------- cut here: end ----------------------------------------------------
---
This time I used XSite (just download and install, fill in the URL,
click on 'Query web site' icon, then click node 'All links', then menu
'File'->'Export node', and it will save all links as a .csv file, which
you can load in TSE).
===
E. To insert the URLs at the correct position in the text, after extracting this URLs
(e.g. of your copy/pasted HTML page, which does not include the URL
shortcuts usually), you could e.g. create a keyboard macro or a TSE
macro which extracts all the time the first line of the block (other
implementations of how to insert the URLs at the correct position in
the text are of course possible):
(e.g. copy the whole block of URL lines to the Microsoft Windows
clipboard, paste that block in your text, insert a new line below the
first line in the block, go down, highlight all lines of that paragraph
below except the first line, and cut this again to the Microsoft
Windows clipboard. That will leave only the first line, thus the
topmost URL. Repeat this process until no more URLs). When you start
from the top of the page, because order extracting is also from top to
bottom, this is a rather linear way to insert the URLs.
===
Example:
I used this last method successfully to extract and insert the URLs of e.g. this webpage:
--- cut here: begin ----------------------------------------------------- cut here: end ----------------------------------------------------
If you only just copy/paste you will get something like
--- cut here: begin --------------------------------------------------
GNU/LINUX DISTROS
Linux Mint (F)
SimplyMEPIS (F)
--- cut here: end ----------------------------------------------------
===
After adding the missing URLs you get something like
--- cut here: begin ----------------------------------------------------- cut here: end ----------------------------------------------------
===
If you have other ideas or interesting implementations let it possibly
be known.
===
[kn, ho, th, 27-03-2008 11:11:02]
Why not just copy-and-paste into KompoZer?
--- cut here: begin ----------------------------------------------------- cut here: end ----------------------------------------------------
===
Book: see also:
===
Diagram: see also:
===
File: see also:
===
File: version: control: see also:
===
Help: see also:
===
Image: see also:
===
Internet: see also:
---
Computer: Editor: Text: TSE: Internet: Url: Source code: Get: How to automatically get the source code of any URL and edit it in TSE? [Microsoft Windows API URLDownloadTofFile]
http://goo.gl/ubFv6
===
Podcast: see also:
===
Record: see also:
===
Screencast: see also:
===
Table: see also:
===
Video: see also:
===
<version>1.0.0.0.9</version>
----------------------------------------------------------------------