----------------------------------------------------------------------
--- Knud van Eeden --- 11 April 2008 - 12:19 am ----------------------

Computer: Editor: Text: TSE: How to extract all URLs from an Internet page? [urlmon.dll / The TSE Cookbook: Recipe]

===

The TSE Cookbook: Recipe:

===

How to get URLs from a HTML page?

---

You often have the problem that you copy/paste a HTML page, but its
URLs are not copied with it.
So how do you include this URLs also?

===

1. -If you run the below method which extracts all the URLs in a given
    web page, you will see that you get very many URLs (e.g. from
    additional advertisements, ...), in a usual
    rather complex looking block of text. Much more URLs than are
    needed to extract. You only want to extract the URL corresponding
    to your block of text, which you e.g. copy/pasted from the
    original web page.

2. -What you could do to further extract the URLs and paste them
    in the right position in your block of TSE flat text is:

    1. -For first to last extracted URL in that block of text

        1. -Get the URL ??? between ...

        2. -Get the text ... between ...

        3. -Search if that text ... can be found in your block of text

--- cut here: begin --------------------------------------------------
ssfffasdfasdfaafdffd adfadsfasdf...adfadf afffafddddfasdfasdfa
--- cut here: end ---------------------------------------------------- 4. If yes, insert that URL at that position in the text --- cut here: begin --------------------------------------------------
ssfffasdfasdfaafdffd adfadsfasdf... ??? adfadf afffafddddfasdfasdfa
--- cut here: end ---------------------------------------------------- 2. -Another method which is less automatica and needs more manual editing is: 1. -You know that the URLs should only be added between the first and last line of your block of text. 2. -So search with the text of the first line of your flat text TSE block in the extracted URL text 3. -Search also for the last line text of your flat text TSE block in the extracted URL text 4. -Only extract and use the URLs between that found first and last line text from all the extracted URLs === To extract all URLS from a given web page there are several methods. === A. Method: Do it all manually You can of course right click on each URL shortcut hyperlink on the page, choose 'Copy shortcut' and copy/paste it. But if there are a lot of this URLs that will usually take too much time. So automation might be the choice. === B. Method: Write a TSE macro which does it all: 1. -Supply the URL and download the source code of that page to local disk (e.g. in combination with the Windows URLMON.dll and the function URLDownloadToFileA. You call the Microsoft Windows API function --- cut here: begin --------------------------------------------------
URLDownloadToFileA
--- cut here: end ---------------------------------------------------- located in the file --- cut here: begin --------------------------------------------------
urlmon.dll
--- cut here: end ---------------------------------------------------- with parameters your URL -E.g. --- cut here: begin --------------------------------------------------
http://www.semware.com/index.php
--- cut here: end ---------------------------------------------------- and the local file where to store the result in E.g. --- cut here: begin --------------------------------------------------
c:\temp\index.htm
--- cut here: end ---------------------------------------------------- You then load that file in TSE E.g. --- cut here: begin --------------------------------------------------
EditFile( "c:\temp\index.htm" )
--- cut here: end ---------------------------------------------------- === Get the source code (nice and simple routine, all in TSE, with no downloading or installation of external programs) === E.g. create the following program: --- cut here: begin --------------------------------------------------
// filenamemacro=viewinae.s
--- cut here: end ---------------------------------------------------- --- cut here: begin --------------------------------------------------
// PROC PROCInternetViewLinkUrlSourceProgramUrlApiDefault( STRING urlS, STRING fileNameS )
--- cut here: end ---------------------------------------------------- --- cut here: begin --------------------------------------------------
FORWARD INTEGER PROC FNHistoryCheckAskCentralB( STRING s1, VAR STRING s2, INTEGER i1 ) FORWARD INTEGER PROC FNKeyCheckPressEscapeB( STRING s1 ) FORWARD INTEGER PROC FNMathCheckGetLogicFalseB() FORWARD INTEGER PROC FNMathCheckInitializeNewBooleanFalseB() FORWARD INTEGER PROC FNMathCheckLogicNotB( INTEGER i1 ) FORWARD INTEGER PROC FNStringCheckEmptyB( STRING s1 ) FORWARD INTEGER PROC FNStringCheckEmptyNotB( STRING s1 ) FORWARD INTEGER PROC FNStringCheckEqualB( STRING s1, STRING s2 ) FORWARD PROC Main() FORWARD PROC PROCInternetViewLinkUrlSourceProgramUrlApiDefault( STRING s1, STRING s2 ) FORWARD PROC PROCInternetViewUrlSource_ProgramUrlApi( STRING s1, STRING s2 ) FORWARD PROC PROCInternetView_UrlSourceProgramUrlApiDefault( STRING s1, STRING s2 ) FORWARD PROC PROCUrlGetSource( STRING s1, STRING s2 ) FORWARD STRING PROC FNStringGetEmptyS() FORWARD STRING PROC FNStringGetEscapeS() FORWARD STRING PROC FNStringGetHistoryInputS( STRING s1, STRING s2, INTEGER i1 ) FORWARD STRING PROC FNStringGetInitializeNewStringS() FORWARD STRING PROC FNStringGetInputS( STRING s1, STRING s2 ) FORWARD STRING PROC FNStringGetSearchHistoryFindInputS( STRING s1, STRING s2 ) // --- MAIN --- // DLL "<urlmon.dll>" INTEGER PROC FNUrlGetSourceApiI( INTEGER lpunknown, STRING urlS : CSTRVAL, STRING filenameS : CSTRVAL, INTEGER dword, INTEGER tlpbindstatuscallback ) : "URLDownloadToFileA" END PROC Main() STRING s1[255] = FNStringGetInitializeNewStringS() STRING s2[255] = FNStringGetInitializeNewStringS() PushKey( <Shift Ins> ) // added [kn, ri, su, 16-09-2012 14:31:54] s1 = FNStringGetInputS( "Microsoft: API: urlmon.dll: url of which you want to see the source code = ", "http://www.google.com/index.html" ) IF FNKeyCheckPressEscapeB( s1 ) RETURN() ENDIF // s2 = FNStringGetInputS( "Microsoft: API: urlmon.dll: filename to store the url source code = ", MakeTempName( "." ) ) // old [kn, ri, su, 16-09-2012 14:33:24] s2 = MakeTempName( "." ) // new [kn, ri, su, 16-09-2012 14:33:28] // IF FNKeyCheckPressEscapeB( s2 ) RETURN() ENDIF // old [kn, ri, su, 16-09-2012 14:33:30] PROCInternetViewLinkUrlSourceProgramUrlApiDefault( s1, s2 ) END <F12> Main() // --- LIBRARY --- // // library: string: initialize [kn, ri, mo, 09-07-2001 12:00:07] STRING PROC FNStringGetInitializeNewStringS() RETURN( FNStringGetEmptyS() ) END // library: string: get: input <description>input a string</description> <version>1.0.0.0.1</version> (filenamemacro=getstgiq.s) [kn, ni, mo, 03-08-1998 13:04:18] STRING PROC FNStringGetInputS( STRING askS, STRING answerDefaultS ) // e.g. PROC Main() // e.g. Message( FNStringGetConcat3S( "'", FNStringGetInputS( "Choose option (Y/n)", "Y" ), "'" ) ) // gives e.g. "Y" // e.g. END // e.g. // e.g. <F12> Main() // RETURN( FNStringGetSearchHistoryFindInputS( askS, answerDefaultS ) ) // END // library: key: check: press: escape <description>input: escape: test if escape was pressed</description> <version>1.0.0.0.2</version> (filenamemacro=checkepe.s) [kn, ni, we, 05-08-1998 20:29:00] INTEGER PROC FNKeyCheckPressEscapeB( STRING s ) // version with testing local variable // e.g. PROC Main() // e.g. Message( FNKeyCheckPressEscapeB( "" ) ) // version with testing local variable ) // gives e.g. FALSE // e.g. END // e.g. // e.g. <F12> Main() // RETURN( FNStringCheckEqualB( s, FNStringGetEscapeS() ) ) // END // library: internet: get: link: url: source: program: url: api: default <description>extract the URLs from a given Internet web page URL</description> <version control></version control> <version>1.0.0.0.20</version> (filenamemacro=viewinae.s) [kn, ri, su, 13-04-2008 05:35:51] PROC PROCInternetViewLinkUrlSourceProgramUrlApiDefault( STRING urlS, STRING fileNameS ) // e.g. DLL "<urlmon.dll>" // e.g. INTEGER PROC FNUrlGetSourceApiI( // e.g. INTEGER lpunknown, // e.g. STRING urlS : CSTRVAL, // e.g. STRING filenameS : CSTRVAL, // e.g. INTEGER dword, // e.g. INTEGER tlpbindstatuscallback // e.g. ) : "URLDownloadToFileA" // e.g. END // e.g. // e.g. PROC Main() // e.g. STRING s1[255] = FNStringGetInitializeNewStringS() // e.g. STRING s2[255] = FNStringGetInitializeNewStringS() // e.g. PushKey( <Shift Ins> ) // added [kn, ri, su, 16-09-2012 14:31:54] // e.g. s1 = FNStringGetInputS( "Microsoft: API: urlmon.dll: url of which you want to see the source code = ", "http://www.google.com/index.html" ) // e.g. IF FNKeyCheckPressEscapeB( s1 ) RETURN() ENDIF // e.g. // s2 = FNStringGetInputS( "Microsoft: API: urlmon.dll: filename to store the url source code = ", MakeTempName( "." ) ) // old [kn, ri, su, 16-09-2012 14:33:24] // e.g. s2 = MakeTempName( "." ) // new [kn, ri, su, 16-09-2012 14:33:28] // e.g. // IF FNKeyCheckPressEscapeB( s2 ) RETURN() ENDIF // old [kn, ri, su, 16-09-2012 14:33:30] // e.g. PROCInternetViewLinkUrlSourceProgramUrlApiDefault( s1, s2 ) // e.g. END // e.g. // e.g. <F12> Main() // STRING s1[255] = "" // STRING s2[255] = "" // STRING s[255] = Format( "\<a.*href[ ]*=[ ]*", '"', "?{.*}", '"', "?\>{.*}{{<\/a}|$}\c" ) // INTEGER bufferI = 0 // PushPosition() bufferI = CreateTempBuffer() PopPosition() // PROCInternetView_UrlSourceProgramUrlApiDefault( urlS, fileNameS ) // BegFile() // WHILE ( LFind( s, "ix" ) ) // s1 = GetFoundText( 1 ) // s2 = GetFoundText( 2 ) // AddLine( s2, bufferI ) AddLine( s1, bufferI ) AddLine( "", bufferI ) AddLine( "---", bufferI ) // added [kn, ri, sa, 15-09-2012 19:27:51] AddLine( "", bufferI ) // added [kn, ri, sa, 15-09-2012 19:27:55] // ENDWHILE // IF ( GotoBufferId( bufferI ) ) // // clean up in buffer // LReplace( 'target[ ]*=[ ]*["]?_blank', "", "ginx" ) LReplace( 'rel[ ]*=[ ]*["]?nofollow', "", "ginx" ) LReplace( '"', "", "ginx" ) // remove all double quotes LReplace( "'", "", "ginx" ) // remove all single quotes // ENDIF // END // library: string: get: empty (return an empty string) <description></description> <version control></version control> <version>1.0.0.0.0</version> (filenamemacro=getstgem.s) [kn, ri, sa, 20-05-2000 20:11:03] STRING PROC FNStringGetEmptyS() // e.g. PROC Main() // e.g. Message( FNStringGetEmptyS() ) // gives e.g. ..."" // e.g. END // e.g. // e.g. <F12> Main() RETURN( "" ) END // library: string: get: search: history: find: input <description>input a string: history: find</description> <version>1.0.0.0.2</version> (filenamemacro=getstfir.s) [kn, ri, sa, 25-08-2001 21:00:25] STRING PROC FNStringGetSearchHistoryFindInputS( STRING askS, STRING answerDefaultS ) // e.g. PROC Main() // e.g. Message( FNStringGetSearchHistoryFindInputS( "Please input something", "test" ) ) // gives e.g. "test" // e.g. END // e.g. // e.g. <F12> Main() // RETURN( FNStringGetHistoryInputS( askS, answerDefaultS, _FIND_HISTORY_ ) ) // END // library: string: equal: are two given strings equal? (stored in 'checstcf.s') [kn, zoe, we, 04-10-2000 18:23:27] INTEGER PROC FNStringCheckEqualB( STRING s1, STRING s2 ) // e.g. PROC Main() // e.g. STRING s1[255] = FNStringGetInitializeNewStringS() // e.g. STRING s2[255] = FNStringGetInitializeNewStringS() // e.g. s1 = FNStringGetInputS( "string: check: equal: first string = ", "a" ) // e.g. IF FNKeyCheckPressEscapeB( s1 ) RETURN() ENDIF // e.g. s2 = FNStringGetInputS( "string: check: equal: second string = ", "a" ) // e.g. IF FNKeyCheckPressEscapeB( s2 ) RETURN() ENDIF // e.g. Message( FNStringCheckEqualB( s1, s2 ) ) // gives e.g. TRUE when string1 is equal to string2 // e.g. END // e.g. // e.g. <F12> Main() // // // <F12> PROCMessage( FNStringCheckEqualB( "knud", "knud" ) ) // gives TRUE // // <F12> PROCMessage( FNStringCheckEqualB( "knud", "van" ) ) // gives FALSE RETURN( s1 == s2 ) END // library: string: get: escape <description>general output string to recognize an escape (e.g. in another routine). Central routine, only one occurrence of this constant string</description> <version>1.0.0.0.2</version> (filenamemacro=getstges.s) [kn, ri, sa, 05-12-1998 18:52:24] STRING PROC FNStringGetEscapeS() // e.g. PROC Main() // e.g. Message( FNStringGetEscapeS() ) // gives e.g. ..."" // e.g. END // e.g. // e.g. <F12> Main() // RETURN( "<ESCAPE>" ) // END // library: internet: view: url: source: program: url: api: default <description></description> <version control></version control> <version>1.0.0.0.0</version> (filenamemacro=viewinad.s) [kn, ri, su, 13-04-2008 05:19:22] PROC PROCInternetView_UrlSourceProgramUrlApiDefault( STRING urlS, STRING fileNameS ) // e.g. DLL "<urlmon.dll>" // e.g. INTEGER PROC FNUrlGetSourceApiI( // e.g. INTEGER lpunknown, // e.g. STRING urlS : CSTRVAL, // e.g. STRING filenameS : CSTRVAL, // e.g. INTEGER dword, // e.g. INTEGER tlpbindstatuscallback // e.g. ) : "URLDownloadToFileA" // e.g. END // e.g. // e.g. PROC Main() // e.g. STRING s1[255] = FNStringGetInitializeNewStringS() // e.g. STRING s2[255] = FNStringGetInitializeNewStringS() // e.g. s1 = FNStringGetInputS( "Microsoft: API: urlmon.dll: url of which you want to see the source code = ", "http://www.google.com/index.html" ) // e.g. IF FNKeyCheckPressEscapeB( s1 ) RETURN() ENDIF // e.g. s2 = FNStringGetInputS( "Microsoft: API: urlmon.dll: filename to store the url source code = ", MakeTempName( "." ) ) // e.g. IF FNKeyCheckPressEscapeB( s2 ) RETURN() ENDIF // e.g. PROCInternetView_UrlSourceProgramUrlApiDefault( s1, s2 ) // e.g. END // e.g. // e.g. <F12> Main() PROCInternetViewUrlSource_ProgramUrlApi( urlS, fileNameS ) END // library: string: get: history: input <description>input a string and store it in that specific history list</description> <version>1.0.0.0.2</version> (filenamemacro=getsthin.s) [kn, ni, mo, 03-08-1998 13:04:18] STRING PROC FNStringGetHistoryInputS( STRING infoS, STRING answerDefaultS, INTEGER historyI ) // e.g. PROC Main() // e.g. Message( FNStringGetHistoryInputS( "Please input something", "test", _FIND_HISTORY_ ) ) // gives e.g. "test" // e.g. END // e.g. // e.g. <F12> Main() // STRING s[255] = answerDefaultS // INTEGER escapeB = FNMathCheckInitializeNewBooleanFalseB() // escapeB = FNMathCheckLogicNotB( FNHistoryCheckAskCentralB( infoS, s, historyI ) ) // IF ( escapeB ) // RETURN( FNStringGetEscapeS() ) // ENDIF // <Escape> was pressed, in response // IF FNStringCheckEmptyB( s ) AND FNStringCheckEmptyNotB( answerDefaultS ) // RETURN( FNStringGetEmptyS() ) // input of an empty string, user has removed the string to indicate that an empty string was wanted // ENDIF // IF FNStringCheckEmptyB( s ) // RETURN( answerDefaultS ) // ENDIF // <Enter> was pressed, in response (variation: IF FNMathCheckLogicNotB( MathGetStringLengthI( s ) ) ...) // removed FN because it gave problems compiling [kn, ri, sa, 16-02-2008 21:53:49] // RETURN( s ) // response was entered // END // library: internet: view: url: source: program: url: api <description></description> <version control></version control> <version>1.0.0.0.1</version> (filenamemacro=viewinua.s) [kn, ri, su, 13-04-2008 05:24:46] PROC PROCInternetViewUrlSource_ProgramUrlApi( STRING urlS, STRING fileNameS ) // e.g. DLL "<urlmon.dll>" // e.g. INTEGER PROC FNUrlGetSourceApiI( // e.g. INTEGER lpunknown, // e.g. STRING urlS : CSTRVAL, // e.g. STRING filenameS : CSTRVAL, // e.g. INTEGER dword, // e.g. INTEGER tlpbindstatuscallback // e.g. ) : "URLDownloadToFileA" // e.g. END // e.g. // e.g. PROC Main() // e.g. STRING s1[255] = FNStringGetInitializeNewStringS() // e.g. s1 = FNStringGetInputS( "internet: view: url: source: program: url: api: urlS = ", "http://www.google.com/index.html" ) // e.g. IF FNKeyCheckPressEscapeB( s1 ) RETURN() ENDIF // e.g. PROCInternetViewUrlSource_ProgramUrlApi( s1, "c:\temp\ddd.txt" ) // e.g. END // e.g. // e.g. <F12> Main() PROCUrlGetSource( urlS, filenameS ) EditFile( filenameS ) // EditFile( Format( "-b250", " ", filenameS ) ) // if you want to load the file in binary format // [kn, ri, su, 12-04-2009 19:46:28] EraseDiskFile( fileNameS ) // to remove this temporary file from your disk. END // library: initialize: check: new: boolean: false <description></description> <version control></version control> <version>1.0.0.0.0</version> (filenamemacro=checinbf.s) [kn, ri, su, 22-07-2001 15:58:06] INTEGER PROC FNMathCheckInitializeNewBooleanFalseB() // e.g. PROC Main() // e.g. Message( FNMathCheckInitializeNewBooleanFalseB() ) // gives e.g. FALSE // e.g. END // e.g. // e.g. <F12> Main() RETURN( FNMathCheckGetLogicFalseB() ) END // library: math: check: logic: not <description></description> <version control></version control> <version>1.0.0.0.1</version> (filenamemacro=checmaln.s) [kn, ri, tu, 15-05-2001 16:54:21] INTEGER PROC FNMathCheckLogicNotB( INTEGER B ) // e.g. PROC Main() // e.g. STRING s[255] = FNStringGetInitializeNewStringS() // e.g. s = FNStringGetInputS( "math: check: logic: not: number = ", "1" ) // e.g. IF FNKeyCheckPressEscapeB( s ) RETURN() ENDIF // e.g. Message( FNMathCheckLogicNotB( FNStringGetToIntegerI( s ) ) ) // e.g. END // e.g. // e.g. <F12> Main() // RETURN( NOT B ) // END // library: history: check: ask: central <description>input: ask: find history</description> <version>1.0.0.0.1</version> (filenamemacro=chechiac.s) [kn, ri, sa, 25-08-2001 20:34:13] INTEGER PROC FNHistoryCheckAskCentralB( STRING askS, VAR STRING answerDefaultS, INTEGER historyI ) // e.g. PROC Main() // e.g. STRING s[255] = "test" // e.g. Message( FNHistoryCheckAskCentralB( "Please input something", s, _FIND_HISTORY_ ) ) // gives e.g. "test" // e.g. END // e.g. // e.g. <F12> Main() // RETURN( Ask( askS, answerDefaultS, historyI ) ) // END // library: string: empty: is given string empty? [kn, ri, sa, 20-05-2000 20:11:08] INTEGER PROC FNStringCheckEmptyB( STRING s ) RETURN( FNStringCheckEqualB( s, FNStringGetEmptyS() ) ) END // library: string: check: empty: not <description></description> <version control></version control> <version>1.0.0.0.0</version> (filenamemacro=checstep.s) [kn, ri, su, 21-05-2006 22:32:11] INTEGER PROC FNStringCheckEmptyNotB( STRING s ) // e.g. PROC Main() // e.g. Message( FNStringCheckEmptyNotB( FNStringGetEmptyS() ) ) // gives e.g. FALSE // e.g. END // e.g. // e.g. <F12> Main() RETURN( FNMathCheckLogicNotB( FNStringCheckEmptyB( s ) ) ) END // library: url: get: source <description></description> <version control></version control> <version>1.0.0.0.3</version> (filenamemacro=geturgso.s) [kn, ri, su, 13-04-2008 05:12:53] PROC PROCUrlGetSource( STRING urlS, STRING filenameS ) // e.g. DLL "<urlmon.dll>" // e.g. INTEGER PROC FNUrlGetSourceApiI( // e.g. INTEGER lpunknown, // e.g. STRING urlS : CSTRVAL, // e.g. STRING filenameS : CSTRVAL, // e.g. INTEGER dword, // e.g. INTEGER tlpbindstatuscallback // e.g. ) : "URLDownloadToFileA" // e.g. END // e.g. // e.g. PROC Main() // e.g. STRING s1[255] = "http://www.google.com/index.html" // e.g. STRING s2[255] = "c:\temp\ddd.txt" // e.g. IF ( NOT ( Ask( "url: get: source: urlS = ", s1, _EDIT_HISTORY_ ) ) AND ( Length( s1 ) > 0 ) ) RETURN() ENDIF // e.g. IF ( NOT ( AskFilename( "url: get: source: filenameS = ", s2, _DEFAULT_, _EDIT_HISTORY_ ) ) AND ( Length( s2 ) > 0 ) ) RETURN() ENDIF // e.g. PROCUrlGetSource( s1, s2 ) // e.g. EditFile( s2 ) // e.g. END // e.g. // e.g. <F12> Main() // FNUrlGetSourceApiI( 0, urlS, filenameS, 0, 0 ) // END // library: math: check: get: logic: false: wrapper <description></description> <version control></version control> <version>1.0.0.0.0</version> (filenamemacro=checmalf.s) [kn, ri, su, 22-07-2001 15:43:08] INTEGER PROC FNMathCheckGetLogicFalseB() // e.g. PROC Main() // e.g. Message( FNMathCheckGetLogicFalseB() ) // gives e.g. ..."" // e.g. END // e.g. // e.g. <F12> Main() RETURN( FALSE ) END
--- cut here: end ---------------------------------------------------- === Note: [Thursday 17 April 2008] There is something wrong with the parameters, or even with the routine. I did find examples where I did not get the same source code as when using right click in Microsoft Internet Explorer, then 'view source'. I will have to check this parameters further (e.g. dword, tlpbindstatuscallback, ...) to see if that makes a difference. I checked it in BBCBASIC using the same urlmon.dll and parameters, and got the same unexpected result. So very probably the parameters are the problem. Possible root cause: Assumed currently probably caused by the use of HTML frames. That is two or more web pages nested inside each other. The URL then points to another frame than the frame you are interested in. === --- cut here: begin --------------------------------------------------
DLL "<urlmon.dll>" INTEGER PROC FNUrlGetSourceApiI( INTEGER lpunknown, STRING urlS : CSTRVAL, STRING filenameS : CSTRVAL, INTEGER dword, INTEGER tlpbindstatuscallback ) : "URLDownloadToFileA" END PROC PROCUrlGetSource( STRING urlS, STRING fileNameS ) FNUrlGetSourceApiI( 0, urlS, filenameS, 0, 0 ) END PROC Main() PROCUrlGetSource( "http://www.google.com/index.html", "c:\ddd.txt" ) EditFile( "c:\ddd.txt" ) END
--- cut here: end ---------------------------------------------------- === Get the links in the source code --- cut here: begin --------------------------------------------------
DLL "<urlmon.dll>" INTEGER PROC FNUrlGetSourceApiI( INTEGER lpunknown, STRING urlS : CSTRVAL, STRING filenameS : CSTRVAL, INTEGER dword, INTEGER tlpbindstatuscallback ) : "URLDownloadToFileA" END PROC PROCUrlGetSource( STRING urlS, STRING fileNameS ) FNUrlGetSourceApiI( 0, urlS, filenameS, 0, 0 ) END PROC Main() STRING urlS[255] = "http://www.google.com/index.html" STRING fileNameS[255] = MakeTempName( "." ) // IF NOT Ask( "get source code at which url = ", urlS, _FIND_HISTORY_ ) RETURN() ENDIF // IF NOT Ask( "store this source code in which filename = ", fileNameS, _FIND_HISTORY_ ) RETURN() ENDIF // PROCUrlGetSource( urlS, fileNameS ) EditFile( fileNameS ) EraseDiskFile( fileNameS ) // to remove this temporary file from your disk. // PushKey( <ALT E> ) // IF NOT ( LFind( "{www.}|{http}", "ngixv" ) ) Warn( "no 'www.' or 'http' links found in current page" ) ENDIF // END
--- cut here: end ---------------------------------------------------- === Note --- Of course the URL link extraction can be done better with a more complex regular expression. === Currently it is just illustrating the principles to keep it simple. === Possibly analyse the source code If you can get the source of a web page (e.g. on the Internet, or on your own local network or Intranet), you can automate all kind of tasks by extracting the information from this web page (e.g. by searching with LFind(), and e.g. regular expressions) --- It will be really handy to automate web applications handling (I will e.g. extract some (form) information out of given web pages (e.g. name, address of a given user), and can then use this data to automatically create some other information (e.g. store it in database tables) in TSE). === E.g. here an illustrative example of monitoring a change in a web page --- cut here: begin --------------------------------------------------
DLL "<urlmon.dll>" INTEGER PROC FNUrlGetSourceApiI( INTEGER lpunknown, STRING urlS : CSTRVAL, STRING filenameS : CSTRVAL, INTEGER dword, INTEGER tlpbindstatuscallback ) : "URLDownloadToFileA" END PROC PROCUrlGetSource( STRING urlS, STRING fileNameS ) FNUrlGetSourceApiI( 0, urlS, filenameS, 0, 0 ) END PROC Main() STRING urlS[255] = "http://www.semware.com/index.php" STRING fileNameS[255] = MakeTempName( "." ) // IF NOT Ask( "get source code at which url = ", urlS, _FIND_HISTORY_ ) RETURN() ENDIF // IF NOT Ask( "store this source code in which filename = ", fileNameS, _FIND_HISTORY_ ) RETURN() ENDIF // PROCUrlGetSource( urlS, fileNameS ) EditFile( fileNameS ) EraseDiskFile( fileNameS ) // to remove this temporary file from your disk. // IF LFind( "last updated{.*}$", "gix" ) Warn( Format( "this page is last updated at", " ", Trim( GetFoundText( 1 ) ) ) ) ENDIF // END
--- cut here: end ---------------------------------------------------- === 2. Load that file in TSE 3. Search with regular expressions for 'http://www...' in the text (e.g. something like --- cut here: begin --------------------------------------------------
WHILE LFind( "{http:\/\/}?{www.@~[ ]}", "ix" ) s = GetFoundText( 2 ) AddLine( s, <your bufferid> ) ENDWHILE
--- cut here: end ---------------------------------------------------- or use the regular expression (debug this) ["]?{http:\/\/}?{www\..@}\c{["]|[ ]} to add the output to some temporary file === C. Method: Download that page manually, then search for the URLs Similar to above you do only step 2 and 3. You get as usual the source by going to the URL in your browser, right clicking on the page, and choosing 'View source'. This will e.g. open notepad (otherwise your TSE if you have set it as that editor (via a registry setting)). === D. Method: Use an external program (in combination with TSE) 1. -E.g. use a library in another computer language (like Perl, PHP, Ruby, ...) to extract and handle the URLs (e.g. combine it with TSE capture macro) 2. -E.g. --- cut here: begin --------------------------------------------------
urlget.exe
--- cut here: end ---------------------------------------------------- 2. -E.g. --- cut here: begin --------------------------------------------------
wget.exe
--- cut here: end ---------------------------------------------------- === --- cut here: begin --------------------------------------------------
http://en.wikipedia.org/wiki/Wget
--- cut here: end ---------------------------------------------------- 3. -E.g. Curl --- cut here: begin --------------------------------------------------
http://curl.haxx.se/
--- cut here: end ---------------------------------------------------- 4. -E.g. XSite --- cut here: begin --------------------------------------------------
http://www.veign.com/application.php?appid=108
--- cut here: end ---------------------------------------------------- --- This time I used XSite (just download and install, fill in the URL, click on 'Query web site' icon, then click node 'All links', then menu 'File'->'Export node', and it will save all links as a .csv file, which you can load in TSE). === E. To insert the URLs at the correct position in the text, after extracting this URLs (e.g. of your copy/pasted HTML page, which does not include the URL shortcuts usually), you could e.g. create a keyboard macro or a TSE macro which extracts all the time the first line of the block (other implementations of how to insert the URLs at the correct position in the text are of course possible): (e.g. copy the whole block of URL lines to the Microsoft Windows clipboard, paste that block in your text, insert a new line below the first line in the block, go down, highlight all lines of that paragraph below except the first line, and cut this again to the Microsoft Windows clipboard. That will leave only the first line, thus the topmost URL. Repeat this process until no more URLs). When you start from the top of the page, because order extracting is also from top to bottom, this is a rather linear way to insert the URLs. === Example: I used this last method successfully to extract and insert the URLs of e.g. this webpage: --- cut here: begin --------------------------------------------------
http://www.anova.org/software/
--- cut here: end ---------------------------------------------------- If you only just copy/paste you will get something like --- cut here: begin --------------------------------------------------
GNU/LINUX DISTROS Linux Mint (F) SimplyMEPIS (F)
--- cut here: end ---------------------------------------------------- === After adding the missing URLs you get something like --- cut here: begin --------------------------------------------------
GNU/LINUX DISTROS Linux Mint (F) http://linuxmint.com/ SimplyMEPIS (F) http://www.mepis.org/
--- cut here: end ---------------------------------------------------- === If you have other ideas or interesting implementations let it possibly be known. === [kn, ho, th, 27-03-2008 11:11:02] Why not just copy-and-paste into KompoZer? --- cut here: begin --------------------------------------------------
http://www.kompozer.net
--- cut here: end ---------------------------------------------------- === Book: see also: === Diagram: see also: === File: see also: === File: version: control: see also: === Help: see also: === Image: see also: === Internet: see also: --- Computer: Editor: Text: TSE: Internet: Url: Source code: Get: How to automatically get the source code of any URL and edit it in TSE? [Microsoft Windows API URLDownloadTofFile] http://goo.gl/ubFv6 === Podcast: see also: === Record: see also: === Screencast: see also: === Table: see also: === Video: see also: === <version>1.0.0.0.9</version> ----------------------------------------------------------------------

Share |

This web page is created and maintained using the Semware TSE text editor