---------------------------------------------------------------------- --- Knud van Eeden --- 30 May 2009 - 04:51 pm ------------------------ Computer: Editor: TSE: File: Unicode: UTF-8: Structure: Operation: Check: How to check if a file starts with the 3 UTF-8 bytes? [Backus Naur form / syntax diagram] --- Steps: Overview: 1. -E.g. create the following program: --- cut here: begin --------------------------------------------------FORWARD INTEGER PROC FNFileCheckIsUtf8BeginB( STRING s1 ) FORWARD PROC Main() // --- MAIN --- // STRING fileNameGS[255] = "" // global filename PROC Main() IF NOT( AskFileName( "file: check: is: utf8: fileNameS = ", fileNameGS, _DEFAULT_, _EDIT_HISTORY_ ) ) RETURN() ENDIF Message( FNFileCheckIsUtf8BeginB( fileNameGS ) ) // gives e.g. TRUE if it is an UTF-8 file (thus starting with that 3 bytes) END <F12> Main() // --- LIBRARY --- // // library: file: check: is: utf8 <description>Valid UTF-8 file starts with the 3 bytes 0xEF, 0xBB, 0xBF</description> <version>1.0.0.0.17</version> (filenamemacro=checfiiu.s) [kn, ri, sa, 30-05-2009 16:30:24] INTEGER PROC FNFileCheckIsUtf8BeginB( STRING fileNameS ) // e.g. STRING fileNameGS[255] = "" // global filename // e.g. // e.g. PROC Main() // e.g. // e.g. IF NOT( AskFileName( "file: check: is: utf8: fileNameS = ", fileNameGS, _DEFAULT_, _EDIT_HISTORY_ ) ) RETURN() ENDIF // e.g. // e.g. Message( FNFileCheckIsUtf8BeginB( fileNameGS ) ) // gives e.g. TRUE if it is an UTF-8 file (thus starting with that 3 bytes) // e.g. END // e.g. // e.g. <F12> Main() // // Method: // // 1. -To test if a file is UTF-8, create any file in Windows Notepad, and save it there as UTF-8 // ('File' > 'Save as' > 'Encoding' > 'UTF-8') // // #DEFINE byteMaxI 1 // constant // STRING s[ byteMaxI ] = "" // // open that file // INTEGER fileI = fopen( fileNameS ) // // check if file is found // IF ( fileI == -1 ) Warn( Format( "could not find file ", fileNameS ) ) RETURN( FALSE ) ENDIF // // put the filepointer at the beginning of that file // IF ( NOT( fSeek( fileI, 0, _SEEK_BEGIN_ ) ) == 0 ) Warn( "could not place file pointer at beginning of the file" ) RETURN( FALSE ) ENDIF // // read the first byte at the beginning of the file, and check if it is byte '0xEF' // fRead( fileI, s, byteMaxI ) IF( NOT( Asc( s ) == 239 ) ) // byte 0xEF fClose( fileI ) RETURN( FALSE ) ENDIF // // read the second byte at the beginning of the file, and check if it is byte '0xBB' // fRead( fileI, s, byteMaxI ) IF( NOT( Asc( s ) == 187 ) ) // byte 0xBB fClose( fileI ) RETURN( FALSE ) ENDIF // // read the third byte at the beginning of the file, and check if it is byte '0xBF' // fRead( fileI, s, byteMaxI ) IF( NOT( Asc( s ) == 191 ) ) // byte 0xBF fClose( fileI ) RETURN( FALSE ) ENDIF // fClose( fileI ) // RETURN( TRUE ) END--- cut here: end ---------------------------------------------------- 2. -Run the program 3. -If it is a file starting with the 3 bytes 0xEF, 0xBB, 0xBF it will show TRUE or 1 otherwise it will show 0. 4. -Tested successfully on Microsoft Windows XP Professional (service pack 3), running TSE v4.x === Book: see also: === Diagram: see also: === File: see also: === Help: see also: === Image: see also:![]()
![]()
![]()
![]()
=== Internet: see also: Many Windows programs (including Windows Notepad) add the bytes 0xEF,0xBB,0xBF at the start of any document saved as UTF-8. This is the UTF-8 encoding of the Unicode byte-order mark, and is commonly referred to as a UTF-8 BOM even though it is not relevant to byte order. The BOM can also appear if another encoding with a BOM is translated to UTF-8 without stripping it. http://en.wikipedia.org/wiki/UTF-8 --- Self-synchronizing Ken Thompson of the Plan 9 operating system group at Bell Labs, then made a crucial modification to the encoding to allow it to be self-synchronizing, meaning that it was not necessary to read from the beginning of the string in order to find code point boundaries. http://en.wikipedia.org/wiki/UTF-8 --- What is the structure of UTF-8? - RFC3629 http://tools.ietf.org/html/rfc3629 --- How to convert UTF-8 in Java, PHP, Python, MySql, .NET, Perl? http://www.unicodetools.com --- How to convert UTF-8 in BBCBASIC for Windows? http://www.knudvaneeden.com/tinyurl.php?urlKey=url000206 === Podcast: see also: === Screencast: see also: === Table: see also: === Video: see also: --- ----------------------------------------------------------------------