1@node Character Set Handling, Locales, String and Array Utilities, Top 2@c %MENU% Support for extended character sets 3@chapter Character Set Handling 4 5@ifnottex 6@macro cal{text} 7\text\ 8@end macro 9@end ifnottex 10 11Character sets used in the early days of computing had only six, seven, 12or eight bits for each character: there was never a case where more than 13eight bits (one byte) were used to represent a single character. The 14limitations of this approach became more apparent as more people 15grappled with non-Roman character sets, where not all the characters 16that make up a language's character set can be represented by @math{2^8} 17choices. This chapter shows the functionality that was added to the C 18library to support multiple character sets. 19 20@menu 21* Extended Char Intro:: Introduction to Extended Characters. 22* Charset Function Overview:: Overview about Character Handling 23 Functions. 24* Restartable multibyte conversion:: Restartable multibyte conversion 25 Functions. 26* Non-reentrant Conversion:: Non-reentrant Conversion Function. 27* Generic Charset Conversion:: Generic Charset Conversion. 28@end menu 29 30 31@node Extended Char Intro 32@section Introduction to Extended Characters 33 34A variety of solutions are available to overcome the differences between 35character sets with a 1:1 relation between bytes and characters and 36character sets with ratios of 2:1 or 4:1. The remainder of this 37section gives a few examples to help understand the design decisions 38made while developing the functionality of the @w{C library}. 39 40@cindex internal representation 41A distinction we have to make right away is between internal and 42external representation. @dfn{Internal representation} means the 43representation used by a program while keeping the text in memory. 44External representations are used when text is stored or transmitted 45through some communication channel. Examples of external 46representations include files waiting in a directory to be 47read and parsed. 48 49Traditionally there has been no difference between the two representations. 50It was equally comfortable and useful to use the same single-byte 51representation internally and externally. This comfort level decreases 52with more and larger character sets. 53 54One of the problems to overcome with the internal representation is 55handling text that is externally encoded using different character 56sets. Assume a program that reads two texts and compares them using 57some metric. The comparison can be usefully done only if the texts are 58internally kept in a common format. 59 60@cindex wide character 61For such a common format (@math{=} character set) eight bits are certainly 62no longer enough. So the smallest entity will have to grow: @dfn{wide 63characters} will now be used. Instead of one byte per character, two or 64four will be used instead. (Three are not good to address in memory and 65more than four bytes seem not to be necessary). 66 67@cindex Unicode 68@cindex ISO 10646 69As shown in some other part of this manual, 70@c !!! Ahem, wide char string functions are not yet covered -- drepper 71a completely new family has been created of functions that can handle wide 72character texts in memory. The most commonly used character sets for such 73internal wide character representations are Unicode and @w{ISO 10646} 74(also known as UCS for Universal Character Set). Unicode was originally 75planned as a 16-bit character set; whereas, @w{ISO 10646} was designed to 76be a 31-bit large code space. The two standards are practically identical. 77They have the same character repertoire and code table, but Unicode specifies 78added semantics. At the moment, only characters in the first @code{0x10000} 79code positions (the so-called Basic Multilingual Plane, BMP) have been 80assigned, but the assignment of more specialized characters outside this 8116-bit space is already in progress. A number of encodings have been 82defined for Unicode and @w{ISO 10646} characters: 83@cindex UCS-2 84@cindex UCS-4 85@cindex UTF-8 86@cindex UTF-16 87UCS-2 is a 16-bit word that can only represent characters 88from the BMP, UCS-4 is a 32-bit word than can represent any Unicode 89and @w{ISO 10646} character, UTF-8 is an ASCII compatible encoding where 90ASCII characters are represented by ASCII bytes and non-ASCII characters 91by sequences of 2-6 non-ASCII bytes, and finally UTF-16 is an extension 92of UCS-2 in which pairs of certain UCS-2 words can be used to encode 93non-BMP characters up to @code{0x10ffff}. 94 95To represent wide characters the @code{char} type is not suitable. For 96this reason the @w{ISO C} standard introduces a new type that is 97designed to keep one character of a wide character string. To maintain 98the similarity there is also a type corresponding to @code{int} for 99those functions that take a single wide character. 100 101@deftp {Data type} wchar_t 102@standards{ISO, stddef.h} 103This data type is used as the base type for wide character strings. 104In other words, arrays of objects of this type are the equivalent of 105@code{char[]} for multibyte character strings. The type is defined in 106@file{stddef.h}. 107 108The @w{ISO C90} standard, where @code{wchar_t} was introduced, does not 109say anything specific about the representation. It only requires that 110this type is capable of storing all elements of the basic character set. 111Therefore it would be legitimate to define @code{wchar_t} as @code{char}, 112which might make sense for embedded systems. 113 114But in @theglibc{} @code{wchar_t} is always 32 bits wide and, therefore, 115capable of representing all UCS-4 values and, therefore, covering all of 116@w{ISO 10646}. Some Unix systems define @code{wchar_t} as a 16-bit type 117and thereby follow Unicode very strictly. This definition is perfectly 118fine with the standard, but it also means that to represent all 119characters from Unicode and @w{ISO 10646} one has to use UTF-16 surrogate 120characters, which is in fact a multi-wide-character encoding. But 121resorting to multi-wide-character encoding contradicts the purpose of the 122@code{wchar_t} type. 123@end deftp 124 125@deftp {Data type} wint_t 126@standards{ISO, wchar.h} 127@code{wint_t} is a data type used for parameters and variables that 128contain a single wide character. As the name suggests this type is the 129equivalent of @code{int} when using the normal @code{char} strings. The 130types @code{wchar_t} and @code{wint_t} often have the same 131representation if their size is 32 bits wide but if @code{wchar_t} is 132defined as @code{char} the type @code{wint_t} must be defined as 133@code{int} due to the parameter promotion. 134 135@pindex wchar.h 136This type is defined in @file{wchar.h} and was introduced in 137@w{Amendment 1} to @w{ISO C90}. 138@end deftp 139 140As there are for the @code{char} data type macros are available for 141specifying the minimum and maximum value representable in an object of 142type @code{wchar_t}. 143 144@deftypevr Macro wint_t WCHAR_MIN 145@standards{ISO, wchar.h} 146The macro @code{WCHAR_MIN} evaluates to the minimum value representable 147by an object of type @code{wint_t}. 148 149This macro was introduced in @w{Amendment 1} to @w{ISO C90}. 150@end deftypevr 151 152@deftypevr Macro wint_t WCHAR_MAX 153@standards{ISO, wchar.h} 154The macro @code{WCHAR_MAX} evaluates to the maximum value representable 155by an object of type @code{wint_t}. 156 157This macro was introduced in @w{Amendment 1} to @w{ISO C90}. 158@end deftypevr 159 160Another special wide character value is the equivalent to @code{EOF}. 161 162@deftypevr Macro wint_t WEOF 163@standards{ISO, wchar.h} 164The macro @code{WEOF} evaluates to a constant expression of type 165@code{wint_t} whose value is different from any member of the extended 166character set. 167 168@code{WEOF} need not be the same value as @code{EOF} and unlike 169@code{EOF} it also need @emph{not} be negative. In other words, sloppy 170code like 171 172@smallexample 173@{ 174 int c; 175 @dots{} 176 while ((c = getc (fp)) < 0) 177 @dots{} 178@} 179@end smallexample 180 181@noindent 182has to be rewritten to use @code{WEOF} explicitly when wide characters 183are used: 184 185@smallexample 186@{ 187 wint_t c; 188 @dots{} 189 while ((c = getwc (fp)) != WEOF) 190 @dots{} 191@} 192@end smallexample 193 194@pindex wchar.h 195This macro was introduced in @w{Amendment 1} to @w{ISO C90} and is 196defined in @file{wchar.h}. 197@end deftypevr 198 199 200These internal representations present problems when it comes to storage 201and transmittal. Because each single wide character consists of more 202than one byte, they are affected by byte-ordering. Thus, machines with 203different endianesses would see different values when accessing the same 204data. This byte ordering concern also applies for communication protocols 205that are all byte-based and therefore require that the sender has to 206decide about splitting the wide character in bytes. A last (but not least 207important) point is that wide characters often require more storage space 208than a customized byte-oriented character set. 209 210@cindex multibyte character 211@cindex EBCDIC 212For all the above reasons, an external encoding that is different from 213the internal encoding is often used if the latter is UCS-2 or UCS-4. 214The external encoding is byte-based and can be chosen appropriately for 215the environment and for the texts to be handled. A variety of different 216character sets can be used for this external encoding (information that 217will not be exhaustively presented here--instead, a description of the 218major groups will suffice). All of the ASCII-based character sets 219fulfill one requirement: they are "filesystem safe." This means that 220the character @code{'/'} is used in the encoding @emph{only} to 221represent itself. Things are a bit different for character sets like 222EBCDIC (Extended Binary Coded Decimal Interchange Code, a character set 223family used by IBM), but if the operating system does not understand 224EBCDIC directly the parameters-to-system calls have to be converted 225first anyhow. 226 227@itemize @bullet 228@item 229The simplest character sets are single-byte character sets. There can 230be only up to 256 characters (for @w{8 bit} character sets), which is 231not sufficient to cover all languages but might be sufficient to handle 232a specific text. Handling of a @w{8 bit} character sets is simple. This 233is not true for other kinds presented later, and therefore, the 234application one uses might require the use of @w{8 bit} character sets. 235 236@cindex ISO 2022 237@item 238The @w{ISO 2022} standard defines a mechanism for extended character 239sets where one character @emph{can} be represented by more than one 240byte. This is achieved by associating a state with the text. 241Characters that can be used to change the state can be embedded in the 242text. Each byte in the text might have a different interpretation in each 243state. The state might even influence whether a given byte stands for a 244character on its own or whether it has to be combined with some more 245bytes. 246 247@cindex EUC 248@cindex Shift_JIS 249@cindex SJIS 250In most uses of @w{ISO 2022} the defined character sets do not allow 251state changes that cover more than the next character. This has the 252big advantage that whenever one can identify the beginning of the byte 253sequence of a character one can interpret a text correctly. Examples of 254character sets using this policy are the various EUC character sets 255(used by Sun's operating systems, EUC-JP, EUC-KR, EUC-TW, and EUC-CN) 256or Shift_JIS (SJIS, a Japanese encoding). 257 258But there are also character sets using a state that is valid for more 259than one character and has to be changed by another byte sequence. 260Examples for this are ISO-2022-JP, ISO-2022-KR, and ISO-2022-CN. 261 262@item 263@cindex ISO 6937 264Early attempts to fix 8 bit character sets for other languages using the 265Roman alphabet lead to character sets like @w{ISO 6937}. Here bytes 266representing characters like the acute accent do not produce output 267themselves: one has to combine them with other characters to get the 268desired result. For example, the byte sequence @code{0xc2 0x61} 269(non-spacing acute accent, followed by lower-case `a') to get the ``small 270a with acute'' character. To get the acute accent character on its own, 271one has to write @code{0xc2 0x20} (the non-spacing acute followed by a 272space). 273 274Character sets like @w{ISO 6937} are used in some embedded systems such 275as teletex. 276 277@item 278@cindex UTF-8 279Instead of converting the Unicode or @w{ISO 10646} text used internally, 280it is often also sufficient to simply use an encoding different than 281UCS-2/UCS-4. The Unicode and @w{ISO 10646} standards even specify such an 282encoding: UTF-8. This encoding is able to represent all of @w{ISO 28310646} 31 bits in a byte string of length one to six. 284 285@cindex UTF-7 286There were a few other attempts to encode @w{ISO 10646} such as UTF-7, 287but UTF-8 is today the only encoding that should be used. In fact, with 288any luck UTF-8 will soon be the only external encoding that has to be 289supported. It proves to be universally usable and its only disadvantage 290is that it favors Roman languages by making the byte string 291representation of other scripts (Cyrillic, Greek, Asian scripts) longer 292than necessary if using a specific character set for these scripts. 293Methods like the Unicode compression scheme can alleviate these 294problems. 295@end itemize 296 297The question remaining is: how to select the character set or encoding 298to use. The answer: you cannot decide about it yourself, it is decided 299by the developers of the system or the majority of the users. Since the 300goal is interoperability one has to use whatever the other people one 301works with use. If there are no constraints, the selection is based on 302the requirements the expected circle of users will have. In other words, 303if a project is expected to be used in only, say, Russia it is fine to use 304KOI8-R or a similar character set. But if at the same time people from, 305say, Greece are participating one should use a character set that allows 306all people to collaborate. 307 308The most widely useful solution seems to be: go with the most general 309character set, namely @w{ISO 10646}. Use UTF-8 as the external encoding 310and problems about users not being able to use their own language 311adequately are a thing of the past. 312 313One final comment about the choice of the wide character representation 314is necessary at this point. We have said above that the natural choice 315is using Unicode or @w{ISO 10646}. This is not required, but at least 316encouraged, by the @w{ISO C} standard. The standard defines at least a 317macro @code{__STDC_ISO_10646__} that is only defined on systems where 318the @code{wchar_t} type encodes @w{ISO 10646} characters. If this 319symbol is not defined one should avoid making assumptions about the wide 320character representation. If the programmer uses only the functions 321provided by the C library to handle wide character strings there should 322be no compatibility problems with other systems. 323 324@node Charset Function Overview 325@section Overview about Character Handling Functions 326 327A Unix @w{C library} contains three different sets of functions in two 328families to handle character set conversion. One of the function families 329(the most commonly used) is specified in the @w{ISO C90} standard and, 330therefore, is portable even beyond the Unix world. Unfortunately this 331family is the least useful one. These functions should be avoided 332whenever possible, especially when developing libraries (as opposed to 333applications). 334 335The second family of functions got introduced in the early Unix standards 336(XPG2) and is still part of the latest and greatest Unix standard: 337@w{Unix 98}. It is also the most powerful and useful set of functions. 338But we will start with the functions defined in @w{Amendment 1} to 339@w{ISO C90}. 340 341@node Restartable multibyte conversion 342@section Restartable Multibyte Conversion Functions 343 344The @w{ISO C} standard defines functions to convert strings from a 345multibyte representation to wide character strings. There are a number 346of peculiarities: 347 348@itemize @bullet 349@item 350The character set assumed for the multibyte encoding is not specified 351as an argument to the functions. Instead the character set specified by 352the @code{LC_CTYPE} category of the current locale is used; see 353@ref{Locale Categories}. 354 355@item 356The functions handling more than one character at a time require NUL 357terminated strings as the argument (i.e., converting blocks of text 358does not work unless one can add a NUL byte at an appropriate place). 359@Theglibc{} contains some extensions to the standard that allow 360specifying a size, but basically they also expect terminated strings. 361@end itemize 362 363Despite these limitations the @w{ISO C} functions can be used in many 364contexts. In graphical user interfaces, for instance, it is not 365uncommon to have functions that require text to be displayed in a wide 366character string if the text is not simple ASCII. The text itself might 367come from a file with translations and the user should decide about the 368current locale, which determines the translation and therefore also the 369external encoding used. In such a situation (and many others) the 370functions described here are perfect. If more freedom while performing 371the conversion is necessary take a look at the @code{iconv} functions 372(@pxref{Generic Charset Conversion}). 373 374@menu 375* Selecting the Conversion:: Selecting the conversion and its properties. 376* Keeping the state:: Representing the state of the conversion. 377* Converting a Character:: Converting Single Characters. 378* Converting Strings:: Converting Multibyte and Wide Character 379 Strings. 380* Multibyte Conversion Example:: A Complete Multibyte Conversion Example. 381@end menu 382 383@node Selecting the Conversion 384@subsection Selecting the conversion and its properties 385 386We already said above that the currently selected locale for the 387@code{LC_CTYPE} category decides the conversion that is performed 388by the functions we are about to describe. Each locale uses its own 389character set (given as an argument to @code{localedef}) and this is the 390one assumed as the external multibyte encoding. The wide character 391set is always UCS-4 in @theglibc{}. 392 393A characteristic of each multibyte character set is the maximum number 394of bytes that can be necessary to represent one character. This 395information is quite important when writing code that uses the 396conversion functions (as shown in the examples below). 397The @w{ISO C} standard defines two macros that provide this information. 398 399 400@deftypevr Macro int MB_LEN_MAX 401@standards{ISO, limits.h} 402@code{MB_LEN_MAX} specifies the maximum number of bytes in the multibyte 403sequence for a single character in any of the supported locales. It is 404a compile-time constant and is defined in @file{limits.h}. 405@pindex limits.h 406@end deftypevr 407 408@deftypevr Macro int MB_CUR_MAX 409@standards{ISO, stdlib.h} 410@code{MB_CUR_MAX} expands into a positive integer expression that is the 411maximum number of bytes in a multibyte character in the current locale. 412The value is never greater than @code{MB_LEN_MAX}. Unlike 413@code{MB_LEN_MAX} this macro need not be a compile-time constant, and in 414@theglibc{} it is not. 415 416@pindex stdlib.h 417@code{MB_CUR_MAX} is defined in @file{stdlib.h}. 418@end deftypevr 419 420Two different macros are necessary since strictly @w{ISO C90} compilers 421do not allow variable length array definitions, but still it is desirable 422to avoid dynamic allocation. This incomplete piece of code shows the 423problem: 424 425@smallexample 426@{ 427 char buf[MB_LEN_MAX]; 428 ssize_t len = 0; 429 430 while (! feof (fp)) 431 @{ 432 fread (&buf[len], 1, MB_CUR_MAX - len, fp); 433 /* @r{@dots{} process} buf */ 434 len -= used; 435 @} 436@} 437@end smallexample 438 439The code in the inner loop is expected to have always enough bytes in 440the array @var{buf} to convert one multibyte character. The array 441@var{buf} has to be sized statically since many compilers do not allow a 442variable size. The @code{fread} call makes sure that @code{MB_CUR_MAX} 443bytes are always available in @var{buf}. Note that it isn't 444a problem if @code{MB_CUR_MAX} is not a compile-time constant. 445 446 447@node Keeping the state 448@subsection Representing the state of the conversion 449 450@cindex stateful 451In the introduction of this chapter it was said that certain character 452sets use a @dfn{stateful} encoding. That is, the encoded values depend 453in some way on the previous bytes in the text. 454 455Since the conversion functions allow converting a text in more than one 456step we must have a way to pass this information from one call of the 457functions to another. 458 459@deftp {Data type} mbstate_t 460@standards{ISO, wchar.h} 461@cindex shift state 462A variable of type @code{mbstate_t} can contain all the information 463about the @dfn{shift state} needed from one call to a conversion 464function to another. 465 466@pindex wchar.h 467@code{mbstate_t} is defined in @file{wchar.h}. It was introduced in 468@w{Amendment 1} to @w{ISO C90}. 469@end deftp 470 471To use objects of type @code{mbstate_t} the programmer has to define such 472objects (normally as local variables on the stack) and pass a pointer to 473the object to the conversion functions. This way the conversion function 474can update the object if the current multibyte character set is stateful. 475 476There is no specific function or initializer to put the state object in 477any specific state. The rules are that the object should always 478represent the initial state before the first use, and this is achieved by 479clearing the whole variable with code such as follows: 480 481@smallexample 482@{ 483 mbstate_t state; 484 memset (&state, '\0', sizeof (state)); 485 /* @r{from now on @var{state} can be used.} */ 486 @dots{} 487@} 488@end smallexample 489 490When using the conversion functions to generate output it is often 491necessary to test whether the current state corresponds to the initial 492state. This is necessary, for example, to decide whether to emit 493escape sequences to set the state to the initial state at certain 494sequence points. Communication protocols often require this. 495 496@deftypefun int mbsinit (const mbstate_t *@var{ps}) 497@standards{ISO, wchar.h} 498@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}} 499@c ps is dereferenced once, unguarded. This would call for @mtsrace:ps, 500@c but since a single word-sized field is (atomically) accessed, any 501@c race here would be harmless. Other functions that take an optional 502@c mbstate_t* argument named ps are marked with @mtasurace:<func>/!ps, 503@c to indicate that the function uses a static buffer if ps is NULL. 504@c These could also have been marked with @mtsrace:ps, but we'll omit 505@c that for brevity, for it's somewhat redundant with the @mtasurace. 506The @code{mbsinit} function determines whether the state object pointed 507to by @var{ps} is in the initial state. If @var{ps} is a null pointer or 508the object is in the initial state the return value is nonzero. Otherwise 509it is zero. 510 511@pindex wchar.h 512@code{mbsinit} was introduced in @w{Amendment 1} to @w{ISO C90} and is 513declared in @file{wchar.h}. 514@end deftypefun 515 516Code using @code{mbsinit} often looks similar to this: 517 518@c Fix the example to explicitly say how to generate the escape sequence 519@c to restore the initial state. 520@smallexample 521@{ 522 mbstate_t state; 523 memset (&state, '\0', sizeof (state)); 524 /* @r{Use @var{state}.} */ 525 @dots{} 526 if (! mbsinit (&state)) 527 @{ 528 /* @r{Emit code to return to initial state.} */ 529 const wchar_t empty[] = L""; 530 const wchar_t *srcp = empty; 531 wcsrtombs (outbuf, &srcp, outbuflen, &state); 532 @} 533 @dots{} 534@} 535@end smallexample 536 537The code to emit the escape sequence to get back to the initial state is 538interesting. The @code{wcsrtombs} function can be used to determine the 539necessary output code (@pxref{Converting Strings}). Please note that with 540@theglibc{} it is not necessary to perform this extra action for the 541conversion from multibyte text to wide character text since the wide 542character encoding is not stateful. But there is nothing mentioned in 543any standard that prohibits making @code{wchar_t} use a stateful 544encoding. 545 546@node Converting a Character 547@subsection Converting Single Characters 548 549The most fundamental of the conversion functions are those dealing with 550single characters. Please note that this does not always mean single 551bytes. But since there is very often a subset of the multibyte 552character set that consists of single byte sequences, there are 553functions to help with converting bytes. Frequently, ASCII is a subset 554of the multibyte character set. In such a scenario, each ASCII character 555stands for itself, and all other characters have at least a first byte 556that is beyond the range @math{0} to @math{127}. 557 558@deftypefun wint_t btowc (int @var{c}) 559@standards{ISO, wchar.h} 560@safety{@prelim{}@mtsafe{}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} 561@c Calls btowc_fct or __fct; reads from locale, and from the 562@c get_gconv_fcts result multiple times. get_gconv_fcts calls 563@c __wcsmbs_load_conv to initialize the ctype if it's null. 564@c wcsmbs_load_conv takes a non-recursive wrlock before allocating 565@c memory for the fcts structure, initializing it, and then storing it 566@c in the locale object. The initialization involves dlopening and a 567@c lot more. 568The @code{btowc} function (``byte to wide character'') converts a valid 569single byte character @var{c} in the initial shift state into the wide 570character equivalent using the conversion rules from the currently 571selected locale of the @code{LC_CTYPE} category. 572 573If @code{(unsigned char) @var{c}} is no valid single byte multibyte 574character or if @var{c} is @code{EOF}, the function returns @code{WEOF}. 575 576Please note the restriction of @var{c} being tested for validity only in 577the initial shift state. No @code{mbstate_t} object is used from 578which the state information is taken, and the function also does not use 579any static state. 580 581@pindex wchar.h 582The @code{btowc} function was introduced in @w{Amendment 1} to @w{ISO C90} 583and is declared in @file{wchar.h}. 584@end deftypefun 585 586Despite the limitation that the single byte value is always interpreted 587in the initial state, this function is actually useful most of the time. 588Most characters are either entirely single-byte character sets or they 589are extensions to ASCII. But then it is possible to write code like this 590(not that this specific example is very useful): 591 592@smallexample 593wchar_t * 594itow (unsigned long int val) 595@{ 596 static wchar_t buf[30]; 597 wchar_t *wcp = &buf[29]; 598 *wcp = L'\0'; 599 while (val != 0) 600 @{ 601 *--wcp = btowc ('0' + val % 10); 602 val /= 10; 603 @} 604 if (wcp == &buf[29]) 605 *--wcp = L'0'; 606 return wcp; 607@} 608@end smallexample 609 610Why is it necessary to use such a complicated implementation and not 611simply cast @code{'0' + val % 10} to a wide character? The answer is 612that there is no guarantee that one can perform this kind of arithmetic 613on the character of the character set used for @code{wchar_t} 614representation. In other situations the bytes are not constant at 615compile time and so the compiler cannot do the work. In situations like 616this, using @code{btowc} is required. 617 618@noindent 619There is also a function for the conversion in the other direction. 620 621@deftypefun int wctob (wint_t @var{c}) 622@standards{ISO, wchar.h} 623@safety{@prelim{}@mtsafe{}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} 624The @code{wctob} function (``wide character to byte'') takes as the 625parameter a valid wide character. If the multibyte representation for 626this character in the initial state is exactly one byte long, the return 627value of this function is this character. Otherwise the return value is 628@code{EOF}. 629 630@pindex wchar.h 631@code{wctob} was introduced in @w{Amendment 1} to @w{ISO C90} and 632is declared in @file{wchar.h}. 633@end deftypefun 634 635There are more general functions to convert single characters from 636multibyte representation to wide characters and vice versa. These 637functions pose no limit on the length of the multibyte representation 638and they also do not require it to be in the initial state. 639 640@deftypefun size_t mbrtowc (wchar_t *restrict @var{pwc}, const char *restrict @var{s}, size_t @var{n}, mbstate_t *restrict @var{ps}) 641@standards{ISO, wchar.h} 642@safety{@prelim{}@mtunsafe{@mtasurace{:mbrtowc/!ps}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} 643@cindex stateful 644The @code{mbrtowc} function (``multibyte restartable to wide 645character'') converts the next multibyte character in the string pointed 646to by @var{s} into a wide character and stores it in the location 647pointed to by @var{pwc}. The conversion is performed according 648to the locale currently selected for the @code{LC_CTYPE} category. If 649the conversion for the character set used in the locale requires a state, 650the multibyte string is interpreted in the state represented by the 651object pointed to by @var{ps}. If @var{ps} is a null pointer, a static, 652internal state variable used only by the @code{mbrtowc} function is 653used. 654 655If the next multibyte character corresponds to the null wide character, 656the return value of the function is @math{0} and the state object is 657afterwards in the initial state. If the next @var{n} or fewer bytes 658form a correct multibyte character, the return value is the number of 659bytes starting from @var{s} that form the multibyte character. The 660conversion state is updated according to the bytes consumed in the 661conversion. In both cases the wide character (either the @code{L'\0'} 662or the one found in the conversion) is stored in the string pointed to 663by @var{pwc} if @var{pwc} is not null. 664 665If the first @var{n} bytes of the multibyte string possibly form a valid 666multibyte character but there are more than @var{n} bytes needed to 667complete it, the return value of the function is @code{(size_t) -2} and 668no value is stored in @code{*@var{pwc}}. The conversion state is 669updated and all @var{n} input bytes are consumed and should not be 670submitted again. Please note that this can happen even if @var{n} has a 671value greater than or equal to @code{MB_CUR_MAX} since the input might 672contain redundant shift sequences. 673 674If the first @code{n} bytes of the multibyte string cannot possibly form 675a valid multibyte character, no value is stored, the global variable 676@code{errno} is set to the value @code{EILSEQ}, and the function returns 677@code{(size_t) -1}. The conversion state is afterwards undefined. 678 679As specified, the @code{mbrtowc} function could deal with multibyte 680sequences which contain embedded null bytes (which happens in Unicode 681encodings such as UTF-16), but @theglibc{} does not support such 682multibyte encodings. When encountering a null input byte, the function 683will either return zero, or return @code{(size_t) -1)} and report a 684@code{EILSEQ} error. The @code{iconv} function can be used for 685converting between arbitrary encodings. @xref{Generic Conversion 686Interface}. 687 688@pindex wchar.h 689@code{mbrtowc} was introduced in @w{Amendment 1} to @w{ISO C90} and 690is declared in @file{wchar.h}. 691@end deftypefun 692 693A function that copies a multibyte string into a wide character string 694while at the same time converting all lowercase characters into 695uppercase could look like this: 696 697@smallexample 698@include mbstouwcs.c.texi 699@end smallexample 700 701In the inner loop, a single wide character is stored in @code{wc}, and 702the number of consumed bytes is stored in the variable @code{nbytes}. 703If the conversion is successful, the uppercase variant of the wide 704character is stored in the @code{result} array and the pointer to the 705input string and the number of available bytes is adjusted. If the 706@code{mbrtowc} function returns zero, the null input byte has not been 707converted, so it must be stored explicitly in the result. 708 709The above code uses the fact that there can never be more wide 710characters in the converted result than there are bytes in the multibyte 711input string. This method yields a pessimistic guess about the size of 712the result, and if many wide character strings have to be constructed 713this way or if the strings are long, the extra memory required to be 714allocated because the input string contains multibyte characters might 715be significant. The allocated memory block can be resized to the 716correct size before returning it, but a better solution might be to 717allocate just the right amount of space for the result right away. 718Unfortunately there is no function to compute the length of the wide 719character string directly from the multibyte string. There is, however, 720a function that does part of the work. 721 722@deftypefun size_t mbrlen (const char *restrict @var{s}, size_t @var{n}, mbstate_t *@var{ps}) 723@standards{ISO, wchar.h} 724@safety{@prelim{}@mtunsafe{@mtasurace{:mbrlen/!ps}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} 725The @code{mbrlen} function (``multibyte restartable length'') computes 726the number of at most @var{n} bytes starting at @var{s}, which form the 727next valid and complete multibyte character. 728 729If the next multibyte character corresponds to the NUL wide character, 730the return value is @math{0}. If the next @var{n} bytes form a valid 731multibyte character, the number of bytes belonging to this multibyte 732character byte sequence is returned. 733 734If the first @var{n} bytes possibly form a valid multibyte 735character but the character is incomplete, the return value is 736@code{(size_t) -2}. Otherwise the multibyte character sequence is invalid 737and the return value is @code{(size_t) -1}. 738 739The multibyte sequence is interpreted in the state represented by the 740object pointed to by @var{ps}. If @var{ps} is a null pointer, a state 741object local to @code{mbrlen} is used. 742 743@pindex wchar.h 744@code{mbrlen} was introduced in @w{Amendment 1} to @w{ISO C90} and 745is declared in @file{wchar.h}. 746@end deftypefun 747 748The attentive reader now will note that @code{mbrlen} can be implemented 749as 750 751@smallexample 752mbrtowc (NULL, s, n, ps != NULL ? ps : &internal) 753@end smallexample 754 755This is true and in fact is mentioned in the official specification. 756How can this function be used to determine the length of the wide 757character string created from a multibyte character string? It is not 758directly usable, but we can define a function @code{mbslen} using it: 759 760@smallexample 761size_t 762mbslen (const char *s) 763@{ 764 mbstate_t state; 765 size_t result = 0; 766 size_t nbytes; 767 memset (&state, '\0', sizeof (state)); 768 while ((nbytes = mbrlen (s, MB_LEN_MAX, &state)) > 0) 769 @{ 770 if (nbytes >= (size_t) -2) 771 /* @r{Something is wrong.} */ 772 return (size_t) -1; 773 s += nbytes; 774 ++result; 775 @} 776 return result; 777@} 778@end smallexample 779 780This function simply calls @code{mbrlen} for each multibyte character 781in the string and counts the number of function calls. Please note that 782we here use @code{MB_LEN_MAX} as the size argument in the @code{mbrlen} 783call. This is acceptable since a) this value is larger than the length of 784the longest multibyte character sequence and b) we know that the string 785@var{s} ends with a NUL byte, which cannot be part of any other multibyte 786character sequence but the one representing the NUL wide character. 787Therefore, the @code{mbrlen} function will never read invalid memory. 788 789Now that this function is available (just to make this clear, this 790function is @emph{not} part of @theglibc{}) we can compute the 791number of wide characters required to store the converted multibyte 792character string @var{s} using 793 794@smallexample 795wcs_bytes = (mbslen (s) + 1) * sizeof (wchar_t); 796@end smallexample 797 798Please note that the @code{mbslen} function is quite inefficient. The 799implementation of @code{mbstouwcs} with @code{mbslen} would have to 800perform the conversion of the multibyte character input string twice, and 801this conversion might be quite expensive. So it is necessary to think 802about the consequences of using the easier but imprecise method before 803doing the work twice. 804 805@deftypefun size_t wcrtomb (char *restrict @var{s}, wchar_t @var{wc}, mbstate_t *restrict @var{ps}) 806@standards{ISO, wchar.h} 807@safety{@prelim{}@mtunsafe{@mtasurace{:wcrtomb/!ps}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} 808@c wcrtomb uses a static, non-thread-local unguarded state variable when 809@c PS is NULL. When a state is passed in, and it's not used 810@c concurrently in other threads, this function behaves safely as long 811@c as gconv modules don't bring MT safety issues of their own. 812@c Attempting to load gconv modules or to build conversion chains in 813@c signal handlers may encounter gconv databases or caches in a 814@c partially-updated state, and asynchronous cancellation may leave them 815@c in such states, besides leaking the lock that guards them. 816@c get_gconv_fcts ok 817@c wcsmbs_load_conv ok 818@c norm_add_slashes ok 819@c wcsmbs_getfct ok 820@c gconv_find_transform ok 821@c gconv_read_conf (libc_once) 822@c gconv_lookup_cache ok 823@c find_module_idx ok 824@c find_module ok 825@c gconv_find_shlib (ok) 826@c ->init_fct (assumed ok) 827@c gconv_get_builtin_trans ok 828@c gconv_release_step ok 829@c do_lookup_alias ok 830@c find_derivation ok 831@c derivation_lookup ok 832@c increment_counter ok 833@c gconv_find_shlib ok 834@c step->init_fct (assumed ok) 835@c gen_steps ok 836@c gconv_find_shlib ok 837@c dlopen (presumed ok) 838@c dlsym (presumed ok) 839@c step->init_fct (assumed ok) 840@c step->end_fct (assumed ok) 841@c gconv_get_builtin_trans ok 842@c gconv_release_step ok 843@c add_derivation ok 844@c gconv_close_transform ok 845@c gconv_release_step ok 846@c step->end_fct (assumed ok) 847@c gconv_release_shlib ok 848@c dlclose (presumed ok) 849@c gconv_release_cache ok 850@c ->tomb->__fct (assumed ok) 851The @code{wcrtomb} function (``wide character restartable to 852multibyte'') converts a single wide character into a multibyte string 853corresponding to that wide character. 854 855If @var{s} is a null pointer, the function resets the state stored in 856the object pointed to by @var{ps} (or the internal @code{mbstate_t} 857object) to the initial state. This can also be achieved by a call like 858this: 859 860@smallexample 861wcrtombs (temp_buf, L'\0', ps) 862@end smallexample 863 864@noindent 865since, if @var{s} is a null pointer, @code{wcrtomb} performs as if it 866writes into an internal buffer, which is guaranteed to be large enough. 867 868If @var{wc} is the NUL wide character, @code{wcrtomb} emits, if 869necessary, a shift sequence to get the state @var{ps} into the initial 870state followed by a single NUL byte, which is stored in the string 871@var{s}. 872 873Otherwise a byte sequence (possibly including shift sequences) is written 874into the string @var{s}. This only happens if @var{wc} is a valid wide 875character (i.e., it has a multibyte representation in the character set 876selected by locale of the @code{LC_CTYPE} category). If @var{wc} is no 877valid wide character, nothing is stored in the strings @var{s}, 878@code{errno} is set to @code{EILSEQ}, the conversion state in @var{ps} 879is undefined and the return value is @code{(size_t) -1}. 880 881If no error occurred the function returns the number of bytes stored in 882the string @var{s}. This includes all bytes representing shift 883sequences. 884 885One word about the interface of the function: there is no parameter 886specifying the length of the array @var{s}. Instead the function 887assumes that there are at least @code{MB_CUR_MAX} bytes available since 888this is the maximum length of any byte sequence representing a single 889character. So the caller has to make sure that there is enough space 890available, otherwise buffer overruns can occur. 891 892@pindex wchar.h 893@code{wcrtomb} was introduced in @w{Amendment 1} to @w{ISO C90} and is 894declared in @file{wchar.h}. 895@end deftypefun 896 897Using @code{wcrtomb} is as easy as using @code{mbrtowc}. The following 898example appends a wide character string to a multibyte character string. 899Again, the code is not really useful (or correct), it is simply here to 900demonstrate the use and some problems. 901 902@smallexample 903char * 904mbscatwcs (char *s, size_t len, const wchar_t *ws) 905@{ 906 mbstate_t state; 907 /* @r{Find the end of the existing string.} */ 908 char *wp = strchr (s, '\0'); 909 len -= wp - s; 910 memset (&state, '\0', sizeof (state)); 911 do 912 @{ 913 size_t nbytes; 914 if (len < MB_CUR_LEN) 915 @{ 916 /* @r{We cannot guarantee that the next} 917 @r{character fits into the buffer, so} 918 @r{return an error.} */ 919 errno = E2BIG; 920 return NULL; 921 @} 922 nbytes = wcrtomb (wp, *ws, &state); 923 if (nbytes == (size_t) -1) 924 /* @r{Error in the conversion.} */ 925 return NULL; 926 len -= nbytes; 927 wp += nbytes; 928 @} 929 while (*ws++ != L'\0'); 930 return s; 931@} 932@end smallexample 933 934First the function has to find the end of the string currently in the 935array @var{s}. The @code{strchr} call does this very efficiently since a 936requirement for multibyte character representations is that the NUL byte 937is never used except to represent itself (and in this context, the end 938of the string). 939 940After initializing the state object the loop is entered where the first 941task is to make sure there is enough room in the array @var{s}. We 942abort if there are not at least @code{MB_CUR_LEN} bytes available. This 943is not always optimal but we have no other choice. We might have less 944than @code{MB_CUR_LEN} bytes available but the next multibyte character 945might also be only one byte long. At the time the @code{wcrtomb} call 946returns it is too late to decide whether the buffer was large enough. If 947this solution is unsuitable, there is a very slow but more accurate 948solution. 949 950@smallexample 951 @dots{} 952 if (len < MB_CUR_LEN) 953 @{ 954 mbstate_t temp_state; 955 memcpy (&temp_state, &state, sizeof (state)); 956 if (wcrtomb (NULL, *ws, &temp_state) > len) 957 @{ 958 /* @r{We cannot guarantee that the next} 959 @r{character fits into the buffer, so} 960 @r{return an error.} */ 961 errno = E2BIG; 962 return NULL; 963 @} 964 @} 965 @dots{} 966@end smallexample 967 968Here we perform the conversion that might overflow the buffer so that 969we are afterwards in the position to make an exact decision about the 970buffer size. Please note the @code{NULL} argument for the destination 971buffer in the new @code{wcrtomb} call; since we are not interested in the 972converted text at this point, this is a nice way to express this. The 973most unusual thing about this piece of code certainly is the duplication 974of the conversion state object, but if a change of the state is necessary 975to emit the next multibyte character, we want to have the same shift state 976change performed in the real conversion. Therefore, we have to preserve 977the initial shift state information. 978 979There are certainly many more and even better solutions to this problem. 980This example is only provided for educational purposes. 981 982@node Converting Strings 983@subsection Converting Multibyte and Wide Character Strings 984 985The functions described in the previous section only convert a single 986character at a time. Most operations to be performed in real-world 987programs include strings and therefore the @w{ISO C} standard also 988defines conversions on entire strings. However, the defined set of 989functions is quite limited; therefore, @theglibc{} contains a few 990extensions that can help in some important situations. 991 992@deftypefun size_t mbsrtowcs (wchar_t *restrict @var{dst}, const char **restrict @var{src}, size_t @var{len}, mbstate_t *restrict @var{ps}) 993@standards{ISO, wchar.h} 994@safety{@prelim{}@mtunsafe{@mtasurace{:mbsrtowcs/!ps}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} 995The @code{mbsrtowcs} function (``multibyte string restartable to wide 996character string'') converts the NUL-terminated multibyte character 997string at @code{*@var{src}} into an equivalent wide character string, 998including the NUL wide character at the end. The conversion is started 999using the state information from the object pointed to by @var{ps} or 1000from an internal object of @code{mbsrtowcs} if @var{ps} is a null 1001pointer. Before returning, the state object is updated to match the state 1002after the last converted character. The state is the initial state if the 1003terminating NUL byte is reached and converted. 1004 1005If @var{dst} is not a null pointer, the result is stored in the array 1006pointed to by @var{dst}; otherwise, the conversion result is not 1007available since it is stored in an internal buffer. 1008 1009If @var{len} wide characters are stored in the array @var{dst} before 1010reaching the end of the input string, the conversion stops and @var{len} 1011is returned. If @var{dst} is a null pointer, @var{len} is never checked. 1012 1013Another reason for a premature return from the function call is if the 1014input string contains an invalid multibyte sequence. In this case the 1015global variable @code{errno} is set to @code{EILSEQ} and the function 1016returns @code{(size_t) -1}. 1017 1018@c XXX The ISO C9x draft seems to have a problem here. It says that PS 1019@c is not updated if DST is NULL. This is not said straightforward and 1020@c none of the other functions is described like this. It would make sense 1021@c to define the function this way but I don't think it is meant like this. 1022 1023In all other cases the function returns the number of wide characters 1024converted during this call. If @var{dst} is not null, @code{mbsrtowcs} 1025stores in the pointer pointed to by @var{src} either a null pointer (if 1026the NUL byte in the input string was reached) or the address of the byte 1027following the last converted multibyte character. 1028 1029Like @code{mbstowcs} the @var{dst} parameter may be a null pointer and 1030the function can be used to count the number of wide characters that 1031would be required. 1032 1033@pindex wchar.h 1034@code{mbsrtowcs} was introduced in @w{Amendment 1} to @w{ISO C90} and is 1035declared in @file{wchar.h}. 1036@end deftypefun 1037 1038The definition of the @code{mbsrtowcs} function has one important 1039limitation. The requirement that @var{dst} has to be a NUL-terminated 1040string provides problems if one wants to convert buffers with text. A 1041buffer is not normally a collection of NUL-terminated strings but instead a 1042continuous collection of lines, separated by newline characters. Now 1043assume that a function to convert one line from a buffer is needed. Since 1044the line is not NUL-terminated, the source pointer cannot directly point 1045into the unmodified text buffer. This means, either one inserts the NUL 1046byte at the appropriate place for the time of the @code{mbsrtowcs} 1047function call (which is not doable for a read-only buffer or in a 1048multi-threaded application) or one copies the line in an extra buffer 1049where it can be terminated by a NUL byte. Note that it is not in general 1050possible to limit the number of characters to convert by setting the 1051parameter @var{len} to any specific value. Since it is not known how 1052many bytes each multibyte character sequence is in length, one can only 1053guess. 1054 1055@cindex stateful 1056There is still a problem with the method of NUL-terminating a line right 1057after the newline character, which could lead to very strange results. 1058As said in the description of the @code{mbsrtowcs} function above, the 1059conversion state is guaranteed to be in the initial shift state after 1060processing the NUL byte at the end of the input string. But this NUL 1061byte is not really part of the text (i.e., the conversion state after 1062the newline in the original text could be something different than the 1063initial shift state and therefore the first character of the next line 1064is encoded using this state). But the state in question is never 1065accessible to the user since the conversion stops after the NUL byte 1066(which resets the state). Most stateful character sets in use today 1067require that the shift state after a newline be the initial state--but 1068this is not a strict guarantee. Therefore, simply NUL-terminating a 1069piece of a running text is not always an adequate solution and, 1070therefore, should never be used in generally used code. 1071 1072The generic conversion interface (@pxref{Generic Charset Conversion}) 1073does not have this limitation (it simply works on buffers, not 1074strings), and @theglibc{} contains a set of functions that take 1075additional parameters specifying the maximal number of bytes that are 1076consumed from the input string. This way the problem of 1077@code{mbsrtowcs}'s example above could be solved by determining the line 1078length and passing this length to the function. 1079 1080@deftypefun size_t wcsrtombs (char *restrict @var{dst}, const wchar_t **restrict @var{src}, size_t @var{len}, mbstate_t *restrict @var{ps}) 1081@standards{ISO, wchar.h} 1082@safety{@prelim{}@mtunsafe{@mtasurace{:wcsrtombs/!ps}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} 1083The @code{wcsrtombs} function (``wide character string restartable to 1084multibyte string'') converts the NUL-terminated wide character string at 1085@code{*@var{src}} into an equivalent multibyte character string and 1086stores the result in the array pointed to by @var{dst}. The NUL wide 1087character is also converted. The conversion starts in the state 1088described in the object pointed to by @var{ps} or by a state object 1089local to @code{wcsrtombs} in case @var{ps} is a null pointer. If 1090@var{dst} is a null pointer, the conversion is performed as usual but the 1091result is not available. If all characters of the input string were 1092successfully converted and if @var{dst} is not a null pointer, the 1093pointer pointed to by @var{src} gets assigned a null pointer. 1094 1095If one of the wide characters in the input string has no valid multibyte 1096character equivalent, the conversion stops early, sets the global 1097variable @code{errno} to @code{EILSEQ}, and returns @code{(size_t) -1}. 1098 1099Another reason for a premature stop is if @var{dst} is not a null 1100pointer and the next converted character would require more than 1101@var{len} bytes in total to the array @var{dst}. In this case (and if 1102@var{dst} is not a null pointer) the pointer pointed to by @var{src} is 1103assigned a value pointing to the wide character right after the last one 1104successfully converted. 1105 1106Except in the case of an encoding error the return value of the 1107@code{wcsrtombs} function is the number of bytes in all the multibyte 1108character sequences which were or would have been (if @var{dst} was 1109not a null) stored in @var{dst}. Before returning, the state in the 1110object pointed to by @var{ps} (or the internal object in case @var{ps} 1111is a null pointer) is updated to reflect the state after the last 1112conversion. The state is the initial shift state in case the 1113terminating NUL wide character was converted. 1114 1115@pindex wchar.h 1116The @code{wcsrtombs} function was introduced in @w{Amendment 1} to 1117@w{ISO C90} and is declared in @file{wchar.h}. 1118@end deftypefun 1119 1120The restriction mentioned above for the @code{mbsrtowcs} function applies 1121here also. There is no possibility of directly controlling the number of 1122input characters. One has to place the NUL wide character at the correct 1123place or control the consumed input indirectly via the available output 1124array size (the @var{len} parameter). 1125 1126@deftypefun size_t mbsnrtowcs (wchar_t *restrict @var{dst}, const char **restrict @var{src}, size_t @var{nmc}, size_t @var{len}, mbstate_t *restrict @var{ps}) 1127@standards{GNU, wchar.h} 1128@safety{@prelim{}@mtunsafe{@mtasurace{:mbsnrtowcs/!ps}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} 1129The @code{mbsnrtowcs} function is very similar to the @code{mbsrtowcs} 1130function. All the parameters are the same except for @var{nmc}, which is 1131new. The return value is the same as for @code{mbsrtowcs}. 1132 1133This new parameter specifies how many bytes at most can be used from the 1134multibyte character string. In other words, the multibyte character 1135string @code{*@var{src}} need not be NUL-terminated. But if a NUL byte 1136is found within the @var{nmc} first bytes of the string, the conversion 1137stops there. 1138 1139Like @code{mbstowcs} the @var{dst} parameter may be a null pointer and 1140the function can be used to count the number of wide characters that 1141would be required. 1142 1143This function is a GNU extension. It is meant to work around the 1144problems mentioned above. Now it is possible to convert a buffer with 1145multibyte character text piece by piece without having to care about 1146inserting NUL bytes and the effect of NUL bytes on the conversion state. 1147@end deftypefun 1148 1149A function to convert a multibyte string into a wide character string 1150and display it could be written like this (this is not a really useful 1151example): 1152 1153@smallexample 1154void 1155showmbs (const char *src, FILE *fp) 1156@{ 1157 mbstate_t state; 1158 int cnt = 0; 1159 memset (&state, '\0', sizeof (state)); 1160 while (1) 1161 @{ 1162 wchar_t linebuf[100]; 1163 const char *endp = strchr (src, '\n'); 1164 size_t n; 1165 1166 /* @r{Exit if there is no more line.} */ 1167 if (endp == NULL) 1168 break; 1169 1170 n = mbsnrtowcs (linebuf, &src, endp - src, 99, &state); 1171 linebuf[n] = L'\0'; 1172 fprintf (fp, "line %d: \"%S\"\n", linebuf); 1173 @} 1174@} 1175@end smallexample 1176 1177There is no problem with the state after a call to @code{mbsnrtowcs}. 1178Since we don't insert characters in the strings that were not in there 1179right from the beginning and we use @var{state} only for the conversion 1180of the given buffer, there is no problem with altering the state. 1181 1182@deftypefun size_t wcsnrtombs (char *restrict @var{dst}, const wchar_t **restrict @var{src}, size_t @var{nwc}, size_t @var{len}, mbstate_t *restrict @var{ps}) 1183@standards{GNU, wchar.h} 1184@safety{@prelim{}@mtunsafe{@mtasurace{:wcsnrtombs/!ps}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} 1185The @code{wcsnrtombs} function implements the conversion from wide 1186character strings to multibyte character strings. It is similar to 1187@code{wcsrtombs} but, just like @code{mbsnrtowcs}, it takes an extra 1188parameter, which specifies the length of the input string. 1189 1190No more than @var{nwc} wide characters from the input string 1191@code{*@var{src}} are converted. If the input string contains a NUL 1192wide character in the first @var{nwc} characters, the conversion stops at 1193this place. 1194 1195The @code{wcsnrtombs} function is a GNU extension and just like 1196@code{mbsnrtowcs} helps in situations where no NUL-terminated input 1197strings are available. 1198@end deftypefun 1199 1200 1201@node Multibyte Conversion Example 1202@subsection A Complete Multibyte Conversion Example 1203 1204The example programs given in the last sections are only brief and do 1205not contain all the error checking, etc. Presented here is a complete 1206and documented example. It features the @code{mbrtowc} function but it 1207should be easy to derive versions using the other functions. 1208 1209@smallexample 1210int 1211file_mbsrtowcs (int input, int output) 1212@{ 1213 /* @r{Note the use of @code{MB_LEN_MAX}.} 1214 @r{@code{MB_CUR_MAX} cannot portably be used here.} */ 1215 char buffer[BUFSIZ + MB_LEN_MAX]; 1216 mbstate_t state; 1217 int filled = 0; 1218 int eof = 0; 1219 1220 /* @r{Initialize the state.} */ 1221 memset (&state, '\0', sizeof (state)); 1222 1223 while (!eof) 1224 @{ 1225 ssize_t nread; 1226 ssize_t nwrite; 1227 char *inp = buffer; 1228 wchar_t outbuf[BUFSIZ]; 1229 wchar_t *outp = outbuf; 1230 1231 /* @r{Fill up the buffer from the input file.} */ 1232 nread = read (input, buffer + filled, BUFSIZ); 1233 if (nread < 0) 1234 @{ 1235 perror ("read"); 1236 return 0; 1237 @} 1238 /* @r{If we reach end of file, make a note to read no more.} */ 1239 if (nread == 0) 1240 eof = 1; 1241 1242 /* @r{@code{filled} is now the number of bytes in @code{buffer}.} */ 1243 filled += nread; 1244 1245 /* @r{Convert those bytes to wide characters--as many as we can.} */ 1246 while (1) 1247 @{ 1248 size_t thislen = mbrtowc (outp, inp, filled, &state); 1249 /* @r{Stop converting at invalid character;} 1250 @r{this can mean we have read just the first part} 1251 @r{of a valid character.} */ 1252 if (thislen == (size_t) -1) 1253 break; 1254 /* @r{We want to handle embedded NUL bytes} 1255 @r{but the return value is 0. Correct this.} */ 1256 if (thislen == 0) 1257 thislen = 1; 1258 /* @r{Advance past this character.} */ 1259 inp += thislen; 1260 filled -= thislen; 1261 ++outp; 1262 @} 1263 1264 /* @r{Write the wide characters we just made.} */ 1265 nwrite = write (output, outbuf, 1266 (outp - outbuf) * sizeof (wchar_t)); 1267 if (nwrite < 0) 1268 @{ 1269 perror ("write"); 1270 return 0; 1271 @} 1272 1273 /* @r{See if we have a @emph{real} invalid character.} */ 1274 if ((eof && filled > 0) || filled >= MB_CUR_MAX) 1275 @{ 1276 error (0, 0, "invalid multibyte character"); 1277 return 0; 1278 @} 1279 1280 /* @r{If any characters must be carried forward,} 1281 @r{put them at the beginning of @code{buffer}.} */ 1282 if (filled > 0) 1283 memmove (buffer, inp, filled); 1284 @} 1285 1286 return 1; 1287@} 1288@end smallexample 1289 1290 1291@node Non-reentrant Conversion 1292@section Non-reentrant Conversion Function 1293 1294The functions described in the previous chapter are defined in 1295@w{Amendment 1} to @w{ISO C90}, but the original @w{ISO C90} standard 1296also contained functions for character set conversion. The reason that 1297these original functions are not described first is that they are almost 1298entirely useless. 1299 1300The problem is that all the conversion functions described in the 1301original @w{ISO C90} use a local state. Using a local state implies that 1302multiple conversions at the same time (not only when using threads) 1303cannot be done, and that you cannot first convert single characters and 1304then strings since you cannot tell the conversion functions which state 1305to use. 1306 1307These original functions are therefore usable only in a very limited set 1308of situations. One must complete converting the entire string before 1309starting a new one, and each string/text must be converted with the same 1310function (there is no problem with the library itself; it is guaranteed 1311that no library function changes the state of any of these functions). 1312@strong{For the above reasons it is highly requested that the functions 1313described in the previous section be used in place of non-reentrant 1314conversion functions.} 1315 1316@menu 1317* Non-reentrant Character Conversion:: Non-reentrant Conversion of Single 1318 Characters. 1319* Non-reentrant String Conversion:: Non-reentrant Conversion of Strings. 1320* Shift State:: States in Non-reentrant Functions. 1321@end menu 1322 1323@node Non-reentrant Character Conversion 1324@subsection Non-reentrant Conversion of Single Characters 1325 1326@deftypefun int mbtowc (wchar_t *restrict @var{result}, const char *restrict @var{string}, size_t @var{size}) 1327@standards{ISO, stdlib.h} 1328@safety{@prelim{}@mtunsafe{@mtasurace{}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} 1329The @code{mbtowc} (``multibyte to wide character'') function when called 1330with non-null @var{string} converts the first multibyte character 1331beginning at @var{string} to its corresponding wide character code. It 1332stores the result in @code{*@var{result}}. 1333 1334@code{mbtowc} never examines more than @var{size} bytes. (The idea is 1335to supply for @var{size} the number of bytes of data you have in hand.) 1336 1337@code{mbtowc} with non-null @var{string} distinguishes three 1338possibilities: the first @var{size} bytes at @var{string} start with 1339valid multibyte characters, they start with an invalid byte sequence or 1340just part of a character, or @var{string} points to an empty string (a 1341null character). 1342 1343For a valid multibyte character, @code{mbtowc} converts it to a wide 1344character and stores that in @code{*@var{result}}, and returns the 1345number of bytes in that character (always at least @math{1} and never 1346more than @var{size}). 1347 1348For an invalid byte sequence, @code{mbtowc} returns @math{-1}. For an 1349empty string, it returns @math{0}, also storing @code{'\0'} in 1350@code{*@var{result}}. 1351 1352If the multibyte character code uses shift characters, then 1353@code{mbtowc} maintains and updates a shift state as it scans. If you 1354call @code{mbtowc} with a null pointer for @var{string}, that 1355initializes the shift state to its standard initial value. It also 1356returns nonzero if the multibyte character code in use actually has a 1357shift state. @xref{Shift State}. 1358@end deftypefun 1359 1360@deftypefun int wctomb (char *@var{string}, wchar_t @var{wchar}) 1361@standards{ISO, stdlib.h} 1362@safety{@prelim{}@mtunsafe{@mtasurace{}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} 1363The @code{wctomb} (``wide character to multibyte'') function converts 1364the wide character code @var{wchar} to its corresponding multibyte 1365character sequence, and stores the result in bytes starting at 1366@var{string}. At most @code{MB_CUR_MAX} characters are stored. 1367 1368@code{wctomb} with non-null @var{string} distinguishes three 1369possibilities for @var{wchar}: a valid wide character code (one that can 1370be translated to a multibyte character), an invalid code, and 1371@code{L'\0'}. 1372 1373Given a valid code, @code{wctomb} converts it to a multibyte character, 1374storing the bytes starting at @var{string}. Then it returns the number 1375of bytes in that character (always at least @math{1} and never more 1376than @code{MB_CUR_MAX}). 1377 1378If @var{wchar} is an invalid wide character code, @code{wctomb} returns 1379@math{-1}. If @var{wchar} is @code{L'\0'}, it returns @code{0}, also 1380storing @code{'\0'} in @code{*@var{string}}. 1381 1382If the multibyte character code uses shift characters, then 1383@code{wctomb} maintains and updates a shift state as it scans. If you 1384call @code{wctomb} with a null pointer for @var{string}, that 1385initializes the shift state to its standard initial value. It also 1386returns nonzero if the multibyte character code in use actually has a 1387shift state. @xref{Shift State}. 1388 1389Calling this function with a @var{wchar} argument of zero when 1390@var{string} is not null has the side-effect of reinitializing the 1391stored shift state @emph{as well as} storing the multibyte character 1392@code{'\0'} and returning @math{0}. 1393@end deftypefun 1394 1395Similar to @code{mbrlen} there is also a non-reentrant function that 1396computes the length of a multibyte character. It can be defined in 1397terms of @code{mbtowc}. 1398 1399@deftypefun int mblen (const char *@var{string}, size_t @var{size}) 1400@standards{ISO, stdlib.h} 1401@safety{@prelim{}@mtunsafe{@mtasurace{}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} 1402The @code{mblen} function with a non-null @var{string} argument returns 1403the number of bytes that make up the multibyte character beginning at 1404@var{string}, never examining more than @var{size} bytes. (The idea is 1405to supply for @var{size} the number of bytes of data you have in hand.) 1406 1407The return value of @code{mblen} distinguishes three possibilities: the 1408first @var{size} bytes at @var{string} start with valid multibyte 1409characters, they start with an invalid byte sequence or just part of a 1410character, or @var{string} points to an empty string (a null character). 1411 1412For a valid multibyte character, @code{mblen} returns the number of 1413bytes in that character (always at least @code{1} and never more than 1414@var{size}). For an invalid byte sequence, @code{mblen} returns 1415@math{-1}. For an empty string, it returns @math{0}. 1416 1417If the multibyte character code uses shift characters, then @code{mblen} 1418maintains and updates a shift state as it scans. If you call 1419@code{mblen} with a null pointer for @var{string}, that initializes the 1420shift state to its standard initial value. It also returns a nonzero 1421value if the multibyte character code in use actually has a shift state. 1422@xref{Shift State}. 1423 1424@pindex stdlib.h 1425The function @code{mblen} is declared in @file{stdlib.h}. 1426@end deftypefun 1427 1428 1429@node Non-reentrant String Conversion 1430@subsection Non-reentrant Conversion of Strings 1431 1432For convenience the @w{ISO C90} standard also defines functions to 1433convert entire strings instead of single characters. These functions 1434suffer from the same problems as their reentrant counterparts from 1435@w{Amendment 1} to @w{ISO C90}; see @ref{Converting Strings}. 1436 1437@deftypefun size_t mbstowcs (wchar_t *@var{wstring}, const char *@var{string}, size_t @var{size}) 1438@standards{ISO, stdlib.h} 1439@safety{@prelim{}@mtsafe{}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} 1440@c Odd... Although this was supposed to be non-reentrant, the internal 1441@c state is not a static buffer, but an automatic variable. 1442The @code{mbstowcs} (``multibyte string to wide character string'') 1443function converts the null-terminated string of multibyte characters 1444@var{string} to an array of wide character codes, storing not more than 1445@var{size} wide characters into the array beginning at @var{wstring}. 1446The terminating null character counts towards the size, so if @var{size} 1447is less than the actual number of wide characters resulting from 1448@var{string}, no terminating null character is stored. 1449 1450The conversion of characters from @var{string} begins in the initial 1451shift state. 1452 1453If an invalid multibyte character sequence is found, the @code{mbstowcs} 1454function returns a value of @math{-1}. Otherwise, it returns the number 1455of wide characters stored in the array @var{wstring}. This number does 1456not include the terminating null character, which is present if the 1457number is less than @var{size}. 1458 1459Here is an example showing how to convert a string of multibyte 1460characters, allocating enough space for the result. 1461 1462@smallexample 1463wchar_t * 1464mbstowcs_alloc (const char *string) 1465@{ 1466 size_t size = strlen (string) + 1; 1467 wchar_t *buf = xmalloc (size * sizeof (wchar_t)); 1468 1469 size = mbstowcs (buf, string, size); 1470 if (size == (size_t) -1) 1471 return NULL; 1472 buf = xreallocarray (buf, size + 1, sizeof *buf); 1473 return buf; 1474@} 1475@end smallexample 1476 1477If @var{wstring} is a null pointer then no output is written and the 1478conversion proceeds as above, and the result is returned. In practice 1479such behaviour is useful for calculating the exact number of wide 1480characters required to convert @var{string}. This behaviour of 1481accepting a null pointer for @var{wstring} is an @w{XPG4.2} extension 1482that is not specified in @w{ISO C} and is optional in @w{POSIX}. 1483@end deftypefun 1484 1485@deftypefun size_t wcstombs (char *@var{string}, const wchar_t *@var{wstring}, size_t @var{size}) 1486@standards{ISO, stdlib.h} 1487@safety{@prelim{}@mtsafe{}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} 1488The @code{wcstombs} (``wide character string to multibyte string'') 1489function converts the null-terminated wide character array @var{wstring} 1490into a string containing multibyte characters, storing not more than 1491@var{size} bytes starting at @var{string}, followed by a terminating 1492null character if there is room. The conversion of characters begins in 1493the initial shift state. 1494 1495The terminating null character counts towards the size, so if @var{size} 1496is less than or equal to the number of bytes needed in @var{wstring}, no 1497terminating null character is stored. 1498 1499If a code that does not correspond to a valid multibyte character is 1500found, the @code{wcstombs} function returns a value of @math{-1}. 1501Otherwise, the return value is the number of bytes stored in the array 1502@var{string}. This number does not include the terminating null character, 1503which is present if the number is less than @var{size}. 1504@end deftypefun 1505 1506@node Shift State 1507@subsection States in Non-reentrant Functions 1508 1509In some multibyte character codes, the @emph{meaning} of any particular 1510byte sequence is not fixed; it depends on what other sequences have come 1511earlier in the same string. Typically there are just a few sequences that 1512can change the meaning of other sequences; these few are called 1513@dfn{shift sequences} and we say that they set the @dfn{shift state} for 1514other sequences that follow. 1515 1516To illustrate shift state and shift sequences, suppose we decide that 1517the sequence @code{0200} (just one byte) enters Japanese mode, in which 1518pairs of bytes in the range from @code{0240} to @code{0377} are single 1519characters, while @code{0201} enters Latin-1 mode, in which single bytes 1520in the range from @code{0240} to @code{0377} are characters, and 1521interpreted according to the ISO Latin-1 character set. This is a 1522multibyte code that has two alternative shift states (``Japanese mode'' 1523and ``Latin-1 mode''), and two shift sequences that specify particular 1524shift states. 1525 1526When the multibyte character code in use has shift states, then 1527@code{mblen}, @code{mbtowc}, and @code{wctomb} must maintain and update 1528the current shift state as they scan the string. To make this work 1529properly, you must follow these rules: 1530 1531@itemize @bullet 1532@item 1533Before starting to scan a string, call the function with a null pointer 1534for the multibyte character address---for example, @code{mblen (NULL, 15350)}. This initializes the shift state to its standard initial value. 1536 1537@item 1538Scan the string one character at a time, in order. Do not ``back up'' 1539and rescan characters already scanned, and do not intersperse the 1540processing of different strings. 1541@end itemize 1542 1543Here is an example of using @code{mblen} following these rules: 1544 1545@smallexample 1546void 1547scan_string (char *s) 1548@{ 1549 int length = strlen (s); 1550 1551 /* @r{Initialize shift state.} */ 1552 mblen (NULL, 0); 1553 1554 while (1) 1555 @{ 1556 int thischar = mblen (s, length); 1557 /* @r{Deal with end of string and invalid characters.} */ 1558 if (thischar == 0) 1559 break; 1560 if (thischar == -1) 1561 @{ 1562 error ("invalid multibyte character"); 1563 break; 1564 @} 1565 /* @r{Advance past this character.} */ 1566 s += thischar; 1567 length -= thischar; 1568 @} 1569@} 1570@end smallexample 1571 1572The functions @code{mblen}, @code{mbtowc} and @code{wctomb} are not 1573reentrant when using a multibyte code that uses a shift state. However, 1574no other library functions call these functions, so you don't have to 1575worry that the shift state will be changed mysteriously. 1576 1577 1578@node Generic Charset Conversion 1579@section Generic Charset Conversion 1580 1581The conversion functions mentioned so far in this chapter all had in 1582common that they operate on character sets that are not directly 1583specified by the functions. The multibyte encoding used is specified by 1584the currently selected locale for the @code{LC_CTYPE} category. The 1585wide character set is fixed by the implementation (in the case of @theglibc{} 1586it is always UCS-4 encoded @w{ISO 10646}). 1587 1588This has of course several problems when it comes to general character 1589conversion: 1590 1591@itemize @bullet 1592@item 1593For every conversion where neither the source nor the destination 1594character set is the character set of the locale for the @code{LC_CTYPE} 1595category, one has to change the @code{LC_CTYPE} locale using 1596@code{setlocale}. 1597 1598Changing the @code{LC_CTYPE} locale introduces major problems for the rest 1599of the programs since several more functions (e.g., the character 1600classification functions, @pxref{Classification of Characters}) use the 1601@code{LC_CTYPE} category. 1602 1603@item 1604Parallel conversions to and from different character sets are not 1605possible since the @code{LC_CTYPE} selection is global and shared by all 1606threads. 1607 1608@item 1609If neither the source nor the destination character set is the character 1610set used for @code{wchar_t} representation, there is at least a two-step 1611process necessary to convert a text using the functions above. One would 1612have to select the source character set as the multibyte encoding, 1613convert the text into a @code{wchar_t} text, select the destination 1614character set as the multibyte encoding, and convert the wide character 1615text to the multibyte (@math{=} destination) character set. 1616 1617Even if this is possible (which is not guaranteed) it is a very tiring 1618work. Plus it suffers from the other two raised points even more due to 1619the steady changing of the locale. 1620@end itemize 1621 1622The XPG2 standard defines a completely new set of functions, which has 1623none of these limitations. They are not at all coupled to the selected 1624locales, and they have no constraints on the character sets selected for 1625source and destination. Only the set of available conversions limits 1626them. The standard does not specify that any conversion at all must be 1627available. Such availability is a measure of the quality of the 1628implementation. 1629 1630In the following text first the interface to @code{iconv} and then the 1631conversion function, will be described. Comparisons with other 1632implementations will show what obstacles stand in the way of portable 1633applications. Finally, the implementation is described in so far as might 1634interest the advanced user who wants to extend conversion capabilities. 1635 1636@menu 1637* Generic Conversion Interface:: Generic Character Set Conversion Interface. 1638* iconv Examples:: A complete @code{iconv} example. 1639* Other iconv Implementations:: Some Details about other @code{iconv} 1640 Implementations. 1641* glibc iconv Implementation:: The @code{iconv} Implementation in the GNU C 1642 library. 1643@end menu 1644 1645@node Generic Conversion Interface 1646@subsection Generic Character Set Conversion Interface 1647 1648This set of functions follows the traditional cycle of using a resource: 1649open--use--close. The interface consists of three functions, each of 1650which implements one step. 1651 1652Before the interfaces are described it is necessary to introduce a 1653data type. Just like other open--use--close interfaces the functions 1654introduced here work using handles and the @file{iconv.h} header 1655defines a special type for the handles used. 1656 1657@deftp {Data Type} iconv_t 1658@standards{XPG2, iconv.h} 1659This data type is an abstract type defined in @file{iconv.h}. The user 1660must not assume anything about the definition of this type; it must be 1661completely opaque. 1662 1663Objects of this type can be assigned handles for the conversions using 1664the @code{iconv} functions. The objects themselves need not be freed, but 1665the conversions for which the handles stand for have to. 1666@end deftp 1667 1668@noindent 1669The first step is the function to create a handle. 1670 1671@deftypefun iconv_t iconv_open (const char *@var{tocode}, const char *@var{fromcode}) 1672@standards{XPG2, iconv.h} 1673@safety{@prelim{}@mtsafe{@mtslocale{}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} 1674@c Calls malloc if tocode and/or fromcode are too big for alloca. Calls 1675@c strip and upstr on both, then gconv_open. strip and upstr call 1676@c isalnum_l and toupper_l with the C locale. gconv_open may MT-safely 1677@c tokenize toset, replace unspecified codesets with the current locale 1678@c (possibly two different accesses), and finally it calls 1679@c gconv_find_transform and initializes the gconv_t result with all the 1680@c steps in the conversion sequence, running each one's initializer, 1681@c destructing and releasing them all if anything fails. 1682 1683The @code{iconv_open} function has to be used before starting a 1684conversion. The two parameters this function takes determine the 1685source and destination character set for the conversion, and if the 1686implementation has the possibility to perform such a conversion, the 1687function returns a handle. 1688 1689If the wanted conversion is not available, the @code{iconv_open} function 1690returns @code{(iconv_t) -1}. In this case the global variable 1691@code{errno} can have the following values: 1692 1693@table @code 1694@item EMFILE 1695The process already has @code{OPEN_MAX} file descriptors open. 1696@item ENFILE 1697The system limit of open files is reached. 1698@item ENOMEM 1699Not enough memory to carry out the operation. 1700@item EINVAL 1701The conversion from @var{fromcode} to @var{tocode} is not supported. 1702@end table 1703 1704It is not possible to use the same descriptor in different threads to 1705perform independent conversions. The data structures associated 1706with the descriptor include information about the conversion state. 1707This must not be messed up by using it in different conversions. 1708 1709An @code{iconv} descriptor is like a file descriptor as for every use a 1710new descriptor must be created. The descriptor does not stand for all 1711of the conversions from @var{fromset} to @var{toset}. 1712 1713The @glibcadj{} implementation of @code{iconv_open} has one 1714significant extension to other implementations. To ease the extension 1715of the set of available conversions, the implementation allows storing 1716the necessary files with data and code in an arbitrary number of 1717directories. How this extension must be written will be explained below 1718(@pxref{glibc iconv Implementation}). Here it is only important to say 1719that all directories mentioned in the @code{GCONV_PATH} environment 1720variable are considered only if they contain a file @file{gconv-modules}. 1721These directories need not necessarily be created by the system 1722administrator. In fact, this extension is introduced to help users 1723writing and using their own, new conversions. Of course, this does not 1724work for security reasons in SUID binaries; in this case only the system 1725directory is considered and this normally is 1726@file{@var{prefix}/lib/gconv}. The @code{GCONV_PATH} environment 1727variable is examined exactly once at the first call of the 1728@code{iconv_open} function. Later modifications of the variable have no 1729effect. 1730 1731@pindex iconv.h 1732The @code{iconv_open} function was introduced early in the X/Open 1733Portability Guide, @w{version 2}. It is supported by all commercial 1734Unices as it is required for the Unix branding. However, the quality and 1735completeness of the implementation varies widely. The @code{iconv_open} 1736function is declared in @file{iconv.h}. 1737@end deftypefun 1738 1739The @code{iconv} implementation can associate large data structure with 1740the handle returned by @code{iconv_open}. Therefore, it is crucial to 1741free all the resources once all conversions are carried out and the 1742conversion is not needed anymore. 1743 1744@deftypefun int iconv_close (iconv_t @var{cd}) 1745@standards{XPG2, iconv.h} 1746@safety{@prelim{}@mtsafe{}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{}}} 1747@c Calls gconv_close to destruct and release each of the conversion 1748@c steps, release the gconv_t object, then call gconv_close_transform. 1749@c Access to the gconv_t object is not guarded, but calling iconv_close 1750@c concurrently with any other use is undefined. 1751 1752The @code{iconv_close} function frees all resources associated with the 1753handle @var{cd}, which must have been returned by a successful call to 1754the @code{iconv_open} function. 1755 1756If the function call was successful the return value is @math{0}. 1757Otherwise it is @math{-1} and @code{errno} is set appropriately. 1758Defined errors are: 1759 1760@table @code 1761@item EBADF 1762The conversion descriptor is invalid. 1763@end table 1764 1765@pindex iconv.h 1766The @code{iconv_close} function was introduced together with the rest 1767of the @code{iconv} functions in XPG2 and is declared in @file{iconv.h}. 1768@end deftypefun 1769 1770The standard defines only one actual conversion function. This has, 1771therefore, the most general interface: it allows conversion from one 1772buffer to another. Conversion from a file to a buffer, vice versa, or 1773even file to file can be implemented on top of it. 1774 1775@deftypefun size_t iconv (iconv_t @var{cd}, char **@var{inbuf}, size_t *@var{inbytesleft}, char **@var{outbuf}, size_t *@var{outbytesleft}) 1776@standards{XPG2, iconv.h} 1777@safety{@prelim{}@mtsafe{@mtsrace{:cd}}@assafe{}@acunsafe{@acucorrupt{}}} 1778@c Without guarding access to the iconv_t object pointed to by cd, call 1779@c the conversion function to convert inbuf or flush the internal 1780@c conversion state. 1781@cindex stateful 1782The @code{iconv} function converts the text in the input buffer 1783according to the rules associated with the descriptor @var{cd} and 1784stores the result in the output buffer. It is possible to call the 1785function for the same text several times in a row since for stateful 1786character sets the necessary state information is kept in the data 1787structures associated with the descriptor. 1788 1789The input buffer is specified by @code{*@var{inbuf}} and it contains 1790@code{*@var{inbytesleft}} bytes. The extra indirection is necessary for 1791communicating the used input back to the caller (see below). It is 1792important to note that the buffer pointer is of type @code{char} and the 1793length is measured in bytes even if the input text is encoded in wide 1794characters. 1795 1796The output buffer is specified in a similar way. @code{*@var{outbuf}} 1797points to the beginning of the buffer with at least 1798@code{*@var{outbytesleft}} bytes room for the result. The buffer 1799pointer again is of type @code{char} and the length is measured in 1800bytes. If @var{outbuf} or @code{*@var{outbuf}} is a null pointer, the 1801conversion is performed but no output is available. 1802 1803If @var{inbuf} is a null pointer, the @code{iconv} function performs the 1804necessary action to put the state of the conversion into the initial 1805state. This is obviously a no-op for non-stateful encodings, but if the 1806encoding has a state, such a function call might put some byte sequences 1807in the output buffer, which perform the necessary state changes. The 1808next call with @var{inbuf} not being a null pointer then simply goes on 1809from the initial state. It is important that the programmer never makes 1810any assumption as to whether the conversion has to deal with states. 1811Even if the input and output character sets are not stateful, the 1812implementation might still have to keep states. This is due to the 1813implementation chosen for @theglibc{} as it is described below. 1814Therefore an @code{iconv} call to reset the state should always be 1815performed if some protocol requires this for the output text. 1816 1817The conversion stops for one of three reasons. The first is that all 1818characters from the input buffer are converted. This actually can mean 1819two things: either all bytes from the input buffer are consumed or 1820there are some bytes at the end of the buffer that possibly can form a 1821complete character but the input is incomplete. The second reason for a 1822stop is that the output buffer is full. And the third reason is that 1823the input contains invalid characters. 1824 1825In all of these cases the buffer pointers after the last successful 1826conversion, for the input and output buffers, are stored in @var{inbuf} and 1827@var{outbuf}, and the available room in each buffer is stored in 1828@var{inbytesleft} and @var{outbytesleft}. 1829 1830Since the character sets selected in the @code{iconv_open} call can be 1831almost arbitrary, there can be situations where the input buffer contains 1832valid characters, which have no identical representation in the output 1833character set. The behavior in this situation is undefined. The 1834@emph{current} behavior of @theglibc{} in this situation is to 1835return with an error immediately. This certainly is not the most 1836desirable solution; therefore, future versions will provide better ones, 1837but they are not yet finished. 1838 1839If all input from the input buffer is successfully converted and stored 1840in the output buffer, the function returns the number of non-reversible 1841conversions performed. In all other cases the return value is 1842@code{(size_t) -1} and @code{errno} is set appropriately. In such cases 1843the value pointed to by @var{inbytesleft} is nonzero. 1844 1845@table @code 1846@item EILSEQ 1847The conversion stopped because of an invalid byte sequence in the input. 1848After the call, @code{*@var{inbuf}} points at the first byte of the 1849invalid byte sequence. 1850 1851@item E2BIG 1852The conversion stopped because it ran out of space in the output buffer. 1853 1854@item EINVAL 1855The conversion stopped because of an incomplete byte sequence at the end 1856of the input buffer. 1857 1858@item EBADF 1859The @var{cd} argument is invalid. 1860@end table 1861 1862@pindex iconv.h 1863The @code{iconv} function was introduced in the XPG2 standard and is 1864declared in the @file{iconv.h} header. 1865@end deftypefun 1866 1867The definition of the @code{iconv} function is quite good overall. It 1868provides quite flexible functionality. The only problems lie in the 1869boundary cases, which are incomplete byte sequences at the end of the 1870input buffer and invalid input. A third problem, which is not really 1871a design problem, is the way conversions are selected. The standard 1872does not say anything about the legitimate names, a minimal set of 1873available conversions. We will see how this negatively impacts other 1874implementations, as demonstrated below. 1875 1876@node iconv Examples 1877@subsection A complete @code{iconv} example 1878 1879The example below features a solution for a common problem. Given that 1880one knows the internal encoding used by the system for @code{wchar_t} 1881strings, one often is in the position to read text from a file and store 1882it in wide character buffers. One can do this using @code{mbsrtowcs}, 1883but then we run into the problems discussed above. 1884 1885@smallexample 1886int 1887file2wcs (int fd, const char *charset, wchar_t *outbuf, size_t avail) 1888@{ 1889 char inbuf[BUFSIZ]; 1890 size_t insize = 0; 1891 char *wrptr = (char *) outbuf; 1892 int result = 0; 1893 iconv_t cd; 1894 1895 cd = iconv_open ("WCHAR_T", charset); 1896 if (cd == (iconv_t) -1) 1897 @{ 1898 /* @r{Something went wrong.} */ 1899 if (errno == EINVAL) 1900 error (0, 0, "conversion from '%s' to wchar_t not available", 1901 charset); 1902 else 1903 perror ("iconv_open"); 1904 1905 /* @r{Terminate the output string.} */ 1906 *outbuf = L'\0'; 1907 1908 return -1; 1909 @} 1910 1911 while (avail > 0) 1912 @{ 1913 size_t nread; 1914 size_t nconv; 1915 char *inptr = inbuf; 1916 1917 /* @r{Read more input.} */ 1918 nread = read (fd, inbuf + insize, sizeof (inbuf) - insize); 1919 if (nread == 0) 1920 @{ 1921 /* @r{When we come here the file is completely read.} 1922 @r{This still could mean there are some unused} 1923 @r{characters in the @code{inbuf}. Put them back.} */ 1924 if (lseek (fd, -insize, SEEK_CUR) == -1) 1925 result = -1; 1926 1927 /* @r{Now write out the byte sequence to get into the} 1928 @r{initial state if this is necessary.} */ 1929 iconv (cd, NULL, NULL, &wrptr, &avail); 1930 1931 break; 1932 @} 1933 insize += nread; 1934 1935 /* @r{Do the conversion.} */ 1936 nconv = iconv (cd, &inptr, &insize, &wrptr, &avail); 1937 if (nconv == (size_t) -1) 1938 @{ 1939 /* @r{Not everything went right. It might only be} 1940 @r{an unfinished byte sequence at the end of the} 1941 @r{buffer. Or it is a real problem.} */ 1942 if (errno == EINVAL) 1943 /* @r{This is harmless. Simply move the unused} 1944 @r{bytes to the beginning of the buffer so that} 1945 @r{they can be used in the next round.} */ 1946 memmove (inbuf, inptr, insize); 1947 else 1948 @{ 1949 /* @r{It is a real problem. Maybe we ran out of} 1950 @r{space in the output buffer or we have invalid} 1951 @r{input. In any case back the file pointer to} 1952 @r{the position of the last processed byte.} */ 1953 lseek (fd, -insize, SEEK_CUR); 1954 result = -1; 1955 break; 1956 @} 1957 @} 1958 @} 1959 1960 /* @r{Terminate the output string.} */ 1961 if (avail >= sizeof (wchar_t)) 1962 *((wchar_t *) wrptr) = L'\0'; 1963 1964 if (iconv_close (cd) != 0) 1965 perror ("iconv_close"); 1966 1967 return (wchar_t *) wrptr - outbuf; 1968@} 1969@end smallexample 1970 1971@cindex stateful 1972This example shows the most important aspects of using the @code{iconv} 1973functions. It shows how successive calls to @code{iconv} can be used to 1974convert large amounts of text. The user does not have to care about 1975stateful encodings as the functions take care of everything. 1976 1977An interesting point is the case where @code{iconv} returns an error and 1978@code{errno} is set to @code{EINVAL}. This is not really an error in the 1979transformation. It can happen whenever the input character set contains 1980byte sequences of more than one byte for some character and texts are not 1981processed in one piece. In this case there is a chance that a multibyte 1982sequence is cut. The caller can then simply read the remainder of the 1983takes and feed the offending bytes together with new character from the 1984input to @code{iconv} and continue the work. The internal state kept in 1985the descriptor is @emph{not} unspecified after such an event as is the 1986case with the conversion functions from the @w{ISO C} standard. 1987 1988The example also shows the problem of using wide character strings with 1989@code{iconv}. As explained in the description of the @code{iconv} 1990function above, the function always takes a pointer to a @code{char} 1991array and the available space is measured in bytes. In the example, the 1992output buffer is a wide character buffer; therefore, we use a local 1993variable @var{wrptr} of type @code{char *}, which is used in the 1994@code{iconv} calls. 1995 1996This looks rather innocent but can lead to problems on platforms that 1997have tight restriction on alignment. Therefore the caller of @code{iconv} 1998has to make sure that the pointers passed are suitable for access of 1999characters from the appropriate character set. Since, in the 2000above case, the input parameter to the function is a @code{wchar_t} 2001pointer, this is the case (unless the user violates alignment when 2002computing the parameter). But in other situations, especially when 2003writing generic functions where one does not know what type of character 2004set one uses and, therefore, treats text as a sequence of bytes, it might 2005become tricky. 2006 2007@node Other iconv Implementations 2008@subsection Some Details about other @code{iconv} Implementations 2009 2010This is not really the place to discuss the @code{iconv} implementation 2011of other systems but it is necessary to know a bit about them to write 2012portable programs. The above mentioned problems with the specification 2013of the @code{iconv} functions can lead to portability issues. 2014 2015The first thing to notice is that, due to the large number of character 2016sets in use, it is certainly not practical to encode the conversions 2017directly in the C library. Therefore, the conversion information must 2018come from files outside the C library. This is usually done in one or 2019both of the following ways: 2020 2021@itemize @bullet 2022@item 2023The C library contains a set of generic conversion functions that can 2024read the needed conversion tables and other information from data files. 2025These files get loaded when necessary. 2026 2027This solution is problematic as it requires a great deal of effort to 2028apply to all character sets (potentially an infinite set). The 2029differences in the structure of the different character sets is so large 2030that many different variants of the table-processing functions must be 2031developed. In addition, the generic nature of these functions make them 2032slower than specifically implemented functions. 2033 2034@item 2035The C library only contains a framework that can dynamically load 2036object files and execute the conversion functions contained therein. 2037 2038This solution provides much more flexibility. The C library itself 2039contains only very little code and therefore reduces the general memory 2040footprint. Also, with a documented interface between the C library and 2041the loadable modules it is possible for third parties to extend the set 2042of available conversion modules. A drawback of this solution is that 2043dynamic loading must be available. 2044@end itemize 2045 2046Some implementations in commercial Unices implement a mixture of these 2047possibilities; the majority implement only the second solution. Using 2048loadable modules moves the code out of the library itself and keeps 2049the door open for extensions and improvements, but this design is also 2050limiting on some platforms since not many platforms support dynamic 2051loading in statically linked programs. On platforms without this 2052capability it is therefore not possible to use this interface in 2053statically linked programs. @Theglibc{} has, on ELF platforms, no 2054problems with dynamic loading in these situations; therefore, this 2055point is moot. The danger is that one gets acquainted with this 2056situation and forgets about the restrictions on other systems. 2057 2058A second thing to know about other @code{iconv} implementations is that 2059the number of available conversions is often very limited. Some 2060implementations provide, in the standard release (not special 2061international or developer releases), at most 100 to 200 conversion 2062possibilities. This does not mean 200 different character sets are 2063supported; for example, conversions from one character set to a set of 10 2064others might count as 10 conversions. Together with the other direction 2065this makes 20 conversion possibilities used up by one character set. One 2066can imagine the thin coverage these platforms provide. Some Unix vendors 2067even provide only a handful of conversions, which renders them useless for 2068almost all uses. 2069 2070This directly leads to a third and probably the most problematic point. 2071The way the @code{iconv} conversion functions are implemented on all 2072known Unix systems and the availability of the conversion functions from 2073character set @math{@cal{A}} to @math{@cal{B}} and the conversion from 2074@math{@cal{B}} to @math{@cal{C}} does @emph{not} imply that the 2075conversion from @math{@cal{A}} to @math{@cal{C}} is available. 2076 2077This might not seem unreasonable and problematic at first, but it is a 2078quite big problem as one will notice shortly after hitting it. To show 2079the problem we assume to write a program that has to convert from 2080@math{@cal{A}} to @math{@cal{C}}. A call like 2081 2082@smallexample 2083cd = iconv_open ("@math{@cal{C}}", "@math{@cal{A}}"); 2084@end smallexample 2085 2086@noindent 2087fails according to the assumption above. But what does the program 2088do now? The conversion is necessary; therefore, simply giving up is not 2089an option. 2090 2091This is a nuisance. The @code{iconv} function should take care of this. 2092But how should the program proceed from here on? If it tries to convert 2093to character set @math{@cal{B}}, first the two @code{iconv_open} 2094calls 2095 2096@smallexample 2097cd1 = iconv_open ("@math{@cal{B}}", "@math{@cal{A}}"); 2098@end smallexample 2099 2100@noindent 2101and 2102 2103@smallexample 2104cd2 = iconv_open ("@math{@cal{C}}", "@math{@cal{B}}"); 2105@end smallexample 2106 2107@noindent 2108will succeed, but how to find @math{@cal{B}}? 2109 2110Unfortunately, the answer is: there is no general solution. On some 2111systems guessing might help. On those systems most character sets can 2112convert to and from UTF-8 encoded @w{ISO 10646} or Unicode text. Besides 2113this only some very system-specific methods can help. Since the 2114conversion functions come from loadable modules and these modules must 2115be stored somewhere in the filesystem, one @emph{could} try to find them 2116and determine from the available file which conversions are available 2117and whether there is an indirect route from @math{@cal{A}} to 2118@math{@cal{C}}. 2119 2120This example shows one of the design errors of @code{iconv} mentioned 2121above. It should at least be possible to determine the list of available 2122conversions programmatically so that if @code{iconv_open} says there is no 2123such conversion, one could make sure this also is true for indirect 2124routes. 2125 2126@node glibc iconv Implementation 2127@subsection The @code{iconv} Implementation in @theglibc{} 2128 2129After reading about the problems of @code{iconv} implementations in the 2130last section it is certainly good to note that the implementation in 2131@theglibc{} has none of the problems mentioned above. What 2132follows is a step-by-step analysis of the points raised above. The 2133evaluation is based on the current state of the development (as of 2134January 1999). The development of the @code{iconv} functions is not 2135complete, but basic functionality has solidified. 2136 2137@Theglibc{}'s @code{iconv} implementation uses shared loadable 2138modules to implement the conversions. A very small number of 2139conversions are built into the library itself but these are only rather 2140trivial conversions. 2141 2142All the benefits of loadable modules are available in the @glibcadj{} 2143implementation. This is especially appealing since the interface is 2144well documented (see below), and it, therefore, is easy to write new 2145conversion modules. The drawback of using loadable objects is not a 2146problem in @theglibc{}, at least on ELF systems. Since the 2147library is able to load shared objects even in statically linked 2148binaries, static linking need not be forbidden in case one wants to use 2149@code{iconv}. 2150 2151The second mentioned problem is the number of supported conversions. 2152Currently, @theglibc{} supports more than 150 character sets. The 2153way the implementation is designed the number of supported conversions 2154is greater than 22350 (@math{150} times @math{149}). If any conversion 2155from or to a character set is missing, it can be added easily. 2156 2157Particularly impressive as it may be, this high number is due to the 2158fact that the @glibcadj{} implementation of @code{iconv} does not have 2159the third problem mentioned above (i.e., whenever there is a conversion 2160from a character set @math{@cal{A}} to @math{@cal{B}} and from 2161@math{@cal{B}} to @math{@cal{C}} it is always possible to convert from 2162@math{@cal{A}} to @math{@cal{C}} directly). If the @code{iconv_open} 2163returns an error and sets @code{errno} to @code{EINVAL}, there is no 2164known way, directly or indirectly, to perform the wanted conversion. 2165 2166@cindex triangulation 2167Triangulation is achieved by providing for each character set a 2168conversion from and to UCS-4 encoded @w{ISO 10646}. Using @w{ISO 10646} 2169as an intermediate representation it is possible to @dfn{triangulate} 2170(i.e., convert with an intermediate representation). 2171 2172There is no inherent requirement to provide a conversion to @w{ISO 217310646} for a new character set, and it is also possible to provide other 2174conversions where neither source nor destination character set is @w{ISO 217510646}. The existing set of conversions is simply meant to cover all 2176conversions that might be of interest. 2177 2178@cindex ISO-2022-JP 2179@cindex EUC-JP 2180All currently available conversions use the triangulation method above, 2181making conversion run unnecessarily slow. If, for example, somebody 2182often needs the conversion from ISO-2022-JP to EUC-JP, a quicker solution 2183would involve direct conversion between the two character sets, skipping 2184the input to @w{ISO 10646} first. The two character sets of interest 2185are much more similar to each other than to @w{ISO 10646}. 2186 2187In such a situation one easily can write a new conversion and provide it 2188as a better alternative. The @glibcadj{} @code{iconv} implementation 2189would automatically use the module implementing the conversion if it is 2190specified to be more efficient. 2191 2192@subsubsection Format of @file{gconv-modules} files 2193 2194All information about the available conversions comes from a file named 2195@file{gconv-modules}, which can be found in any of the directories along 2196the @code{GCONV_PATH}. The @file{gconv-modules} files are line-oriented 2197text files, where each of the lines has one of the following formats: 2198 2199@itemize @bullet 2200@item 2201If the first non-whitespace character is a @kbd{#} the line contains only 2202comments and is ignored. 2203 2204@item 2205Lines starting with @code{alias} define an alias name for a character 2206set. Two more words are expected on the line. The first word 2207defines the alias name, and the second defines the original name of the 2208character set. The effect is that it is possible to use the alias name 2209in the @var{fromset} or @var{toset} parameters of @code{iconv_open} and 2210achieve the same result as when using the real character set name. 2211 2212This is quite important as a character set has often many different 2213names. There is normally an official name but this need not correspond to 2214the most popular name. Besides this many character sets have special 2215names that are somehow constructed. For example, all character sets 2216specified by the ISO have an alias of the form @code{ISO-IR-@var{nnn}} 2217where @var{nnn} is the registration number. This allows programs that 2218know about the registration number to construct character set names and 2219use them in @code{iconv_open} calls. More on the available names and 2220aliases follows below. 2221 2222@item 2223Lines starting with @code{module} introduce an available conversion 2224module. These lines must contain three or four more words. 2225 2226The first word specifies the source character set, the second word the 2227destination character set of conversion implemented in this module, and 2228the third word is the name of the loadable module. The filename is 2229constructed by appending the usual shared object suffix (normally 2230@file{.so}) and this file is then supposed to be found in the same 2231directory the @file{gconv-modules} file is in. The last word on the line, 2232which is optional, is a numeric value representing the cost of the 2233conversion. If this word is missing, a cost of @math{1} is assumed. The 2234numeric value itself does not matter that much; what counts are the 2235relative values of the sums of costs for all possible conversion paths. 2236Below is a more precise description of the use of the cost value. 2237@end itemize 2238 2239Returning to the example above where one has written a module to directly 2240convert from ISO-2022-JP to EUC-JP and back. All that has to be done is 2241to put the new module, let its name be ISO2022JP-EUCJP.so, in a directory 2242and add a file @file{gconv-modules} with the following content in the 2243same directory: 2244 2245@smallexample 2246module ISO-2022-JP// EUC-JP// ISO2022JP-EUCJP 1 2247module EUC-JP// ISO-2022-JP// ISO2022JP-EUCJP 1 2248@end smallexample 2249 2250To see why this is sufficient, it is necessary to understand how the 2251conversion used by @code{iconv} (and described in the descriptor) is 2252selected. The approach to this problem is quite simple. 2253 2254At the first call of the @code{iconv_open} function the program reads 2255all available @file{gconv-modules} files and builds up two tables: one 2256containing all the known aliases and another that contains the 2257information about the conversions and which shared object implements 2258them. 2259 2260@subsubsection Finding the conversion path in @code{iconv} 2261 2262The set of available conversions form a directed graph with weighted 2263edges. The weights on the edges are the costs specified in the 2264@file{gconv-modules} files. The @code{iconv_open} function uses an 2265algorithm suitable for search for the best path in such a graph and so 2266constructs a list of conversions that must be performed in succession 2267to get the transformation from the source to the destination character 2268set. 2269 2270Explaining why the above @file{gconv-modules} files allows the 2271@code{iconv} implementation to resolve the specific ISO-2022-JP to 2272EUC-JP conversion module instead of the conversion coming with the 2273library itself is straightforward. Since the latter conversion takes two 2274steps (from ISO-2022-JP to @w{ISO 10646} and then from @w{ISO 10646} to 2275EUC-JP), the cost is @math{1+1 = 2}. The above @file{gconv-modules} 2276file, however, specifies that the new conversion modules can perform this 2277conversion with only the cost of @math{1}. 2278 2279A mysterious item about the @file{gconv-modules} file above (and also 2280the file coming with @theglibc{}) are the names of the character 2281sets specified in the @code{module} lines. Why do almost all the names 2282end in @code{//}? And this is not all: the names can actually be 2283regular expressions. At this point in time this mystery should not be 2284revealed, unless you have the relevant spell-casting materials: ashes 2285from an original @w{DOS 6.2} boot disk burnt in effigy, a crucifix 2286blessed by St.@: Emacs, assorted herbal roots from Central America, sand 2287from Cebu, etc. Sorry! @strong{The part of the implementation where 2288this is used is not yet finished. For now please simply follow the 2289existing examples. It'll become clearer once it is. --drepper} 2290 2291A last remark about the @file{gconv-modules} is about the names not 2292ending with @code{//}. A character set named @code{INTERNAL} is often 2293mentioned. From the discussion above and the chosen name it should have 2294become clear that this is the name for the representation used in the 2295intermediate step of the triangulation. We have said that this is UCS-4 2296but actually that is not quite right. The UCS-4 specification also 2297includes the specification of the byte ordering used. Since a UCS-4 value 2298consists of four bytes, a stored value is affected by byte ordering. The 2299internal representation is @emph{not} the same as UCS-4 in case the byte 2300ordering of the processor (or at least the running process) is not the 2301same as the one required for UCS-4. This is done for performance reasons 2302as one does not want to perform unnecessary byte-swapping operations if 2303one is not interested in actually seeing the result in UCS-4. To avoid 2304trouble with endianness, the internal representation consistently is named 2305@code{INTERNAL} even on big-endian systems where the representations are 2306identical. 2307 2308@subsubsection @code{iconv} module data structures 2309 2310So far this section has described how modules are located and considered 2311to be used. What remains to be described is the interface of the modules 2312so that one can write new ones. This section describes the interface as 2313it is in use in January 1999. The interface will change a bit in the 2314future but, with luck, only in an upwardly compatible way. 2315 2316The definitions necessary to write new modules are publicly available 2317in the non-standard header @file{gconv.h}. The following text, 2318therefore, describes the definitions from this header file. First, 2319however, it is necessary to get an overview. 2320 2321From the perspective of the user of @code{iconv} the interface is quite 2322simple: the @code{iconv_open} function returns a handle that can be used 2323in calls to @code{iconv}, and finally the handle is freed with a call to 2324@code{iconv_close}. The problem is that the handle has to be able to 2325represent the possibly long sequences of conversion steps and also the 2326state of each conversion since the handle is all that is passed to the 2327@code{iconv} function. Therefore, the data structures are really the 2328elements necessary to understanding the implementation. 2329 2330We need two different kinds of data structures. The first describes the 2331conversion and the second describes the state etc. There are really two 2332type definitions like this in @file{gconv.h}. 2333@pindex gconv.h 2334 2335@deftp {Data type} {struct __gconv_step} 2336@standards{GNU, gconv.h} 2337This data structure describes one conversion a module can perform. For 2338each function in a loaded module with conversion functions there is 2339exactly one object of this type. This object is shared by all users of 2340the conversion (i.e., this object does not contain any information 2341corresponding to an actual conversion; it only describes the conversion 2342itself). 2343 2344@table @code 2345@item struct __gconv_loaded_object *__shlib_handle 2346@itemx const char *__modname 2347@itemx int __counter 2348All these elements of the structure are used internally in the C library 2349to coordinate loading and unloading the shared object. One must not expect any 2350of the other elements to be available or initialized. 2351 2352@item const char *__from_name 2353@itemx const char *__to_name 2354@code{__from_name} and @code{__to_name} contain the names of the source and 2355destination character sets. They can be used to identify the actual 2356conversion to be carried out since one module might implement conversions 2357for more than one character set and/or direction. 2358 2359@item gconv_fct __fct 2360@itemx gconv_init_fct __init_fct 2361@itemx gconv_end_fct __end_fct 2362These elements contain pointers to the functions in the loadable module. 2363The interface will be explained below. 2364 2365@item int __min_needed_from 2366@itemx int __max_needed_from 2367@itemx int __min_needed_to 2368@itemx int __max_needed_to; 2369These values have to be supplied in the init function of the module. The 2370@code{__min_needed_from} value specifies how many bytes a character of 2371the source character set at least needs. The @code{__max_needed_from} 2372specifies the maximum value that also includes possible shift sequences. 2373 2374The @code{__min_needed_to} and @code{__max_needed_to} values serve the 2375same purpose as @code{__min_needed_from} and @code{__max_needed_from} but 2376this time for the destination character set. 2377 2378It is crucial that these values be accurate since otherwise the 2379conversion functions will have problems or not work at all. 2380 2381@item int __stateful 2382This element must also be initialized by the init function. 2383@code{int __stateful} is nonzero if the source character set is stateful. 2384Otherwise it is zero. 2385 2386@item void *__data 2387This element can be used freely by the conversion functions in the 2388module. @code{void *__data} can be used to communicate extra information 2389from one call to another. @code{void *__data} need not be initialized if 2390not needed at all. If @code{void *__data} element is assigned a pointer 2391to dynamically allocated memory (presumably in the init function) it has 2392to be made sure that the end function deallocates the memory. Otherwise 2393the application will leak memory. 2394 2395It is important to be aware that this data structure is shared by all 2396users of this specification conversion and therefore the @code{__data} 2397element must not contain data specific to one specific use of the 2398conversion function. 2399@end table 2400@end deftp 2401 2402@deftp {Data type} {struct __gconv_step_data} 2403@standards{GNU, gconv.h} 2404This is the data structure that contains the information specific to 2405each use of the conversion functions. 2406 2407 2408@table @code 2409@item char *__outbuf 2410@itemx char *__outbufend 2411These elements specify the output buffer for the conversion step. The 2412@code{__outbuf} element points to the beginning of the buffer, and 2413@code{__outbufend} points to the byte following the last byte in the 2414buffer. The conversion function must not assume anything about the size 2415of the buffer but it can be safely assumed there is room for at 2416least one complete character in the output buffer. 2417 2418Once the conversion is finished, if the conversion is the last step, the 2419@code{__outbuf} element must be modified to point after the last byte 2420written into the buffer to signal how much output is available. If this 2421conversion step is not the last one, the element must not be modified. 2422The @code{__outbufend} element must not be modified. 2423 2424@item int __is_last 2425This element is nonzero if this conversion step is the last one. This 2426information is necessary for the recursion. See the description of the 2427conversion function internals below. This element must never be 2428modified. 2429 2430@item int __invocation_counter 2431The conversion function can use this element to see how many calls of 2432the conversion function already happened. Some character sets require a 2433certain prolog when generating output, and by comparing this value with 2434zero, one can find out whether it is the first call and whether, 2435therefore, the prolog should be emitted. This element must never be 2436modified. 2437 2438@item int __internal_use 2439This element is another one rarely used but needed in certain 2440situations. It is assigned a nonzero value in case the conversion 2441functions are used to implement @code{mbsrtowcs} et.al.@: (i.e., the 2442function is not used directly through the @code{iconv} interface). 2443 2444This sometimes makes a difference as it is expected that the 2445@code{iconv} functions are used to translate entire texts while the 2446@code{mbsrtowcs} functions are normally used only to convert single 2447strings and might be used multiple times to convert entire texts. 2448 2449But in this situation we would have problem complying with some rules of 2450the character set specification. Some character sets require a prolog, 2451which must appear exactly once for an entire text. If a number of 2452@code{mbsrtowcs} calls are used to convert the text, only the first call 2453must add the prolog. However, because there is no communication between the 2454different calls of @code{mbsrtowcs}, the conversion functions have no 2455possibility to find this out. The situation is different for sequences 2456of @code{iconv} calls since the handle allows access to the needed 2457information. 2458 2459The @code{int __internal_use} element is mostly used together with 2460@code{__invocation_counter} as follows: 2461 2462@smallexample 2463if (!data->__internal_use 2464 && data->__invocation_counter == 0) 2465 /* @r{Emit prolog.} */ 2466 @dots{} 2467@end smallexample 2468 2469This element must never be modified. 2470 2471@item mbstate_t *__statep 2472The @code{__statep} element points to an object of type @code{mbstate_t} 2473(@pxref{Keeping the state}). The conversion of a stateful character 2474set must use the object pointed to by @code{__statep} to store 2475information about the conversion state. The @code{__statep} element 2476itself must never be modified. 2477 2478@item mbstate_t __state 2479This element must @emph{never} be used directly. It is only part of 2480this structure to have the needed space allocated. 2481@end table 2482@end deftp 2483 2484@subsubsection @code{iconv} module interfaces 2485 2486With the knowledge about the data structures we now can describe the 2487conversion function itself. To understand the interface a bit of 2488knowledge is necessary about the functionality in the C library that 2489loads the objects with the conversions. 2490 2491It is often the case that one conversion is used more than once (i.e., 2492there are several @code{iconv_open} calls for the same set of character 2493sets during one program run). The @code{mbsrtowcs} et.al.@: functions in 2494@theglibc{} also use the @code{iconv} functionality, which 2495increases the number of uses of the same functions even more. 2496 2497Because of this multiple use of conversions, the modules do not get 2498loaded exclusively for one conversion. Instead a module once loaded can 2499be used by an arbitrary number of @code{iconv} or @code{mbsrtowcs} calls 2500at the same time. The splitting of the information between conversion- 2501function-specific information and conversion data makes this possible. 2502The last section showed the two data structures used to do this. 2503 2504This is of course also reflected in the interface and semantics of the 2505functions that the modules must provide. There are three functions that 2506must have the following names: 2507 2508@table @code 2509@item gconv_init 2510The @code{gconv_init} function initializes the conversion function 2511specific data structure. This very same object is shared by all 2512conversions that use this conversion and, therefore, no state information 2513about the conversion itself must be stored in here. If a module 2514implements more than one conversion, the @code{gconv_init} function will 2515be called multiple times. 2516 2517@item gconv_end 2518The @code{gconv_end} function is responsible for freeing all resources 2519allocated by the @code{gconv_init} function. If there is nothing to do, 2520this function can be missing. Special care must be taken if the module 2521implements more than one conversion and the @code{gconv_init} function 2522does not allocate the same resources for all conversions. 2523 2524@item gconv 2525This is the actual conversion function. It is called to convert one 2526block of text. It gets passed the conversion step information 2527initialized by @code{gconv_init} and the conversion data, specific to 2528this use of the conversion functions. 2529@end table 2530 2531There are three data types defined for the three module interface 2532functions and these define the interface. 2533 2534@deftypevr {Data type} int {(*__gconv_init_fct)} (struct __gconv_step *) 2535@standards{GNU, gconv.h} 2536This specifies the interface of the initialization function of the 2537module. It is called exactly once for each conversion the module 2538implements. 2539 2540As explained in the description of the @code{struct __gconv_step} data 2541structure above the initialization function has to initialize parts of 2542it. 2543 2544@table @code 2545@item __min_needed_from 2546@itemx __max_needed_from 2547@itemx __min_needed_to 2548@itemx __max_needed_to 2549These elements must be initialized to the exact numbers of the minimum 2550and maximum number of bytes used by one character in the source and 2551destination character sets, respectively. If the characters all have the 2552same size, the minimum and maximum values are the same. 2553 2554@item __stateful 2555This element must be initialized to a nonzero value if the source 2556character set is stateful. Otherwise it must be zero. 2557@end table 2558 2559If the initialization function needs to communicate some information 2560to the conversion function, this communication can happen using the 2561@code{__data} element of the @code{__gconv_step} structure. But since 2562this data is shared by all the conversions, it must not be modified by 2563the conversion function. The example below shows how this can be used. 2564 2565@smallexample 2566#define MIN_NEEDED_FROM 1 2567#define MAX_NEEDED_FROM 4 2568#define MIN_NEEDED_TO 4 2569#define MAX_NEEDED_TO 4 2570 2571int 2572gconv_init (struct __gconv_step *step) 2573@{ 2574 /* @r{Determine which direction.} */ 2575 struct iso2022jp_data *new_data; 2576 enum direction dir = illegal_dir; 2577 enum variant var = illegal_var; 2578 int result; 2579 2580 if (__strcasecmp (step->__from_name, "ISO-2022-JP//") == 0) 2581 @{ 2582 dir = from_iso2022jp; 2583 var = iso2022jp; 2584 @} 2585 else if (__strcasecmp (step->__to_name, "ISO-2022-JP//") == 0) 2586 @{ 2587 dir = to_iso2022jp; 2588 var = iso2022jp; 2589 @} 2590 else if (__strcasecmp (step->__from_name, "ISO-2022-JP-2//") == 0) 2591 @{ 2592 dir = from_iso2022jp; 2593 var = iso2022jp2; 2594 @} 2595 else if (__strcasecmp (step->__to_name, "ISO-2022-JP-2//") == 0) 2596 @{ 2597 dir = to_iso2022jp; 2598 var = iso2022jp2; 2599 @} 2600 2601 result = __GCONV_NOCONV; 2602 if (dir != illegal_dir) 2603 @{ 2604 new_data = (struct iso2022jp_data *) 2605 malloc (sizeof (struct iso2022jp_data)); 2606 2607 result = __GCONV_NOMEM; 2608 if (new_data != NULL) 2609 @{ 2610 new_data->dir = dir; 2611 new_data->var = var; 2612 step->__data = new_data; 2613 2614 if (dir == from_iso2022jp) 2615 @{ 2616 step->__min_needed_from = MIN_NEEDED_FROM; 2617 step->__max_needed_from = MAX_NEEDED_FROM; 2618 step->__min_needed_to = MIN_NEEDED_TO; 2619 step->__max_needed_to = MAX_NEEDED_TO; 2620 @} 2621 else 2622 @{ 2623 step->__min_needed_from = MIN_NEEDED_TO; 2624 step->__max_needed_from = MAX_NEEDED_TO; 2625 step->__min_needed_to = MIN_NEEDED_FROM; 2626 step->__max_needed_to = MAX_NEEDED_FROM + 2; 2627 @} 2628 2629 /* @r{Yes, this is a stateful encoding.} */ 2630 step->__stateful = 1; 2631 2632 result = __GCONV_OK; 2633 @} 2634 @} 2635 2636 return result; 2637@} 2638@end smallexample 2639 2640The function first checks which conversion is wanted. The module from 2641which this function is taken implements four different conversions; 2642which one is selected can be determined by comparing the names. The 2643comparison should always be done without paying attention to the case. 2644 2645Next, a data structure, which contains the necessary information about 2646which conversion is selected, is allocated. The data structure 2647@code{struct iso2022jp_data} is locally defined since, outside the 2648module, this data is not used at all. Please note that if all four 2649conversions this module supports are requested there are four data 2650blocks. 2651 2652One interesting thing is the initialization of the @code{__min_} and 2653@code{__max_} elements of the step data object. A single ISO-2022-JP 2654character can consist of one to four bytes. Therefore the 2655@code{MIN_NEEDED_FROM} and @code{MAX_NEEDED_FROM} macros are defined 2656this way. The output is always the @code{INTERNAL} character set (aka 2657UCS-4) and therefore each character consists of exactly four bytes. For 2658the conversion from @code{INTERNAL} to ISO-2022-JP we have to take into 2659account that escape sequences might be necessary to switch the character 2660sets. Therefore the @code{__max_needed_to} element for this direction 2661gets assigned @code{MAX_NEEDED_FROM + 2}. This takes into account the 2662two bytes needed for the escape sequences to signal the switching. The 2663asymmetry in the maximum values for the two directions can be explained 2664easily: when reading ISO-2022-JP text, escape sequences can be handled 2665alone (i.e., it is not necessary to process a real character since the 2666effect of the escape sequence can be recorded in the state information). 2667The situation is different for the other direction. Since it is in 2668general not known which character comes next, one cannot emit escape 2669sequences to change the state in advance. This means the escape 2670sequences have to be emitted together with the next character. 2671Therefore one needs more room than only for the character itself. 2672 2673The possible return values of the initialization function are: 2674 2675@table @code 2676@item __GCONV_OK 2677The initialization succeeded 2678@item __GCONV_NOCONV 2679The requested conversion is not supported in the module. This can 2680happen if the @file{gconv-modules} file has errors. 2681@item __GCONV_NOMEM 2682Memory required to store additional information could not be allocated. 2683@end table 2684@end deftypevr 2685 2686The function called before the module is unloaded is significantly 2687easier. It often has nothing at all to do; in which case it can be left 2688out completely. 2689 2690@deftypevr {Data type} void {(*__gconv_end_fct)} (struct gconv_step *) 2691@standards{GNU, gconv.h} 2692The task of this function is to free all resources allocated in the 2693initialization function. Therefore only the @code{__data} element of 2694the object pointed to by the argument is of interest. Continuing the 2695example from the initialization function, the finalization function 2696looks like this: 2697 2698@smallexample 2699void 2700gconv_end (struct __gconv_step *data) 2701@{ 2702 free (data->__data); 2703@} 2704@end smallexample 2705@end deftypevr 2706 2707The most important function is the conversion function itself, which can 2708get quite complicated for complex character sets. But since this is not 2709of interest here, we will only describe a possible skeleton for the 2710conversion function. 2711 2712@deftypevr {Data type} int {(*__gconv_fct)} (struct __gconv_step *, struct __gconv_step_data *, const char **, const char *, size_t *, int) 2713@standards{GNU, gconv.h} 2714The conversion function can be called for two basic reasons: to convert 2715text or to reset the state. From the description of the @code{iconv} 2716function it can be seen why the flushing mode is necessary. What mode 2717is selected is determined by the sixth argument, an integer. This 2718argument being nonzero means that flushing is selected. 2719 2720Common to both modes is where the output buffer can be found. The 2721information about this buffer is stored in the conversion step data. A 2722pointer to this information is passed as the second argument to this 2723function. The description of the @code{struct __gconv_step_data} 2724structure has more information on the conversion step data. 2725 2726@cindex stateful 2727What has to be done for flushing depends on the source character set. 2728If the source character set is not stateful, nothing has to be done. 2729Otherwise the function has to emit a byte sequence to bring the state 2730object into the initial state. Once this all happened the other 2731conversion modules in the chain of conversions have to get the same 2732chance. Whether another step follows can be determined from the 2733@code{__is_last} element of the step data structure to which the first 2734parameter points. 2735 2736The more interesting mode is when actual text has to be converted. The 2737first step in this case is to convert as much text as possible from the 2738input buffer and store the result in the output buffer. The start of the 2739input buffer is determined by the third argument, which is a pointer to a 2740pointer variable referencing the beginning of the buffer. The fourth 2741argument is a pointer to the byte right after the last byte in the buffer. 2742 2743The conversion has to be performed according to the current state if the 2744character set is stateful. The state is stored in an object pointed to 2745by the @code{__statep} element of the step data (second argument). Once 2746either the input buffer is empty or the output buffer is full the 2747conversion stops. At this point, the pointer variable referenced by the 2748third parameter must point to the byte following the last processed 2749byte (i.e., if all of the input is consumed, this pointer and the fourth 2750parameter have the same value). 2751 2752What now happens depends on whether this step is the last one. If it is 2753the last step, the only thing that has to be done is to update the 2754@code{__outbuf} element of the step data structure to point after the 2755last written byte. This update gives the caller the information on how 2756much text is available in the output buffer. In addition, the variable 2757pointed to by the fifth parameter, which is of type @code{size_t}, must 2758be incremented by the number of characters (@emph{not bytes}) that were 2759converted in a non-reversible way. Then, the function can return. 2760 2761In case the step is not the last one, the later conversion functions have 2762to get a chance to do their work. Therefore, the appropriate conversion 2763function has to be called. The information about the functions is 2764stored in the conversion data structures, passed as the first parameter. 2765This information and the step data are stored in arrays, so the next 2766element in both cases can be found by simple pointer arithmetic: 2767 2768@smallexample 2769int 2770gconv (struct __gconv_step *step, struct __gconv_step_data *data, 2771 const char **inbuf, const char *inbufend, size_t *written, 2772 int do_flush) 2773@{ 2774 struct __gconv_step *next_step = step + 1; 2775 struct __gconv_step_data *next_data = data + 1; 2776 @dots{} 2777@end smallexample 2778 2779The @code{next_step} pointer references the next step information and 2780@code{next_data} the next data record. The call of the next function 2781therefore will look similar to this: 2782 2783@smallexample 2784 next_step->__fct (next_step, next_data, &outerr, outbuf, 2785 written, 0) 2786@end smallexample 2787 2788But this is not yet all. Once the function call returns the conversion 2789function might have some more to do. If the return value of the function 2790is @code{__GCONV_EMPTY_INPUT}, more room is available in the output 2791buffer. Unless the input buffer is empty, the conversion functions start 2792all over again and process the rest of the input buffer. If the return 2793value is not @code{__GCONV_EMPTY_INPUT}, something went wrong and we have 2794to recover from this. 2795 2796A requirement for the conversion function is that the input buffer 2797pointer (the third argument) always point to the last character that 2798was put in converted form into the output buffer. This is trivially 2799true after the conversion performed in the current step, but if the 2800conversion functions deeper downstream stop prematurely, not all 2801characters from the output buffer are consumed and, therefore, the input 2802buffer pointers must be backed off to the right position. 2803 2804Correcting the input buffers is easy to do if the input and output 2805character sets have a fixed width for all characters. In this situation 2806we can compute how many characters are left in the output buffer and, 2807therefore, can correct the input buffer pointer appropriately with a 2808similar computation. Things are getting tricky if either character set 2809has characters represented with variable length byte sequences, and it 2810gets even more complicated if the conversion has to take care of the 2811state. In these cases the conversion has to be performed once again, from 2812the known state before the initial conversion (i.e., if necessary the 2813state of the conversion has to be reset and the conversion loop has to be 2814executed again). The difference now is that it is known how much input 2815must be created, and the conversion can stop before converting the first 2816unused character. Once this is done the input buffer pointers must be 2817updated again and the function can return. 2818 2819One final thing should be mentioned. If it is necessary for the 2820conversion to know whether it is the first invocation (in case a prolog 2821has to be emitted), the conversion function should increment the 2822@code{__invocation_counter} element of the step data structure just 2823before returning to the caller. See the description of the @code{struct 2824__gconv_step_data} structure above for more information on how this can 2825be used. 2826 2827The return value must be one of the following values: 2828 2829@table @code 2830@item __GCONV_EMPTY_INPUT 2831All input was consumed and there is room left in the output buffer. 2832@item __GCONV_FULL_OUTPUT 2833No more room in the output buffer. In case this is not the last step 2834this value is propagated down from the call of the next conversion 2835function in the chain. 2836@item __GCONV_INCOMPLETE_INPUT 2837The input buffer is not entirely empty since it contains an incomplete 2838character sequence. 2839@end table 2840 2841The following example provides a framework for a conversion function. 2842In case a new conversion has to be written the holes in this 2843implementation have to be filled and that is it. 2844 2845@smallexample 2846int 2847gconv (struct __gconv_step *step, struct __gconv_step_data *data, 2848 const char **inbuf, const char *inbufend, size_t *written, 2849 int do_flush) 2850@{ 2851 struct __gconv_step *next_step = step + 1; 2852 struct __gconv_step_data *next_data = data + 1; 2853 gconv_fct fct = next_step->__fct; 2854 int status; 2855 2856 /* @r{If the function is called with no input this means we have} 2857 @r{to reset to the initial state. The possibly partly} 2858 @r{converted input is dropped.} */ 2859 if (do_flush) 2860 @{ 2861 status = __GCONV_OK; 2862 2863 /* @r{Possible emit a byte sequence which put the state object} 2864 @r{into the initial state.} */ 2865 2866 /* @r{Call the steps down the chain if there are any but only} 2867 @r{if we successfully emitted the escape sequence.} */ 2868 if (status == __GCONV_OK && ! data->__is_last) 2869 status = fct (next_step, next_data, NULL, NULL, 2870 written, 1); 2871 @} 2872 else 2873 @{ 2874 /* @r{We preserve the initial values of the pointer variables.} */ 2875 const char *inptr = *inbuf; 2876 char *outbuf = data->__outbuf; 2877 char *outend = data->__outbufend; 2878 char *outptr; 2879 2880 do 2881 @{ 2882 /* @r{Remember the start value for this round.} */ 2883 inptr = *inbuf; 2884 /* @r{The outbuf buffer is empty.} */ 2885 outptr = outbuf; 2886 2887 /* @r{For stateful encodings the state must be safe here.} */ 2888 2889 /* @r{Run the conversion loop. @code{status} is set} 2890 @r{appropriately afterwards.} */ 2891 2892 /* @r{If this is the last step, leave the loop. There is} 2893 @r{nothing we can do.} */ 2894 if (data->__is_last) 2895 @{ 2896 /* @r{Store information about how many bytes are} 2897 @r{available.} */ 2898 data->__outbuf = outbuf; 2899 2900 /* @r{If any non-reversible conversions were performed,} 2901 @r{add the number to @code{*written}.} */ 2902 2903 break; 2904 @} 2905 2906 /* @r{Write out all output that was produced.} */ 2907 if (outbuf > outptr) 2908 @{ 2909 const char *outerr = data->__outbuf; 2910 int result; 2911 2912 result = fct (next_step, next_data, &outerr, 2913 outbuf, written, 0); 2914 2915 if (result != __GCONV_EMPTY_INPUT) 2916 @{ 2917 if (outerr != outbuf) 2918 @{ 2919 /* @r{Reset the input buffer pointer. We} 2920 @r{document here the complex case.} */ 2921 size_t nstatus; 2922 2923 /* @r{Reload the pointers.} */ 2924 *inbuf = inptr; 2925 outbuf = outptr; 2926 2927 /* @r{Possibly reset the state.} */ 2928 2929 /* @r{Redo the conversion, but this time} 2930 @r{the end of the output buffer is at} 2931 @r{@code{outerr}.} */ 2932 @} 2933 2934 /* @r{Change the status.} */ 2935 status = result; 2936 @} 2937 else 2938 /* @r{All the output is consumed, we can make} 2939 @r{ another run if everything was ok.} */ 2940 if (status == __GCONV_FULL_OUTPUT) 2941 status = __GCONV_OK; 2942 @} 2943 @} 2944 while (status == __GCONV_OK); 2945 2946 /* @r{We finished one use of this step.} */ 2947 ++data->__invocation_counter; 2948 @} 2949 2950 return status; 2951@} 2952@end smallexample 2953@end deftypevr 2954 2955This information should be sufficient to write new modules. Anybody 2956doing so should also take a look at the available source code in the 2957@glibcadj{} sources. It contains many examples of working and optimized 2958modules. 2959 2960@c File charset.texi edited October 2001 by Dennis Grace, IBM Corporation 2961