LCOV - code coverage report
Current view: top level - include/llvm/Support - ConvertUTF.h (source / functions) Hit Total Coverage
Test: llvm-toolchain.info Lines: 5 5 100.0 %
Date: 2018-10-20 13:21:21 Functions: 1 1 100.0 %
Legend: Lines: hit not hit

          Line data    Source code
       1             : /*===--- ConvertUTF.h - Universal Character Names conversions ---------------===
       2             :  *
       3             :  *                     The LLVM Compiler Infrastructure
       4             :  *
       5             :  * This file is distributed under the University of Illinois Open Source
       6             :  * License. See LICENSE.TXT for details.
       7             :  *
       8             :  *==------------------------------------------------------------------------==*/
       9             : /*
      10             :  * Copyright 2001-2004 Unicode, Inc.
      11             :  *
      12             :  * Disclaimer
      13             :  *
      14             :  * This source code is provided as is by Unicode, Inc. No claims are
      15             :  * made as to fitness for any particular purpose. No warranties of any
      16             :  * kind are expressed or implied. The recipient agrees to determine
      17             :  * applicability of information provided. If this file has been
      18             :  * purchased on magnetic or optical media from Unicode, Inc., the
      19             :  * sole remedy for any claim will be exchange of defective media
      20             :  * within 90 days of receipt.
      21             :  *
      22             :  * Limitations on Rights to Redistribute This Code
      23             :  *
      24             :  * Unicode, Inc. hereby grants the right to freely use the information
      25             :  * supplied in this file in the creation of products supporting the
      26             :  * Unicode Standard, and to make copies of this file in any form
      27             :  * for internal or external distribution as long as this notice
      28             :  * remains attached.
      29             :  */
      30             : 
      31             : /* ---------------------------------------------------------------------
      32             : 
      33             :     Conversions between UTF32, UTF-16, and UTF-8.  Header file.
      34             : 
      35             :     Several funtions are included here, forming a complete set of
      36             :     conversions between the three formats.  UTF-7 is not included
      37             :     here, but is handled in a separate source file.
      38             : 
      39             :     Each of these routines takes pointers to input buffers and output
      40             :     buffers.  The input buffers are const.
      41             : 
      42             :     Each routine converts the text between *sourceStart and sourceEnd,
      43             :     putting the result into the buffer between *targetStart and
      44             :     targetEnd. Note: the end pointers are *after* the last item: e.g.
      45             :     *(sourceEnd - 1) is the last item.
      46             : 
      47             :     The return result indicates whether the conversion was successful,
      48             :     and if not, whether the problem was in the source or target buffers.
      49             :     (Only the first encountered problem is indicated.)
      50             : 
      51             :     After the conversion, *sourceStart and *targetStart are both
      52             :     updated to point to the end of last text successfully converted in
      53             :     the respective buffers.
      54             : 
      55             :     Input parameters:
      56             :         sourceStart - pointer to a pointer to the source buffer.
      57             :                 The contents of this are modified on return so that
      58             :                 it points at the next thing to be converted.
      59             :         targetStart - similarly, pointer to pointer to the target buffer.
      60             :         sourceEnd, targetEnd - respectively pointers to the ends of the
      61             :                 two buffers, for overflow checking only.
      62             : 
      63             :     These conversion functions take a ConversionFlags argument. When this
      64             :     flag is set to strict, both irregular sequences and isolated surrogates
      65             :     will cause an error.  When the flag is set to lenient, both irregular
      66             :     sequences and isolated surrogates are converted.
      67             : 
      68             :     Whether the flag is strict or lenient, all illegal sequences will cause
      69             :     an error return. This includes sequences such as: <F4 90 80 80>, <C0 80>,
      70             :     or <A0> in UTF-8, and values above 0x10FFFF in UTF-32. Conformant code
      71             :     must check for illegal sequences.
      72             : 
      73             :     When the flag is set to lenient, characters over 0x10FFFF are converted
      74             :     to the replacement character; otherwise (when the flag is set to strict)
      75             :     they constitute an error.
      76             : 
      77             :     Output parameters:
      78             :         The value "sourceIllegal" is returned from some routines if the input
      79             :         sequence is malformed.  When "sourceIllegal" is returned, the source
      80             :         value will point to the illegal value that caused the problem. E.g.,
      81             :         in UTF-8 when a sequence is malformed, it points to the start of the
      82             :         malformed sequence.
      83             : 
      84             :     Author: Mark E. Davis, 1994.
      85             :     Rev History: Rick McGowan, fixes & updates May 2001.
      86             :          Fixes & updates, Sept 2001.
      87             : 
      88             : ------------------------------------------------------------------------ */
      89             : 
      90             : #ifndef LLVM_SUPPORT_CONVERTUTF_H
      91             : #define LLVM_SUPPORT_CONVERTUTF_H
      92             : 
      93             : #include <cstddef>
      94             : #include <string>
      95             : #include <system_error>
      96             : 
      97             : // Wrap everything in namespace llvm so that programs can link with llvm and
      98             : // their own version of the unicode libraries.
      99             : 
     100             : namespace llvm {
     101             : 
     102             : /* ---------------------------------------------------------------------
     103             :     The following 4 definitions are compiler-specific.
     104             :     The C standard does not guarantee that wchar_t has at least
     105             :     16 bits, so wchar_t is no less portable than unsigned short!
     106             :     All should be unsigned values to avoid sign extension during
     107             :     bit mask & shift operations.
     108             : ------------------------------------------------------------------------ */
     109             : 
     110             : typedef unsigned int    UTF32;  /* at least 32 bits */
     111             : typedef unsigned short  UTF16;  /* at least 16 bits */
     112             : typedef unsigned char   UTF8;   /* typically 8 bits */
     113             : typedef unsigned char   Boolean; /* 0 or 1 */
     114             : 
     115             : /* Some fundamental constants */
     116             : #define UNI_REPLACEMENT_CHAR (UTF32)0x0000FFFD
     117             : #define UNI_MAX_BMP (UTF32)0x0000FFFF
     118             : #define UNI_MAX_UTF16 (UTF32)0x0010FFFF
     119             : #define UNI_MAX_UTF32 (UTF32)0x7FFFFFFF
     120             : #define UNI_MAX_LEGAL_UTF32 (UTF32)0x0010FFFF
     121             : 
     122             : #define UNI_MAX_UTF8_BYTES_PER_CODE_POINT 4
     123             : 
     124             : #define UNI_UTF16_BYTE_ORDER_MARK_NATIVE  0xFEFF
     125             : #define UNI_UTF16_BYTE_ORDER_MARK_SWAPPED 0xFFFE
     126             : 
     127             : typedef enum {
     128             :   conversionOK,           /* conversion successful */
     129             :   sourceExhausted,        /* partial character in source, but hit end */
     130             :   targetExhausted,        /* insuff. room in target for conversion */
     131             :   sourceIllegal           /* source sequence is illegal/malformed */
     132             : } ConversionResult;
     133             : 
     134             : typedef enum {
     135             :   strictConversion = 0,
     136             :   lenientConversion
     137             : } ConversionFlags;
     138             : 
     139             : ConversionResult ConvertUTF8toUTF16 (
     140             :   const UTF8** sourceStart, const UTF8* sourceEnd,
     141             :   UTF16** targetStart, UTF16* targetEnd, ConversionFlags flags);
     142             : 
     143             : /**
     144             :  * Convert a partial UTF8 sequence to UTF32.  If the sequence ends in an
     145             :  * incomplete code unit sequence, returns \c sourceExhausted.
     146             :  */
     147             : ConversionResult ConvertUTF8toUTF32Partial(
     148             :   const UTF8** sourceStart, const UTF8* sourceEnd,
     149             :   UTF32** targetStart, UTF32* targetEnd, ConversionFlags flags);
     150             : 
     151             : /**
     152             :  * Convert a partial UTF8 sequence to UTF32.  If the sequence ends in an
     153             :  * incomplete code unit sequence, returns \c sourceIllegal.
     154             :  */
     155             : ConversionResult ConvertUTF8toUTF32(
     156             :   const UTF8** sourceStart, const UTF8* sourceEnd,
     157             :   UTF32** targetStart, UTF32* targetEnd, ConversionFlags flags);
     158             : 
     159             : ConversionResult ConvertUTF16toUTF8 (
     160             :   const UTF16** sourceStart, const UTF16* sourceEnd,
     161             :   UTF8** targetStart, UTF8* targetEnd, ConversionFlags flags);
     162             : 
     163             : ConversionResult ConvertUTF32toUTF8 (
     164             :   const UTF32** sourceStart, const UTF32* sourceEnd,
     165             :   UTF8** targetStart, UTF8* targetEnd, ConversionFlags flags);
     166             : 
     167             : ConversionResult ConvertUTF16toUTF32 (
     168             :   const UTF16** sourceStart, const UTF16* sourceEnd,
     169             :   UTF32** targetStart, UTF32* targetEnd, ConversionFlags flags);
     170             : 
     171             : ConversionResult ConvertUTF32toUTF16 (
     172             :   const UTF32** sourceStart, const UTF32* sourceEnd,
     173             :   UTF16** targetStart, UTF16* targetEnd, ConversionFlags flags);
     174             : 
     175             : Boolean isLegalUTF8Sequence(const UTF8 *source, const UTF8 *sourceEnd);
     176             : 
     177             : Boolean isLegalUTF8String(const UTF8 **source, const UTF8 *sourceEnd);
     178             : 
     179             : unsigned getNumBytesForUTF8(UTF8 firstByte);
     180             : 
     181             : /*************************************************************************/
     182             : /* Below are LLVM-specific wrappers of the functions above. */
     183             : 
     184             : template <typename T> class ArrayRef;
     185             : template <typename T> class SmallVectorImpl;
     186             : class StringRef;
     187             : 
     188             : /**
     189             :  * Convert an UTF8 StringRef to UTF8, UTF16, or UTF32 depending on
     190             :  * WideCharWidth. The converted data is written to ResultPtr, which needs to
     191             :  * point to at least WideCharWidth * (Source.Size() + 1) bytes. On success,
     192             :  * ResultPtr will point one after the end of the copied string. On failure,
     193             :  * ResultPtr will not be changed, and ErrorPtr will be set to the location of
     194             :  * the first character which could not be converted.
     195             :  * \return true on success.
     196             :  */
     197             : bool ConvertUTF8toWide(unsigned WideCharWidth, llvm::StringRef Source,
     198             :                        char *&ResultPtr, const UTF8 *&ErrorPtr);
     199             : 
     200             : /**
     201             : * Converts a UTF-8 StringRef to a std::wstring.
     202             : * \return true on success.
     203             : */
     204             : bool ConvertUTF8toWide(llvm::StringRef Source, std::wstring &Result);
     205             : 
     206             : /**
     207             : * Converts a UTF-8 C-string to a std::wstring.
     208             : * \return true on success.
     209             : */
     210             : bool ConvertUTF8toWide(const char *Source, std::wstring &Result);
     211             : 
     212             : /**
     213             : * Converts a std::wstring to a UTF-8 encoded std::string.
     214             : * \return true on success.
     215             : */
     216             : bool convertWideToUTF8(const std::wstring &Source, std::string &Result);
     217             : 
     218             : 
     219             : /**
     220             :  * Convert an Unicode code point to UTF8 sequence.
     221             :  *
     222             :  * \param Source a Unicode code point.
     223             :  * \param [in,out] ResultPtr pointer to the output buffer, needs to be at least
     224             :  * \c UNI_MAX_UTF8_BYTES_PER_CODE_POINT bytes.  On success \c ResultPtr is
     225             :  * updated one past end of the converted sequence.
     226             :  *
     227             :  * \returns true on success.
     228             :  */
     229             : bool ConvertCodePointToUTF8(unsigned Source, char *&ResultPtr);
     230             : 
     231             : /**
     232             :  * Convert the first UTF8 sequence in the given source buffer to a UTF32
     233             :  * code point.
     234             :  *
     235             :  * \param [in,out] source A pointer to the source buffer. If the conversion
     236             :  * succeeds, this pointer will be updated to point to the byte just past the
     237             :  * end of the converted sequence.
     238             :  * \param sourceEnd A pointer just past the end of the source buffer.
     239             :  * \param [out] target The converted code
     240             :  * \param flags Whether the conversion is strict or lenient.
     241             :  *
     242             :  * \returns conversionOK on success
     243             :  *
     244             :  * \sa ConvertUTF8toUTF32
     245             :  */
     246         374 : inline ConversionResult convertUTF8Sequence(const UTF8 **source,
     247             :                                             const UTF8 *sourceEnd,
     248             :                                             UTF32 *target,
     249             :                                             ConversionFlags flags) {
     250         374 :   if (*source == sourceEnd)
     251             :     return sourceExhausted;
     252         374 :   unsigned size = getNumBytesForUTF8(**source);
     253         374 :   if ((ptrdiff_t)size > sourceEnd - *source)
     254             :     return sourceExhausted;
     255         365 :   return ConvertUTF8toUTF32(source, *source + size, &target, target + 1, flags);
     256             : }
     257             : 
     258             : /**
     259             :  * Returns true if a blob of text starts with a UTF-16 big or little endian byte
     260             :  * order mark.
     261             :  */
     262             : bool hasUTF16ByteOrderMark(ArrayRef<char> SrcBytes);
     263             : 
     264             : /**
     265             :  * Converts a stream of raw bytes assumed to be UTF16 into a UTF8 std::string.
     266             :  *
     267             :  * \param [in] SrcBytes A buffer of what is assumed to be UTF-16 encoded text.
     268             :  * \param [out] Out Converted UTF-8 is stored here on success.
     269             :  * \returns true on success
     270             :  */
     271             : bool convertUTF16ToUTF8String(ArrayRef<char> SrcBytes, std::string &Out);
     272             : 
     273             : /**
     274             : * Converts a UTF16 string into a UTF8 std::string.
     275             : *
     276             : * \param [in] Src A buffer of UTF-16 encoded text.
     277             : * \param [out] Out Converted UTF-8 is stored here on success.
     278             : * \returns true on success
     279             : */
     280             : bool convertUTF16ToUTF8String(ArrayRef<UTF16> Src, std::string &Out);
     281             : 
     282             : /**
     283             :  * Converts a UTF-8 string into a UTF-16 string with native endianness.
     284             :  *
     285             :  * \returns true on success
     286             :  */
     287             : bool convertUTF8ToUTF16String(StringRef SrcUTF8,
     288             :                               SmallVectorImpl<UTF16> &DstUTF16);
     289             : 
     290             : #if defined(_WIN32)
     291             : namespace sys {
     292             : namespace windows {
     293             : std::error_code UTF8ToUTF16(StringRef utf8, SmallVectorImpl<wchar_t> &utf16);
     294             : /// Convert to UTF16 from the current code page used in the system
     295             : std::error_code CurCPToUTF16(StringRef utf8, SmallVectorImpl<wchar_t> &utf16);
     296             : std::error_code UTF16ToUTF8(const wchar_t *utf16, size_t utf16_len,
     297             :                             SmallVectorImpl<char> &utf8);
     298             : /// Convert from UTF16 to the current code page used in the system
     299             : std::error_code UTF16ToCurCP(const wchar_t *utf16, size_t utf16_len,
     300             :                              SmallVectorImpl<char> &utf8);
     301             : } // namespace windows
     302             : } // namespace sys
     303             : #endif
     304             : 
     305             : } /* end namespace llvm */
     306             : 
     307             : #endif

Generated by: LCOV version 1.13