2
голосов
2ответов
7256 просмотров

Convert from hex string to unicode

How can i convert the 'dead' string to an unicode string u'\xde\xad'? Doing this: from binascii import unhexlify out = ''.join(x for x in [unhexlify('de'), unhexlify('ad')]) creates a <type 'str'> string '\xde\xad' Trying to use the Unicode.join() like this: from binascii import unhex...

1
голосов
2ответов
672 просмотров

Rendering unicode characters correctly on textbox

I am working on a translation application in which users are allowed to give English input and I need to convert to a target language and display on a text box. I am facing problems in displaying unicode characters. Complex characters are not rendering correctly. I know windows uses Uniscribe fo...

1
голосов
6ответов
5972 просмотров

Extract first valid line of string from byte array

I am writing a utility in Java that reads a stream which may contain both text and binary data. I want to avoid having I/O wait. To do that I create a thread to keep reading the data (and wait for it) putting it into a buffer, so the clients can check avialability and terminate the waiting whenev...

12
голосов
4ответов
11492 просмотров

Java Can't Open a File with Surrogate Unicode Values in the Filename?

I'm dealing with code that does various IO operations with files, and I want to make it able to deal with international filenames. I'm working on a Mac with Java 1.5, and if a filename contains Unicode characters that require surrogates, the JVM can't seem to locate the file. For example, my test...

45
голосов
6ответов
45609 просмотров

UTF-8 In Python logging, how?

I'm trying to log a UTF-8 encoded string to a file using Python's logging package. As a toy example: import logging def logging_test(): handler = logging.FileHandler("/home/ted/logfile.txt", "w", encoding = "UTF-8") formatter = logging.Formatter("%(mes...

1
голосов
4ответов
253 просмотров

I need a string that won't properly convert to ANSI using several code pages

My .NET library has to marshal strings to a C library that expects text encoded using the system's default ANSI code page. Since .NET supports Unicode, this makes it possible for users to pass a string to the library that doesn't properly convert to ANSI. For example, on an English machine, "デス...

110
голосов
3ответов
22911 просмотров

How does UTF-8 "variable-width encoding" work?

The unicode standard has enough code-points in it that you need 4 bytes to store them all. That's what the UTF-32 encoding does. Yet the UTF-8 encoding somehow squeezes these into much smaller spaces by using something called "variable-width encoding". In fact, it manages to represent the fi...

4
голосов
1ответов
392 просмотров

Does python's print function handle unicode differently now than when Dive Into Python was written?

I'm trying to work my way through some frustrating encoding issues by going back to basics. In Dive Into Python example 9.14 (here) we have this: >>> s = u'La Pe\xf1a' >>> print s Traceback (innermost last): File "<interactive input>", line 1, in ? UnicodeError: ASCII en...

2
голосов
1ответов
346 просмотров

Please help me trace how charsets are handled every step of the way

We all know how easy character sets are on the web, yet every time you think you got it right, a foreign charset bites you in the butt. So I'd like to trace the steps of what happens in a fictional scenario I will describe below. I'm going to try and put down my understanding as well as possible ...

0
голосов
1ответов
254 просмотров

Fixing Unicode Oops

It seems that we have managed to insert into our database 2 unicode characters for each of the unicode characters we want, For example, for the unicde char 0x3CBC, we've inserted the unicode equivalents for each of it's components (0xC383 AND 0xC2BC) Can anyone think of a simple solution for fi...

0
голосов
5ответов
197 просмотров

can someone help me to figure this out ? about unicode

hibyte lobyte makeunicode 250 65 57345 I got this table, and the hibyte and lobyte are some chinese character which may use big5 or GBK encoding, hibyte is hight byte, and lobyte is low byte. And I think the unicode might be some encoding in unicode that corresponding to the big5...

1
голосов
1ответов
70 просмотров

Fixing older program: database text encoding, and incorrect field types

I'm currently again working on a program from when I was, umm... less capable. It has a number of problems: The database collation is latin1_swedish_ci. I would like to convert it to utf8. How would I do this? The database has some fields that are boolean values stored as 0 or 1. However, the f...

1
голосов
5ответов
5037 просмотров

convert function to delphi 2009/2010 (unicode)

I'm slowly converting my existing code into Delphi 2010 and read several of the articles on Embarcaedro web site as well as Marco Cantú whitepaper. There are still some things I haven't understood, so here are two functions to exemplify my question: function RemoveSpace(InStr: string): string; ...

-1
голосов
4ответов
474 просмотров

Java Unicode problem

My question would be what's wrong with the next code? I'm trying with j2ee to read some unicode from a database and some characters are returned as the famous question mark. try { Class.forName("com.mysql.jdbc.Driver"); String connectionUrl = "jdbc:mysql://localho...

0
голосов
2ответов
548 просмотров

Encoding/Decoding strange issue

This line of code, which decodes an encoded Chinese word: URLDecoder.decode("%E4%BB%BB%E4%BD%95%E8%BD%A6%E8%BE%86%E5%BA%94", "UTF-8").getBytes().length When I run it in a JSP page (on Jboss) it prints 5: <%= URLDecoder.decode("%E4%BB%BB%E4%BD%95%E8%BD%A6%E8%BE%86%E5%BA%94", "UT...

106
голосов
4ответов
29681 просмотров

How can I iterate through the unicode codepoints of a Java String?

So I know about String#codePointAt(int), but it's indexed by the char offset, not by the codepoint offset. I'm thinking about trying something like: using String#charAt(int) to get the char at an index testing whether the char is in the high-surrogates range if so, use String#codePointAt(i...

3
голосов
5ответов
3562 просмотров

How to deal with query parameter's encoding?

I assumed that any data being sent to my parameter strings would be utf-8, since that is what my whole site uses throughout. Lo-and-behold I was wrong. For this example has the character ä in utf-8 in the document (from the query string) but proceeds to send a B\xe4ule (which is either ISO-8859-...

3
голосов
2ответов
177 просмотров

How does one allow a subset of UNICODE codepoints in input validation?

I am creating a service that could "go international" to non-English speaking markets. I do not want to restrict a username to the ASCII range of characters but would like to allow a user to specify their "natural" username. OK, use UNICODE (and say UTF-8 as my username text encoding). But! I...

0
голосов
2ответов
1379 просмотров

Unicode issue with freetype (C)

I currently working on a library for the NekoVM to create a binding to Freetype 2. It is written in plain c and it all works really nice, except when the user enters some unicode chars like "ü", "Ä" or "ß" they will be transformed into to some ugly square-like letters. When I recieve the data fro...

0
голосов
2ответов
1983 просмотров

Open file in TagLib with Unicode chars in filename

I am quite new to the C++ 'area' so I hope this will not be just another silly 'C++ strings' question. Here is my problem. I want to integrate TagLib (1.5, 1.6 as soon as I manage to build it for Windows) into an existing Windows MFC VS2005 project. I need it to read audio files metadata (not wr...

1
голосов
4ответов
1224 просмотров

Asc(Chr(254)) returns 116 in .Net 1.1 when language is Hungarian

I set the culture to Hungarian language, and Chr() seems to be broken. System.Threading.Thread.CurrentThread.CurrentCulture = "hu-US" System.Threading.Thread.CurrentThread.CurrentUICulture = "hu-US" Chr(254) This returns "ţ" when it should be "þ" However, Asc("ţ") returns 116. This: Asc(C...

1
голосов
1ответов
798 просмотров

VerQueryValue and multi codepage Unicode characters

In our application we use VerQueryValue() API call to fetch version info such as ProductName etc. For some applications running on a machine in Traditional Chinese (code page 950), the ProductName which has Unicode sequences that span multiple code pages, some characters are not translated proper...

8
голосов
2ответов
6640 просмотров

How should escaped unicode be handled by json parsers and encoders?

The json spec allows for escaped unicode in json strings (of the form \uXXXX). It specifically mentions a restricted codepoint (a noncharacter) as a valid escaped codepoint. Doesn't this imply parsers should generate illegal unicode from strings containing noncharacters and restricted codepoints?...

0
голосов
1ответов
1318 просмотров

Regex word-break with unicode diacritics

I am working on an application that searches text using regular expressions based on input from a user. One option the user has is to include a "Match 0 or more characters" wildcard using the asterisk. I need this to only match between word boundaries. My first attempt was to convert all asterisk...

0
голосов
3ответов
804 просмотров

Where can I find a list of .NET unicode (wide) functions?

I would like to get a list of the VB.net/C# "wide" functions for unicode - i.e. AscW, ChrW, MessageBoxW, etc. Is there a list of these somewhere?

18
голосов
5ответов
8148 просмотров

Why does wide file-stream in C++ narrow written data by default?

Honestly, I just don't get the following design decision in C++ Standard library. When writing wide characters to a file, the wofstream converts wchar_t into char characters: #include <fstream> #include <string> int main() { using namespace std; wstring someString = L"Hello...

0
голосов
1ответов
83 просмотров

to extract characters of a particular language

how can i extract only the characters in a particular language from a file containing language characters, alphanumeric character english alphabets

3
голосов
1ответов
278 просмотров

Unicode regular expression tutorial

Is there a good tutorial available for changing ASCII regular expressions to Unicode regular expressions? I need to convert existing a US English application to support internationalization.

4
голосов
2ответов
9631 просмотров

How can I detect japanese text in a Java string?

I need to be able to detect Japanese characters in a Java string. Currently I'm getting the UnicodeBlock and checking to see if it's equal to Character.UnicodeBlock.KATAKANA or Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS, but I'm not 100% that's going to cover everything. Any suggestio...

4
голосов
4ответов
446 просмотров

What is the best practice for creating libraries that support both Unicode and ASCII in C++?

I'm working on writing some libraries that will be used both internally and by customers and was wondering what the best method of supporting both Unicode and ASCII. It looks like Microsoft (in the MFC Libraries) writes both the Unicode and ASCII classes and does something similar to this in the...