Encoding, Unicode, Charset Safety
Encoding bugs are silent data corruption
Encoding issues rarely crash your service. They corrupt data silently. A user name becomes question marks. A payment reference loses characters. A JSON payload breaks only in production. These bugs are expensive because they often surface long after the root cause.
Golden rule: always use UTF-8 explicitly
Never rely on the system default encoding. The JVM may run with different defaults across environments.
Bad example: platform default encoding
byte[] bytes = str.getBytes();
This uses the platform default charset. On one machine it may be UTF-8, on another ISO-8859-1.
Correct: explicit charset
import java.nio.charset.StandardCharsets; byte[] bytes = str.getBytes(StandardCharsets.UTF_8); String decoded = new String(bytes, StandardCharsets.UTF_8);
file.encoding: do not trust it
The system property file.encoding can differ across environments. Never design code that depends on it implicitly.
System.getProperty("file.encoding");
In production, explicitly enforce UTF-8 where possible:
java -Dfile.encoding=UTF-8 -jar app.jar
Database encoding mismatch
Even if your Java application uses UTF-8, your database may not. If DB charset differs, you get corrupted storage.
- Ensure database charset is UTF-8 (or utf8mb4 in MySQL-like systems).
- Ensure JDBC connection does not override charset unexpectedly.
- Test with non-ASCII characters in integration tests.
Example: JDBC and Unicode safety
PreparedStatement ps = conn.prepareStatement( "INSERT INTO users(name) VALUES (?)" ); ps.setString(1, "Çağrı Öztürk"); ps.executeUpdate();
If the database column or connection charset is wrong, the name may be stored incorrectly.
JSON and HTTP boundaries
HTTP bodies and JSON must declare and use UTF-8 consistently. Always set Content-Type with charset explicitly.
Example header
Content-Type: application/json; charset=UTF-8
When reading request bodies manually, ensure correct charset:
new InputStreamReader(inputStream, StandardCharsets.UTF_8);
Multi-byte character pitfalls
UTF-8 uses variable-length encoding. String length and byte length are not the same.
Example
String s = "Ö"; System.out.println(s.length()); // 1 character System.out.println(s.getBytes(StandardCharsets.UTF_8).length); // 2 bytes
Production bug example: truncating by bytes instead of characters corrupts multi-byte characters.
Substring and grapheme clusters
Some characters are composed of multiple code points (e.g., emoji or combined accents). Truncating blindly can split them.
Normalization issues
The same visual character can be encoded in multiple ways (composed vs decomposed forms). String equality may fail.
Example
import java.text.Normalizer; String composed = "é"; String decomposed = "eu0301"; System.out.println(composed.equals(decomposed)); // false String normalized = Normalizer.normalize(decomposed, Normalizer.Form.NFC); System.out.println(composed.equals(normalized)); // true
If your system compares user input strictly, normalization may be required at boundaries.
Case conversion and locale sensitivity
Case conversion depends on locale. For identifiers or protocol keys, use Locale.ROOT.
Safe example
String key = "invoice"; String upper = key.toUpperCase(java.util.Locale.ROOT);
File I/O encoding safety
When reading or writing files, always specify charset.
import java.nio.file.*; import java.nio.charset.StandardCharsets; String content = Files.readString(path, StandardCharsets.UTF_8); Files.writeString(path, content, StandardCharsets.UTF_8);
Production failure scenario
A CSV export job writes using default encoding. Production server default is ISO-8859-1. UTF-8 characters become corrupted in exported files. Customers complain weeks later.
Correct mitigation
- Explicitly use UTF-8 for all file I/O.
- Integration test with non-ASCII data.
- Validate DB charset configuration.
Checklist
- Always specify StandardCharsets.UTF_8 explicitly.
- Never rely on platform default encoding.
- Set Content-Type charset for HTTP responses.
- Validate DB charset configuration.
- Be careful with byte-based truncation.
- Use Locale.ROOT for protocol/identifier transformations.
- Normalize input if equality across systems matters.
Final principle
Encoding is not optional metadata. It is part of the data contract. If you do not define it explicitly, production will eventually define it for you.