Encoding, Unicode, Charset Safety

Standardize on UTF-8, never rely on platform default encoding, and treat charset boundaries explicitly to prevent data corruption and production-only Unicode bugs.

On this page

Encoding bugs are silent data corruption

Encoding issues rarely crash your service. They corrupt data silently. A user name becomes question marks. A payment reference loses characters. A JSON payload breaks only in production. These bugs are expensive because they often surface long after the root cause.

Golden rule: always use UTF-8 explicitly

Never rely on the system default encoding. The JVM may run with different defaults across environments.

Bad example: platform default encoding

byte[] bytes = str.getBytes();

This uses the platform default charset. On one machine it may be UTF-8, on another ISO-8859-1.

Correct: explicit charset

import java.nio.charset.StandardCharsets;

byte[] bytes = str.getBytes(StandardCharsets.UTF_8);
String decoded = new String(bytes, StandardCharsets.UTF_8);

file.encoding: do not trust it

The system property file.encoding can differ across environments. Never design code that depends on it implicitly.

System.getProperty("file.encoding");

In production, explicitly enforce UTF-8 where possible:

java -Dfile.encoding=UTF-8 -jar app.jar

Database encoding mismatch

Even if your Java application uses UTF-8, your database may not. If DB charset differs, you get corrupted storage.

Ensure database charset is UTF-8 (or utf8mb4 in MySQL-like systems).
Ensure JDBC connection does not override charset unexpectedly.
Test with non-ASCII characters in integration tests.

Example: JDBC and Unicode safety

PreparedStatement ps = conn.prepareStatement(
  "INSERT INTO users(name) VALUES (?)"
);
ps.setString(1, "Çağrı Öztürk");
ps.executeUpdate();

If the database column or connection charset is wrong, the name may be stored incorrectly.

JSON and HTTP boundaries

HTTP bodies and JSON must declare and use UTF-8 consistently. Always set Content-Type with charset explicitly.

Example header

Content-Type: application/json; charset=UTF-8

When reading request bodies manually, ensure correct charset:

new InputStreamReader(inputStream, StandardCharsets.UTF_8);

Multi-byte character pitfalls

UTF-8 uses variable-length encoding. String length and byte length are not the same.

Example

String s = "Ö";
System.out.println(s.length()); // 1 character
System.out.println(s.getBytes(StandardCharsets.UTF_8).length); // 2 bytes

Production bug example: truncating by bytes instead of characters corrupts multi-byte characters.

Substring and grapheme clusters

Some characters are composed of multiple code points (e.g., emoji or combined accents). Truncating blindly can split them.

Normalization issues

The same visual character can be encoded in multiple ways (composed vs decomposed forms). String equality may fail.

Example

import java.text.Normalizer;

String composed = "é";
String decomposed = "eu0301";

System.out.println(composed.equals(decomposed)); // false

String normalized = Normalizer.normalize(decomposed, Normalizer.Form.NFC);
System.out.println(composed.equals(normalized)); // true

If your system compares user input strictly, normalization may be required at boundaries.

Case conversion and locale sensitivity

Case conversion depends on locale. For identifiers or protocol keys, use Locale.ROOT.

Safe example

String key = "invoice";
String upper = key.toUpperCase(java.util.Locale.ROOT);

File I/O encoding safety

When reading or writing files, always specify charset.

import java.nio.file.*;
import java.nio.charset.StandardCharsets;

String content = Files.readString(path, StandardCharsets.UTF_8);
Files.writeString(path, content, StandardCharsets.UTF_8);

Production failure scenario

A CSV export job writes using default encoding. Production server default is ISO-8859-1. UTF-8 characters become corrupted in exported files. Customers complain weeks later.

Correct mitigation

Explicitly use UTF-8 for all file I/O.
Integration test with non-ASCII data.
Validate DB charset configuration.

Checklist

Always specify StandardCharsets.UTF_8 explicitly.
Never rely on platform default encoding.
Set Content-Type charset for HTTP responses.
Validate DB charset configuration.
Be careful with byte-based truncation.
Use Locale.ROOT for protocol/identifier transformations.
Normalize input if equality across systems matters.

Final principle

Encoding is not optional metadata. It is part of the data contract. If you do not define it explicitly, production will eventually define it for you.

← Immutability, Records, Value Objects

java.time, Clock Injection →