Python File Handling — Reading and Writing Files the Right Way

Python's built-in file handling is one of the language's genuine strengths — no imports needed for basic read/write operations, and the API is clean enough to learn in an afternoon. But there's a real gap between the tutorial version and what you'd actually ship. The tutorial version opens a file, reads it, and closes it. The production version deals with encoding mismatches that corrupt data silently, paths that work on macOS but blow up on Windows, and log files that quietly eat all your memory if you call read() on a 2 GB file. This article covers the patterns that hold up — not just the happy path.

The with Statement — Always Use It

Every file handling example in Python should use a context manager — the with block that ensures the file is closed even if an exception is raised mid-read. A context manager is an object that defines what happens on entry and exit from a with block; for files, exit means close() gets called automatically. Here's why it matters in practice:

python

# ❌ Manual close — works until it doesn't
f = open('app.log')
data = f.read()   # if this raises an exception...
f.close()         # ...this line never runs. File handle leaks.

# ✅ Context manager — close() is guaranteed
with open('app.log') as f:
    data = f.read()
# file is closed here, no matter what happened inside the block

On long-running servers this isn't academic — leaking file handles eventually causes OSError: [Errno 24] Too many open files. The with statement costs nothing and prevents that class of bug entirely. Use it everywhere.

Reading Files — Four Ways, One Right Tool Each Time

Python gives you several methods on a file object, and picking the right one matters more than most tutorials admit:

f.read() — reads the entire file into a single string. Fine for small config files, dangerous for large ones.
f.readline() — reads one line at a time, advancing the internal pointer. Useful when you need manual control over iteration.
f.readlines() — reads all lines into a list. Convenient, but still loads the whole file into memory.
for line in f: — the iterator protocol. Reads one line at a time without loading the full file. This is the one to reach for by default.

Here's a realistic example: reading a .env-style config file and turning it into a dictionary. This is the kind of thing you actually write, not a contrived "read hello.txt" demo:

python

from pathlib import Path

def load_config(path: str) -> dict:
    """Read a key=value config file, ignoring comments and blank lines."""
    config = {}
    with open(path, encoding='utf-8') as f:
        for line in f:
            line = line.strip()
            if not line or line.startswith('#'):
                continue
            if '=' not in line:
                continue
            key, _, value = line.partition('=')
            config[key.strip()] = value.strip()
    return config

# Usage
settings = load_config('config/app.conf')
db_host = settings.get('DB_HOST', 'localhost')

The .strip() habit: When reading lines, every line except the last includes a trailing \n (and on Windows, \r\n). Call line.strip() to remove both. If you only want to strip the newline and not leading whitespace, use line.rstrip('\n') instead.

Writing and Appending — Know Which Mode Destroys Data

The second argument to open() is the mode. Two modes trip people up repeatedly:

'w' — write mode. Opens the file for writing. If the file already exists, it is truncated to zero bytes immediately — before you write a single character. This is silent data destruction if you open the wrong path.
'a' — append mode. Opens the file and moves the write pointer to the end. Existing content is never touched. New writes go after whatever was already there.

A good use case for append mode is writing a structured log file with timestamps. Here's a pattern that's useful in scripts and small services alike:

python

import datetime

LOG_FILE = 'logs/pipeline.log'

def log_event(level: str, message: str) -> None:
    timestamp = datetime.datetime.utcnow().isoformat() + 'Z'
    line = f"[{timestamp}] {level.upper()}: {message}\n"
    with open(LOG_FILE, 'a', encoding='utf-8') as f:
        f.write(line)

log_event('info', 'Pipeline started')
log_event('warning', 'Retrying connection to database')
log_event('error', 'Failed to parse row 4821 — skipping')

Warning: open(path, 'w') creates the file if it doesn't exist — which is convenient — but it also silently destroys the file if it does exist. A mistyped path can wipe a production file without any error message. If you're not sure the file should be overwritten, check first with Path(path).exists() or use 'x' mode, which raises FileExistsError instead of overwriting.

Encoding — The Bug That Bites Everyone Eventually

This is the single most common source of silent data corruption in Python file handling. Python 3's default encoding when you call open() without specifying one is determined by locale.getpreferredencoding() — which on Windows is typically cp1252, and on Linux/macOS is usually UTF-8. That means code that works perfectly on your Mac can silently mangle or crash on a Windows server when the file contains any character outside ASCII. The fix is one extra argument:

python

# ❌ Platform-dependent — works on Linux, corrupts on Windows
with open('customers.csv') as f:
    data = f.read()

# ✅ Explicit UTF-8 — same behavior on every platform
with open('customers.csv', encoding='utf-8') as f:
    data = f.read()

# For files exported from Excel on Windows — may have a BOM (byte order mark)
# utf-8-sig strips the BOM automatically on read
with open('export.csv', encoding='utf-8-sig') as f:
    data = f.read()

The BOM issue is particularly common with CSV files exported from Microsoft Excel — the file starts with a hidden \ufeff character that appears as ï»¿ if read with the wrong encoding, or causes the first column header to look like name instead of name. Using encoding='utf-8-sig' handles it transparently. See the Python codecs documentation for the full list of encoding names.

Rule of thumb: Always pass encoding='utf-8' (or 'utf-8-sig' for Excel exports) to every open() call. Make it a habit — it costs nothing and eliminates an entire category of environment-specific bugs.

Working with Paths — Use pathlib

The old way to build file paths in Python was string concatenation or os.path.join(). The modern way is pathlib.Path, available since Python 3.4 and fully mature since 3.6. It handles path separators correctly on Windows and Unix without you thinking about it, and it replaces a handful of os.path calls with readable attribute access.

python

from pathlib import Path

# Build a path relative to the current script — works on Windows and Unix
base_dir = Path(__file__).parent
data_dir = base_dir / 'data'
input_file = data_dir / 'records.csv'

# Check existence before opening
if not input_file.exists():
    raise FileNotFoundError(f"Input file not found: {input_file}")

# Create a directory (including parents) without error if it already exists
output_dir = base_dir / 'output' / 'reports'
output_dir.mkdir(parents=True, exist_ok=True)

# Iterate over all JSON files in a directory
for json_file in data_dir.glob('*.json'):
    print(json_file.name)       # just the filename: 'records.json'
    print(json_file.stem)       # filename without extension: 'records'
    print(json_file.suffix)     # extension: '.json'
    print(json_file.parent)     # parent directory as a Path

# The / operator builds paths — no os.path.join needed
report_path = output_dir / f"report_{input_file.stem}.txt"

The / operator is not division here — Path overrides it to mean path joining. This reads naturally and eliminates the quoting and separator issues that come with string-based path building. One more useful method: path.read_text(encoding='utf-8') is a shortcut for the open/read/close pattern when you just want the file's contents as a string.

Reading Large Files Without Blowing Up Memory

When a file is small — say, under a few megabytes — f.read() or f.readlines() is fine. When it's a 500 MB server log or a multi-gigabyte data export, loading the whole thing into memory is a fast path to an MemoryError or a process kill from the OS. The fix is line-by-line iteration:

python

from pathlib import Path
from collections import Counter

def count_error_levels(log_path: str) -> dict:
    """
    Process a large log file line by line.
    Memory usage stays roughly constant regardless of file size.
    """
    counts = Counter()
    with open(log_path, encoding='utf-8') as f:
        for line in f:
            # Each line is fetched from disk as needed — not loaded all at once
            if ' ERROR ' in line:
                counts['error'] += 1
            elif ' WARN ' in line:
                counts['warning'] += 1
            elif ' INFO ' in line:
                counts['info'] += 1
    return dict(counts)

results = count_error_levels('/var/log/app/server.log')
print(f"Errors: {results.get('error', 0)}, Warnings: {results.get('warning', 0)}")

The for line in f: pattern works because Python's file object implements the iterator protocol — it fetches lines from disk one at a time using an internal buffer, so memory usage is essentially constant regardless of file size. For truly massive files (tens of gigabytes) where even line-by-line iteration isn't fast enough, mmap lets you memory-map the file and search it with regular expressions without reading it at all — but for most use cases, the line iterator is all you need.

Reading and Writing JSON and CSV

Two formats come up constantly in real Python work, and both have dedicated stdlib modules that handle quoting, escaping, and structure correctly — don't parse them with string splits.

python

import json
import csv
from pathlib import Path

# --- JSON ---
# Reading
with open('config/settings.json', encoding='utf-8') as f:
    settings = json.load(f)            # parsed directly from the file object

# Writing (indent=2 gives readable output)
with open('output/results.json', 'w', encoding='utf-8') as f:
    json.dump(results, f, indent=2, ensure_ascii=False)

# --- CSV ---
# Reading
with open('data/customers.csv', encoding='utf-8-sig', newline='') as f:
    reader = csv.DictReader(f)         # each row is a dict keyed by header
    for row in reader:
        process_customer(row['email'], row['plan'])

# Writing
fieldnames = ['id', 'email', 'plan', 'created_at']
with open('output/export.csv', 'w', encoding='utf-8', newline='') as f:
    writer = csv.DictWriter(f, fieldnames=fieldnames)
    writer.writeheader()
    for record in records:
        writer.writerow(record)

A few things worth noting: pass newline='' when opening CSV files — the csv module handles its own line endings, and letting Python's universal newline mode interfere causes duplicate blank rows on Windows. For JSON, ensure_ascii=False lets non-ASCII characters (accented letters, CJK characters, etc.) write as-is rather than being escaped to \uXXXX sequences — much more readable output. If you're working with JSON or CSV data and want to inspect or transform it visually, the JSON Formatter and CSV Formatter on this site are good complements to the code approach.

Error Handling — The Three Exceptions You Will See

File operations fail in predictable ways. Handling each case explicitly gives you error messages that are actually useful instead of a generic traceback:

python

import json
from pathlib import Path

def load_json_config(path: str) -> dict:
    """
    Load a JSON config file with explicit error handling.
    Returns the parsed config or raises with a clear message.
    """
    config_path = Path(path)

    try:
        with open(config_path, encoding='utf-8') as f:
            return json.load(f)

    except FileNotFoundError:
        raise FileNotFoundError(
            f"Config file not found: {config_path.resolve()}\n"
            f"Create it or set CONFIG_PATH to the correct location."
        )

    except PermissionError:
        raise PermissionError(
            f"No read permission on {config_path.resolve()}\n"
            f"Check file ownership and mode (chmod 644 on Linux)."
        )

    except UnicodeDecodeError as e:
        raise ValueError(
            f"Encoding error reading {config_path}: {e}\n"
            f"Try opening with encoding='utf-8-sig' if the file came from Windows."
        )

    except json.JSONDecodeError as e:
        raise ValueError(
            f"Invalid JSON in {config_path} at line {e.lineno}, col {e.colno}: {e.msg}"
        )

The four exceptions cover almost every real failure mode: the file doesn't exist, you don't have permission, the encoding is wrong, or the content is malformed. Each message tells the next developer (or you at 2 AM) exactly what went wrong and where to look. Catching a bare Exception and printing "something went wrong" is not useful error handling — it just moves the confusion downstream.

When to reraise vs return None: If a missing config file is a fatal error for your script, reraise (or raise a new exception) so the caller fails loudly. If it's optional — say, a per-user override file — catching FileNotFoundError and returning None or a default dict is fine. Pick one behaviour per function and document it.

Wrapping Up

The short version of everything above: always use with blocks, always pass encoding='utf-8', use pathlib.Path for path construction, and iterate lines instead of reading whole files when size is unknown. These four habits eliminate the vast majority of file handling bugs before they reach production.

For deeper reading: the Python tutorial section on reading and writing files covers the basics thoroughly. The pathlib documentation is worth bookmarking — it's one of the most useful parts of the stdlib and most Python developers underuse it. The csv module docs and json module docs both have good examples for the edge cases (custom delimiters, streaming JSON, etc.) worth reading if you work with those formats regularly.

← All Python articles Browse all categories →