Python's built-in file handling is one of the language's genuine strengths — no imports needed
for basic read/write operations, and the API is clean enough to learn in an afternoon. But there's a real
gap between the tutorial version and what you'd actually ship. The tutorial version opens a file, reads it,
and closes it. The production version deals with encoding mismatches that corrupt data silently, paths that
work on macOS but blow up on Windows, and log files that quietly eat all your memory if you call
read()
on a 2 GB file. This article covers the patterns that hold up — not just the happy path.
The with Statement — Always Use It
Every file handling example in Python should use a
context manager
— the with block that ensures the file is closed even if an exception is raised mid-read.
A context manager is an object that defines what happens on entry and exit from a with block;
for files, exit means close() gets called automatically. Here's why it matters in practice:
# ❌ Manual close — works until it doesn't
f = open('app.log')
data = f.read() # if this raises an exception...
f.close() # ...this line never runs. File handle leaks.
# ✅ Context manager — close() is guaranteed
with open('app.log') as f:
data = f.read()
# file is closed here, no matter what happened inside the blockOn long-running servers this isn't academic — leaking file handles eventually causes
OSError: [Errno 24] Too many open files. The with statement costs nothing
and prevents that class of bug entirely. Use it everywhere.
Reading Files — Four Ways, One Right Tool Each Time
Python gives you several methods on a file object, and picking the right one matters more than most tutorials admit:
f.read()— reads the entire file into a single string. Fine for small config files, dangerous for large ones.f.readline()— reads one line at a time, advancing the internal pointer. Useful when you need manual control over iteration.f.readlines()— reads all lines into a list. Convenient, but still loads the whole file into memory.for line in f:— the iterator protocol. Reads one line at a time without loading the full file. This is the one to reach for by default.
Here's a realistic example: reading a .env-style config file and turning it
into a dictionary. This is the kind of thing you actually write, not a contrived "read hello.txt" demo:
from pathlib import Path
def load_config(path: str) -> dict:
"""Read a key=value config file, ignoring comments and blank lines."""
config = {}
with open(path, encoding='utf-8') as f:
for line in f:
line = line.strip()
if not line or line.startswith('#'):
continue
if '=' not in line:
continue
key, _, value = line.partition('=')
config[key.strip()] = value.strip()
return config
# Usage
settings = load_config('config/app.conf')
db_host = settings.get('DB_HOST', 'localhost').strip() habit: When reading lines, every line except the
last includes a trailing \n (and on Windows, \r\n). Call
line.strip() to remove both. If you only want to strip the newline and not leading
whitespace, use line.rstrip('\n') instead.Writing and Appending — Know Which Mode Destroys Data
The second argument to
open()
is the mode. Two modes trip people up repeatedly:
'w'— write mode. Opens the file for writing. If the file already exists, it is truncated to zero bytes immediately — before you write a single character. This is silent data destruction if you open the wrong path.'a'— append mode. Opens the file and moves the write pointer to the end. Existing content is never touched. New writes go after whatever was already there.
A good use case for append mode is writing a structured log file with timestamps. Here's a pattern that's useful in scripts and small services alike:
import datetime
LOG_FILE = 'logs/pipeline.log'
def log_event(level: str, message: str) -> None:
timestamp = datetime.datetime.utcnow().isoformat() + 'Z'
line = f"[{timestamp}] {level.upper()}: {message}\n"
with open(LOG_FILE, 'a', encoding='utf-8') as f:
f.write(line)
log_event('info', 'Pipeline started')
log_event('warning', 'Retrying connection to database')
log_event('error', 'Failed to parse row 4821 — skipping')open(path, 'w') creates the file if it doesn't exist —
which is convenient — but it also silently destroys the file if it does exist. A mistyped
path can wipe a production file without any error message. If you're not sure the file should be
overwritten, check first with Path(path).exists() or use 'x' mode, which
raises FileExistsError instead of overwriting.Encoding — The Bug That Bites Everyone Eventually
This is the single most common source of silent data corruption in Python file handling.
Python 3's default encoding when you call open() without specifying one is determined
by locale.getpreferredencoding() — which on Windows is typically cp1252,
and on Linux/macOS is usually UTF-8. That means code that works perfectly on your Mac
can silently mangle or crash on a Windows server when the file contains any character outside ASCII.
The fix is one extra argument:
# ❌ Platform-dependent — works on Linux, corrupts on Windows
with open('customers.csv') as f:
data = f.read()
# ✅ Explicit UTF-8 — same behavior on every platform
with open('customers.csv', encoding='utf-8') as f:
data = f.read()
# For files exported from Excel on Windows — may have a BOM (byte order mark)
# utf-8-sig strips the BOM automatically on read
with open('export.csv', encoding='utf-8-sig') as f:
data = f.read()The BOM issue is particularly common with CSV files exported from Microsoft Excel — the
file starts with a hidden \ufeff character that appears as  if
read with the wrong encoding, or causes the first column header to look like
name instead of name. Using encoding='utf-8-sig'
handles it transparently. See the
Python codecs documentation
for the full list of encoding names.
encoding='utf-8' (or
'utf-8-sig' for Excel exports) to every open() call. Make it a
habit — it costs nothing and eliminates an entire category of environment-specific bugs.Working with Paths — Use pathlib
The old way to build file paths in Python was string concatenation or
os.path.join(). The modern way is
pathlib.Path,
available since Python 3.4 and fully mature since 3.6. It handles path separators correctly
on Windows and Unix without you thinking about it, and it replaces a handful of
os.path calls with readable attribute access.
from pathlib import Path
# Build a path relative to the current script — works on Windows and Unix
base_dir = Path(__file__).parent
data_dir = base_dir / 'data'
input_file = data_dir / 'records.csv'
# Check existence before opening
if not input_file.exists():
raise FileNotFoundError(f"Input file not found: {input_file}")
# Create a directory (including parents) without error if it already exists
output_dir = base_dir / 'output' / 'reports'
output_dir.mkdir(parents=True, exist_ok=True)
# Iterate over all JSON files in a directory
for json_file in data_dir.glob('*.json'):
print(json_file.name) # just the filename: 'records.json'
print(json_file.stem) # filename without extension: 'records'
print(json_file.suffix) # extension: '.json'
print(json_file.parent) # parent directory as a Path
# The / operator builds paths — no os.path.join needed
report_path = output_dir / f"report_{input_file.stem}.txt"The / operator is not division here — Path overrides it to
mean path joining. This reads naturally and eliminates the quoting and separator issues that come
with string-based path building. One more useful method: path.read_text(encoding='utf-8')
is a shortcut for the open/read/close pattern when you just want the file's contents as a string.
Reading Large Files Without Blowing Up Memory
When a file is small — say, under a few megabytes — f.read() or
f.readlines() is fine. When it's a 500 MB server log or a multi-gigabyte data
export, loading the whole thing into memory is a fast path to an MemoryError or
a process kill from the OS. The fix is line-by-line iteration:
from pathlib import Path
from collections import Counter
def count_error_levels(log_path: str) -> dict:
"""
Process a large log file line by line.
Memory usage stays roughly constant regardless of file size.
"""
counts = Counter()
with open(log_path, encoding='utf-8') as f:
for line in f:
# Each line is fetched from disk as needed — not loaded all at once
if ' ERROR ' in line:
counts['error'] += 1
elif ' WARN ' in line:
counts['warning'] += 1
elif ' INFO ' in line:
counts['info'] += 1
return dict(counts)
results = count_error_levels('/var/log/app/server.log')
print(f"Errors: {results.get('error', 0)}, Warnings: {results.get('warning', 0)}")The for line in f: pattern works because Python's file object implements
the iterator protocol — it fetches lines from disk one at a time using an internal buffer, so
memory usage is essentially constant regardless of file size. For truly massive files (tens of
gigabytes) where even line-by-line iteration isn't fast enough,
mmap
lets you memory-map the file and search it with regular expressions without reading it at all —
but for most use cases, the line iterator is all you need.
Reading and Writing JSON and CSV
Two formats come up constantly in real Python work, and both have dedicated stdlib modules that handle quoting, escaping, and structure correctly — don't parse them with string splits.
import json
import csv
from pathlib import Path
# --- JSON ---
# Reading
with open('config/settings.json', encoding='utf-8') as f:
settings = json.load(f) # parsed directly from the file object
# Writing (indent=2 gives readable output)
with open('output/results.json', 'w', encoding='utf-8') as f:
json.dump(results, f, indent=2, ensure_ascii=False)
# --- CSV ---
# Reading
with open('data/customers.csv', encoding='utf-8-sig', newline='') as f:
reader = csv.DictReader(f) # each row is a dict keyed by header
for row in reader:
process_customer(row['email'], row['plan'])
# Writing
fieldnames = ['id', 'email', 'plan', 'created_at']
with open('output/export.csv', 'w', encoding='utf-8', newline='') as f:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
for record in records:
writer.writerow(record)A few things worth noting: pass newline='' when opening CSV files —
the csv module
handles its own line endings, and letting Python's universal newline mode interfere causes
duplicate blank rows on Windows. For
JSON,
ensure_ascii=False lets non-ASCII characters (accented letters, CJK characters,
etc.) write as-is rather than being escaped to \uXXXX sequences — much more
readable output. If you're working with JSON or CSV data and want to inspect or transform it
visually, the JSON Formatter and
CSV Formatter on this site are good complements to the code
approach.
Error Handling — The Three Exceptions You Will See
File operations fail in predictable ways. Handling each case explicitly gives you error messages that are actually useful instead of a generic traceback:
import json
from pathlib import Path
def load_json_config(path: str) -> dict:
"""
Load a JSON config file with explicit error handling.
Returns the parsed config or raises with a clear message.
"""
config_path = Path(path)
try:
with open(config_path, encoding='utf-8') as f:
return json.load(f)
except FileNotFoundError:
raise FileNotFoundError(
f"Config file not found: {config_path.resolve()}\n"
f"Create it or set CONFIG_PATH to the correct location."
)
except PermissionError:
raise PermissionError(
f"No read permission on {config_path.resolve()}\n"
f"Check file ownership and mode (chmod 644 on Linux)."
)
except UnicodeDecodeError as e:
raise ValueError(
f"Encoding error reading {config_path}: {e}\n"
f"Try opening with encoding='utf-8-sig' if the file came from Windows."
)
except json.JSONDecodeError as e:
raise ValueError(
f"Invalid JSON in {config_path} at line {e.lineno}, col {e.colno}: {e.msg}"
)The four exceptions cover almost every real failure mode: the file doesn't exist,
you don't have permission, the encoding is wrong, or the content is malformed. Each message
tells the next developer (or you at 2 AM) exactly what went wrong and where to look. Catching
a bare Exception and printing "something went wrong" is not useful error handling —
it just moves the confusion downstream.
FileNotFoundError
and returning None or a default dict is fine. Pick one behaviour per function
and document it.Wrapping Up
The short version of everything above: always use with blocks, always
pass encoding='utf-8', use pathlib.Path for path construction, and
iterate lines instead of reading whole files when size is unknown. These four habits eliminate
the vast majority of file handling bugs before they reach production.
For deeper reading: the Python tutorial section on reading and writing files covers the basics thoroughly. The pathlib documentation is worth bookmarking — it's one of the most useful parts of the stdlib and most Python developers underuse it. The csv module docs and json module docs both have good examples for the edge cases (custom delimiters, streaming JSON, etc.) worth reading if you work with those formats regularly.