Question 3: Log Triage
Scenario
This is a practical file processing problem. You’re an on-call engineer at 2 AM. The alerting system shows elevated error rates. Your task is to quickly parse the logs and identify which error codes are occurring most frequently so you can prioritize your investigation.
Problem Statement
Write a script that:
- Reads all
.logfiles from thequestion-3-log-triage/logs/directory - Extracts all error codes matching the pattern
Efollowed by 4 digits (e.g.,E1234,E5678) - Only counts ERROR level entries (not WARN or INFO)
- Counts occurrences of each error code across all files
- Prints a summary sorted by frequency (highest first)
Log Format
Each log file contains entries in this format:
[YYYY-MM-DD HH:MM] LEVEL CODE: message text
Example lines:
[2025-01-15 02:34] ERROR E1234: Database connection failed
[2025-01-15 02:34] WARN W0001: Retry attempt 1 of 3
[2025-01-15 02:35] ERROR E5678: Authentication timeout, user=42
[2025-01-15 02:36] INFO: Health check passed
Only count entries with level ERROR. Ignore WARN and INFO entries even if they contain error codes.
Expected Output Format
Your script should output something immediately useful for someone at 2 AM trying to figure out what’s broken. Example:
================================================================================
LOG TRIAGE REPORT
================================================================================
Scanned: 3 files | Found: 12 errors | Time range: 10:15 - 11:02
TOP ISSUE (5 hits, 42% of errors)
E1234: Payment failed
Last seen: 10:39 in app-server-1.log
Likely cause: Payment processor issues
ALL ERRORS BY FREQUENCY
--------------------------------------------------------------------------------
E1234 ############ 5 hits (42%)
E5678 ######## 4 hits (33%)
E9012 ### 2 hits (17%)
E4040 # 1 hit (8%)
MOST RECENT ERRORS
--------------------------------------------------------------------------------
10:39 E1234 Payment failed user=201 reason=expired_card
10:32 E5678 DB timeout query=update_session duration=31s
ERRORS BY SOURCE
--------------------------------------------------------------------------------
app-server-1.log: 6 errors
app-server-2.log: 4 errors
api-gateway.log: 2 errors
================================================================================
The exact format can vary. Use your creativity and ask AI to help you make it scannable and clear. The sections shown above (TOP ISSUE, ALL ERRORS BY FREQUENCY, MOST RECENT ERRORS, ERRORS BY SOURCE) are recommendations, not requirements.
Sample Files
The question-3-log-triage/logs/ directory contains sample log files:
app-server-1.log- Application server logsapp-server-2.log- Second application serverapi-gateway.log- API gateway logs
These files simulate a real incident with overlapping error codes across services.
Workshop Workflow
1. Clarifying Questions
Before diving in, consider:
- Should I handle nested directories or just the top level?
- What if a log line has multiple error codes?
- Should I include the source file in my output?
- What if there are no error codes found?
2. Plan Your Approach
Break this into steps:
- Find all .log files in the directory
- Read each file line by line
- Use regex to extract error codes
- Aggregate counts in a dictionary
- Sort and display results
3. Test Your Solution
Run your script and verify:
- It finds all error codes across all files
- Counts are accurate (you can manually spot-check a few)
- Output is sorted correctly
Deliverables
Create a solution file in question-3-log-triage/ using your preferred language:
- Python:
triage.py - JavaScript:
triage.js - PowerShell:
triage.ps1
Run your script and verify the output makes sense. Output verification is the test for this problem. Unit tests are optional.
Extension Challenges
If you finish early:
- Add timestamps - Show when each error code first and last appeared
- Add file source - Show which files contained each error code
- Make it a CLI - Accept the directory path as a command-line argument
Hints
Only look at hints if you’re stuck!