Writing Tests With the Same Tool That Wrote the Code
If the model generated the transformation, can it generate the tests? That's the natural question after a few months of using ChatGPT and GPT-4 for pipeline scaffolding. The answer is yes — with a workflow caveat that turns out to matter a lot.
The naive approach: ask for the transformation and the tests in the same prompt. The model generates both. The transformation is usually right. The tests are usually shallow. You get a happy-path test, a null check on the most obvious column, and a type assertion that doesn't test anything meaningful. The test suite is green, which tells you nothing.
The workflow that actually produces useful test coverage splits the task into two separate conversations.
Step One: Edge Cases First
Before asking for any code, ask the model what could go wrong with the transformation. Give it the transformation description and the schema, and ask it to enumerate failure modes, edge cases, and boundary conditions.
"This transformation assigns session IDs to events by grouping consecutive events from the same user where the gap between events is less than 30 minutes. The input is sorted by event_ts within each user_id partition. What are the edge cases and failure modes I should test for?"
A good response from GPT-4 surfaces things like:
- Single-event sessions (gap logic doesn't apply — should produce a valid session_id)
- Users with exactly one event in the partition (same as above)
- Events where the gap is exactly 30 minutes (boundary condition — is it inclusive or exclusive?)
- Two events from the same user with null event_ts values
- A user whose first event in the partition follows a gap > 30 minutes from their last event in the previous partition (cross-partition session boundary)
- Very high cardinality user_id partitions that might cause memory pressure in the window function
Review this list. Add cases from your knowledge of the actual data that the model doesn't have context for. Remove cases that aren't applicable to your specific implementation. The list you end up with is your test specification.
Step Two: Write the Tests Against the Spec
Now ask for the tests — against the agreed edge case list, not against the model's notion of what's important.
from sparktestingbase.sqltestcase import SQLTestCase
from pyspark.sql import Row
from myteam.transforms import assign_session_ids
class TestSessionAssignment(SQLTestCase):
def test_single_event_gets_valid_session_id(self):
input_df = self.sqlCtx.createDataFrame([
Row(user_id="han_solo", event_ts=1000000)
])
result = assign_session_ids(input_df, gap_minutes=30)
row = result.first()
self.assertIsNotNone(row["session_id"])
self.assertTrue(row["session_id"].startswith("han_solo_"))
def test_events_within_gap_share_session(self):
gap_ms = 30 * 60 * 1000 # 30 minutes in ms
input_df = self.sqlCtx.createDataFrame([
Row(user_id="leia_organa", event_ts=1000000),
Row(user_id="leia_organa", event_ts=1000000 + gap_ms - 1), # just under
])
result = assign_session_ids(input_df, gap_minutes=30)
session_ids = [r["session_id"] for r in result.collect()]
self.assertEqual(session_ids[0], session_ids[1])
def test_events_exceeding_gap_get_different_sessions(self):
gap_ms = 30 * 60 * 1000
input_df = self.sqlCtx.createDataFrame([
Row(user_id="leia_organa", event_ts=1000000),
Row(user_id="leia_organa", event_ts=1000000 + gap_ms + 1), # just over
])
result = assign_session_ids(input_df, gap_minutes=30)
session_ids = [r["session_id"] for r in result.collect()]
self.assertNotEqual(session_ids[0], session_ids[1])
def test_null_event_ts_does_not_crash(self):
input_df = self.sqlCtx.createDataFrame([
Row(user_id="chewbacca", event_ts=None),
])
# Should not raise — nulls are filtered at ingest, but test the boundary
result = assign_session_ids(input_df, gap_minutes=30)
self.assertEqual(result.count(), 0)Why the Split Matters
When you ask for the transformation and the tests in one shot, the model writes tests that prove the code it just wrote does what it just wrote. That's circular. The tests confirm the implementation, not the specification.
When you derive the test specification independently — from the model's edge case analysis, augmented by your domain knowledge — and then write tests against that specification, the tests can catch errors in the implementation. That's what tests are for.
Separate the thinking from the writing. The model is good at both. They're different tasks and they produce better results as separate conversations. As always, I'm here to help.