What I've been quietly building: a tour of SC-Datamining

A tour of the Python toolkit I've been quietly maintaining for the last few months: what it does, who it's actually for, why I had to rebuild the unpacking layer when CIG shipped DataCore V6, and the long list of things I still want to add.

Star Citizen ships a new build to the public test server every couple of months. By the time most of the org has logged in for the morning, there's a hundred-odd gigabytes of fresh game data sitting on my SSD waiting to be torn apart.

This is the post about the thing that does the tearing. Is that a good word? Maybe. Fuck knows. Let's go.

I started writing it a few months ago because I wanted to know what changed between patches and build my own tools with the resulting data. CIG's own patch notes are good for highlights ("New ship added: the Anvil Whatever, with a 4SCU cargo hold") and useless for everything else. They never tell you that a weapon's per-pellet damage went from 28 to 22, or that a mining laser's optimal window shifted six metres, or that a ship now has a 12% energy resistance on its rear hull instead of 8%. You only find that out by extracting the previous patch's data, extracting the new patch's data, and diffing them by hand in a spreadsheet.

PAAAAAAAAAAAAAAAIN

What the thing is (I think it's a cool thing)

SC-Datamining is a Python toolkit (Python 3.14, src-layout under scdm, uv-managed, you can stop reading the boilerplate now) that automates the whole flow. You point it at your Star Citizen install, give it two version numbers, and a few minutes later you have:

  • Per-version JSON and CSV dumps of every weapon, ship, missile, mining laser, mining location, refining process, faction, mission type, blueprint, shop inventory, FPS armour piece, and reputation standing the game knows about.
  • A 23-category structured diff between the two versions, with added / removed / changed entries for each.
  • Google-Sheets-ready CSVs for anyone who wants to make a copy and start filtering.
  • Two large xlsx workbooks (a datamine summary and a blueprint reference) with formulas, filtering, conditional formatting, and per-category sheets, dark-themed because I have to look at them.
  • The compiled JS modules that power ten standalone tools on the Freehold Stellar Works website: a penetration matrix, a fabrication simulator, a mining-location planner, a mining-quality optimiser, a reputation planner, a shop browser, an armour library, and a few others.

It is, depending on which patch I'm in the middle of grinding through, somewhere between a pipeline, a one-person ETL platform, and a coping mechanism.

Who it's actually for

In rough order:

  • The org. The Exiles uses the outputs to brief members on what's changed in a patch before it goes live, and the FSW tools to plan loadouts and mining runs. Some of the spreadsheet work I do never makes it to the public site because it's commercial-sensitive (mining yields, reputation grind paths, the better trade routes), but the rest does.
  • The wider Star Citizen datamining community. There are maybe a dozen of us who do this seriously, and we mostly know each other's handles. The bigger downstream audience is the second-order one: YouTubers (like you, Rapskilian) and wiki contributors who pull from the published spreadsheets and tools without necessarily knowing or caring how they're built.
  • Future me. By far the largest single use case is "I need to know whether the weapon X numbers are different between 4.7 and 4.8 and I need to know in the next ten minutes." Having the diff produced automatically saves a real amount of cognitive load on patch days.

If you're none of those people, the technical content below might still be interesting if you've ever wondered how to extract structured data out of a game that wasn't designed to give it up.

Or you're not and you'll be done reading here. That's cool. Thanks for stopping by.

Jerk.

The unpacking problem

Star Citizen ships its game data in a single archive called Data.p4k. It's somewhere between 80 and 100 gigabytes depending on the patch. Inside it are tens of thousands of XML records describing every entity in the game, plus a localisation table mapping internal identifiers to in-game display names. The localisation table for 4.8 has 88,000+ entries. You don't realise how many things in a space sim need a name until you try to enumerate them.

The XMLs use a binary schema format CIG calls DataCore. To turn the on-disk bytes into something parseable, you need a decoder that knows the current DataCore version.

For months there were two community tools that did this: unp4k (unzipped the archive) and unforge (decoded the DataCore records inside it). Both written in C#, both maintained by community devs in their spare time. They worked. I wrapped them, and I treated the wrapping as a solved problem.

Then... CIG shipped DataCore V6. That's cool, I guess. Thanks for that, Chris Roberts.

unforge v3 reads V5 structures. When it hit V6 bytes, it produced output that looked plausibly like data but was internally garbage: half-empty rows, fields that should have been integers showing up as nonsense strings, ship armour values that didn't match anything I could verify in-game. No clean error. Just wrong numbers, downstream of which all of my comparison logic was now telling me confidently incorrect things.

The maintainer started a v4 rewrite for V6. It wasn't shipping fast enough for me to keep up with the PTU cycle, and I'd already burned a weekend chasing what turned out to be silently corrupt input.

I had three real options. scdatatools is a mature Python project with a different ergonomic and a bunch of features I didn't need. unp4k_rs is a Rust port of the original chain, in progress but not yet handling V6 either. And StarBreaker, a single Rust binary by diogotr7, which handled both p4k extraction and DataCore V6 decoding in one step, and emitted records as JSON instead of XML.

I picked StarBreaker. PTU 4.8 unpack went from about four and a half minutes to about eighty seconds. The legacy unforge path is still wired up behind a --decoder unforge flag in case StarBreaker ever regresses on a specific build, but it's been the default for several patches and hasn't given me a reason to flip back, so that's also cool.

The hidden complexity

The unpacking is the visible work. Most of the pipeline is the invisible work that comes after.

Take... name resolution, for example. Every record in the DataCore identifies things by GUID or internal codename. A ship is AEGS_Avenger_Titan. A weapon is klwe_attritionalpha_3. A star system is Pyro_System. Useful for a developer, useless for anything you'd want to publish. Every script in the pipeline has to resolve these to in-game display names, which means cross-referencing the localisation table, which means knowing where in the records tree the name keys live for that record type, which is different for ships and weapons and missions and factions and FPS armour and mineable elements.

I have a function called fix_pyro_names whose entire job is to replace internal CIG codenames for the Pyro system planets ("Pyro II", "Pyro III", "Pyro VI") with their in-game display names ("Monox", "Bloom", "Terminus"). It exists because CIG keeps that garbage elsewhere in another file, so tooling trying to work off that.. kinda.. has a problem.

Tiny piece of code. Saves about thirty minutes of confusion every patch. It's the kind of thing that doesn't make it into the README and that you only write after the third time you've squinted at a row labelled "Pyro VI" trying to work out which planet that actually is now.

Or... y'know, take the assembly stage. The FSW tools site is vanilla JS. No bundler, no npm, no build step beyond a small node script that ZIPs the lot. Each tool ships as an immediately-invoked function expression with its data baked in as a JS object literal next to a _logic.js file that handles the UI. Updating a tool means regenerating the JS module from fresh extracted data, then rebuilding the website to repackage. The pipeline does all ten of these in sequence, in dependency order, with the right localisation context loaded for each. None of it is glamorous. All of it is necessary.. for.. better or for worse, I guess. RIP your browser tab memory. Sorry.

Or the penetration matrix, which is one of the FSW tools and probably the heaviest single output of the whole pipeline. It's a cross product of every ship-mountable weapon against every ship hull, computed with each weapon's projectile damage profile applied against each hull's per-section armour resistances, deflection thresholds, and section HP. Roughly thirty-two thousand rows. The CSV is large enough that Excel struggles with it; the web tool handles it because the data is pre-aggregated into a fast lookup, but generating the matrix takes a minute or so and getting the underlying maths right took several patches of iteration. There's still an edge case with multi-pellet weapons against angled armour that I still don't think I've fixed yet.

The two XLSX report generators are the other place a lot of work has accumulated. They use openpyxl, conditional formatting on numeric columns, frozen header rows, auto-filter, and.. you really don't care about this bit, do you? Continuing on.

What's still missing

The DataCore has 216 top-level record categories. I extract from about 25 of them. There's a long roadmap.

The next-up tier:

  • Crime definitions (247 of them in 4.8). Every infraction in the game with its merit value, fine multiplier, grace period, cooloff time, and lifetime hours. Currently I capture the names but not the rich fields. Useful for PvP planning, stupidly useful for working out how to do crime efficiently.
  • Faction reputation (132 factions, including all the AI-only ones). GEIDs, hostility settings, allied parameters, manufacturer logo references. I've already got a mission planner that does this, but I haven't done diffs yet.
  • Consumable types, structured the same way as the commodity type database I already walk.

A second tier covers melee combat, explosive ordnance damage profiles, IFCS flight tunings per ship, and the 2,196 tint palettes that would let the FPS armour gallery render actual colour swatches instead of names.

There's also an enrichment track that I keep promising myself I'll get to. The DataCore points at DDS texture assets for manufacturer logos, mission-type icons, and commodity art. StarBreaker can decode those to PNG. Once that's wired in, the XLSX reports stop being spreadsheets and start being the kind of shareable documents you'd actually want to drop in a Discord channel.

The plan is one new extractor per patch cycle. Each one follows a checklist. Each one ships in a single PR. The CI gates (ruff, mypy, pytest, run on Ubuntu and Windows on every push) are deliberately strict because the cost of shipping a parser bug that produces silently-wrong numbers is much higher than the cost of CI being annoying. If I let one of those through, the credibility hit on the published spreadsheets is real and recovery means manually verifying every number that came out of the broken module against the in-game UI.

Slow growth. It's the only way I've found to keep something this fiddly maintainable across months of patches that I've run this pipeline against. It has become an absolute fucking monster.

Why I bother

I don't know.

Maybe it's that I find this kind of work satisfying in a way that day-job work rarely is. The problem is well-defined, the dataset is finite (really fucking large, but finite), and every patch produces a fresh challenge that's just slightly different from the last one. I can sit down on a Sunday with a fresh PTU build and a coffee and have a working diff in a couple of hours. There's a clean feedback loop: extract, compare, verify against the in-game UI, ship. Day-job work has none of that. The systems are too big, the feedback loops are too long, and "done" is like eating my own tail.

.. You know, like moebius. The snake. The one that.. whatever.

The other half is that the people who use the outputs are the people I play with. Org members planning loadouts, YouTubers quoting the spreadsheets (and being incredibly annoying QA testers), wiki contributors pulling from the JSON. If I stopped, somebody else would eventually rebuild some of this, but the specific shape of it (six stages, 23 diff categories, the FSW tools, the XLSX workbooks, the small choices like the dark theme and the deliberate IIFE-shaped output and the fix_pyro_names hack) is mine and I like that it exists in the world.

There are many like this, but this one is mine. That's a throwback for you.

If you're a fellow SC datamining nerd and you want a closer look, the repo lives at github.com/Iverik/SC-Datamining. It's helpfully private at the moment because.. well. It's not done yet. I'm not happy with it. Issues and PRs will be welcome when I open it. Fair warning that I have opinions about extraction script structure and the new-extractor checklist isn't optional.

If you're not, the next post will probably be about something fucking stupid. Like birdwatching.