Migrating PMC Literature Pipelines from FTP to AWS-hosted PMC Cloud Service

This post is for bioinformaticians and data engineers running text-mining or literature pipelines against PubMed Central — particularly anyone who has relied on the legacy FTP bulk downloads and needs to understand what the upcoming S3 migration will require. If you process PMC at scale (think millions of articles, automated ingestion, commercial-use filtering), the change is more than a URL swap. This walkthrough covers the new data model, the responsibilities it shifts onto the consumer, and our concrete approach to handling it in production.

In August 2026, the NCBI will permanently delete the legacy PubMed Central FTP files that biomedical text-mining pipelines relied on for many years. The replacement is already available: a public S3 bucket hosted by AWS — pmc-oa-opendata — that mirrors the same corpus with a daily CSV inventory and per-article metadata.

The change is good news if you grab and read documents individually from S3, as each document now has its own prefix. But if your workflow involves processing millions of documents from PMC, the migration hands you a pile of new responsibilities , such as versioning, license classification and retraction tracking.

In this article, we'll walk through what changed in the PMC distribution model, the trade-offs between single-document and bulk access, the responsibilities that now sit with the data consumer, and the approach we took migrating Sable's literature pipeline.

What’s changed?

Before, working with the PMC Open Access corpus meant pulling compressed archives from an FTP server. For commercial-use articles, the path was https://ftp.ncbi.nlm.nih.gov/pub/pmc/deprecated/oa_bulk/oa_comm/xml/and the files were .tar.gz archives sliced by PMCID range,oa_comm_xml.PMC003XXXXXX.baseline.<date>.tar.gz for the full snapshot, oa_comm_xml.incr.<date>.tar.gzfor incremental updates. License classification was implicit in the folder you chose. Versioning and "what's new since last time" were handled on NCBI's side through the baseline + incremental file structure.

Now, the same corpus lives in a public S3 bucket, s3://pmc-oa-opendata, with a different layout:

Per-article-version prefixes — e.g. PMC10009416.1/ — each holding the XML, Json metadata, plain-text extraction, PDF, and any media files for that single version. There are no license-tier folders in the new layout.
A metadata/ prefix containing every article version's JSON metadata at metadata/PMC<id>.<version>.json. Fields such as license_code, xml_url, and the retraction flag live inside these JSON objects, not in the path.
An inventory-reports/ prefix with a CSV regenerated once a day, enumerating every object currently in the bucket.

What you need to handle now

The new layout exposes more raw structure than the old FTP service did, so a few responsibilities that used to be invisible now present explicit work for the data consumer:

Picking the right version: Multiple versions of the same article can coexist in the bucket (e.g. PMC10009416.1/ and PMC10009416.2/). The consumer decides which one is the latest.
Commercial vs. non-commercial classification: There is no oa_comm folder anymore. The license code lives inside each article's JSON metadata and has to be mapped explicitly
Retraction tracking: Retracted articles still appear in the bucket. The retraction flag is a field inside the metadata json, and the consumer is responsible for reading and handling it.
Incremental sync: The FTP service shipped baseline + incremental tar files, so consumers just pulled the latest delta. The new layout only has the daily CSV inventory. Working out what changed since your last run is left to the consumer.

As you can see everything the old FTP layout encoded into folder names and baseline files is now logic that must live inside your pipeline, which could mean it is your (our if you are our customer) responsibility now to process millions of metadata files to extract relevant metadata and select the appropriate documents for download.

Our approach

Our tech stack

The dataset is large and the work is naturally parallel, so we use Spark to process it across a cluster rather than on a single machine. We store the state of the dataset in Delta tables (an open table format that adds transactions and versioning on top of Parquet — see the Delta Lake docs), which give us a few useful properties:

Upserts. We can insert new articles and update changed ones in a single operation, keyed on the article ID, instead of rewriting the whole dataset each run.
Version history. Delta keeps a history of changes to the table, so we can see what was added or updated and when.
Transactional writes. A run that fails partway through leaves the table in its previous state rather than half-written, which makes a failed run safe to re-run.

These choices aren't specific to PMC. They suit any pipeline that incrementally syncs a large, slowly-changing dataset into a queryable table.

Mirroring the inventory locally first

Before the Spark job runs, a task mirrors the latest NCBI inventory into our own S3 on a weekly schedule. The procedure follows the pattern documented by NCBI:

List s3://pmc-oa-opendata/inventory-reports/pmc-oa-opendata/metadata/, which contains one timestamped folder per daily snapshot (e.g. 2026-02-23T01-00Z/). The latest folder is whichever sorts last.
Download that folder's manifest.json — it lists the exact .csv.gz files that make up the inventory.
Pull only those files from the bucket's data/ subdirectory, then sync them into our S3.

Because the manifest points at exactly the right files, no listing or filtering of the data/ subdirectory is needed. The downstream Spark job always reads against this local mirror.

With the inventory mirrored, the Spark job does four things on each run:

1. Read the daily inventory CSV

The Spark job reads the daily inventory CSV, filters down to metadata/ keys, and pulls the PMCID and version out of each path with a regex:

df.filter(fn.col("metadata_path").startswith("metadata/PMC"))
  .withColumn("pmcid", fn.regexp_extract(..., r"metadata/(PMC\\\\d+)\\\\.\\\\d+\\\\.json", 1))
  .withColumn("version", fn.regexp_extract(..., r"metadata/PMC\\\\d+\\\\.(\\\\d+)\\\\.json", 1).cast("long"))

df.filter(fn.col("metadata_path").startswith("metadata/PMC"))
  .withColumn("pmcid", fn.regexp_extract(..., r"metadata/(PMC\\\\d+)\\\\.\\\\d+\\\\.json", 1))
  .withColumn("version", fn.regexp_extract(..., r"metadata/PMC\\\\d+\\\\.(\\\\d+)\\\\.json", 1).cast("long"))

df.filter(fn.col("metadata_path").startswith("metadata/PMC"))
  .withColumn("pmcid", fn.regexp_extract(..., r"metadata/(PMC\\\\d+)\\\\.\\\\d+\\\\.json", 1))
  .withColumn("version", fn.regexp_extract(..., r"metadata/PMC\\\\d+\\\\.(\\\\d+)\\\\.json", 1).cast("long"))

It then keeps only the latest version per PMC ID with a window function:

window = Window.partitionBy("pmcid").orderBy(fn.desc("version"))
df.withColumn("rn", fn.row_number().over(window)).filter(fn.col("rn") == 1)

window = Window.partitionBy("pmcid").orderBy(fn.desc("version"))
df.withColumn("rn", fn.row_number().over(window)).filter(fn.col("rn") == 1)

window = Window.partitionBy("pmcid").orderBy(fn.desc("version"))
df.withColumn("rn", fn.row_number().over(window)).filter(fn.col("rn") == 1)

This avoids any S3 LIST traversal of the bucket which can be inefficient, and resolves the version-selection problem before any JSON file is opened.

2. Find the most recent changes

To find changed articles we run an anti-join against the Delta tracking table:

df_inventory.join(df_tracking, on=["pmcid", "version"], how="left_anti")

df_inventory.join(df_tracking, on=["pmcid", "version"], how="left_anti")

df_inventory.join(df_tracking, on=["pmcid", "version"], how="left_anti")

Only articles whose (pmcid, version) pair is missing from the tracking table proceed to the next step — reading the per-article JSON metadata. On a steady-state day, this reduces millions of inventory rows to the few hundred or few thousand that actually changed since the last run.

An alternative is to use the inventory's last modified field, which records when each JSON metadata object was created or last updated. Filtering the CSV to rows newer than the previous run's timestamp would achieve the same narrowing without joining against the tracking table.

3. Fetch metadata efficiently

To fetch the metadata efficiently, the per-article JSON reads are run in parallel on Spark executors via mapPartitions, using an anonymous fsspec client:

def process_partition(rows):
    fs = fsspec.filesystem(protocol, **fs_kwargs)  # anon=True for S3
    for row in rows:
        with fs.open(f"{base_path}/{row.metadata_path}", "rb") as f:
            meta = json.load(f)
        ...

rdd = df_delta.select("metadata_path").rdd.mapPartitions(process_partition)

def process_partition(rows):
    fs = fsspec.filesystem(protocol, **fs_kwargs)  # anon=True for S3
    for row in rows:
        with fs.open(f"{base_path}/{row.metadata_path}", "rb") as f:
            meta = json.load(f)
        ...

rdd = df_delta.select("metadata_path").rdd.mapPartitions(process_partition)

def process_partition(rows):
    fs = fsspec.filesystem(protocol, **fs_kwargs)  # anon=True for S3
    for row in rows:
        with fs.open(f"{base_path}/{row.metadata_path}", "rb") as f:
            meta = json.load(f)
        ...

rdd = df_delta.select("metadata_path").rdd.mapPartitions(process_partition)

Each JSON yields the license_code, a commercial flag derived from a small set of permissive license codes (CC BY, CC0, CC BY-SA, CC BY-ND), the xml_url, and is_retracted. The filesystem client is constructed once per partition and reused for every row in that partition.

Two alternatives we considered and rejected:

spark.read.json("s3://...") would work for a small dataset, but it relies on an S3 LIST to discover the files. On the first run — when the delta is in the millions — that LIST is slow and unnecessary, since the inventory already gives us the exact paths.
Collecting the metadata_path column into a Python list on the driver would put millions of entries into a single process on the first run, risking an out-of-memory error. mapPartitions keeps the iteration distributed across executors and streams the rows within each partition.

4. Idempotent upsert via Delta `MERGE INTO`

The final step is a Delta MERGE keyed on pmcid:

tracking_table.alias("tracking").merge(
    source=df_metadata.alias("delta"),
    condition="tracking.pmcid = delta.pmcid",
).whenMatchedUpdate(set={...}).whenNotMatchedInsert(values={...}).execute()

tracking_table.alias("tracking").merge(
    source=df_metadata.alias("delta"),
    condition="tracking.pmcid = delta.pmcid",
).whenMatchedUpdate(set={...}).whenNotMatchedInsert(values={...}).execute()

tracking_table.alias("tracking").merge(
    source=df_metadata.alias("delta"),
    condition="tracking.pmcid = delta.pmcid",
).whenMatchedUpdate(set={...}).whenNotMatchedInsert(values={...}).execute()

If a PMCID is new, it is inserted; if it already exists with an older version, the row is overwritten. Re-running the job on the same inventory produces the same tracking table, so a failure mid-run can be recovered by running the job again.

5. Fetch XML content for eligible articles

With the tracking table updated, a separate Spark job fetches the actual XML. It selects only the rows worth fetching — commercial, not retracted, and not already fetched:

df_tracking.filter(fn.col("is_commercial")&~fn.col("is_retracted")&
fn.col("xml_content").isNull())

df_tracking.filter(fn.col("is_commercial")&~fn.col("is_retracted")&
fn.col("xml_content").isNull())

df_tracking.filter(fn.col("is_commercial")&~fn.col("is_retracted")&
fn.col("xml_content").isNull())

For each of those rows it reads the article's xml_url (again on executors via mapPartitions), merges the XML back into the tracking table, and writes the articles that now have content to the output. Because the filter excludes rows already fetched, each run only retrieves XML that hasn't been downloaded before.

Results

We run these jobs on AWS EMR Serverless, with a Spark cluster of 20 executors (16 vCPUs and 104 GB each).

The initial run is the expensive one. Processing the full corpus — around 9M metadata files and extracting 5M commercial XMLs — took about 6 hours and cost roughly $250. After that, each run only handles what changed: a typical weekly run processes around 20K metadata files and extracts 10K commercial XMLs in about 10 minutes, for under $10.

Optimisation tips

Notes for anyone building something similar:

Use the daily inventory CSV instead of S3 LIST. The CSV is a single read; a LIST over millions of keys is paginated, slow, and costs money.
Resolve the latest version before fetching JSON. A window function over (pmcid, version) collapses the inventory first, so older versions don't need to be read at all.
Create the filesystem client once per partition, not per row. Inside mapPartitions, the fsspec client is constructed once and reused for every row in that partition.
Use Delta MERGE INTO for idempotency. A failed run can be retried safely, because the same inventory produces the same tracking table.

This is the pipeline we built and run at Sable to keep our literature corpus current commercially licensed. If it sounds like a lot of work, that's because it is! At Sable, we absorb all of it so that Pharma and biotech teams don't have to. If your organisation needs reliable access to the PMC corpus (and others) as part of a target safety or literature intelligence workflow, get in touch. This is infrastructure we've already built, and we'd rather you spent your time on the science.

Sources