Scraping Nerd

Building a Fault-Tolerant YouTube Data Extraction Pipeline for 200k Channels

2026-04-10T00:00:00+07:00

Building a Fault-Tolerant YouTube Data Extraction Pipeline for 200k Channels

Extracting data from 200,000 YouTube channels on a daily basis is a monumental engineering challenge. You’ll quickly find that leaning on the normal YouTube Data API v3 leads to instantaneous quota exhaustion. Shifting immediately to robust tools like yt-dlp will land you in the penalty box with CAPTCHAs, and trying to orchestrate Puppeteer or Playwright grids for headless DOM rendering will chew through your server budget without even making a dent in the daily workload.

To operate at this scale, heavy browser automation has to go. You must rethink your approach to favor lightweight endpoints, precise delta detection, and robust IP management. Here is a blueprint for doing exactly that.

1. Ditching the DOM: A Lean Extraction Strategy

Traditional scraping methodologies are too fat. They download massive amounts of irrelevant assets and burn CPU cycles rendering layouts. Your objective is raw data, fast.

RSS Feeds as the Frontline Filter

There’s no need to execute a deep scrape on 200k channels if nothing has changed. Rely on YouTube’s built-in RSS capabilities to verify updates.

The Target: https://www.youtube.com/feeds/videos.xml?channel_id=CHANNEL_ID
The Mechanism: This XML feed is extremely lightweight and fetching it almost never flags bot-protection algorithms. Use it to check for new entry nodes or updated publishing timestamps. If there is a delta, push the channel ID into your deep-scrape queue. Otherwise, skip it.

Pivot to the InnerTube API

Treating Playwright as a primary scraping tool is an antipattern for scale. You want to bypass the web UI entirely and interact directly with InnerTube (/youtubei/v1/), YouTube’s internal API.

The Tactics: Open your network tab, analyze the JSON payloads being dispatched to InnerTube endpoints, and recreate these POST requests natively. Use fast network libraries (like requests in Python or net/http in Go) making sure to inject the appropriate x-youtube-client-name and x-youtube-client-version parameters.
Alternative Approaches: If yt-dlp is mandatory for highly specific metadata tasks, avoid invoking its web extractor. Submitting the arguments player_client=android or player_client=ios leverages mobile endpoints, which possess a separate and often looser set of anti-bot heuristics.

2. IP Reputation and Traffic Camouflage

Even with hyper-optimized API requests, blasting a single domain at this frequency guarantees a swift IP ban unless you manage your footprint.

Rotation Mechanics and Proxy Quality

Using AWS or DigitalOcean static IPs to scrape YouTube is a dead end.

Residential Networks: You are required to bounce requests across premium residential proxies. These IPs look like ordinary household internet setups, providing excellent camouflage against WAFs.
Intelligent Cycling: While rotating IPs is critical, you must maintain session stickiness across paginated extractions. Ensure that a multi-page pull for a single channel is executed from a single IP, then switch proxies before tackling the next channel.

JA3/JA4 TLS Handshake Mimicry

Modern enterprise defenses, including Google’s WAF, inspect the cryptographic signatures of incoming connections—specifically the JA3/JA4 TLS fingerprint.

The Solution: A vanilla HTTP client is instantly recognizable as non-human. Utilize specialized packages like utls in Go or curl-impersonate to spoof the precise TLS handshake patterns of real Chrome or Safari browsers.

3. Data Flow Orchestration

Acquiring the raw data is meaningless unless it’s correctly buffered, normalized, and persisted.

Distributed Queues

A monolithic script has no place here. Decouple your crawling steps using message brokers.

The Engine: Implement Kafka or RabbitMQ to maintain URL state.
Worker Redundancy: Disperse the workload across dozens of isolated workers. Should a task fail or a proxy die, the message safely returns to the queue to be attempted by another worker using a fresh IP.

Scalable Storage Layers

Writing highly volatile data across 200,000 entities daily will paralyze a poorly designed schema.

Metadata Persistence: Rely on PostgreSQL or MySQL exclusively for static baseline metadata (e.g., creation dates, channel UUIDs).
Metric Ingestion: Push rapidly changing statistics (like view counts or sub counts) into a dedicated time-series environment—such as ClickHouse or TimescaleDB. This isolates heavy read/write analytical loads from your core relational databases.

Graceful Degradation

When faced with HTTP 429 errors, immediately halting or violently retrying are both bad options. Integrate exponential backoff combined with randomized jitter intervals. This prevents your worker arrays from accidentally unleashing a self-inflicted DDoS attack upon your own proxy infrastructure when a target endpoint momentarily wobbles.

4. Operational Guidelines and Safety

Engineering at high throughput means you must respect infrastructure realities and legal guidelines.

Data Scope: Always restrict extraction pipelines to public-facing, generic facts—such as numerical performance metrics. Do not attempt to siphon personal identifiers, private videos, or circumvent authentication blocks.
Polite Spacing: Spiking 200,000 requests in a brief window is reckless. Distribute the operational load continuously over the course of 24 hours to reduce anomaly detection severity and behave like a good internet citizen.

Claude Code Architecture (Part 3): Automated Context Compaction for Data Ops

2026-04-06T12:00:00+07:00

Welcome to the end of our specialized ScrapingNerd architectural review. After dissecting the agentic Grep integration in Part 2, we arrive at the absolute crux of high-efficiency AI automation limits: Handling Context Bloat.

Executing tools autonomously via scripts is trivial until you receive a 50MB payload stack trace response. When context sizes explode, operations halt. Claude Code uses complex Automated Context Compaction thresholds.

The Tool Result Budget Limits

Before automated sequences cascade their outputs backwards to the inference cycle, Claude filters it through an intermediate budget layer (applyToolResultBudget()). Suppose you parse massive JSON web request tables entirely within the backend system. Claude calculates maxResultSizeChars stringently preventing payloads beyond 20,000 textual lines from resolving.

Rather than sending 20,000 characters, it seamlessly injects the target payload dynamically against physical disk storage, returning a truncated system proxy message linking directly indicating its persistent local address.

Silent Data Snips

Data automation thrives via chronological retention. However, previous automated processes hold decreasing inference value linearly alongside execution time.

Claude runs the HISTORY_SNIP flag inside the execution loop. If it generated data payloads 8 API iterations previously, it actively triggers programmatic removals deleting inner payload constructs whilst intelligently preserving outer tool brackets (so the LLM remembers it theoretically ran the logic previously, without processing the huge textual return cost perpetually).

Threshold Session Syntheses (Autocompact)

The ultimate safety valve executes fully autonomously. When automation queries span 13,000 tokens of the maximum token size array threshold limit, an immediate asynchronous fail-safe triggers.

Strategy (Session Memory Compaction):

The backend pauses operations pausing active data traversal tools immediately.
An isolated Forked Agent executes silently locally querying raw transcript dumps.
The isolated agent compresses exact project workflows (# Workflow, # Files and Functions, # Key Results) extracting structured outputs purely via synthesized summaries dynamically persisted via hidden .claude/session_memory markers.
The active terminal clears all transient bloated payloads keeping solely the pure analytic structural mapping before gracefully continuing its scraping objective pipeline.

This automated cleanup acts invisibly to the user layer guaranteeing terminal automations never experience memory starvation operations regardless of length!

Claude Code Architecture (Part 2): Deep Dive into Agentic Search (GrepTool)

2026-04-06T11:00:00+07:00

Automating large data pipelines is strictly contingent upon traversing massive amounts of content quickly. For coding automation, that means mastering search limits. Following Part 1’s overview, we will now analyze Claude’s weapon of choice: The GrepTool.

You simply cannot feed thousands of files directly into an orchestration prompt. Claude resolves this constraint by embedding ultra-fast local binary search mechanisms natively utilizing Ripgrep (rg).

Binary Execution and Fallback

Spawning non-native binaries in Node across cross-platform environments leads easily to silent failures. To fix this, Claude implements the getRipgrepConfig() process priority sequence:

System Resolver: Checks for pre-installed user PATH applications.
Embedded Build: Dispatches specifically via static internal bundlers inside the primary binary pipeline.
Local Vendor Fallback: Deploys a customized architectural platform binary bundled specifically during npm setup (e.g. vendor/ripgrep/{arch}-{platform}/rg).

Building the Safe Scrape

When extracting information, garbage in yields garbage out. Claude automatically forces parameters to guarantee stability:

Automatically appends --max-columns 500 to prevent infinite log line string overflows from polluting LLM comprehension.
Auto-excludes noisy roots like .git, .hg, and .bzr.
Translates complex string flags into matching semantic filters seamlessly (such as defining multi-line context parsing commands safely into executable scripts).

Resilience & the EAGAIN Defense

Automation inevitably runs entirely headlong into OS connection pool limitations.

Running parallel extraction traces occasionally triggers localized Docker limitations (specifically spawning threads). This usually crashes pipelines gracefully returning obscure resource temporarily unavailable (EAGAIN) flags. The Claude system intercepts this exact error profile via Regex logic, and instantly retries the query injecting -j 1 (forcing single-threaded processing). It safely sacrifices marginal speed metrics to guarantee absolute reliability.

Achieving Paged Search

To prevent completely blowing the data extraction payload size out, GrepTool implements rigorous Offset Pagination. The total raw output matches are artificially constrained utilizing variables like head_limit. If Ripgrep parses 100,000 matches, it will supply Claude exactly 250 matches trailed by the warning: [Showing results with pagination = limit: 250].

Claude will analyze the partial block, infer data properties iteratively, and optionally prompt successive tool hits querying offset indexes (offset: 250) infinitely allowing total traversal of monolithic codebases structurally in chunks.

In Part 3, we scale our automation metrics by observing how Claude manages those massive data dumps memory constraints without failing contexts.

Claude Code Architecture (Part 1): The Automation Infrastructure

2026-04-06T10:00:00+07:00

Welcome to Part 1 of the architecture series dissecting Claude Code v2.1.88. From a data engineering and automation perspective, observing how an LLM can traverse directories, execute terminal sequences, and systematically digest massive data structures is a masterclass in resilient infrastructure.

Traditional LLM wrappers suffer heavily from arbitrary hallucinated outputs mapping improperly to CLI flags. Claude tackles this through structured orchestration.

Integrating the Terminal

Unlike conventional chat UI frameworks, Claude uses a node backend directly interfaced with your machine’s environment. The primary input loop passes all terminal traces through an abstract system labeled the QueryEngine.

When you ask the system to “find specific parsing variables in the project,” the LLM isn’t randomly guessing strings. The input generates a structured tool_use JSON format natively connected to predefined execution models (e.g. BashTool, FileReadTool, GlobTool).

The Zod Validation Layer

Handling raw text integration requires unyielding validation bounds. Before a tool is authorized to scrape file content from your directory, the pipeline passes the raw Anthropic API payload directly into Zod schema interpreters. If Claude suggests traversing ~/.data/* but utilizes the wrong recursive parameter tag, the Zod handler immediately traps the schema violation and feeds an error trace automatically back into the LLM context.

This creates a self-healing automation loop where the LLM realizes its own scraping error and retries syntactically perfectly.

Parallel Tool Execution

Automation thrives on speed. One of the greatest optimization factors within the infrastructure lies in the tool batching technique.

Identifying mutability is executed prior to any process execution. If Claude decides it needs to scrape logic structures from 10 different .py files across varying directories using FileReadTool, the backend explicitly flags that interaction as readOnly: true. Promise.all then spawns the commands asynchronously in a concurrent bash execution layer. Thus, extracting multiple elements is massively parallelized rather than serially bottlenecks.

In Part 2, we will shift entirely to examine the single most sophisticated automation driver in the architecture: Agentic Search and the GrepTool layer.

Advanced Anti-Bot Evasion: Defeating TLS Fingerprinting and CDP Detection

2026-04-03T10:00:00+07:00

Scraping modern web applications protected by enterprise-grade solutions like Cloudflare Turnstile, Akamai, or DataDome requires more than simple request-response cycles. To successfully extract data from high-security targets without encountering 403 Forbidden errors or CAPTCHA loops, you need deep control over your browser fingerprint and network characteristics.

Here is a deep dive into the technical vectors you must master to bypass modern passive and active fingerprinting.

1. Defeating Passive TLS and HTTP/2 Fingerprinting

Modern WAFs employ passive fingerprinting to identify automated scripts at the network layer before a single line of JavaScript even executes.

TLS Fingerprinting (JA3)

When a client initiates an HTTPS connection, it sends a “Client Hello” packet. Standard libraries like Python’s requests or aiohttp generate a very distinct JA3 Fingerprint that immediately signals “non-browser” traffic to the WAF.

The Fix: You must use libraries that support TLS Client Hello GREASE and extension shuffling. Tools like Python’s curl_cffi or httpx with specialized transporters can perfectly mimic Chrome’s TLS handshake, drastically reducing early-stage blocking.

HTTP/2 Prioritization

Browsers have highly specific, deterministic patterns for requesting resources (CSS, JS, Images) across HTTP/2 multiplexed frames.

Implementation: Always ensure your scraping client fully supports HTTP/2. When using automation drivers like Playwright or Puppeteer, the browser engine handles this natively, but you must ensure your header order perfectly matches the expected profile of the emulated browser version.

2. Browser Stealth and Environment Masking

If a target requires JavaScript execution (e.g., executing a Cloudflare Turnstile challenge), you have no choice but to utilize a headless browser. However, default headless modes leak multiple flags, such as the navigator.webdriver property.

Playwright with Stealth: Implement the stealth plugin via playwright-extra. This plugin actively patches common leaks, including chrome.runtime, WebGL vendor strings, and default viewport dimensions.
Canvas and WebGL Noise: Advanced anti-bots utilize hardware rendering to “fingerprint” your GPU signature. It’s highly recommended to inject script that adds slight, consistent “noise” into Canvas rendering. This effectively prevents the WAF from tracking your session across different requests based on your hardware profile.
The CDP Leak: Standard automation uses the Chrome DevTools Protocol (CDP), which often leaves detectable traces. For the most hardened targets, consider deploying undetected-chromedriver or specialized, modified browser binaries (like Brave or custom Chromium builds) to bypass deep environment introspection.

3. Behavioral Emulation: Avoiding Robotic Heuristics

Having a perfect fingerprint means nothing if your bot moves like a bot.

Optical-Motor Lag: Avoid linear cursor movements. Utilize Bezier curves and introduce random jitters to simulate human mouse interaction.
Variable Latency: Never fire requests or actions at exact intervals. Implement a Gaussian distribution for delays (e.g., time.sleep(random.gauss(5, 1))).
Input Dynamics: When interacting with forms, introduce realistic and varying delays between keydown and keyup events.

Mastering these core components—network-level masking, deep environment stealth, and behavioral emulation—is the baseline requirement for maintaining a successful data extraction pipeline against modern WAFs. If you’d prefer to skip the complexity of building this stack yourself, our pre-built scraping tools already incorporate these evasion techniques for platforms like TikTok and X.com.

The Claude Code Source Map Leak: Inside the Agentic Harness

2026-04-03T08:00:00+07:00

The “Great Claude Code Leak” of March 2026 stands as a watershed moment in the security of AI development tools. By inadvertently exposing the Agentic Harness of Anthropic’s flagship CLI tool, the incident provided an unprecedented look at high-level AI orchestration while highlighting critical vulnerabilities in modern JavaScript/TypeScript build pipelines.

Here we’ll analyze the technical root cause of the leak and deconstruct what the exposed architectural code tells us about state-of-the-art AI agent tooling.

1. Technical Root Cause: The Source Map Oversight

The leak was not the result of a sophisticated breach but a misconfiguration within the deployment pipeline.

The Mechanism: Anthropic released version 2.1.88 of the @anthropic-ai/claude-code npm package containing a massive 59.8 MB Source Map (.map) file.
The Vulnerability: Source maps are designed to map minified, obfuscated production code back to its original TypeScript source for debugging. Due to a known but unpatched bug in the Bun Runtime (Issue #28001), the build process ignored the minify: true and sourcemap: none flags, bundling the full source mapping into the public artifact.
Data Exposure: Beyond the logic, the source map contained hardcoded pointers to an internal R2 Bucket (Cloudflare). This bucket was temporarily misconfigured with public read permissions, allowing users to reconstruct the entire repository.

2. Architectural Revelations: Deconstructing the “Agentic Harness”

The leak exposed over 512,000 lines of TypeScript, revealing the sophisticated engineering required to make an LLM act as an autonomous developer.

Strict Write Discipline & Memory Management

Claude Code utilizes a Self-Healing Memory architecture. Unlike simpler wrappers, it implements a Strict Write Discipline where the agent’s internal state (context) is only updated after a filesystem operation is verified via checksums. This effectively prevents “context drift” during long-running refactoring sessions.

The “autoDream” Background Process

The source code revealed a background optimization routine called autoDream. This process triggers during idle periods to consolidate session logs, prune redundant tokens, and “compress” the history into a high-density vector representation. The result is a significantly reduced TTFT (Time To First Token) for subsequent queries.

Tooling Orchestration

The codebase manages a fleet of over 40 specialized tools, showing advanced delegation including:

Playwright Integration: For headless browser testing and UI verification.
PostgreSQL Adapters: For direct schema introspection and data migration.
LSP (Language Server Protocol) Clients: Enabling Claude to “understand” symbol definitions and references across a workspace without reading every file into the context window.

This look into Anthropic’s secret sauce changes how developers understand LLM tooling—proving that building autonomous AI requires incredibly deep, classical engineering infrastructure alongside the prompts. The same principle applies to data extraction: reliable scraping tools rely on robust orchestration, proxy management, and anti-detection layers that go far beyond simple HTTP requests.

Part 4: OAuth2 authentication and YouTube Data API uploads

2026-03-27T08:00:00+07:00

This is the final component of our distributed pipeline. Having successfully managed the ETL process (Extract from TikTok, Transform via LLM/FFmpeg, Load to disk in Part 3), we now map the normalized .mp4 chunks into a destination platform: the YouTube Data API v3.

Series Navigation

Part 1: Interfacing with the Apify Python SDK

Part 2: Asynchronous video ingestion and connection pooling

Part 3: LLM script synthesis and FFmpeg concatenation

Part 4: OAuth2 authentication and YouTube Data API uploads ← You are here

Architecture and dependency context

We require Google’s official client binding for proper OAuth authorization flows (google-auth-oauthlib) and the discovery API (google-api-python-client) for endpoint definitions.

pip install google-auth google-auth-oauthlib google-api-python-client

Managing OAuth2 credentials lifecycle

The primary challenge of uploading via YouTube API is maintaining the access_token lifecycle without continuous human execution. First, we securely request permissions against the local loopback interface, storing a serialized binary object (.pickle) containing the persistent refresh_token.

# google_auth_handler.py
import pickle
from pathlib import Path

from google.auth.transport.requests import Request
from google_auth_oauthlib.flow import InstalledAppFlow
from googleapiclient.discovery import build

SCOPES = ["https://www.googleapis.com/auth/youtube.upload"]
TOKEN_CACHE = Path("credentials/youtube_token.pickle")
CLIENT_SECRET = Path("credentials/client_secret.json")

def bootstrap_oauth_service():
    """Initializes the YouTube API interface, rotating tokens automatically."""
    creds = None
    
    # Load previously serialized tokens to avoid continuous user authorization
    if TOKEN_CACHE.exists():
        with open(TOKEN_CACHE, "rb") as token:
            creds = pickle.load(token)

    if not creds or not creds.valid:
        # Transparently solicit a new access_token if the short-lived token expired
        if creds and creds.expired and creds.refresh_token:
            creds.refresh(Request())
        else:
            # Cold-start loopback auth
            flow = InstalledAppFlow.from_client_secrets_file(
                str(CLIENT_SECRET), SCOPES
            )
            creds = flow.run_local_server(port=8090)

        # Serialize
        TOKEN_CACHE.parent.mkdir(parents=True, exist_ok=True)
        with open(TOKEN_CACHE, "wb") as token:
            pickle.dump(creds, token)

    return build("youtube", "v3", credentials=creds)

Resumable media chunks and standardizing the upload API

Video uploads fail unpredictably. Standard POST bodies break down over 5MB. Thus, we utilize MediaFileUpload to configure chunked, resumable POST transfers.

# sync_youtube.py
from googleapiclient.http import MediaFileUpload
from google_auth_handler import bootstrap_oauth_service

def transmit_artifact(youtube_service, video_filepath: str, metadata: dict):
    """Executes a resumable file transmission of an `.mp4` artifact."""
    
    # Define request payload using YouTube's Data Schema
    request_body = {
        "snippet": {
            "title": metadata.get("title"),
            "description": metadata.get("description"),
            "tags": metadata.get("tags", []),
            "categoryId": "25" # Explicitly cast to 'News & Politics' category
        },
        "status": {
            "privacyStatus": "private", # Stage immediately, authorize later
            "selfDeclaredMadeForKids": False
        }
    }

    # Transmit via 10 MB contiguous chunks
    chunk_stream = MediaFileUpload(
        video_filepath,
        mimetype="video/mp4",
        resumable=True,
        chunksize=10_485_760 
    )

    request = youtube_service.videos().insert(
        part="snippet,status",
        body=request_body,
        media_body=chunk_stream
    )

    # Poll server transmission rate
    response = None
    while response is None:
        status, response = request.next_chunk()
        if status:
            print(f"Transmission progress: {int(status.progress() * 100)}%")

    print(f"Artifact persisted: https://youtu.be/{response['id']}")

Distributed orchestration vs generic cron logic

While a monolithic bash script or basic cron daemon can wrap this four-step pipeline, serious deployments rely on containerized execution.

Apify Webhooks: Using Apify as a task queue, you can bind a web endpoint to the Advanced TikTok Search API Actor directly. Post-execution, the webhook triggers a downstream CI/CD worker (e.g., GitHub Actions or AWS Lambda) which executes this normalized codebase sequentially.

By pushing the extraction layer onto a managed platform and keeping the transform and load steps within a localized microservice container, you fully isolate volatile web scraping states from stable application logic.

Series Navigation

Part 1: Interfacing with the Apify Python SDK

Part 2: Asynchronous video ingestion and connection pooling

Part 3: LLM script synthesis and FFmpeg concatenation

Part 4: OAuth2 authentication and YouTube Data API uploads ← You are here

Part 3: LLM script synthesis and FFmpeg concatenation

2026-03-26T08:00:00+07:00

This is Part 3 of our pipeline series. (See the Architecture Overview for context.) Having isolated independent .mp4 chunks in our local filesystem from Part 2, we must now parse metadata via OpenAI and concatenate the sequence using FFmpeg.

Series Navigation

Part 1: Interfacing with the Apify Python SDK

Part 2: Asynchronous video ingestion and connection pooling

Part 3: LLM script synthesis and FFmpeg concatenation ← You are here

Part 4: OAuth2 authentication and YouTube Data API uploads

Engineering dependencies

We interface with OpenAI’s asynchronous API wrapper and the command-line FFmpeg application. FFmpeg operations are delegated via Python’s subprocess for raw control over transcode flags.

pip install openai

Ensure the FFmpeg binary is installed on your OS environment (sudo apt install ffmpeg or brew install ffmpeg).

Prompt engineering the LLM component

We construct a narrative bridging the disparate .mp4 clips by pushing the accompanying desc metadata to an LLM context window.

# synthesize_script.py
import json
import os
from openai import OpenAI

SYSTEM_PROMPT = """
You are a factual, un-biased news synthesis engine. Given a JSON array of localized multimedia captions originating from social media on a specific geographic event, synthesize a tight 60-second broadcast script. Return raw text without markdown headers, HTML, commentary, or social media references.
"""

def synthesize_broadcast(api_key: str, topic: str, context: list[str]) -> str:
    """Invokes OpenAI ChatGPT to generate continuous editorial."""
    client = OpenAI(api_key=api_key)
    
    # Prune elements beyond context threshold
    pruned_context = context[:20] 
    
    payload = f"Topic Context: {topic}\n\nIngestion Metadata:\n" 
    payload += "\n".join(f"- {c}" for c in pruned_context)
    
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0.3, # Low variance
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": payload}
        ]
    )
    return resp.choices[0].message.content

def synthesize_audio(api_key: str, script: str, out_path: str):
    """Invokes the neural TTS engine."""
    client = OpenAI(api_key=api_key)
    with client.audio.speech.with_streaming_response.create(
        model="tts-1",
        voice="onyx",
        input=script,
    ) as response:
        response.stream_to_file(out_path)

By strictly managing the temperature parameter and heavily biasing the systemic prompt against colloquialisms, we extract objective signals from unstructured social streams.

Demuxing and concatenating via FFmpeg

With narration.mp3 isolated, we construct an FFmpeg intermediate .txt concatenation file, enforce standardization via H.264 profiles (libx264), and map the audio stream natively, skipping arbitrary format transcodings.

# build_artifacts.py
import subprocess
from pathlib import Path

def concat_and_mux(video_paths: list[Path], audio_path: Path, out_path: Path):
    """
    Standardizes aspect ratios, drops source audio channels, concatenates video tracks, 
    and muxes the secondary audio track.
    """
    # 1. Build FFmpeg concat instruction file
    concat_txt = Path("concat.txt")
    with open(concat_txt, "w") as f:
        for vp in video_paths:
            f.write(f"file '{vp.absolute()}'\n")

    # 2. Transcode parameters to normalize resolutions to 1080p 16:9
    tmp_vid = Path("tmp_vid.mp4")
    subprocess.run([
        "ffmpeg", "-y", "-f", "concat", "-safe", "0",
        "-i", str(concat_txt),
        "-c:v", "libx264", "-crf", "24", "-preset", "veryfast",
        "-r", "30", "-an", # Drop source audio track (-an)
        "-vf", "scale=1920:1080:force_original_aspect_ratio=decrease,pad=1920:1080:(ow-iw)/2:(oh-ih)/2",
        str(tmp_vid)
    ], check=True)

    # 3. Mux audio and video tracks, truncate at shortest stream
    subprocess.run([
        "ffmpeg", "-y",
        "-i", str(tmp_vid), "-i", str(audio_path),
        "-c:v", "copy", "-c:a", "aac", "-b:a", "192k",
        "-shortest",
        str(out_path)
    ], check=True)

    # Teardown artifacts
    concat_txt.unlink()
    tmp_vid.unlink()

This structural normalization guarantees that vertical TikTok codecs are padded appropriately without mutating hardware profiles, minimizing the container rendering overhead.

In Part 4: OAuth2 authentication and YouTube Data API uploads, we discuss orchestrating the final stage via the google-api-python-client.

Need more TikTok data inputs? You can also feed this pipeline with trending videos, hashtag data, or user profiles. See our full collection of TikTok and Twitter scraping tools for additional ingestion sources.

Series Navigation

Part 1: Interfacing with the Apify Python SDK

Part 2: Asynchronous video ingestion and connection pooling

Part 3: LLM script synthesis and FFmpeg concatenation ← You are here

Part 4: OAuth2 authentication and YouTube Data API uploads

Part 2: Asynchronous video ingestion and connection pooling

2026-03-25T08:00:00+07:00

This is Part 2 of our pipeline series. (See the Architecture Overview for context.) After normalizing the dataset payload from the Advanced TikTok Search API Actor, our next challenge is physically ingesting the .mp4 payloads from TikTok’s CDN endpoints.

Series Navigation

Part 1: Interfacing with the Apify Python SDK

Part 2: Asynchronous video ingestion and connection pooling ← You are here

Part 3: LLM script synthesis and FFmpeg concatenation

Part 4: OAuth2 authentication and YouTube Data API uploads

Engineering constraints

Downloading 100+ videos synchronously causes substantial I/O blocking. Conversely, unrestrained parallel requests will exhaust file descriptors and trigger remote rate limiters (resulting in 403 Forbidden or 429 Too Many Requests responses from the CDN). We need bounded concurrency.

We use httpx for modern Python HTTP connection pooling, wrapped in an asyncio.Semaphore to throttle active multiplexed TLS streams.

pip install httpx asyncio

Bounded concurrency downloader implementation

This script reads the normalized data lake JSON produced in Part 1, builds a target manifest, deduplicates operations using a set lookup (using aweme_id as the primary key), and initiates pooled GET requests to the source_url.

# ingest.py

import asyncio
import json
import logging
import sys
from pathlib import Path

import httpx

# Throttle concurrent TCP connections to the CDN
MAX_CONCURRENT_RQS = 8
TIMEOUT_SEC = 45.0
DATA_LAKE_PATH = Path("data/raw")
INGEST_PATH = Path("data/media_assets")

logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")

def build_manifest(input_json: Path) -> dict:
    """Parses extract data and yields a deduplicated manifest dict."""
    with open(input_json, "r") as f:
        records = json.load(f)

    manifest = {}
    for r in records:
        uid = r["aweme_id"]
        source_url = r["media"]["source_url"]
        
        # Guard against malformed records
        if not uid or not source_url:
            continue
            
        manifest[uid] = {
            "url": source_url,
            "topic": r["query_context"].replace(" ", "_"),
            "target_dir": INGEST_PATH / r["query_context"].replace(" ", "_"),
            "target_filename": f"{uid}.mp4"
        }
            
    return manifest

async def stream_to_disk(
    client: httpx.AsyncClient, 
    semaphore: asyncio.Semaphore, 
    item_id: str, 
    job: dict
):
    """Downloads a video stream using bounded concurrency and exponential backoff."""
    job["target_dir"].mkdir(parents=True, exist_ok=True)
    out_file = job["target_dir"] / job["target_filename"]
    
    # Idempotent execution
    if out_file.exists():
        logging.debug(f"[{item_id}] Cache hit. Skipping.")
        return "cache"

    async with semaphore:
        for attempt in range(3):
            try:
                # TikTok CDNs require a realistic User-Agent
                headers = {
                    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
                    "Referer": "https://www.tiktok.com/"
                }
                
                async with client.stream("GET", job["url"], headers=headers, timeout=TIMEOUT_SEC) as resp:
                    resp.raise_for_status()
                    
                    with open(out_file, "wb") as f:
                        async for chunk in resp.aiter_bytes(chunk_size=1024 * 128): # 128KB chunks
                            f.write(chunk)
                            
                return "ok"
            except (httpx.HTTPError, httpx.TimeoutException) as e:
                logging.warning(f"[{item_id}] Network fault {e}. Retry {attempt+1}/3")
                await asyncio.sleep(2 ** attempt)  # Exponential backoff
        
    return "failed"

async def orchestrate_ingestion(manifest: dict):
    # Enforces absolute limit on active connections
    semaphore = asyncio.Semaphore(MAX_CONCURRENT_RQS)
    
    # httpx.AsyncClient maintains a persistent connection pool (Keep-Alive)
    async with httpx.AsyncClient(http2=True, follow_redirects=True) as client:
        tasks = [
            stream_to_disk(client, semaphore, uid, job)
            for uid, job in manifest.items()
        ]
        
        # Await completion of all ingestion coroutines
        results = await asyncio.gather(*tasks)
        
    oks = results.count("ok")
    caches = results.count("cache")
    fails = results.count("failed")
    logging.info(f"Ingestion complete: OK={oks}, Cached={caches}, Failed={fails}")

if __name__ == "__main__":
    latest_extract = sorted(DATA_LAKE_PATH.glob("extract_*.json"))[-1]
    logging.info(f"Mounting manifest from {latest_extract.name}")
    
    manifest = build_manifest(latest_extract)
    asyncio.run(orchestrate_ingestion(manifest))

Infrastructure design notes

Chunked Writes (aiter_bytes): Reading arbitrary 50MB files securely into memory scales poorly. Passing physical streams into file byte streams avoids buffer constraints.
Connection Pooling: Reusing the TLS connections to the same host over http2 minimizes handshake bottlenecks, improving performance exponentially versus naive requests.get() inside a ThreadPoolExecutor.
Deduplication: By building a manifest dictionary indexed on aweme_id, multiple query strategies yielding the same video naturally deduplicate.

In Part 3: LLM script synthesis and FFmpeg concatenation, we will utilize ffmpeg-python bindings and ChatGPT to assemble these raw .mp4 chunks into a single chronological presentation.

Series Navigation

Part 1: Interfacing with the Apify Python SDK

Part 2: Asynchronous video ingestion and connection pooling ← You are here

Part 3: LLM script synthesis and FFmpeg concatenation

Part 4: OAuth2 authentication and YouTube Data API uploads

Part 1: Interfacing with the Apify Python SDK for TikTok Extraction

2026-03-24T08:00:00+07:00

This is Part 1 of our pipeline architecture series. (See the Architecture Overview for context.) Here, we implement the extraction layer using the apify-client SDK to interface with the Advanced TikTok Search API Actor.

Series Navigation

Part 1: Interfacing with the Apify Python SDK ← You are here

Part 2: Asynchronous video ingestion and connection pooling

Part 3: LLM script synthesis and FFmpeg concatenation

Part 4: OAuth2 authentication and YouTube Data API uploads

Infrastructure dependencies

We rely on the official Apify SDK to manage API requests and dataset pagination to Apify’s infrastructure.

pip install apify-client python-dotenv

Store your APIFY_TOKEN securely in environment variables or a .env file. Never hardcode access tokens in version control.

Schema configurations and targeted queries

The actor strictly defines an input schema. When extracting data for news events, geographical relevance (region) and recency (publishTime) are critical query parameters.

# config.py

import os
from dotenv import load_dotenv

load_dotenv()

APIFY_TOKEN = os.environ.get("APIFY_TOKEN")
if not APIFY_TOKEN:
    raise EnvironmentError("Missing APIFY_TOKEN boundary condition.")

# Input schema matching the Advanced TikTok Search API
SEARCH_PAYLOADS = [
    {
        "keyword": "Ukraine frontline",
        "region": "UA",
        "sortType": 2,       # 2 maps to "Most recent"
        "publishTime": "YESTERDAY",
        "limit": 50,
    },
    {
        "keyword": "France protests",
        "region": "FR",
        "sortType": 1,       # 1 maps to "Most liked"
        "publishTime": "WEEK",
        "limit": 30,
    }
]

Executing the extraction pipeline

To separate raw HTTP responses from our pipeline’s expected schema, we implement a normalizer (normalize_payload) that extracts only the specific fields required downstream (e.g., aweme_id, video_url_no_watermark, desc). Storing the entire raw payload from the actor wastes disk space and database I/O bandwidth.

# extract.py

import json
from datetime import datetime
from pathlib import Path

from apify_client import ApifyClient
from config import APIFY_TOKEN, SEARCH_PAYLOADS

ACTOR_ID = "novi/advanced-search-tiktok-api"
DATA_LAKE_PATH = Path("data/raw")

def normalize_payload(raw_item: dict, original_query: str) -> dict:
    """Normalizes the raw JSON schema into a consistent pipeline format."""
    stats = raw_item.get("statistics", {})
    play_addr = raw_item.get("video", {}).get("play_addr", {})

    return {
        "aweme_id": raw_item.get("aweme_id"),
        "description": raw_item.get("desc", ""),
        "region": raw_item.get("region"),
        "timestamp_unix": raw_item.get("create_time"),
        "metrics": {
            "views": stats.get("play_count", 0),
            "likes": stats.get("digg_count", 0),
            "shares": stats.get("share_count", 0),
        },
        "media": {
            # Extract the first available CDN URL; prioritizing the watermark-free address
            "source_url": play_addr.get("url_list", [None])[0],
            "duration_ms": raw_item.get("video", {}).get("duration"),
        },
        "query_context": original_query
    }

def fetch_datasets():
    client = ApifyClient(APIFY_TOKEN)
    normalized_results = []

    for payload in SEARCH_PAYLOADS:
        print(f"Triggering execution for: {payload['keyword']}")
        
        # Blocking call: waits for the actor to complete execution on Apify
        run = client.actor(ACTOR_ID).call(run_input=payload)
        
        # Paginate through the associated dataset
        items_iterator = client.dataset(run["defaultDatasetId"]).iterate_items()
        
        for item in items_iterator:
            normalized_results.append(normalize_payload(item, payload['keyword']))
            
    return normalized_results

if __name__ == "__main__":
    results = fetch_datasets()
    
    DATA_LAKE_PATH.mkdir(parents=True, exist_ok=True)
    out_file = DATA_LAKE_PATH / f"extract_{datetime.now().strftime('%Y%m%d%H%M')}.json"
    
    with open(out_file, "w") as f:
        json.dump(results, f, separators=(',', ':')) # Minified JSON
        
    print(f"Extracted {len(results)} records to local storage.")

By explicitly flattening nested hierarchies into metrics and media dictionaries, we decouple the actor’s API shape from our internal domain model.

In Part 2: Asynchronous video ingestion and connection pooling, we write a concurrent parser using asyncio and httpx to actually fetch the .mp4 payloads from the retrieved CDN URLs.

Series Navigation

Part 1: Interfacing with the Apify Python SDK ← You are here

Part 2: Asynchronous video ingestion and connection pooling

Part 3: LLM script synthesis and FFmpeg concatenation

Part 4: OAuth2 authentication and YouTube Data API uploads