Parsing

Pulling structure out of responses without BeautifulSoup: regex extraction, JSON traversal, PDF text, and zipping scraped lists.

Zip two scraped lists into a dict

Parallel result columns (e.g. item names + prices) are paired into one dict for clean output.

When a page renders parallel columns (names in <h3>, prices in a sibling), each column is scraped into its own list, then zip(names, prices) pairs them positionally and dict(...) turns the pairs into a lookup. The if t.startswith("$") filter keeps only the price text node out of each card. This works because iterating the DOM top-to-bottom yields both lists in the same order.

names  = [t.get_text(strip=True) for t in soup.find_all("h3")]
prices = [t for d in soup.select("section.list > div") for t in d.stripped_strings if t.startswith("$")]
items  = dict(zip(names, prices))
for k, v in items.items():
    print(f"{k}: {v}")

Sample markup

<section class="list">
  <div><h3>Laptop</h3><span>$1,299</span></div>
  <div><h3>Mouse</h3><span>$25</span></div>
</section>

Example output

Laptop: $1,299
Mouse: $25

_{Find by: zip, dict, combine lists, pair, names and prices, mapping, dictionary · Source: WSA SQLi in-band}

Regex extraction from response

When the data sits between known markers, a single capture group beats a full DOM parse.

When the data sits between two fixed markers and a full DOM walk is unnecessary, one capture group is faster than parsing. ([^<]*) grabs everything up to the next <, so it stops cleanly at the closing tag. group(1) is the captured text; it must always be guarded for None (no match) before use. The re.DOTALL flag is added if the value can span newlines.

import re
m = re.search(r"Results:</b><br><br>([^<]*)</center>", r.text)
if m and m.group(1).strip():
    value = m.group(1).strip()

Sample response fragment

<center>Results:</b><br><br>HTB{f1a9b2...c2}</center>

Example output

value -> "HTB{f1a9b2...c2}"

_{Find by: regex, re.search, extract, group, pattern, match, capture, parse response, scrape without bs4 · Source: CWEE/XPath in-band}

Extract text from a PDF response (PyMuPDF/fitz)

A generated PDF is read straight from response bytes; slicing between headings isolates the table.

The PDF is opened straight from the response bytes (r.content, not .text) so nothing is written to disk. page.get_text() returns the page as plain text; slicing between two known headings (each find() returns the index) isolates just the relevant table. Iterating doc covers multi-page invoices. This is the read side of the SSRF-to-PDF chain.

import fitz  # pip install pymupdf
doc = fitz.open(stream=r.content, filetype="pdf")
for page in doc:
    txt = page.get_text()
    start = txt.find("Order System")
    end   = txt.find("SEARCH ORDER")
    print(txt[start:end])

_{Find by: pdf, fitz, pymupdf, extract text, invoice, document, parse pdf, ssrf pdf, get_text, binary response · Source: CWEE/PDF SSRF}

Parse JSON responses — r.json() + safe .get traversal

r.json() then chained .get calls with defaults; the list is guarded before indexing [0] to pull a value buried in a list-of-objects.

r.json() decodes the response body into a Python dict/list (it raises requests.exceptions.JSONDecodeError if the body is not valid JSON, so it should be wrapped when the endpoint can return HTML errors). data.get('key') reads a key without the KeyError that data['key'] throws on a miss; a second argument supplies a default. Chaining .get('a', {}).get('b') handles nested objects: each missing level yields {} so the next .get is still safe. List indexing is not safe the same way — reqs[0] raises IndexError on an empty list, so it is gated with reqs[0] if reqs else {} before reading. next((x for x in reqs if pred), None) returns the first element matching a predicate (or None). json.loads(text) parses a JSON string already held; json.dumps(obj, indent=2) re-serialises any object pretty-printed — the fastest way to inspect an unfamiliar response shape.

import json

r = s.get(url=POLL_URL, verify=False, proxies=PROXIES, timeout=10)
data = r.json()                              # JSON body -> dict/list (raises on non-JSON body)

token  = data.get('uuid')                    # safe key read: None if absent (no KeyError)
status = data.get('status', 'unknown')       # ...with a fallback default
host   = data.get('meta', {}).get('host')    # nested: the {} default keeps the chain alive

# value buried in a list-of-objects -- guard the empty list BEFORE indexing [0]:
reqs  = data.get('data', [])
first = reqs[0] if reqs else {}
out   = first.get('query', {}).get('data')   # e.g. the exfiltrated value

# first list item matching a predicate (None if none match):
hit = next((x for x in reqs if x.get('method') == 'GET'), None)

# parse a JSON *string*, and pretty-print any object to eyeball its shape:
obj = json.loads(r.text)
print(json.dumps(obj, indent=2))

Sample JSON body the calls target

{
  "uuid": "1b2c3d4e",
  "status": "ok",
  "data": [
    { "method": "GET", "query": { "data": "726f6f74" } }
  ]
}

What each variable holds

token  -> '1b2c3d4e'
status -> 'ok'
host   -> None                  (no 'meta' key, default {} then .get -> None)
out    -> '726f6f74'            (data[0]['query']['data'])
hit    -> {'method': 'GET', 'query': {...}}

_{Find by: json, parse json, r.json, response json, api response, dict, get key, default value, keyerror, indexerror, jsondecodeerror, nested, traversal, list of objects, index guard, first match, find in list, predicate, json.loads, json.dumps, pretty print, navigate, extract field, webhook, poll, oob exfil · Source: HTB/VoidWhispers}

BeautifulSoup Encodings