yiff.party downloader

redturtle2

New Member
Jul 15, 2018
3
2
A while back I made a python script for downloading stuff from yiff.party. I thought it might be useful for others as well. I am attaching the file as .txt since .py extension is not allowed.

Requirements:
  • Python 3
  • beautifulsoup4

usage: yiff-dl.py [-h] [-n CONNECTIONS] [-p PAGE] [-t POST] [-i] [-d {pre,post}] [-o OUTPUT_DIR]
[--downloader DOWNLOADER] [--downloader-args DOWNLOADER_ARGS]
url

positional arguments:
url

optional arguments:
-h, --help show this help message and exit
-n CONNECTIONS, --connections CONNECTIONS
the maximum number of files to download concurrently
-p PAGE, --page PAGE download only specified pages (n for just page n, :n for upto page n, n: for n to end, n1:n2
for n1 to n2)
-t POST, --post POST download only specified posts (n for just post n, :n for upto post n, n: for n to end, n1:n2
for n1 to n2)
-i, --write_info write info htmls
-d {pre,post}, --id_format {pre,post}
prefix or postfix id
-o OUTPUT_DIR, --output_dir OUTPUT_DIR
output directory
--downloader DOWNLOADER
specify external downloader
--downloader-args DOWNLOADER_ARGS
specify external downloader arguments

-n, --connections
essentially creates n processes using python's multiprocessing library and start downloading one file with each process. Don't keep this too high, maybe 4-6.

To download a single page or some specified range of pages use -p, --page. For example let's say some creator has 10 pages on yiff.party, then -p 3 only downloads page 3, -p 2:5 downloads pages 2,3,4,5, -p :5 downloads pages 1,2,3,4,5 and -p 6: downloads pages 6,7,8,9,10.

Use -t, --post to download specific posts from a single page. Best used along with -p n.

Use -i to write out local html files so that you have a local working copy which can be viewed in a browser.

By default the files are prefixed with post id. Use -d post, --id_format post to postfix it instead.

You can use external downloaders like aria2 or axel to download the files, otherwise python's requests library will be used. It's best to use an external downloader since they allow resuming and skipping already existing downloads. Currently only downloaders that support a -o output flag can be used. Thus wget cannot be used, curl, axel, aria2 can be used.

My preferred usage is:
python yiff-dl.py -n 5 -d post -i --downloader axel --downloader-args " -n 10" "<url>"
OR, with aria2
python yiff-dl.py -n 5 -d post -i --downloader aria2c --downloader-args "-x 10" "<url>"

Python:
import argparse
import os
import requests
import subprocess
from multiprocessing import Pool, Value

from bs4 import BeautifulSoup

# Global counter for downloads
counter = None


def init(c):
    """Initialize global counter.

    """

    global counter
    counter = c


def write_info(info, page, pages, output_dir, patreon_title):
    """Write html info file.

    Parameters
    ----------
    info : dict
        Dict with all the necessary info.
    page : int
        Page number for which html is to be written.
    pages : int
        Total number of pages.
    output_dir : str
        The output directory.
    patreon_title : str
        The patreon title.

    """

    # Header
    html = '<!DOCTYPE html><html><head><meta http-equiv="content-type" content="text/html; charset=UTF-8"><meta http-equiv="content-language" content="en"></head><body style="margin:auto;width:50%;background-color:#eeeeee;color:#424242;font-family:\'Segoe UI\', Tahoma, Geneva, Verdana, sans-serif;text-align: justify"><br>'

    # Top page links
    if pages > 1:
        html += '<div style="word-spacing: 15px">'
        for p in range(1, pages+1):
            html += f'<a href="info{p}.html">{p}</a> ' if p != page else f'{p} '
        html += '</div><hr style="border: dashed 2px"/>'

    # Write info
    for i in info:
        html += f'<h3>{i["title"]}</h3>'
        if 'thumb' in i:
            html += f'<img src="{i["thumb"]}" height=200><br><br>'
        html += f'<b>id</b>: <i>{i["id"]}</i>&nbsp&nbsp<b>{i["time"]}</b>'
        if 'post' in i:
            html += f'&nbsp&nbsp&nbsp&nbsp<a href="{i["post"]}"><b>Post File</b></a>'
        if 'incomplete' in i:
            html += f'&nbsp&nbsp&nbsp&nbsp<span style="color:#c62828"><b>INCOMPLETE</b></span>'
        html += '<br>'
        if i['body'] != '':
            html += f'<br><div>{i["body"]}</div>'
        if 'embed' in i:
            html += f'<br><a href="{i["embed"]}"><b>EMBED URL</b></a><br>'
        if 'attachments' in i and len(i['attachments']) > 0:
            html += '<br><b>Attachments</b><br>'
            for att in i['attachments']:
                html += f'<a href="{att[0]}">{att[1]}</a><br>'
        html += '<br><hr style="border: dashed 2px"/>'

    # Bottom page links
    if pages > 1:
        html += '<div style="word-spacing: 15px">'
        for p in range(1, pages+1):
            html += f'<a href="info{p}.html">{p}</a> ' if p != page else f'{p} '

    html += '</div><br></body></html>'

    # Write out to file
    with open(os.path.join(output_dir, patreon_title, f'info{page}.html'), 'w', encoding='utf8') as f:
        f.write(html)


def downloadfile(all_args):
    """Download a single file using whichever downloader is specified.

    Parameters
    ----------
    all_args : tuple
        all arguments as a tuple (script_args, url, file_name).

    """

    global counter

    # Extract arguments
    args, url, file_name = all_args

    # Fix url
    if not url.startswith('http'):
        url = 'https://yiff.party' + url

    # Download file
    try:
        # Use external downloader if specified else use requests lib
        if args.downloader is None:
            file = requests.get(url)
            with open(file_name, 'wb') as f:
                f.write(file.content)
        else:
            call_list = [args.downloader]
            if args.downloader_args is not None:
                call_list += args.downloader_args.split()
            call_list += ['-o', file_name, url]

            subprocess.call(call_list, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

        # Update counter
        with counter.get_lock():
            counter.value += 1
        print(f'\r{counter.value} files downloaded', end='')
    except (requests.RequestException, FileNotFoundError, OSError):
        pass


def downloadpost(args, post, patreon_title):
    """Download a single post.

    Parameters
    ----------
    args : dict
        Script arguments.
    post : dict
        Post to download.
    patreon_title : str
        Patreon title.

    Returns
    -------
    info : dict
        info about the post.

    """

    downloadables = []
    types = {}

    # Extract id, time and title
    id_ = post['id'][1:]
    time = post.select('small.post-time')[0].text[:10]
    title = post.select('span.card-title')[0].text
    if title.endswith('more_vert'):
        title = title[:-9]
    else:
        title = title[:-25]
    info = {'id': id_, 'time': time, 'title': title}
    print(title)

    # Extract thumbnail
    if len(post.select('div.card-image')) > 0:
        assert len(post.select('div.card-image')) == 1
        card_image = post.select('div.card-image')[0].select('img')[0]['data-src']
        bname = os.path.basename(card_image)
        if args.id_format == 'pre':
            card_image_local = os.path.join(args.output_dir, patreon_title, 'thumbs', f'{id_}_{bname}')
        else:
            card_image_local = os.path.join(args.output_dir, patreon_title, 'thumbs',
                                            f'{bname[:bname.rfind(".")]}_{id_}{bname[bname.rfind("."):]}' if bname.rfind('.') != -1 else f'{bname}_{id_}')
        downloadables.append((args, card_image, card_image_local))
        types['card image'] = 1
        info['thumb'] = os.path.join('thumbs', os.path.basename(card_image_local))

    # Extract post file
    if len(post.select('div.card-action')) > 0:
        assert len(post.select('div.card-action')) == 1
        post_file = post.select('div.card-action')[0].select('a')[0]['href']
        bname = os.path.basename(post_file)
        if args.id_format == 'pre':
            post_file_local = os.path.join(args.output_dir, patreon_title, 'posts', f'{id_}_{bname}')
        else:
            post_file_local = os.path.join(args.output_dir, patreon_title, 'posts',
                                           f'{bname[:bname.rfind(".")]}_{id_}{bname[bname.rfind("."):]}' if bname.rfind('.') != -1 else f'{bname}_{id_}')
        downloadables.append((args, post_file, post_file_local))
        types['post file'] = 1
        info['post'] = os.path.join('posts', os.path.basename(post_file_local))

    # Extract body
    assert len(post.select('div.post-body')) == 1
    post_body = post.select('div.post-body')[0]

    # Extract inlines
    for img in post_body.select('img.post-img-inline'):
        bname_i = os.path.basename(img.parent['href'])
        bname_t = os.path.basename(img['data-src'])
        if args.id_format == 'pre':
            img_local = os.path.join(args.output_dir, patreon_title, 'inlines', f'{id_}_{bname_i}')
            thumb_local = os.path.join(args.output_dir, patreon_title, 'thumbs', f'{id_}_inline_{bname_t}')
        else:
            img_local = os.path.join(args.output_dir, patreon_title, 'inlines', f'{bname_i[:bname_i.rfind(".")]}_{id_}{bname_i[bname_i.rfind("."):]}' if bname_i.rfind(
                '.') != -1 else f'{bname_i}_{id_}')
            thumb_local = os.path.join(args.output_dir, patreon_title, 'thumbs', f'{bname_t[:bname_t.rfind(".")]}_{id_}{bname_t[bname_t.rfind("."):]}' if bname_t.rfind(
                '.') != -1 else f'{bname_t}_{id_}_inline')
        downloadables.append((args, img.parent['href'], img_local))
        downloadables.append((args, img['data-src'], thumb_local))
        if 'inlines' not in types:
            types['inlines'] = 0
        types['inlines'] += 1
        if 'inline thumbs' not in types:
            types['inline thumbs'] = 0
        types['inline thumbs'] += 1
        img['src'] = os.path.join('thumbs', os.path.basename(thumb_local))
        img['height'] = 200
        img.parent['href'] = os.path.join('inlines', os.path.basename(img_local))
    info['body'] = post_body.decode_contents()
    if info['body'].startswith('<p>') and info['body'].endswith('</p>'):
        info['body'] = info['body'][3:-4]

    # Extract embedded link
    if len(post.select('div.card-embed')) > 0:
        assert len(post.select('div.card-embed')) == 1
        info['embed'] = post.select('div.card-embed')[0].select('a')[0]['href']

    # Extract attachments
    info['attachments'] = []
    for card_attachment in post.select('div.card-attachments'):
        attachments = card_attachment.select('a')
        for attachment in attachments:
            att_id = attachment['href'].split('/')[-2]
            bname = attachment.text[:attachment.text.rfind('.')] + '_' + att_id + attachment.text[attachment.text.rfind(
                '.'):] if attachment.text.rfind('.') != -1 else attachment.text + '_' + att_id
            if args.id_format == 'pre':
                attachment_local = os.path.join(args.output_dir, patreon_title, 'attachments', f'{id_}_{bname}')
            else:
                attachment_local = os.path.join(args.output_dir, patreon_title, 'attachments',
                                                f'{bname[:bname.rfind(".")]}_{id_}{bname[bname.rfind("."):]}' if bname.rfind('.') != -1 else f'{bname}_{id_}')
            downloadables.append((args, attachment['href'], attachment_local))
            if 'attachments' not in types:
                types['attachments'] = 0
            types['attachments'] += 1
            info['attachments'].append(
                (os.path.join('attachments', os.path.basename(attachment_local)), attachment.text))

    # Download files
    print(f'{len(downloadables)} files to download.', end='')
    print(f' ({", ".join([t+":"+str(types[t]) for t in types])})' if len(downloadables) > 0 else '')
    if len(downloadables) > 0:
        counter = Value('i', 0)
        with Pool(args.connections, initializer=init, initargs=(counter,)) as p:
            p.map(downloadfile, downloadables)

        if counter.value < len(downloadables):
            info['incomplete'] = True

    # Return info object
    return info


if __name__ == '__main__':
    """Main function.

    """

    # Define and parse arguments
    parser = argparse.ArgumentParser()
    parser.add_argument('url')
    parser.add_argument('-n', '--connections',
                        help='the maximum number of files to download concurrently', type=int, default=1)
    parser.add_argument(
        '-p', '--page', help='download only specified pages (n for just page n, :n for upto page n, n: for n to end, n1:n2 for n1 to n2)')
    parser.add_argument(
        '-t', '--post', help='download only specified posts (n for just post n, :n for upto post n, n: for n to end, n1:n2 for n1 to n2)')
    parser.add_argument('-i', '--write_info', help='write info htmls', action='store_true')
    parser.add_argument('-d', '--id_format', help='prefix or postfix id', choices=['pre', 'post'], default='pre')
    parser.add_argument('-o', '--output_dir', help='output directory', default='./')
    parser.add_argument('--downloader', help='specify external downloader', default=None)
    parser.add_argument('--downloader-args', dest='downloader_args',
                        help='specify external downloader arguments', default=None)
    args = parser.parse_args()

    # Get first page
    print('Getting webpage')
    while True:
        page = requests.get(args.url)
        if page.status_code == 200:
            break
        print('Retrying')
    soup = BeautifulSoup(page.text, 'html.parser')

    # Get pages
    if len(soup.select('p.paginate-count')) > 0:
        pages = soup.select('p.paginate-count')[0].text
        pages = int(pages[pages.find('/')+2:])
        print(f'{pages} pages')
    else:
        pages = 1

    patreon_title = soup.select('title')[0].text[:-13]

    # Create directories
    if not os.path.exists(os.path.join(args.output_dir, patreon_title)):
        os.mkdir(os.path.join(args.output_dir, patreon_title))
    if not os.path.exists(os.path.join(args.output_dir, patreon_title, 'thumbs')):
        os.mkdir(os.path.join(args.output_dir, patreon_title, 'thumbs'))
    if not os.path.exists(os.path.join(args.output_dir, patreon_title, 'posts')):
        os.mkdir(os.path.join(args.output_dir, patreon_title, 'posts'))
    if not os.path.exists(os.path.join(args.output_dir, patreon_title, 'attachments')):
        os.mkdir(os.path.join(args.output_dir, patreon_title, 'attachments'))
    if not os.path.exists(os.path.join(args.output_dir, patreon_title, 'inlines')):
        os.mkdir(os.path.join(args.output_dir, patreon_title, 'inlines'))

    p1 = 1
    p2 = pages+1
    if args.page:
        if args.page.startswith(':'):
            p2 = int(args.page[1:])+1
        elif args.page.endswith(':'):
            p1 = int(args.page[:-1])
        elif ':' in args.page:
            p1 = int(args.page[:args.page.index(':')])
            p2 = int(args.page[args.page.index(':')+1:])+1
        else:
            p1 = int(args.page)
            p2 = p1+1

    # Download page
    for p in range(p1, p2):
        if p != 1:
            page = requests.get(args.url+f'?p={p}')
            soup = BeautifulSoup(page.text, 'html.parser')

        posts = soup.select('div.card.yp-post')
        posts_start = 0

        if args.post and p == p1:
            if args.post.startswith(':'):
                posts = posts[:int(args.post[1:])+1]
            elif args.post.endswith(':'):
                posts = posts[int(args.post[:-1])-1:]
                posts_start = int(args.post[:-1])-1
            elif ':' in args.post:
                posts = posts[int(args.post[:args.post.index(':')]) - 1:int(args.post[args.post.index(':')+1:])+1]
                posts_start = int(args.post[:args.post.index(':')]) - 1
            else:
                posts = [posts[int(args.post)-1]]
                posts_start = int(args.post)-1

        info = []
        for i, post in enumerate(posts):
            print(f'[Page {p}/{pages}, Post {posts_start+i+1}/{len(posts)}] ', end='')
            info.append(downloadpost(args, post, patreon_title))
            print('')

        if args.write_info:
            write_info(info, p, pages, args.output_dir, patreon_title)
 
Last edited:
  • Like
Reactions: 21dsd and if95