Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon. Entire thread

The #jp twitter scraper

Name: eels is a horse fucker 2021-03-25 1:28

#!/bin/sh

if [ -f ./$@-profile ]; then
break
else
twint --user-full -u $@ > ./$@-profile
fi

# Test for presence of $@.json file; continue if present
if [ -f ./$@.json ]; then
twint --json --since $(jq --raw-output -M .date ./$@.json|tail -n1) -u $@ -o $@.json.tmp && tac $@.json.tmp >>$@.json
else
twint --json -u $@ -o $@.json.tmp && tac $@.json.tmp >$@.json && mkdir video && mkdir images
fi

# Download all images
cat $@.json|tr '\" ' '\n'|grep -o -w https://pbs.twimg.com/media/*.*g -|sed -e "s/$/\:orig/g"|wget -nc -i - -P ./images/

# Download all video media
#cat $@.json|grep "\"video\"\: 1"|tr '\" ' '\n'|grep $@/status|youtube-dl -o ./"video/%(id)s.%(ext)s" -i -a -
for video in $(cat $@.json|grep "video_thumb"|tr '\" ' '\n'|grep $(echo $@|tr '[:upper:]' '[:lower:]')/status); do ls $(echo ./video/$(basename $video*)) || youtube-dl -o ./"video/%(id)s.%(ext)s" "$video";done
#video_thumb is grepped as this is more indicative. sometimes twint sets video to 1 even when there's no video attached. haven't determined exact cause.

# Delete temporary file
rm $@.json.tmp

Name: eels is a horse fucker 2021-03-25 1:29

01:23:39 < Anonymous> it is very awkwardly made but it works
01:24:22 < Anonymous> it uses twint, grep, wget, youtube-dl
01:24:30 < Anonymous> and a lot of posix nonsense duct tape
01:24:46 < Anonymous> >bash
01:24:51 < Anonymous> he uses bash
01:24:56 < Anonymous> look at retard
01:25:01 < Anonymous> he uses bash
01:25:03 < Anonymous> I said bash but it should work with any posix shell
01:25:09 < Anonymous> I THINK
01:25:35 < Anonymous> for downloading posts just use curl
01:25:45 < Anonymous> I call it "twitscrape", there's no usage instructions built into it
01:25:51 < Anonymous> https://0x0.st/-qWU.sh
01:26:04 < Anonymous> it takes one argument, the profile name of the user you want to scrape
01:26:28 < Anonymous> you should use it in a directory named after the user
01:26:39 < Anonymous> woah

Name: Anonymous 2021-03-25 4:55

It's like you combined the worst part of IRC into the toxicity of Twitter. I love it.

Name: Anonymous 2021-03-25 5:45

I didn't give this person permission to upload, improved version coming within a couple of hours.
>>3 It's called "twit" scrape for a reason. Twitter is also filled with nice things like videos of asian models feeling themselves up (with sound!) though, so I just can't dismiss it entirely.

Name: Anonymous 2021-03-25 6:04

Fuck you namefag

Name: Anonymous 2021-03-25 7:10

https://0x0.st/-qJF.sh

--------------------------------------------------------------------------------

#!/bin/sh

# ©︎ Gay Nigger Association of America
# Version 3 I think.
# - No longer writes to pwd
# - Can pick up from an interruption
# - --ignore-config added to youtube-dl arguments. Please customise it manually.
# - Documentation
# - Usage instructions
# TODO: Continue previous scrapes from both date and time, rather than just date

if [ -n "$1" ]; then
break
else
echo "Please provide one (1) Twitter username. There's no sanity checks so please be
exact. This script will create a directory named after this and populate it with
\"video\" (video is plural) and \"images\" directories, and it will fill them
with video and images respectively. It will also create a chronological database
of the user's tweets in JSON format. If there's new files to download, or if
you got ratelimited in the middle of a scrape, don't fret, simply wait and
invoke the command again. Don't manually delete the file ending with .tmp." ;
exit
fi

mkdir -p ./$@
mkdir -p ./$@/video
mkdir -p ./$@/images

# Sometimes this command fails to actually write anything. There doesn't appear
# to be any builtin arithmetic to determine whether the size of the file is
# larger than a single LF byte. Could make this loop until successful instead,
# but I feel like that has the potential to annoy.
if find ./$@/$@-profile -type f -size +1c &>/dev/null ; then
break
else
twint --user-full -u $@ >./$@/$@-profile # Should be safe to overwrite.
fi

# The only reason this file should still exist is because of a failed scrape.
# The "until" argument I give here only checks date, rather than time. Definite
# area to improve.
if [ -f ./$@/$@.json.tmp ]; then
echo "Detected interrupted scrape. Recovering." ;
twint --json --until $(jq --raw-output -M .date ./$@/$@.json.tmp|tail -n1) -u $@ -o ./$@/$@.json.tmp || exit ;
echo "Recovery complete. Moving on." ;
tac ./$@/$@.json.tmp >>./$@/$@.json &&
rm ./$@/$@.json.tmp
fi

# Have twint work on a normal file, sort from reverse-chron to chronological,
# then pipe to new, permanent file.
# Completely arbitrary and scientifically incorrect preference on my part.
if [ -f ./$@/$@.json ]; then
twint --json --since $(jq --raw-output -M .date ./$@/$@.json|tail -n1) -u $@ -o ./$@/$@.json.tmp &&
tac ./$@/$@.json.tmp >>./$@/$@.json &&
rm ./$@/$@.json.tmp
else
twint --json -u $@ -o ./$@/$@.json.tmp &&
tac ./$@/$@.json.tmp >>./$@/$@.json && # No reason not to use a double caret here.
rm ./$@/$@.json.tmp
fi

# Old. I did a weird grep thing instead of interpreting the JSON with jq.
# Still works.
cat ./$@/$@.json |
tr '\" ' '\n' |
grep -o -w https://pbs.twimg.com/media/*.*g - |
sed -e "s/$/\:orig/g"|wget -nc -i - -P ./$@/images/

# Original: grep "\"video\"\: 1"
# Twitter or Twint bug sets "video" json to to 1 incorrectly. Unsure of cause.
# Youtube-DL doesn't appear to have clobber detection, so I did it myself.
# Twitter usernames are case-sensitive, and they all convert to lower case.
for video in $(cat ./$@/$@.json|grep "video_thumb"|tr '\" ' '\n'|grep $(echo $@|tr '[:upper:]' '[:lower:]')/status) ;
do ls $(echo ./$@/video/$(basename $video*)) &>/dev/null || # LOL
youtube-dl --ignore-config -o ./$@/"video/%(id)s.%(ext)s" "$video" ;
done

# unix makes computers so easy to use that even a black person could write this

Name: eels is a horse fucker 2021-03-25 9:48

>>6
good boy
mama's proud

scraped and archived

Newer Posts
Don't change these.
Name: Email:
Entire Thread Thread List