Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon.

Pages: 1-4041-8081-120121-

A project

Name: Anonymous 2014-04-19 21:23

Get the URLs of all 4chan.org textboards (dis.) and scrap and archive them. Use archive.org

Name: Anonymous 2014-04-19 21:44

I have a fair amount of data that was collected last august. We can split up the boards and collect them using a well tested script. Preferrably using the json api to reduce the burden on their servers. Also, do it slowly.

I have copies of threads from newpol, newnew, lang, lounge, and sjis in html

Where is the json sync bot again?

Name: Anonymous 2014-04-19 23:30

Name: Anonymous 2014-04-20 1:28

>>2
Why are you concerned about straining joot's servers? A single popular /b/ thread probably exceeds the bandwidth of all of w4c a hundred times over.

Name: Anonymous 2014-04-20 3:25

>>4
Downloading all of /prog/ raw is 1.5 GB, so it's kind of a lot. I just want to be polite.

Name: Anonymous 2014-04-20 7:26

/lounge/ has been archived. Scraping went much more smoothly. There was only one instance of malformed html. /prog/ was infested with it. Either someone pwned world4ch and was inserting raw data into the database, or someone with access to it was messing around. There are poster dates from 1969. Stuff like that is hard to explain otherwise.

https://archive.org/details/lounge-20140420.db.xz

Name: Anonymous 2014-04-20 13:05

>>6
Those dates are from the \n in the name field bug, could be other poasting bugs too.

Name: Anonymous 2014-04-21 4:33

>>7
How strange

does anyone remember the screens that would appear after making a post? We should compile a list of them before we forget what they were.

Name: Anonymous 2014-04-21 4:43

>>8
Weren't there only two? The VIP poem, and the `here comes the chopper' one.

Name: Anonymous 2014-04-21 9:49

>>9
That's what I remember.

/sjis/, /lang/, and /newnew/ are now on archive.org

https://archive.org/details/sjis-20140420.db.xz
https://archive.org/details/lang-20140420.db.xz
https://archive.org/details/newnew-20140420.db.xz

The remaining boards will take much longer. I'm partially through /newpol/ now.

If you'd to change any of the metadata for any of these items make a request here.

Name: Anonymous 2014-04-21 10:02

>>10
Don't forget /vip/

Name: Anonymous 2014-04-21 10:09

>>11
that'll be next.

Name: Anonymous 2014-04-21 10:56

>>12
thank you!

Name: Anonymous 2014-04-21 11:19

>>10
thanks

Name: Anonymous 2014-04-22 11:56

Name: Anonymous 2014-04-22 14:35

Are you archiving /anime/ and /food/?

Name: Anonymous 2014-04-22 16:43

After you are done, feel free to post a list of all archives in this thread and I will include them in https://progrider.org/files/archives/

Name: Anonymous 2014-04-22 19:45

>>16
yes, all of them eventually. It will take some time.

Name: Anonymous 2014-04-22 20:33

Name: Anonymous 2014-04-23 8:08

/anime/ has been uploaded.

https://archive.org/details/anime-20140420.db.xz

/food/ will be next.

>>17
Ok. Everything uploaded so far has been listed here.

Which one of you spammed /anime/?

https://dis.4chan.org/read/anime/1359200161

Name: Anonymous 2014-04-23 15:38

>>20
That doesn't look like /prog/lodyte spam, that look like actual vengeful spam. Maybe the guy who thought Admin-kin was FN-kin and spammed here.

Name: Anonymous 2014-04-23 18:51

By the way, if you guys find any really good old threads, it might be a fun idea to replay them in /lounge/ (with proper introductions, so that people don't get confused). These are like scripts of old plays, and it would be nice to put on a performance once in a while.

Name: Anonymous 2014-04-23 19:54

I farted and no-one liked the smell

Name: Anonymous 2014-04-24 1:12

>>22
What is the time complexity of a PCRE?

Name: Anonymous 2014-04-24 5:30

>>24
exponential in the worst case. What expression are you considering?

Name: Anonymous 2014-04-24 7:12

scrape, not scrap! One letter makes a huge difference...

Name: L33tUK 2014-04-24 8:56

I hope it was a /prog/rider who hacked 4chan recently as retaliation against this.

Name: Anonymous 2014-04-24 10:57

>>27
From what I've gathered, an imageredditor got obsessed with some random slut and hacked 4chan for that reason. That doesn't seem like something a true /frog/anus would do.

You're a milkribs!

Name: Anonymous 2014-04-24 11:00

>>24
O(anger)

Name: Anonymous 2014-04-24 13:21

>>28
What if that slut was Leah?

Name: Anonymous 2014-04-25 3:58

>>27
If I had access to 4chan I would have done something more interesting than brag and leak ip addresses. Still, it was fun to see moot likes cocks from the account he uses to admin post on 4chan attention whore from a mass of moot worshippers. Inserting obfuscated code to world4ch to continue working but appear frozen from ips used by staff would have been nice. But again, this just continues reliance on a site operated by people who have no respect for us.

Name: Anonymous 2014-04-25 20:20

>>31
The one who gained access (or one pretending to be him) has stated that his access was limited to a few actions, and that he did as much as he figured out how to do. He didn't even have access to board-creation.

Name: Anonymous 2014-04-26 0:22

>>32
He should have been more patient. If you get stuck you don't just stop and blow everything. You wait until you observe more helpful information.

Name: Anonymous 2014-05-04 21:20

Name: Anonymous 2014-05-04 22:15

>>34
thank you so much, kind anon

Name: Anonymous 2014-05-04 22:40

>>35
No problem. I'm glad it's over. It took a long time. I stopped posting updates after the subject.txt page went down temporarily. I was afraid a fagshit working for moot for free might have found this thread and was fucking with us, so I thought it was better to provide the illusion that the project was abandoned until it was completed. Now if world4ch is taken down in spite we still have the data, so it doesn't matter. We (someone) can even host a replacement and enable posting. Are there any cool free web hosts that allow a db that's circa 2-3 GB?

Name: Anonymous 2014-05-04 23:05

>>36
Sadly, Heliohost doesn't, and any paid SOLUTION will probably cost you more than $10/mo.

Is anyone willing to continue world3ch on their VPS or seedbox? Keep in mind you're hosting fresh spammer bait.

Name: Anonymous 2014-05-04 23:07

>>37
heliohost could work for each board with multiple fake accounts. Though I would feel guilty for exploiting them like that.

Name: Anonymous 2014-05-04 23:09

>>38
It's pretty hard to create accounts on Heliohost. They seem to have previous attempt at this and limited the maximum number of new accounts per day to 2x10-5.

Name: Anonymous 2014-05-04 23:12

>>39
Just create a new account right after midnight pacific time a couple days in a row.

Name: Anonymous 2014-05-22 2:01

I'm going to find ways to compress the /prog/ db. I'll start by compacting the tags. Substituting spoiler should give good results. After that, a representation for repeated posts will help compress the spam. If I can get it below 500MB then heliohost can host it, which is the only cool free webhost. The deadline is eventually.

Name: Anonymous 2014-05-22 22:06

>>41
The archives aren't that big, and archive.org is fine. Why do you want to compress them?

Name: Anonymous 2014-05-23 11:14

>>42
I want to host a readable writable old world4ch, but am too cheap to pay for hosting that provides more than 500MB of storage.

Name: Anonymous 2014-05-23 14:40

>>43
maybe admin-kike will give you the hosting

Name: Anonymous 2014-05-23 14:50

Stop crying already and move on for the ima/g/eboards.

Name: Anonymous 2014-05-23 14:51

>>45
you're mom

Name: Anonymous 2014-06-24 0:53

replacing all the spoiler tags on old /prog/ with <span class="spoiler">...</span> saves 817 MB. That's more than half the size of the uncompressed db.

Name: Anonymous 2014-06-24 1:33

>>48,50
Who dost thou quoth?

Name: Anonymous 2014-06-25 1:09

>>51
He's quoting me!

Name: Anonymous 2014-06-27 18:00

compressing the markup reduced the 1.5 GB prog.db to around 390 MB. I could host the old prog on heliohost now, but I want to fit all of world4ch. Any recommendations for using data compression in a database is welcome. Right now I'm thinking of serializing each thread into a flat file and then gzipping them.

Name: Anonymous 2014-07-02 8:18

serializing all threads to flat text and DEFLATEing them by thread gave good results. All of world4ch fits in 200~ MB like this and can be randomly accessed efficiently enough. With an uncompressed caching layer for frequently accessed threads the overhead shouldn't be too bad.

Name: Anonymous 2014-07-02 10:04

>>47
here is an idea, make your own format for prog that uses a weird for of bbcode where every tag is 1 leter and when you close the tag you write [/]

Name: Anonymous 2014-07-02 13:29

>>55
If every tag is one letter, why bother keeping the square bracket syntax? The only reason the tag names need delimiters is so the parser (and the user) can instantly tell where they end. So you might as well switch to \b \i \o \u or something.

Name: Anonymous 2014-07-02 14:26

>>56
That would work even better

Name: Anonymous 2014-07-02 20:06

>>55-57
I tried substituting tags with shorter representations in >>53. <sub> became something like <s and </sub> could have become <S. I didn't want to do an encoding that dependended on balanced tags because of the malformed html. The scheme gracefully handled malformed tags. Decoding was easy. The parser seeked to the next < and used the look ahead to determine the substitution. There were savings but they didn't compare to >>54. I left the spoiler tags in their original form and the spoiler spam is so low in entropy it isn't a problem.

Name: Anonymous 2014-07-04 15:44

tag my anus

Name: Anonymous 2014-07-04 17:10

>>59
<tag>(_._)</tag>

Name: Anonymous 2014-07-05 23:06

>>60
Thank you!

Name: Anonymous 2014-07-09 2:28

There is now something on heliohost.

http://w5ch.heliohost.org

A clever one may be able to view the script and download the database. Not yet implemented are:

* Posting
* BBCode parser
* Post truncation
* Post selection expressions
* Caching

please don't ddos it ;_; It is vulnerable to expensive queries because of the data compression.

Name: >>62 2014-07-09 2:30

Oh and the javascript is removed until I can go through it and make sure it isn't doing anything that harms you.

Name: Anonymous 2014-07-09 2:51

>>62
Shitchan
Yayy someone uses the real name!

Name: Anonymous 2014-07-09 10:05

>>62
I'm surprised someone finally picked up this project. Thank you so much!

Name: Anonymous 2014-07-10 19:00

When submitting a new thread in shiichan, the thread id is part of the post request. So the thread id is the timestamp of when you loaded the page to submit the thread, not when the thread is submitted. And if you generate the post body yourself, you can put in whatever thread id you want. This explains the threads, -2147483648, 1, 3, 4, 1337, 7357, and 2147483648.

Name: Anonymous 2014-07-13 6:20

Posting works now. The post creation page is still a debug page, so after posting, just hit back and refresh. BBCode doesn't work yet. The heliohost server w5ch is on has been down all day, so I've created a hidden service for the site.

https://buvmp4vgrqm2parx.onion/
Finger print: 7C:B4:8E:D8:A5:B0:C8:6B:8E:AC:02:1C:1D:6F:1E:BC:84:94:76:B3

I may experiment with syncing content between multiple unreliable web hosts.

Name: Anonymous 2014-07-13 6:31

The board software can be downloaded here:
https://buvmp4vgrqm2parx.onion/board.py

and the compressed database is here:
https://buvmp4vgrqm2parx.onion/db/w5ch.db

The database has threads in compressed form. See the source in board.py to access it. It's a 270 MB file.

Name: Anonymous 2014-07-13 6:47

>>68
board software
*.py

No, thank you.

Name: Anonymous 2014-07-13 6:50

>>69
I'm not proud of it. But it was the language I was using when I was trying to get the database below 500 MB for heliohost so I stayed with it.

Name: Anonymous 2014-07-13 13:46

>>69
What else do you suggest? Heliohost doesn't support Lisp.

Name: Anonymous 2014-07-13 13:55

>>71
Perl, if it supports cgi + compiled applications use C
And you can make in any language a lisp interpreter anyway

Name: Anonymous 2014-07-13 14:00

>>71
Yesod.

Name: Anonymous 2014-07-13 17:16

Someone post on the hidden service already. Fuck.

Name: Anonymous 2014-07-14 0:20

>>74
Just did, and I have a bug report: Optimizable quotes show up as raw text for new posts.

Name: Anonymous 2014-07-14 1:32

>>75
Optimisable

Name: Anonymous 2014-07-14 3:14

>>75-76
Yeah, everything is raw text still. There's no quotes, hyperlinks, or bbcode yet.

Name: Anonymous 2014-07-14 4:37

> This [o ]
> is how [o ]
> you multiline quote, right?[/o ][/o ]

Name: Anonymous 2014-07-14 4:42

The >'s become <span class="quote">, the newlines after the quote become </span>, the [o ] becomes <span class="o">, and [/o ] becomes </span>. The result is

<span class="quote">This <span class="o"></span>
<span class="quote">is how <span class="o"></span>
<span class="quote">you multiline quote, right?</span></span></span>


I think you could also multiline quote with spoilier tags.

Name: Anonymous 2014-07-14 14:02

>>79

Yep [spoiler] worked, there was another span tag as well, [aa] I think.

Name: Anonymous 2014-07-15 0:25

To multiquote, [br] was used.

Name: Anonymous 2014-07-15 4:38

>>81
I'm so fucking confused. What does [br] do? Just inline another <br/>? I wish I had a list of bbcode inputs and html outputs to refer to. This project became a nightmare as soon as I entered the actual frontend of shiichan. Oh and I've implemented the over 1000 thread thing. If you look at the source of the threads with over 1000 threads, the post form is still there and you can unhide it by editing the html with firebug. I've seen over 1000 necros from as late as 2013, so there was someway to bypass the limit in shiichan. My current implementation hides the post form after 1000 posts and actually stops accepting posts at 1111 posts so you can get quints with a script if you want to. Maybe I'll introduce some randomness to get the same mysteriousness as shiichan.

Name: Anonymous 2014-07-15 9:17

>>81
No it is just several >

>>82
What does [br] do?
it used to insert line BReak. That was really retarded.

Name: Anonymous 2014-07-15 20:42

>>82
I don't follow the thread, I just look at the latest replies, so I hope I'm not off-topic here. I think you are discussing old /prog/'s bbcode, right? In which case, to multiquote, the technique used was this:
> foo[br]bar

which would appear as
foo
bar

By the way, to escape bbcode, there's [#][/#] (which I just used twice to show to you once).

Name: Anonymous 2014-07-15 21:47

>>84
I don't follow the thread, I just look at the latest replies, so I hope I'm not off-topic here.
The thread's at a mere eighty replies, your time can't be that precious.

Name: Anonymous 2014-07-16 7:24

>>84
Thanks for that explanation of br.

Name: test 2014-07-16 10:59

[spoiler]test[/spoiler]

Name: Anonymous 2014-07-16 13:02

I've noticed some weird spans in the html source. They don't appear to do anything. Does anyone remember the bbcode tags for them?

<span class="abbc">
<span class="math">

There's <span class="code"> too but I think this was [code ] before 2006.

Name: Anonymous 2014-07-16 15:06

bbcode works now. backlinks and hyperlinks are next.

Name: Anonymous 2014-07-16 15:13

>>88
Is the [math] span from /sci/? That's for TEX markup, though I don't recall it ever being used on /prog/.

ABBC is the name of the BBCode library.

Name: Anonymous 2014-07-17 17:27

I need help /prog/. how can I write a regular expression in python that will find all matches for N regular expressions and perform a customized replacement for each given expression? I can't iterate the substitutions one after the other because the replaced strings will be matched.

tldr; I need to lex in python. How can I using the standard library?

Name: Anonymous 2014-07-17 22:46

>>91
why do python programmeres think everything about the world changes when using python?

I can never tell if the posts about "<exteremely common thing> in python" are trying to appeal to people who think anything written in python is somehow better or if they're just proud of themselves for getting anything done in that shitheap of a language.

tl;dr
- lexing with regex = two problems, like everywhere else (except perl 6)
- stop being a baby and lex how you would lex in any other language

Name: Anonymous 2014-07-17 23:45

[math] was a tag that just put the text in a <div>. If jsMath hadn't been totally broken because of an undefined variable (if I recall correctly, it loaded before the page body, thus had nothing to work with) it would have done interesting things. It worked fine in my unreleased experimental SuperW4ch userscript, where I loaded the imagereddit jsmath script instead.

Name: Anonymous 2014-07-18 0:23

>>92
If I was writing in c I would lex using a statemachine that was machine generated with lex or flex. But I'm using python instead of c. I could write the statemachine myself in python but it would be slow as fuck to do that kind of processing at the script level. The only thing I can think of would be to use multiple regular expressions and perform multiple passes and craft the expressions so they somehow don't substitute within each other, but it's difficult and opens doors for html injection if things go wrong. It's frustrating to have a library that can do regex but can't do this conveniently.

Name: Anonymous 2014-07-18 0:31

Someone got the kopipe for parsing HTML with regex?

Name: Anonymous 2014-07-18 0:42

>>94
so emit the fsm in c and use ffi

Name: Anonymous 2014-07-18 6:11

>>96
I took the easy route and decided to only detect hyperlinks beginning with a protocol. The range expressions are functioning as closely as I could emulate. I'm going to take a break from development for a while. If you have any changes you'd like to put in post a diff here or somewhere and I'll merge it in when I see it. You are also welcome to pull the code off the site and host your own. I believe the site is feature complete except for the subject page and json feed. I have no plans to implement a moderation interface and no ip addresses or identities are recorded in the database. Ip addresses still appear in logs though, so be aware of that. I'm not looking but someone else might.

Name: TPOBCI 2014-07-19 17:38

The Pleasure Of Being Cummed Inside.

Name: TPOBCI 2014-07-19 17:49

The Pressure Of Being Cummed Inside.

Name: Anonymous 2014-07-20 10:40

The Pleasure of Getting Dubs.

Name: Anonymous 2014-07-24 7:14

I don't know why you're revamping Shiichan. You should just improve Kareha or Tablecat BBS.

Name: Anonymous 2015-07-23 21:24

fyi heliohost took away the sqlite3 module so w5ch.heliohost.org is broken until that dependency is removed. The admin user name is ``w5ch''. You can type that in to reactivate the account if it has been suspended from me not logging in once a month and looking at heliohost's advertisements in the admin interface.

Name: Anonymous 2015-07-24 9:40

>>34
420.db.xz
SMOKE WEED EVERYDAY

Name: Anonymous 2015-07-26 19:33

It appears that world4ch has been completely removed now. dis.4chan.org no longer resolves.

Thanks for making these archives.

Name: some ass 2015-10-05 17:20

>>62
bring this back please,

are you on the IRC?

Name: Anonymous 2015-10-05 18:47

There was a thread on /vip/ I was going to bump today. OP said he would check back in 3 years to see if his thread was on top. Now he's just going to see that the board is gone ;_;

Name: Anonymous 2015-10-05 19:19

>>106
damn that is a sad story for OP

Name: Anonymous 2015-10-06 6:56

>>106
That's horrible.

Name: Anonymous 2015-11-12 16:23

I'm having trouble setting this up. it worked for the default SJIS db but not the progrider one

Name: Anonymous 2015-11-12 16:28

individual threads work e.g. http://localhost:5000/thread/1280369732 but the titles dont

Name: Anonymous 2015-11-12 21:11

↖ check ‘em

Name: Anonymous 2015-11-14 21:35

>>109,110
What are you doing?

Name: Anonymous 2015-11-14 22:44

>>112
the archive of dis4chan

Name: Anonymous 2015-11-14 22:53

>>113
Like viewing it or something? The archives are just sqlite databases.

Name: Anonymous 2015-11-15 0:43

>>114
yes viewing it with the bibanon viewer

Name: Anonymous 2015-11-15 15:45

>>114
How do you pronounce 'sqlite'?

1. Ess-que-ell-ite
2. See-quel-lite
3. See-quel-ite
4. Squdder (faggot)

Name: Anonymous 2015-11-15 15:48

>>116
sk-light

Name: Anonymous 2015-11-15 18:00

>>116
5. Ess-queue-lite

Name: Anonymous 2015-11-15 20:48

>>9
You forgot ``Flag on the moon... How did it get there?''

Name: Anonymous 2015-11-16 0:13

does anyoen have a copy of the poems?

they were hard to get cause they went away fast..

Name: Anonymous 2015-11-16 0:14

Name: Anonymous 2015-11-16 5:23

>>121
It always said That Was VIP Quality! in huge text and then there was a random poem below in italics.

Name: Anonymous 2015-11-16 5:24

That was VIP Quality!
I don't think the w was capitalized actually.

Don't change these.
Name: Email:
Entire Thread Thread List