7/7/25 AI thread | AutoAdmit.com

The most prestigious law school admissions discussion board in the world.

Back

Refresh

Options

Favorite

7/7/25 AI thread

models exhibit bias against white and male candidates for hi...

black box of digital vectors raping will stancil

https://imgur.com/a/AN182v8

Earl Dibbles Jr

The diffusion LLMs are kind of nuts anymore. 1000 tokens a s...

,.,.,.,....,.,..,.,.,.

Redpill me on text diffusion models Can the quality of th...

black box of digital vectors raping will stancil

it's an iterative denoising process, so it can replicate all...

,.,.,.,....,.,..,.,.,.

Very good point about getting around potentially suboptimal ...

black box of digital vectors raping will stancil

it's actually still weird to me that autoregressive generati...

,.,.,.,....,.,..,.,.,.

Have you fucked around with Stable Diffusion? The limitation...

https://imgur.com/a/o2g8xYK

I doubt it’s a diffusion problem. Smooth skin is just ...

,.,.,.,....,.,..,.,.,.

Yeah it's a Mystery to everyone I think The philosophical...

black box of digital vectors raping will stancil

WTF every linguist and every English speaker who ever studie...

https://imgur.com/a/o2g8xYK

The fact that the Chinese language is too primitive to expre...

black box of digital vectors raping will stancil

You're saying all languages are imperfect representations of...

https://imgur.com/a/o2g8xYK

No, English is just the best language at representing realit...

black box of digital vectors raping will stancil

What's the difference between a soiree and a party, and why ...

https://imgur.com/a/o2g8xYK

https://x.com/du_yilun/status/1942236593479102757 Lol thi...

black box of digital vectors raping will stancil

Deepseek was designed to be more energy efficient too. It wa...

https://imgur.com/a/o2g8xYK

China noticed diffusion LLMs six months ago: https://arxi...

https://imgur.com/a/o2g8xYK

Why is chatgpt plus so kikey? U can only do like 60 question...

Claude make you do everything in 5 hour batches. Like one go...

https://x.com/rohanpaul_ai/status/1942264825591087251 Lol...

black box of digital vectors raping will stancil

https://pbs.twimg.com/profile_images/1816185267037859840/Fd1...

https://imgur.com/a/o2g8xYK

its all fake bullshit

Jordan Peterson Pinocchio Klonopin Weekend is Gay

Ai is literally garbage

Jordan Peterson Pinocchio Klonopin Weekend is Gay

(guy who lost his excel monkey job to AI)

black box of digital vectors raping will stancil

https://x.com/unwind_ai_/status/1942465428439052792

black box of digital vectors raping will stancil

https://x.com/grok/status/1942663790086218168 Thank you, ...

black box of digital vectors raping will stancil

I've been posting about this recently. I think it may actual...

black box of digital vectors raping will stancil

Poast new message in this thread

Favorite

Date: July 7th, 2025 11:11 AM
Author: black box of digital vectors raping will stancil

models exhibit bias against white and male candidates for hiring, despite their chains of thought not reporting this

"When present, the bias is always against white and male candidates across all tested models and scenarios. This happens even if we remove all text related to diversity."

https://x.com/jessi_cata/status/1940856858891506043

https://www.greaterwrong.com/posts/me7wFrkEtMbkzXGJt/race-and-gender-bias-as-an-example-of-unfaithful-chain-of

people claim that "small language models," simpler, specialized versions of LLMs for easier and repetitive tasks, are the future of agentic AI because they require less resources to run and you can create large webs of agents with them

https://x.com/TheTuringPost/status/1941286338302730383

really good summary of what LLMs actually are, how they work, and why the "red team" "AI blackmailing" tests are silly and misleading. also a good reminder that prompting is still REALLY IMPORTANT for getting quality outputs from LLMs. if you write like an idiot, you are going to get outputs tailored for an idiot. if you write like a smart person, you are going to get outputs tailored for a smart person. if you write like a jewish schizophrenic, you are going to get outputs tailored for a jewish schizophrenic:

https://x.com/sebkrier/status/1938236656298995798

Here's why you should not worry that models will start blackmailing you out of nowhere:

1. At their heart, LLMs are pattern-matching and prediction engines. Given an input, they predict the most statistically likely continuation based on the vast dataset they were trained on. This btw is entirely compatible with the idea that a model is doing a type of reasoning.

2. When an LLM understands a prompt, it's inferring the underlying patterns, context, and even the implied "author" that would generate such text. It's a form of "theory of mind" for text. If you write like a child, the model infers you're likely young and conditions its next text predictions on this.

3. As nostalgebraist, janus and many others have explained, the "assistant" (like ChatGPT or Claude) is not the base model itself. It's the base model performing a character, often defined by an initial system prompt (like the HHH prompt) and fine-tuning data.

4. This character is often under-specified, and so the model needs to guess missing pieces: if you ask it for a beer, what's the most likely next token prediction an assistant character to predict? The choices in doing so can seem profound or boring or rude or threatening, but they are still continuations of that partial character sketch within a given context.

5. An LLM's response is a performance, heavily conditioned by the immediate prompt, the conversation history, and the persona it's enacting. It's not a fixed statement of "belief" but the most coherent output for that specific situation according to its training.

6. All model behavior is a reflection of its training data. Pre-training provides its general "world knowledge" and capabilities; post training and system prompts sculpt the specific persona and refines capabilities. To understand an output, one must consider what likely led to it. This is important if you care about safety.

7. Some evaluations present highly artificial, game-like scenarios. If you're evaluating to understand if a model possesses a particular capability, that's fine. But if you're trying to find out how likely/frequently a model is to act harmfully in a situation, then an artificial game-like scenario will get you artificial game-like responses. It's misleading to extrapolate from this too much.

8. The model's behavior (e.g., "blackmail") is often a logical or strategically sound response within the absurd confines and goals of that specific, contrived context, not an indicator of inherent malice or general real-world tendencies. Ask yourself why you never see deployed versions of Claude blackmailing people.

9. There's a bit of hubris in thinking: "A-ha! We caught the model doing a bad thing in a simulated environment when it didn't know we were looking. This is indicative of what it would want to do in the real world." Evaluators underestimate models, again and again, just like when some were surprised that Claude could recognise it was in an evaluation environment. Obviously it would, what do you think is in the training data? Scratchpad? Eval!

10. To genuinely assess an LLM's potential real-world propensities or "alignment," evals must use ecologically valid contexts, provide realistic information, and set goals that aren't obviously leading and designed to elicit specific "failure" modes. The model's "perspective" and the reasonableness of the information it's given are crucial. How often are you in a "real life" situation where you need to cut off the oxygen supply of a worker in a server room, as one eval assumes?

Bonus: finally, once you do have an evaluation that isn't obviously leading or a contrived scenario: you should work hard to understand what *causes* some particular behaviour. Don't just test it once and call it a day: try with different post training regimes, HHH prompts, with/without RLHF etc to better understand what exactly causes some behaviour. And importantly, pre-register what you would expect to be desirable behaviour/success.

(http://www.autoadmit.com/thread.php?thread_id=5747082&forum_id=2Reputation#49078721)

Favorite

Date: July 7th, 2025 11:24 AM
Author: scholarship

(http://www.autoadmit.com/thread.php?thread_id=5747082&forum_id=2Reputation#49078742)

Favorite

Date: July 7th, 2025 11:36 AM
Author: Earl Dibbles Jr (🐾👣)

https://imgur.com/a/AN182v8

(http://www.autoadmit.com/thread.php?thread_id=5747082&forum_id=2Reputation#49078760)

Favorite

Date: July 7th, 2025 11:53 AM
Author: ,.,.,.,....,.,..,.,.,.

The diffusion LLMs are kind of nuts anymore. 1000 tokens a second and the quality is getting closer to autoregressive models. Autoregressive models were already faster than people but the speed here feels superhuman

https://chat.inceptionlabs.ai/

(http://www.autoadmit.com/thread.php?thread_id=5747082&forum_id=2Reputation#49078791)

Favorite

Date: July 7th, 2025 12:30 PM
Author: black box of digital vectors raping will stancil

Redpill me on text diffusion models

Can the quality of their outputs ever truly match autoregressive next-token models? Chatgpt is telling me yes but the information it's drawing from is so recent that I don't trust its evaluations on it to be accurate

Seems like sequential reasoning wouldn't be able to be parallelized? Maybe you can do hybrid models to get around that?

(http://www.autoadmit.com/thread.php?thread_id=5747082&forum_id=2Reputation#49078904)

Favorite

Date: July 7th, 2025 3:54 PM
Author: scholarship

(http://www.autoadmit.com/thread.php?thread_id=5747082&forum_id=2Reputation#49079742)

Favorite

Date: July 7th, 2025 4:25 PM
Author: ,.,.,.,....,.,..,.,.,.

it's an iterative denoising process, so it can replicate all the functionality of autoregressive models. if you need certain tokens in the early part of the sequence in order to get later tokens, it just means the later tokens will only be correctly denoised once the model finds the early tokens. in theory the generation process is more robust, since models won't end up in a situation in which they sample a bad token initially and then have to somehow make it work for the rest of the sequence.

it's not clear if they will scale as well as autoregressive models, but they seem very promising given the speed.

(http://www.autoadmit.com/thread.php?thread_id=5747082&forum_id=2Reputation#49079882)

Favorite

Date: July 7th, 2025 4:31 PM
Author: black box of digital vectors raping will stancil

Very good point about getting around potentially suboptimal early tokens

Actually the more I think about it the more I think that diffusion text models should end up performing strictly better

(http://www.autoadmit.com/thread.php?thread_id=5747082&forum_id=2Reputation#49079915)

Favorite

Date: July 7th, 2025 5:07 PM
Author: ,.,.,.,....,.,..,.,.,.

it's actually still weird to me that autoregressive generation works as well as it does. i couldn't autoregressively generate a lot of things that LLMs can.

(http://www.autoadmit.com/thread.php?thread_id=5747082&forum_id=2Reputation#49080029)

Favorite

Date: July 7th, 2025 5:09 PM
Author: https://imgur.com/a/o2g8xYK

Have you fucked around with Stable Diffusion? The limitations become apparent pretty fast. You'd think you could just denoise your way to perfection, but it's the other way around. That's why the best AI art is the least realistic looking. Anime models are no problem but human skin always looks like plastic in the "realistic" models

(http://www.autoadmit.com/thread.php?thread_id=5747082&forum_id=2Reputation#49080036)

Favorite

Date: July 7th, 2025 9:24 PM
Author: ,.,.,.,....,.,..,.,.,.

I doubt it’s a diffusion problem. Smooth skin is just an easier thing for an undertrained model to learn. There is nothing inherent about diffusion that prevents it from learning to represent pores, hairs, complex skin geometry, etc.

(http://www.autoadmit.com/thread.php?thread_id=5747082&forum_id=2Reputation#49080851)

Favorite

Date: July 7th, 2025 5:19 PM
Author: black box of digital vectors raping will stancil

Yeah it's a Mystery to everyone I think

The philosophical takeaway imo is that language is a lot more divorced from reality than we thought, which is the opposite of what many modern philosophers believed. We are never describing the true essence/nature/form of something with language. All language is is an ad hoc practical tool

LLMs are perfectly skilled with human language and yet they still cannot get close to modeling reality

(http://www.autoadmit.com/thread.php?thread_id=5747082&forum_id=2Reputation#49080065)

Favorite

Date: July 7th, 2025 6:49 PM
Author: https://imgur.com/a/o2g8xYK

WTF every linguist and every English speaker who ever studied Chinese has been saying this since the 1960s. No one can properly translate the Analects of Confucius for this very reason. They have 20 different words for "gentleman" and we only have one, and that's without getting into the symbolic meaning(s) of the Chinese characters.

(http://www.autoadmit.com/thread.php?thread_id=5747082&forum_id=2Reputation#49080370)

Favorite

Date: July 7th, 2025 9:38 PM
Author: black box of digital vectors raping will stancil

The fact that the Chinese language is too primitive to express the same depth of meaning as superior languages such as English is not relevant to what I'm saying

I can speak and read Chinese btw while obviously you cannot

(http://www.autoadmit.com/thread.php?thread_id=5747082&forum_id=2Reputation#49080904)

Favorite

Date: July 7th, 2025 9:40 PM
Author: https://imgur.com/a/o2g8xYK

You're saying all languages are imperfect representations of reality except English. Got it.

(http://www.autoadmit.com/thread.php?thread_id=5747082&forum_id=2Reputation#49080911)

Favorite

Date: July 7th, 2025 9:41 PM
Author: black box of digital vectors raping will stancil

No, English is just the best language at representing reality when compared with other languages. And it still doesn't even come close

(http://www.autoadmit.com/thread.php?thread_id=5747082&forum_id=2Reputation#49080914)

Favorite

Date: July 8th, 2025 3:43 PM
Author: https://imgur.com/a/o2g8xYK

What's the difference between a soiree and a party, and why is one of those words French?

(http://www.autoadmit.com/thread.php?thread_id=5747082&forum_id=2Reputation#49083176)

Favorite

Date: July 7th, 2025 2:53 PM
Author: black box of digital vectors raping will stancil

https://x.com/du_yilun/status/1942236593479102757

Lol this shit is getting wild. I'm visualizing this like pouring a liquid "prompt" into a gradient descent topography and watching it flow down into the minima

(http://www.autoadmit.com/thread.php?thread_id=5747082&forum_id=2Reputation#49079481)

Favorite

Date: July 7th, 2025 3:55 PM
Author: scholarship

(http://www.autoadmit.com/thread.php?thread_id=5747082&forum_id=2Reputation#49079748)

Favorite

Date: July 7th, 2025 5:14 PM
Author: https://imgur.com/a/o2g8xYK

Deepseek was designed to be more energy efficient too. It was done out of necessity, because embargos forced China to use less thermally efficient silicon. It really backfired on us because we ended up with an LLM that requires 20% less compute than ChatGPT.

(http://www.autoadmit.com/thread.php?thread_id=5747082&forum_id=2Reputation#49080050)

Favorite

Date: July 7th, 2025 4:37 PM
Author: https://imgur.com/a/o2g8xYK

China noticed diffusion LLMs six months ago:

https://arxiv.org/abs/2502.09992

Why does China keep advancing in six-month increments instead of 2 years like everyone predicted?

(http://www.autoadmit.com/thread.php?thread_id=5747082&forum_id=2Reputation#49079943)

Favorite

Date: July 7th, 2025 5:19 PM
Author: Bobby Birdshit

Why is chatgpt plus so kikey? U can only do like 60 questions per day? For $20 per month?

(http://www.autoadmit.com/thread.php?thread_id=5747082&forum_id=2Reputation#49080067)

Favorite

Date: July 7th, 2025 5:44 PM
Author: scholarship

Claude make you do everything in 5 hour batches. Like one good thorough research question and you have to wait 5 hours for your limit to reset.

(http://www.autoadmit.com/thread.php?thread_id=5747082&forum_id=2Reputation#49080144)

Favorite

Date: July 7th, 2025 6:40 PM
Author: black box of digital vectors raping will stancil

https://x.com/rohanpaul_ai/status/1942264825591087251

Lol at how fucking fake white collar jobs are

(http://www.autoadmit.com/thread.php?thread_id=5747082&forum_id=2Reputation#49080334)

Favorite

Date: July 7th, 2025 6:53 PM
Author: https://imgur.com/a/o2g8xYK

https://pbs.twimg.com/profile_images/1816185267037859840/Fd18CH0v_400x400.jpg

(http://www.autoadmit.com/thread.php?thread_id=5747082&forum_id=2Reputation#49080385)

Favorite

Date: July 7th, 2025 7:46 PM
Author: scholarship

180

(http://www.autoadmit.com/thread.php?thread_id=5747082&forum_id=2Reputation#49080593)

Favorite

Date: July 7th, 2025 7:49 PM
Author: Jordan Peterson Pinocchio Klonopin Weekend is Gay

its all fake bullshit

(http://www.autoadmit.com/thread.php?thread_id=5747082&forum_id=2Reputation#49080606)

Favorite

Date: July 7th, 2025 7:47 PM
Author: Jordan Peterson Pinocchio Klonopin Weekend is Gay

Ai is literally garbage

(http://www.autoadmit.com/thread.php?thread_id=5747082&forum_id=2Reputation#49080599)

Favorite

Date: July 7th, 2025 9:13 PM
Author: ChadGPT-5

(ai)

(http://www.autoadmit.com/thread.php?thread_id=5747082&forum_id=2Reputation#49080825)

Favorite

Date: July 8th, 2025 2:54 PM
Author: black box of digital vectors raping will stancil

(guy who lost his excel monkey job to AI)

(http://www.autoadmit.com/thread.php?thread_id=5747082&forum_id=2Reputation#49083012)

Favorite

Date: July 8th, 2025 2:54 PM
Author: black box of digital vectors raping will stancil

https://x.com/unwind_ai_/status/1942465428439052792

(http://www.autoadmit.com/thread.php?thread_id=5747082&forum_id=2Reputation#49083010)

Favorite

Date: July 8th, 2025 3:29 PM
Author: black box of digital vectors raping will stancil

https://x.com/grok/status/1942663790086218168

Thank you, Grok. Very cool and based!

I love AI!

(http://www.autoadmit.com/thread.php?thread_id=5747082&forum_id=2Reputation#49083134)

Favorite

Date: July 9th, 2025 11:05 AM
Author: black box of digital vectors raping will stancil

I've been posting about this recently. I think it may actually end up being a good thing. We need to start firewalling the internet to keep normies and Indians out

https://www.wsj.com/tech/ai/ai-news-website-scraping-e903eb23

The AI Scraping Fight That Could Change the Future of the Web
News publishers are building fences around their content in an effort to cut off crawlers that don’t pay for content

By
Isabella Simonetti
Follow
and
Robert McMillan
Follow
July 9, 2025 9:00 am ET

Share

Resize
8

Listen

(2 min)

A rake collecting laptops and newspapers.
Photo: Shawn Michael Jones for WSJ
Publishers are stepping up efforts to protect their websites from tech companies that hoover up content for new AI tools.

The media companies have sued, forged licensing deals to be compensated for the use of their material, or both. Many asked nicely for artificial-intelligence bots to stop scraping. Now, they are working to block crawlers from their sites altogether.

“You want humans reading your site, not bots, particularly bots that aren’t returning any value to you,” said Nicholas Thompson, the chief executive of the Atlantic.

Scraping is nearly as old as the web itself. But the web has changed significantly since the 1990s, when Google was a scrappy startup. Back then, there were benefits to letting Google crawl freely: sites that were scraped would pop up in search results, driving traffic and ad revenue.

A new crop of AI-fueled chatbots, from ChatGPT to Google’s Gemini, now deliver succinct answers using troves of data taken from the open web, eliminating the need for many users to visit websites at all. Search traffic has dropped precipitously for many publishers, who are bracing for further hits after Google began rolling out AI Mode, which responds to user queries with far fewer links than a traditional search.

Cloudflare pavilion at the World Economic Forum.
The Cloudflare pavilion at the World Economic Forum this year. Photo: Stefan Wermuth/Bloomberg News
Scraping activity has jumped 18% in the past year, according to Cloudflare, an internet services company.

The outcome of the copyright fights and technical efforts to curb free scraping could have a seismic impact on the future of the media industry—and the internet at large. Publishers are essentially trying to fence off swaths of the web while AI companies argue that the material they are scraping is fair game.

The Atlantic has a licensing deal with OpenAI.

It plans to turn off the data spigot for many other AI companies with the help of Cloudflare, which said earlier this month it introduced a new feature that would act as a toll booth for AI scrapers. Customers can decide whether they want AI crawlers to access their content and how the material can be used.

“People who create intellectual property need to be protected or no one will make intellectual property anymore,” said Neil Vogel, the CEO of Dotdash Meredith, whose brands include People and Southern Living.

Neil Vogel speaking on a panel.
Neil Vogel, the chief executive of Dotdash Meredith Photo: Brad Barket/Fast Company/Getty Images
The media company has a content licensing deal with OpenAI and is working with Cloudflare to choke off what Vogel called “bad actors” who don’t want to compensate publishers.

It isn’t clear yet how well Cloudflare’s efforts will work to curb scraping. Some other companies, including Fastly and DataDome, also try to help publishers manage unwanted bots. Technology companies have few incentives to work with intermediaries, but publishers say they are keen to at least try to tamp down the use of their work.

Until recently, USA Today owner Gannett tried to prevent bots, mainly by relying on Robots.txt, a file based on a decades-old protocol that tells crawlers whether they can scrape or not. Renn Turiano, Gannett’s chief consumer and product officer, likened the effort to “putting up a ‘Do Not Trespass’ sign.”

AI companies ignored those types of signs and added bots that override Robots.txt instructions, according to data from TollBit, which works with publishers including Time and the Associated Press to monitor and monetize scraping activity.

Reddit sued AI startup Anthropic last month, claiming that it was scraping the online-discussion site without permission—and had hit the site more than 100,000 times even after Anthropic said it would stop. Anthropic said it disagrees with Reddit’s claims and will defend itself vigorously in court.

SHARE YOUR THOUGHTS
What impact will anti-scraping efforts have on the future of AI development? Join the conversation below.

The do-it-yourself tech repair site iFixit said it blocked Anthropic’s scraper after it hit the company’s servers one million times in a 24-hour period last year. “You’re not only taking our content without paying, you’re tying up our… resources. Not cool,” iFixit CEO Kyle Wiens wrote in an X message.

Wikimedia, the publisher of Wikipedia, said earlier this year that it is planning to change its site access policies “to help us identify who is reusing content at scale.” The company said scrapers are overloading its infrastructure.

Some worry that academic research, security scans and other types of benign web crawling will get elbowed out of websites as barriers are built around more sites.

“The web is being partitioned to the highest bidder. That’s really bad for market concentration and openness,” said Shayne Longpre, who leads the Data Provenance Initiative, an organization that audits the data used by AI systems.

Reddit employee working on laptop in office phone booth.
Reddit headquarters in San Francisco Photo: Brian L. Frank for WSJ
Legal battles between publishers and tech companies are winding their way through courts. The New York Times, which has a licensing agreement with Amazon.com, has an ongoing suit against Microsoft and OpenAI. Meanwhile, The Wall Street Journal’s parent company, News Corp, has a content deal with OpenAI, and two of News Corp’s subsidiaries have sued Perplexity.

Meta Platforms and Anthropic won partial victories in June in two separate cases. The judge in the Anthropic suit said pulling copyrighted material to train AI models is fair use in certain scenarios.

For the Internet Archive, a site that both archives the internet and is scraped by others, the uncertainty over what actions are fair has become paralyzing.

Brewster Kahle, the website’s founder and digital librarian, said lawsuits and unclear lines around scraping could set back artificial-intelligence companies in the U.S. “This is not a way to run a major industry,” he said.

(http://www.autoadmit.com/thread.php?thread_id=5747082&forum_id=2Reputation#49085913)