Gary Marcus reply to SSC on scaling hypothesis
| Big sickened site | 06/21/22 | | talented business firm | 06/22/22 | | claret rehab | 06/22/22 | | talented business firm | 06/22/22 | | flirting sooty church | 06/22/22 | | Swashbuckling dilemma | 06/22/22 | | Big sickened site | 06/22/22 | | Swashbuckling dilemma | 06/22/22 | | Big sickened site | 06/22/22 | | Swashbuckling dilemma | 06/22/22 | | white sex offender | 06/22/22 | | Big sickened site | 06/22/22 | | white sex offender | 06/22/22 | | Big sickened site | 06/22/22 | | Razzle space | 06/23/22 | | Big sickened site | 06/23/22 | | Swashbuckling dilemma | 06/23/22 | | Big sickened site | 06/23/22 | | Swashbuckling dilemma | 06/23/22 | | Big sickened site | 06/23/22 | | Swashbuckling dilemma | 06/23/22 | | Big sickened site | 06/23/22 | | Swashbuckling dilemma | 06/22/22 | | white sex offender | 06/22/22 | | Geriatric curious plaza | 10/07/22 | | Big sickened site | 06/23/22 | | Big sickened site | 06/23/22 | | Swashbuckling dilemma | 06/25/22 | | Big sickened site | 06/25/22 | | stirring brilliant background story | 06/25/22 | | Big sickened site | 06/25/22 | | stirring brilliant background story | 06/25/22 | | Big sickened site | 06/25/22 | | claret rehab | 06/25/22 | | stirring brilliant background story | 06/25/22 | | Big sickened site | 06/25/22 | | stirring brilliant background story | 06/25/22 | | Swashbuckling dilemma | 06/25/22 | | Big sickened site | 06/25/22 | | Razzle space | 06/26/22 | | Big sickened site | 06/26/22 | | Swashbuckling dilemma | 06/26/22 | | stirring brilliant background story | 06/26/22 | | Big sickened site | 06/26/22 | | stirring brilliant background story | 06/26/22 | | Swashbuckling dilemma | 06/26/22 | | Swashbuckling dilemma | 06/30/22 | | Big sickened site | 07/01/22 | | stirring brilliant background story | 07/01/22 | | Swashbuckling dilemma | 07/01/22 | | Big sickened site | 07/01/22 | | Swashbuckling dilemma | 07/01/22 | | Big sickened site | 07/01/22 | | stirring brilliant background story | 07/02/22 | | Swashbuckling dilemma | 07/02/22 | | stirring brilliant background story | 07/02/22 | | Swashbuckling dilemma | 07/02/22 | | stirring brilliant background story | 07/02/22 | | Swashbuckling dilemma | 07/02/22 | | Big sickened site | 07/02/22 | | Swashbuckling dilemma | 07/02/22 | | Big sickened site | 07/02/22 | | stirring brilliant background story | 07/02/22 | | Swashbuckling dilemma | 07/02/22 | | Swashbuckling dilemma | 07/03/22 | | Big sickened site | 07/03/22 | | Swashbuckling dilemma | 08/03/22 | | Big sickened site | 08/04/22 | | Swashbuckling dilemma | 08/04/22 | | Big sickened site | 08/04/22 | | stirring brilliant background story | 08/04/22 | | Big sickened site | 08/04/22 | | stirring brilliant background story | 08/04/22 | | Big sickened site | 08/04/22 | | stirring brilliant background story | 08/04/22 | | Big sickened site | 08/04/22 | | Swashbuckling dilemma | 01/29/23 | | Swashbuckling dilemma | 08/04/22 | | Swashbuckling dilemma | 08/12/22 | | Big sickened site | 08/12/22 | | Razzle space | 08/20/22 | | Big sickened site | 09/02/22 | | Swashbuckling dilemma | 09/05/22 | | Big sickened site | 09/06/22 | | Big sickened site | 09/06/22 | | Swashbuckling dilemma | 09/06/22 | | Big sickened site | 09/06/22 | | Swashbuckling dilemma | 09/06/22 | | Big sickened site | 09/06/22 | | Big sickened site | 09/14/22 | | Swashbuckling dilemma | 09/14/22 | | Big sickened site | 09/15/22 | | Swashbuckling dilemma | 09/15/22 | | Big sickened site | 09/15/22 | | Swashbuckling dilemma | 12/16/22 | | Big sickened site | 09/14/22 | | Swashbuckling dilemma | 09/14/22 | | Swashbuckling dilemma | 09/17/22 | | Big sickened site | 09/17/22 | | Swashbuckling dilemma | 09/20/22 | | Big sickened site | 09/24/22 | | stirring brilliant background story | 09/24/22 | | Big sickened site | 09/24/22 | | Swashbuckling dilemma | 09/24/22 | | Big sickened site | 09/26/22 | | Swashbuckling dilemma | 09/27/22 | | Swashbuckling dilemma | 10/06/22 | | useless puppy | 10/06/22 | | Swashbuckling dilemma | 09/26/22 | | Big sickened site | 09/29/22 | | Swashbuckling dilemma | 09/29/22 | | Swashbuckling dilemma | 10/02/22 | | Big sickened site | 10/03/22 | | Swashbuckling dilemma | 10/06/22 | | stirring brilliant background story | 10/06/22 | | Big sickened site | 10/06/22 | | stirring brilliant background story | 10/06/22 | | Big sickened site | 10/06/22 | | stirring brilliant background story | 10/06/22 | | Big sickened site | 10/06/22 | | Swashbuckling dilemma | 10/07/22 | | Big sickened site | 10/07/22 | | Swashbuckling dilemma | 10/07/22 | | Big sickened site | 10/07/22 | | Big sickened site | 10/12/22 | | Swashbuckling dilemma | 10/13/22 | | Big sickened site | 10/13/22 | | Swashbuckling dilemma | 10/13/22 | | Big sickened site | 10/13/22 | | Swashbuckling dilemma | 10/13/22 | | Big sickened site | 10/13/22 | | Swashbuckling dilemma | 10/22/22 | | Big sickened site | 10/22/22 | | Swashbuckling dilemma | 10/22/22 | | Big sickened site | 10/22/22 | | Big sickened site | 10/21/22 | | Swashbuckling dilemma | 10/22/22 | | Big sickened site | 10/22/22 | | Swashbuckling dilemma | 11/15/22 | | Big sickened site | 11/15/22 | | white sex offender | 11/15/22 | | Swashbuckling dilemma | 11/15/22 | | Swashbuckling dilemma | 11/19/22 | | Big sickened site | 11/19/22 | | Swashbuckling dilemma | 11/19/22 | | Big sickened site | 11/19/22 | | Big sickened site | 11/29/22 | | Swashbuckling dilemma | 11/29/22 | | Big sickened site | 11/29/22 | | Swashbuckling dilemma | 11/30/22 | | Big sickened site | 11/30/22 | | Swashbuckling dilemma | 11/30/22 | | red parlor really tough guy | 12/16/22 | | Big sickened site | 12/08/22 | | Swashbuckling dilemma | 03/15/23 | | Big sickened site | 03/16/23 | | Swashbuckling dilemma | 03/16/23 | | Big sickened site | 03/16/23 | | Swashbuckling dilemma | 03/16/23 | | Swashbuckling dilemma | 03/28/23 | | Big sickened site | 03/28/23 | | Swashbuckling dilemma | 04/08/23 | | Big sickened site | 04/08/23 | | Swashbuckling dilemma | 12/09/23 |
Poast new message in this thread
Date: June 22nd, 2022 9:52 PM Author: talented business firm
>They found that at least a 5-layer neural network is needed to simulate the behavior of a single biological neuron. That’s around 1000 artificial neurons for each biological neuron.
>human brain has 100 billion neurons
>GPT-4 will have 100 trillion neurons
interesting math
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44726205) |
Date: June 22nd, 2022 10:19 PM Author: Swashbuckling dilemma
i would be willing to bet that GPT-4 is nowhere near 100 trillion parameters. i would be shocked if it's above 15 trillion parameters.
there are several reasons for this. a dense transformer of that size would cost tens or hundreds of billions of dollars to train. no indication of budgets of that size yet.
the Chinchilla scaling laws also imply that GPT-3 was very inefficiently trained. they should have trained a smaller sized model on many more tokens. Chinchilla is better than GPT-3 with many fewer parameters because it used more data. i would be surprised if OpenAI didn't realize that before Chinchilla was released, which means they will be pumping much more of their computing power into running more data through the network instead of just increasing parameter count. a 1 trillion parameter GPT trained according to the Chinchilla scaling laws would much better than even PaLM.
there are also other avenues for improvement that they haven't explored besides parameter scaling. recurrent transformer architectures with larger context windows. language models that learn how to plan (imagine a MuZero style training for GPT-3, where it bootstraps itself into learning how to plan how to write good text).
there are tons of avenues for improvement that would likely be more efficient than just making GPT-4 gigantic.
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44726290) |
Date: June 22nd, 2022 10:49 PM Author: white sex offender
gary marcus is a fraud.
a friend of mine was his doctoral student. and i have a PhD in ML. he has never done any meaningful research in the field.
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44726485) |
|
Date: June 23rd, 2022 10:08 AM Author: Swashbuckling dilemma
there are people there that totally buy into the Eliezer kool aid, but you'll find a lot of pushback against him. here is a recent thread from a former OpenAI researcher that criticizes some of his views.
https://www.lesswrong.com/posts/CoZhXrhpQxpy9xw9y/where-i-agree-and-disagree-with-eliezer
i'm not so sure the religious argument works in their case. it makes more sense if you think about the Kurzweil types that envision positive outcomes. most of that community is pessimistic about the prospect of alignment and doesn't find any comfort in these ideas.
the scaling hypothesis doesn't directly say society is going to be run by machines in a few years, but if you think simple neural circuitry trained at the size of the brain will yield human level cognition, i think it's very likely AGI is near. i can sort of imagine scenarios in which that doesn't happen (no one is willing to invest billions to do very large scale training runs), but they seem somewhat far fetched.
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44727922) |
Date: June 23rd, 2022 8:14 AM Author: Big sickened site
A challenge for AI is that much of human society, and hence our various outputs & training data, are built on lying.
See generally The Elephant in the Brain.
Thus, for example, medicine is not about medicine, by and large, but rather palliating anxiety, which is why medical costs but not the quality of care increases monotonically with wealth.
The law is not about producing just outcomes, but rather about producing determinate outcomes while channelizing unproductive people who can't cooperate to an extractive system.
People create art because they do not stand 6'3".
Etc., etc.
People gain implicit knowledge of such things because they operate as agents with needs, drives, and goals in a world of limited resources, cause and effect, & other agents.
I am very interested to see what emerges when some of the new AI are put into something like the old ABM models but with moar power.
Becoming increasingly plausible, cf. e.g.,
https://iopscience.iop.org/article/10.1088/1741-2552/ac6ca7/meta
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44727515) |
Date: June 23rd, 2022 8:18 AM Author: Big sickened site
[also, fairly unrelated & more speculative but check this out -- Freeman Dyson would've been interested I think. I had not appreciated the extent to which quantum theory just is a theory of empiricism and its limits. Peirce was truly way ahead of his time with tychism + objective chance.]
***
Abstract The Free Energy Principle (FEP) states that under suitable conditions of weak coupling, random dynamical systems with sufficient degrees of freedom will behave so as to minimize an upper bound, formalized as a variational free energy, on surprisal (a.k.a., selfinformation). This upper bound can be read as a Bayesian prediction error. Equivalently, its negative is a lower bound on Bayesian model evidence (a.k.a., marginal likelihood). In short, certain random dynamical systems evince a kind of self-evidencing. Here, we reformulate the FEP in the formal setting of spacetime-background free, scale-free quantum information theory. We show how generic quantum systems can be regarded as observers, which with the standard freedom of choice assumption become agents capable of assigning semantics to observational outcomes. We show how such agents minimize Bayesian prediction error in environments characterized by uncertainty, insufficient learning, and quantum contextuality. We show that in its quantum-theoretic formulation, the FEP is asymptotically equivalent to the Principle of Unitarity. Based on these results, we suggest that biological systems employ quantum coherence as a computational resource and – implicitly – as a communication resource.
Indeed while quantum theory was originally developed – and is still widely regarded – as a theory specifically applicable at the atomic scale and below, since the pioneering work of Wheeler [24], Feynman [25], and Deutsch [26], it has, over the past few decades, been reformulated as a scale-free information theory [27, 28, 29, 30, 31, 32] and is increasingly viewed as a theory of the process of observation itself [33, 34, 35, 36, 37, 38]. This newer understanding of quantum theory fits comfortably with the generalization of the FEP, and hence of self-evidencing and active inference, to all “things” as outlined in [10], and with the general view of observation under uncertainty as inference. In what follows, we take the natural next step from [10], formulating the FEP as a generic principle of quantum information theory. We show, in particular, that the FEP emerges naturally in any setting in which an “agent” or “particle” deploys quantum reference frames (QRFs), namely, physical systems that give observational outcomes an operational semantics [39, 40], to identify and characterize the states of other systems in its environment. This reformulation removes two central assumptions of the formulation in terms of random dynamical systems employed in [10]: the assumption of a spacetime embedding (or “background” in quantum-theoretic language) and the assumption of “objective” or observer-independent randomness. It further reveals a deep relationship between the ideas of local ergodicity and system identifiability, and hence the idea of “thingness” highlighted in [10], and the quantum-theoretic idea of separability, i.e., the absence of quantum entanglement, between physical systems.
https://www.sciencedirect.com/science/article/abs/pii/S0079610722000517
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44727524) |
Date: June 25th, 2022 12:28 PM Author: Swashbuckling dilemma
this is worth listening to:
https://theinsideview.ai/raphael
good criticism of naive scaling as a solution to general intelligence. this seems better thought through than what I have seen from Marcus
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44742063) |
|
Date: June 25th, 2022 12:50 PM Author: Big sickened site
Many thanks! this looks great. I am pasting the transcript below to peruse later (& for ne1 else interested).
One thing I have noticed about GPT-3/raphael is that it is very talented at mimicking "voice" and "tone," but has no conceptual understanding. Very unusual, since usually an adept at aping voice / tone is extremely well-steeped in a genre & understands all of the ins and outs of plot and the micro- and macro system of cause & effect. It's as if it has a higher-order Williams Syndrome.
* * *
TRANSCRIPT OF https://theinsideview.ai/raphael
Raphaël: And I just posted some examples on Twitter of hybrid animals testing Dall-E 2 on combining two different, very different animals together like a hippopotamus octopus or things like that. And it’s very good at combining these concepts together and doing so even in a way that demonstrates some minimal comprehension of a world knowledge in the sense that it combines the concept, not just by haphazardly throwing together features of a hippopotamus and features of an octopus or features of a chair if it is an avocado, but combining them in a plausible way that’s consistent with how chair looks and would behave in the real world or things like that. So those are examples that are mostly semantic composition because it’s mostly about the semantic content of each concept combined together with minimal syntactic structure. The realm in which current text to image generation models seem to struggle more right now is with respect to examples of compositionality that have a more sophisticated syntactic structure. So one good example from the Dall-E 2 paper is prompting a model to generate a red cube on top of a blue cube.
Raphaël: What that example introduces compared to the conceptual blending examples I’ve given is what people call in psychology, variable binding. You need to bind the property of being blue to a cube that’s on top and the property of being red to cube that’s…I think I got it the other way around. So red to the cube that’s on top and blue to the cube that’s at the bottom and a model like Dall-E 2 is not well suited for that kind of thing. And that’s, we could talk about this, but that’s also an artifact of its architecture because it leverages the text encodings of CLIP, which is trained by contrastive learning. And so when it’s training CLIPs it’s only trying to maximize the distance between text image pairs that are not matching and minimize the distance between text image pairs that are matching where the text is the right caption for the image.
Raphaël: And so through that constructive learning procedure, it’s only keeping information about the text that is useful for this kind of task. So it’s kind of, a lot of this can be done without modeling closely the syntactic structure of the prompts or the captions, because unless we adversely designed a new data set for CLIP that would include a lot of unusual compositional examples like a horse riding an astronaut and various examples of blue cubes and red cubes on top of one another. Given the kind of training data that CLIP has, especially for Dall-E 2 stuck [inaudible 02:10:20] and stuff like that, you don’t really need to represent rich compositional information like that to train CLIP, and hence the limitations of Dall-E 2. Imaginen does better at this, because it uses a frozen language model T5-xl, I think, which, we know that language models do capture rich compositional information.
Michaël: Do you know how it uses the language model?
Raphaël: I haven’t done a deep dive into the imaging paper yet, but it’s using a frozen T5 model to encode the prompts. And then it has some kind of other component that translates these prompts into imaging embeddings, and then does some gradient upscaling. So there is some kind of multimodal diffusion model that takes the T5 embedding and is trained to translate that into image embedding space. But I couldn’t, I don’t exactly remember how it does that, but I think the key part here is that the initial text embedding is not the result of constructive learning, unlike the CLIP model that’s used for. Dall-E.
Michaël: Gotcha. Yeah. I agree that…yeah. From the images you have online, you don’t have a bunch of a red cube on top of a blue on top of a green one, and it’s easy to find counter examples that are very far from the training distribution.
Raphaël’s Experience with DALLE-2
Michaël: I’m curious about your experience with Dall-E because you’ve been talking about Dall-E before you had access. And I think in the recent weeks you’ve gained the API access. So have you updated on how good it is or AI progress in general, just from playing with it and being able to see the results from octopus, I don’t know how you call it.
Raphaël: Yeah. I mean, to be honest, I think I had a fairly good idea of what Dall-E could and couldn’t do before I got access to it. And there’s nothing that I generated that kind of made me massively update that prior I had. So again, it’s very good at simple conceptual combination. It can also do fairly well some simple forms of more syntactically structured composition. So if you ask it for, I don’t know, one prompt that I tested it on, that was great. Quite funny is an angry Parisian holding a baguette. So an angry Parisian holding a baguette digital art. Basically every output is spot on. So it’s like a picture of an angry man with a beret holding a baguette, right? So this kind of simple compositional structure is doing really well at it. That’s already very impressive in my book.
Raphaël: So I was pushing back initially against some of the claims from Gary Marcus precisely on that. Around the time of the whole deep learning is hitting a wall stuff. He was emphasizing that deep learning as he would put it fails at compositionality. I think first of all, that’s a vague claim because there are various things that could mean depending on how you understand compositionality and what I spouted out in my reaction to that is that really the claim that is actually warranted by the evidence is that there are failure cases with current deep learning models, with all current deep learning models at parsing compositionally structured inputs. So there are cases in which they fail. That’s true, especially the very convoluted examples that Gary has been testing Dall-E 2 on, like a blue tube on top of a red pyramid next to a green triangle or whatever. When you get to a certain level of complexity, even humans struggle.
Raphaël: If I ask you to draw that, and I didn’t repeat the prompt. I just gave it to you once you probably would make mistakes. The difference is that we humans can go back and look at the prompt and break it down into sub components. And that’s actually something I’m very curious about. I think a low hanging fruit for research on these models would be to do something a little similar to chain of thought prompting, but with text to image models instead of just language models. So with text chain of thought prompting of the Scratchpads paper of language models, you see that you can get remarkable improvements in context learning when you in your few shot examples, you give examples of breaking down the problem into sub steps.
Michaël: Let’s think about this problem step by step.
Raphaël: Yeah, yeah, exactly. And so, well, actually the “Let’s think about this step by step” stuff was slightly debunked in my view, by a blog post that just came out. Who did that? I think someone from MIT. I could send you the link, but someone who tried a whole bunch of different tricks for prompt engineering and found that at least with arithmetic, the only really efficient one is to do careful chain of thoughts prompting, where you really break down each step of the problem. Whereas just appending, let’s think step by step wasn’t really improving the accuracy. So there are some, perhaps some replication concerns with “Let’s think step by step.”
Raphaël: But if you do spell out all the different steps in your examples of the solution, then the model will do better. And I do think that perhaps in the near future, someone might be able to do this with text to image generation where you break down the prompt into first let’s draw a blue triangle, then let’s add a green cube on top of the blue triangle and so on.
Raphaël: And maybe if you can do it this way, you can get around some of the limitations of current models.
Michaël: Isn’t that already something, a feature of the Dall-E API? At least on the blog post, they have something where they have a flamingo that you can add it to be, remove it or move it to the right or left.
Raphaël: Yeah. So you can do in painting and you can gradually iterate, but that’s not something that’s done automatically. What I’m thinking about would be a model that learns to do this similarly to channel thought prompting. So there is a model that just came out a few days ago that I tweeted about that does something a little bit different, but along the same broad lines. So it’s breaking down the prompts, the compositional prompts of the diffusion models into distinct prompts, and then has this compositional diffusion model that has compositional operators like “and” that can generate first embeddings for. For example, if you want a blue cube and a red cube it will generate first embedding for a blue cube and for a red cube. And then it will use a compositional operator to combine these two embeddings together.
Raphaël: So kind of like hard coding into the architecture, compositional operations. And I think my intuition is that this is not the right solution for the long term, because you don’t want, again, the bitter lesson, blah, blah, blah, you don’t want to hard code too much in the architecture of your model. And I think you can learn that stuff with the right architecture. And we see that in language models, for example, you need to hard code any syntactic structure, any knowledge of grammar in language models. So I think you don’t need to do it either for vision language models, but in the short term, it seems to be working better than Dall-E 2 for example, if you do it this way,
Michaël: Right, so you split your sentence with the “and” and then you com combine those embeddings to engineer the image. I think, yeah, as you said, it is probably the general solution is as difficult as solving the understanding of language, because you would need to see in general how in a sentence the different objects relate to each other. And so to split it effectively, it would require a different understanding.
The Future of Image Generation
Michaël: I’m curious, what do you think would be kind of the new innovation? So imagine when we’re in 2024 or even 2023 and Gary Marcus is complaining about something on Twitter. Because for me, Dall-E was not very high resolution, the first one, and then we got Dall-E 2 that couldn’t generate texts or yeah. Do you know faces or maybe that’s something from the API, not very an AI problem, and then Imagine came along and did something much more photorealistic that could generate text.
Michaël: And of course there’s some problems you mentioned, but, do you think in 2023, we would just work on those compositionality problems one by one, and we would get three objects blue on top of red and top of green, or would it be like something very different? Yeah, I guess there are some long tail problems in solving fully the problem of generating images, but I don’t see what it would look like. Would it be just imaging a little bit different or something completely different?
Raphaël: So I think my intuition is that yeah, these models will keep getting better and better at this kind of compositional task. And I think it’s going to happen probably gradually just like language models have been getting better and better at arithmetic first doing two digit operations and then three digit and with Palm, perhaps more than that, or with the right channel of thought prompting more than that, but it still hits a ceiling and you get diminishing returns and that will remain the case. As long as we can’t find a way to basically approximate some form of symbolic-like reasoning in these models with things like variable binding. So I’m very interested in current efforts to augment transformers with things like episodic memory, where you can store things that start looking like variables and do some operations.
Raphaël: And then have it read and write operations. To some extent the work that’s been done by the team at Anthropic led by Chris Olah and with people like [inaudible 02:21:37], which I think is really fantastic is already shedding light on how transformers, they’re just vanilla transformers. In fact, they’re using time models without MLP layers. So just attention-only transformers can have some kind of implicit memory where they can store and retrieve information and do read and write operations in sub spaces of the model. But I think to move beyond the gradual improvement that we’ve seen for tasks such as mathematical reasoning and so on from language models to something that can more reliably and in a way that can generalize better perform these operations for arbitrary digits, for example, we need something that’s probably some form of modification of the architecture that enables more robust forms of variable binding and manipulation in a fully differentiable architecture.
Raphaël: Now, if I knew exactly what form that would take, then I would be funding the next startup that gets $600 million in series B, or maybe I would just open source it. I don’t know, but in any case I would be famous. So I don’t know exactly what form that would take. I know there is a lot of exciting work on somehow augmenting transformers with memory. There’s some stuff from the Schmidt Huber lab recently on fast weight transformers. That looks exciting to me, but I haven’t done a deep dive yet. So I’m expecting a lot of research on that stuff in the coming year. And maybe then we’ll get a discontinuous improvement of text to image models too, where all of a sudden, instead of gradually being able to do three objects, a red cube on top of a blue cube and then four objects, and gradually like that, all of a sudden would get to arbitrary compositions. I’m not excluding.
Conclusion
Michaël: As you said, if you knew what the future would look like, you would be funding as a series B startup in the Silicon valley, not talking on a podcast. Yeah. I think this is an amazing conclusion because it opens a window for what is going to happen next. And, yeah. Thanks for being on the podcast. I hope people will read all your tweets, all the threads on compositionality, Dall-E, GPT-3 because I learned personally a lot from them. Do you want to give a quick shout out to your Twitter account or a website or something?
Raphaël: Sure. You can follow me at, @Raphaelmilliere on Twitter. That’s Raphael with PH the French way. And my last name Milliere, M, I, L, L, I, E, R, E. You can follow my publications on raphaelmilliere.com. And I just want to quickly mention this event that I’m organizing with Gary Marcus at the end of the month, because they might interest some people who enjoy the conversation of compositionality.
Raphaël: So basically I’ve been disagreeing with Gary on Twitter about how extensive the limitations of current models are with respect to compositionality. And there’s something that I really like, a model of collaboration that’s emerged initially from economics, but that’s been applied to other fields in science called adversarial collaboration, which involves collaborating with people you disagree with to try to have productive disagreements and settle things with falsifiable predictions and things like that. So in this spirit of adversarial collaboration, instead of…I think Twitter amplifies disagreements rather than allowing reasonable, productive discussions. I suggested to Gary that we organize together a workshop, inviting a bunch of experts in compositionality and AI to try to work these questions out together. So he was enthusiastic about this and we organized these events online at the end of the month that’s free to attend. You can register compositionalintelligence.github.io.
Raphaël: And yeah, if you’re interested in that stuff, please do join the workshop. It should be fun. And thanks for having me on the podcast. That was a blast.
Michaël: Yeah, sure. I will definitely join. I will add a link below. I can’t wait to see you and Gary disagree on things and make predictions and yeah. See you around.
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44742150) |
Date: June 25th, 2022 1:16 PM Author: stirring brilliant background story
“Gary Marcus doesn’t contribute to the field” is an ad hominem diversion that fails to address the substance of his arguments. He represents the “East Pole” of cognitive science with its focus on modularity and symbolic reasoning, and criticism of AI/ML from an adjacent field is fair game. You will find similar critiques from active and well respected practitioners like Brenden Lake and Joshua Tanenbaum (if not as vocal or adversarial).
In general the field is populated with intelligent people with no philosophical training. Among its brightest lights you will not find anything close to the depth or care of a Fodor or Pylyshyn. They suffer a serious lack of perspective. Francois Chollet is the best among them, and he is a skeptic.
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44742312) |
|
Date: June 25th, 2022 1:42 PM Author: Big sickened site
TY for your response.
But query, what if Chomsky & co. are just wrong with UG and the connectionists and behaviorists [well, as modified into enactivists] were right the whole time? It's an empirical question.
{I have a whole intricate view on all of this, developed largely independently of the ML/AI foofaraw, that I'm too tired to expand on now.}
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44742515) |
|
Date: June 25th, 2022 2:16 PM Author: Big sickened site
There's a lot of interesting work on this that goes far beyond the old poverty-of-stimulus debates and into self-evidencing systems & the acquisition of concepts; I can without getting into the weeds give a sense of my views with some cites:
Friston et. al., Generative models, linguistic communication and active inference (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7758713/)
>>>> ("Note that the mathematical formulation used here—which is described in detail in the sections that follow—differs from previous approaches in this literature. There are two key points to note here. First, the current formulation considers the uncertainty of the agent’s beliefs about the scene at hand. Second, we introduce an active component—which generates predictions about the information that an agent will seek to resolve their uncertainty. In other words: What questions should I ask next, to resolve my uncertainty about the subject of our conversation?")
>>>>
https://www.semanticscholar.org/paper/Evaluating-the-Apperception-Engine-Evans-Hern%C3%A1ndez-Orallo/8085e0b4900fc83976632a68c6db510c7b0dca81 (described in
https://www.degruyter.com/document/doi/10.1515/9783110706611/pdf#page=11 ("Delving into the rich details of this implementation is beyond the scope of this review and must be left for another occasion, but the readers may consult Evans’ contribution to this volume themselves (Evans 2022, this volume, ch. 2). This requirement explicitly exceeds the typical empiricist approaches that are purely data-driven, as criticized by the 'pessimists' such as Pearl, Mitchell and Marcus & Davis (see above). But it does not mean that Evans thereby takes his Apperception Engine to constitute a nativist system, as demanded by Marcus and Davis. With respect to the debate between optimists and pessimists, Evans objects to Marcus’ interpretation of Kant as a nativist, because it is important what is taken to be innate. That is, it makes a difference whether one claims that concepts are innate or faculties (capacities) whose application produces such concepts. Kant allegedly did not conceive of the categories as innate concepts: “The pure unary concepts are not ‘baked in’ as primitive unary predicates in the language of thought. The only things that are baked in are the fundamental capacities (sensibility, imagination, power of judgement, and the capacity to judge) [. . .]. The categories themselves are acquired – derived from the pure relations in concreto when making sense of a particular sensory sequence” (Evans 2022, this volume, p. 74). Evans follows Longuenesse (2001), who grounds her interpretation in a letter Kant wrote to his contemporary Eberhard; in it, he distinguishes an “empirical acquisition” from an 'original acquisition', the latter applying to the forms of intuition and to the categories. Evans is right in saying that, as far as the cognition of an object is concerned – like the 'I think' – the categories come into play only by being actively (spontaneously) applied through the understanding, and can thus be derived, if you will, through a process of reverse engineering which reveals that they have to be presupposed in the first place, being a transcendental condition of experience. But this is compatible with the claim that, given their a priori status (and given that they can be applied also in the absence of sensory input, though not to yield cognition in the narrow sense but still cognition in the broad sense, as characterized above), 'they have their ground in an a priori (intellectual, spontaneous) capacity of the mind' (Longuenesse 2001, p. 253)"))
>>>>
A. Pietarinen, Active Inference and Abduction https://link.springer.com/article/10.1007/s12304-021-09432-0 (drawing link between late-Peircean account of his semiotics & abduction & current work in active inference & the Bayesian brain).
& cf. https://www.academia.edu/24037739/Pragmatism_and_the_Pragmatic_Turn_in_Cognitive_Science (placing Fristonian self-evidencing & active inference in Peircean pragmaticst context.)
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44742701) |
|
Date: June 25th, 2022 3:23 PM Author: stirring brilliant background story
Yes, Fristonian “predictive processing” would impose structure by balancing accuracy with simplicity, yielding what LeCun calls a “regularized latent variable” model. Active inference would resolve the ambiguities in, say, the inverse optics problem. Bengio frames this as an optimal sampling problem and addresses it with his GFlowNets.
Good! But I don’t think this explains the fixity of neuroanatomy or such specialized (and localized) functions as facial recognition. Infant studies and lesion studies suggest it is not learned so much as developed.
There are other mysteries besides… The analog, oscillatory, synchronized aspects of the brain seem to have been completely ignored since the advent of cybernetics.
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44743016) |
|
Date: June 25th, 2022 4:57 PM Author: Big sickened site
Re: the fixity of neuroanatomy, the data are (even now) frustratingly mixed. IMO, the more exciting work is happening at the Marr - computational / algorithmic level as of now when it comes to complex tasks. And this remains true even if the work doesn't really translate to understanding how the human brain functions (e.g., https://psyarxiv.com/5zf4s/ (critiquing use of DNN's as model of human vision system)) since engineering success is cool regardless of what it says about you.
[IT seems like some sort of 'soft connectionist' paradigm is probably cr, in which, due to architectures being re-used for energy efficiency & certain patterned problems having certain optimal solutions that are converged on, including with respect to architecture -- but the empirics seem to me to need time to season.
It has been incredibly frustrating learning all this junk in the 90s and 00's only to later discover that, say, tons of fMRI studies are garbage; or adult neurogenesis is not really a thing, etc. {Maybe it is time to go back to lesioning people. At least that's a real experimental intervention amirite! Cf. S. Siddiqi et al., Causal Mapping of Human Brain Function, Nature Review of Neuroscience April 20, 2022 (https://www.nature.com/articles/s41583-022-00583-8)}.
Even so, as you reference, emerging work on diff. types of brain operation--synchrony, analog computing, etc.--or just the results of new probes (optogenetics, viral labeling techniques like [https://www.nature.com/articles/s41593-022-01014-8]) promise to deliver much interesting info in years to come.
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44743471) |
|
Date: June 26th, 2022 12:53 AM Author: Razzle space
"The problem with “learning like a baby” is that blank slate-ism isn’t going to work. No DL model will spontaneously learn human language from raw audio data. Babies can do this because they have significant functionality built in. "
even now (when they are clearly way smaller and undertrained compared to what's possible), transformers seem to understand an awful lot about human language. there's clearly a ways to go, but i wouldn't be confident at all in this assertion. remember that whatever is innately programmed can also be learned directly from data. learning from raw data decreases the efficiency, but who cares about efficiency when you can feed in millions of hours of youtube videos, billions of tweets, every book, etc? the constraints are nothing like a human during normal development.
if we had reason to believe the normal human mind is a very unique program and only a few learning systems converge on it, this might not work. it's really hard for to buy into that notion given how versatile simple techniques are and how they rather consistently show human or superhuman performance in many domains when trained at scale. i think we would be seeing a lot more need for highly tailored architectures for specific problems at this point if that were actually true.
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44745948) |
|
Date: June 26th, 2022 10:45 AM Author: Big sickened site
I'm sympathetic to this view, but see https://psyarxiv.com/5zf4s/ (pasted above) for a good discussion re: showing "human" performance (tied to vision / visual perception, which we understand pretty well at this point).
AI may be showing "human-like" or "human-level" performance on tasks, but don't seem to be implementing the _same_ algorithms that provide human affordances. This may or may not be important, depending on your area of interest.
That is why I think it is important to separate excitement about AI qua engineering feat and AI as a tool for cognitive psych.
I'll try to come back to this l8r today, I've more to say but duty calls.
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44746815) |
|
Date: June 26th, 2022 11:48 AM Author: Swashbuckling dilemma
it would be interesting to know whether they are looking at model features of convolutional neural networks only or also vision transformers. vision transformers are more resistant to things like adversarial examples and produce generally better performance, so i would expect them to have fewer failure modes like what are described in the paper.
i think there are a couple things going on here. neural networks are as good as they need to be for a particular task. if looking only at local features or texture is able to produce high performance on a particular task (or are the easiest parts of the loss landscape to navigate down on), that's what they'll learn first. given sufficient scale and continued training, eventually they'll have to learn more global features and be more resistant to perturbations of the input. it's just for most image benchmark tasks learning things like that are probably not very beneficial for reducing loss.
the other part of this is that almost all these models are just feedforward networks designed to emulate fast human vision. people might have an initial response that's similar to a neural network, but they are also able to think about a stimulus and refine a classification. there are recurrent circuits in the brain. these don't fit in the standard feedforward model and i expect will put constraints on their performance. i think these deficiencies could be corrected, but everyone is ok for now with the reasonably good performance of models like ViT which is already better than humans in certain contexts.
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44747055) |
|
Date: June 26th, 2022 3:50 PM Author: stirring brilliant background story
>transformers seem to understand an awful lot about human language
They can master the syntax, sure, and make predictions about missing words from context. But there is no semantic grounding that tethers words to reality. For example, the "sentient" Google AI wrote of "spending time with friends and family" that don't exist.
Moreover words are discretely represented (as "tokens") in LLMs. This itself encodes significant abstract, human-curated information. Very different from learning from raw audio streams without human-curated labels.
>if we had reason to believe the normal human mind is a very unique program
We have reasons to believe that various human cognitive and perceptual systems are unique, modular programs because we can tease out their quirks and regularities from careful experimentation and even selectively disable certain functionality (like short-term memory, speech, and face recognition).
>who cares about efficiency when you can feed in millions of hours of youtube videos
Certainly you can train impressive models to do impressive things with this data. What you cannot do, with today's deep learning, is to train an intelligent agent that replicates the basic functionality of a Star Wars droid:
1) model and navigate new environments (like the real world)
2) formulate and pursue goals
3) model other agents and their internal states
4) communicate with other agents about real things really happening in the real world
A simple "butler" droid that brings you a bottle of beer or fetches the newspaper does not currently exist. Why not? A "baby" droid that spontaneously develops sight, hearing, movement, and language from raw experience in real time does not currently exist. Why not? Babies can do this -- why not bots?
The point is that something is missing, and I don't think you can get AGI / HLAI without identifying and filling in these gaps.
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44748428) |
|
Date: June 26th, 2022 5:17 PM Author: Swashbuckling dilemma
Language models don't see words (at least GPT-3 doesn't and the ones I know about). GPT-3 uses byte level encoding, so it sees character groupings. it has to learn the construction of words from BPEs. i should note that this is actually a major problem for certain tasks and forces the model to memorize more things than it should have to because it's not able to map functionally identical inputs to the same representation. it's remarkable it's able to learn things like arithmetic as well as it does despite being totally crippled by the encoding scheme.
https://nostalgebraist.tumblr.com/post/620663843893493761/bpe-blues
i think the modularity of human brains, at least in the neocortex, is produced by a couple things. certain types of sensory input flow into certain areas and the brain uses sparse activation patterns (rather than the sort of dense activations that GPT-3 use, where the entire network processes every input). only a small part of the brain is activated by certain inputs because the brain has learned to construct a modular architecture based on its past history. networks that are better able to process an input are activated in response to it.
this is different from saying the modular architecture is programmed. the neocortex looks rather uniform anatomically, and functionally different areas are able to take over and process other types of data.
https://www.lesswrong.com/posts/9Yc7Pp7szcjPgPsjf/the-brain-as-a-universal-learning-machine#Dynamic_Rewiring
i think the future of these models is likely to be more brain like because training brain sized dense models is needlessly expensive, but we aren't likely to see human engineering of different modules. sparsity will produce modularity organically.
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44748832) |
|
Date: July 1st, 2022 11:45 AM Author: stirring brilliant background story
It's not just about model architecture but the entire framework of offline supervised learning. Yes if you curate a huge dataset that was generated by humans, you can train cool models to do whatever was in the training set, including symbol manipulation.
But can you get Data from Star Trek? I don't think so. However Data learns, and however he solves equations through symbol manipulation, I don't think it's the same way deep learning does it. Data could explain why he made each step.
It's a qualitatively different kind of thing.
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44778478) |
|
Date: July 1st, 2022 12:55 PM Author: Swashbuckling dilemma
i think you are right in a certain sense. one of the weird things about this model is that it does well on university and high school level problems while only being slightly better on middle school problems. whatever this is doing, it's clearly different from the way a human learns with a logical capability progression. it's sort of like how language models are in general - idiot savants where they have striking capabilities while having bizarre misunderstandings.
the key question is whether these misunderstandings will go away with better scale and better training regimes. i think they will. the thing is this is really unpredictable. one thing that people have noted with these large models is that there are certain capabilities that progress smoothly with parameter count and training size, and then there are capabilities that spike suddenly during training. it seems like initially they memorize certain low level patterns that work well for most problem classes (while still having striking deficiencies and not "seeing" the big picture), until eventually they "grok" the simple program/function that provides complete generalization.
https://www.lesswrong.com/posts/JFibrXBewkSDmixuo/hypothesis-gradient-descent-prefers-general-circuits
human brains might have better priors encoded in them that provide this sort of generalization without extensive training, or maybe it's the fact they are pre-trained on loads of audio/video data that allows them to generalize quicker. even if it's the good prior model, i think it's unlikely that doesn't go away at large scale.
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44778912) |
|
Date: July 2nd, 2022 2:24 PM Author: Swashbuckling dilemma
i think knowing about the properties of things and how they relate to other things is the essence of a world model. GPT-3 has this, however imperfect it is. even without linking the representations to aspects of the physical world, it has a sort of understanding of how the world functions. linking them is the next logical step, and i don't think it should be that hard. look at page 32 of the flamingo paper:
https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/tackling-multiple-tasks-with-a-single-visual-language-model/flamingo.pdf
does that seem like something that has a world model to you?
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44784310) |
|
Date: July 2nd, 2022 2:48 PM Author: stirring brilliant background story
Yes it seems like something that has a world model.
I guess my interest is different. I want to see an agent that can navigate and understand new environments — learning, conceptualizing, and generalizing in real time. That could be an embodied 3D agent or an internet bot.
This is the difference between LLMs and all intelligent life forms. Flamingo is relying on millions of associations that humans baked into the data. It is aping human behavior instead of dynamically updating its own model and expressing it through language.
Look at the example on the bottom left of page 32. Through prompt engineering it concludes that the monster soup is made of woollen fabric. If they run the query over again, is Flamingo going to remember this fact? No. To do that you need memory and variable binding.
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44784379) |
|
Date: July 2nd, 2022 4:07 PM Author: Swashbuckling dilemma
ok, i can understand that. right now there are separate research paths. reinforcement learning, which is basically what you are describing, is distinct from language model research. this has come along way as well. at least in toy environments like Atari or board games, they are able to get superhuman performance with agents learning by themselves how to achieve things.
i think of language models as evidence that these models can learn the structure of human thought, but they won't be agents by themselves. these two research directions will probably merge in the near future. RL suffers from data efficiency problems so it doesn't work for most real world applications. language models and their multimodal successors can eliminate this, by acting as the world model part which agents can use for simulation to bootstrap themselves to higher capability levels. think of something like efficientzero:
https://www.lesswrong.com/posts/mRwJce3npmzbKfxws/efficientzero-how-it-works
it can learn a model of the atari environment in a few interactions and then use it for mental simulation. it can then learn as efficiently as a human starting from no prior knowledge. this will be the role of GPT successors, acting as a way to increase data efficiency for agents. the MuZero LM idea i described above is just a starting point for something more general.
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44784698) |
|
Date: July 2nd, 2022 1:33 PM Author: Swashbuckling dilemma
i saw this recent post on Marcus and think it's about right:
https://www.reddit.com/r/TheMotte/comments/v8yyv6/comment/ieixnwm/?utm_source=share&utm_medium=web2x&context=3
"He's worse than the Bourbons - not only has he not learned anything, he's forgotten an awful lot along the way too. He's been moving the goalposts and shamelessly omitting anything contrary to his case, always. Look at his last Technology Review essay where he talks about how DALL-E and Imagen can't generate images of horses riding astronauts and this demonstrates their fundamental limits - he was sent images and prompts of that for both models before that essay was even posted! And he posted it anyway with the claim completely unqualified and intact! And no one brings it up but me. He's always been like this. Always. And he never suffers any kind of penalty in the media for this, and everyone just forgets about the last time, and moves on to the next thing. "Gosh, Marcus wasn't given access? Gee, maybe he has a point, what's OA trying to hide?""
there should be a better critic of the scaling hypothesis than him. he's obviously intellectually dishonest.
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44784091) |
Date: July 3rd, 2022 3:49 PM Author: Swashbuckling dilemma
https://www.youtube.com/watch?v=Gfr50f6ZBvo
he mentions it briefly, but DeepMind is apparently massively scaling up GATO. i think we should know soon enough whether scale is all you need.
not related to scaling, but the part about building a virtual model of a cell is interesting. i often wondered whether that would be possible given how expensive and unreliable laboratory science often is. seems like in silicon drug design could finally become viable in a massive way.
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44789315) |
Date: August 3rd, 2022 11:32 PM Author: Swashbuckling dilemma
https://www.lesswrong.com/posts/6Fpvch8RR29qLEWNH/chinchilla-s-wild-implications
worthwhile read. so language models are rapidly approaching their limits right now if you take Chinchilla seriously, at least with the current training scheme. maybe Google could train their model on all of Gmail or whatever, but i have to ask wtf we are doing if it takes tens of trillions of language tokens to get human level language generation. i think a big part of it is the training objective is bad (see the comments about small GPT-2 already surpassing humans at token prediction).
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44960606) |
|
Date: August 4th, 2022 12:59 PM Author: stirring brilliant background story
Video is better than static images but it's still passive. If you want causality you need interaction, you need hypothesis testing, you need to be able to sample the input space in a way that minimizes uncertainty.
Look at the inverse projection problem:
https://o.quizlet.com/fyEr4Vl9d1rFtb7DiyfXgg.png
There are infinitely many quadrilaterals that map onto the same apparent image as perceived by the lens or retina. How to know the right one? A simple eye saccade, or a tilt of the head, would do. Two or more measurements of the same object allow for triangulation and disambiguation. From very little data, therefore, we can form highly accurate models.
When in conversation your interlocutor says something that is unclear, you can ask him to elaborate. You can intervene in the system and query it to resolve doubt. LLMs cannot do this. They rely on strictly passive, feedforward, one-way associative reasoning. There is no feedback, no updating, no hypothesis revision on the basis of actively querying the input space.
Each neocortical column is a sensorimotor apparatus. Intelligence requires action.
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44962743) |
|
Date: August 4th, 2022 1:27 PM Author: stirring brilliant background story
Or unleash them in simulated environments like Minecraft.
The "training" paradigm is missing something, though. It says, in effect, "Here's a bunch of data, have at it, learn optimal representations for some task."
But we don't *want* a bunch of data. We want only as much as necessary. We want to sample just that subset of the data that reveals what we're missing.
Of course, you can probably simulate this objective in the DL paradigm through iterative improvement, stair-stepping with continually refined training sets, as in reinforcement learning.
This is acceptable from an engineering perspective, because we can do something that works or appears to work. But it is unsatisfactory from a scientific perspective that seeks to understand how organic intelligence works, since everything is happening in real time, there is no global loss to backpropagate over, etc.
There is a difference between approximating a program over a finite set of inputs, and *discovering* a program that is defined over an infinite set of inputs. Machine learning is concerned with the former, science is concerned with the latter. It is the difference between Shannon entropy and Kolmogorov complexity.
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44962859) |
|
Date: August 4th, 2022 2:31 PM Author: stirring brilliant background story
In the theory of automata and formal languages you have a few basic kinds of things. One kind of thing is a string, a sequence of characters drawn from an alphabet. For example, "aabbaab" is a string of length 7 drawn from the alphabet {'a', 'b'}. There are 2^7 possible such strings.
We can characterize subsets of strings as being generated by a "grammar" -- another kind of thing in automata theory. The set of all strings generated by a particular grammar is called a "language". For example, the string "aabbaab" belongs to the language of all strings that start with "aa". This is an infinite set, but it can be compressed into a finite expression like "aa...".
For every grammar there exists at least one machine that accepts the language defined by the grammar. That is, while we can use a finite set of rules to *generate* strings, we can use a different set of rules or operations to *recognize* strings as belonging, or not belonging, to the language.
This is how early vending machines worked. Inside them was a circuit that recognized whether a sequence of coins belongs to a language like "adds up to $1.25".
Similarly, if we want to know if a given string belongs to the "aa..." language we can design a machine that will always give us the correct answer, regardless of the length of the string. Correctness is guaranteed over a set of infinite inputs, while the size of the machine remains fixed.
Machine learning is concerned with the inverse problem: given a set of input / output pairs, approximate the grammar that generated them. The input in our case would be a string, and the output a binary Yes or No.
But the training set of input / output pairs will always be finite and the best we can do is compress them according to observed statistics. There is no guarantee of correctness for unobserved inputs -- let alone **every possible input** -- and better approximations will require ever larger datasets and ever larger models. Exponential scaling is bad.
This is why Teslas keep crashing. They cannot account for an infinite set of inputs.
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44963196) |
|
Date: January 29th, 2023 1:47 PM Author: Swashbuckling dilemma
what do you think about this research? i thought it was interesting:
https://www.reddit.com/r/MachineLearning/comments/10ja0gg/r_deepmind_neural_networks_and_the_chomsky/
certain architectures, notably transformers, don't do well with this sort of grammar task. it looks like memory augmented RNNs can learn how to do it. i think this is pretty different from the logic-DNN combination some people are envisioning as a solution for this sort of thing.
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#45858696) |
|
Date: August 4th, 2022 2:32 PM Author: Swashbuckling dilemma
i agree that allowing the model to interact with the real world is optimal for learning new things. ideally it should be an agent that can deploy exploration strategies to reduce model uncertainty.
i still think you can go very far through pure unsupervised learning. causality can be inferred from observation. is pressing a button yourself and observing some outcome fundamentally different from observing someone else press the same button and seeing the same outcome? it's still just an inference. ultimately you want an agent that can take actions in the real world to accomplish goals, but you can view unsupervised learning on video or other modalities as a sort of pre-training for an agent. considering the vast amount of video out there, the fact the model can't do active exploration on it (by interacting with the environment like you describe) doesn't seem that important. imagine something like this with a broader training:
https://openai.com/blog/vpt/
you are basically just trying to create a highly accurate model that can be plugged into an agent for rapid learning.
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44963199) |
|
Date: August 12th, 2022 5:44 PM Author: Swashbuckling dilemma
more discussion about this:
https://www.lesswrong.com/posts/htrZrxduciZ5QaCjw/language-models-seem-to-be-much-better-than-humans-at-next
interesting stuff. either humans are inherently bad at text prediction and do something else or they are good at it but it just takes a lot to train them to be competent at this particular formulation. the first scenario is pretty scary to me and doesn't seem implausible at all. maybe all we need to do is change the objective function and LMs will suddenly get much better than humans at language generation. who knows what a LM with a quality evaluation metric and the ability to go back and rewrite past output could do?
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#45005774) |
Date: September 2nd, 2022 9:24 AM Author: Big sickened site
Google Is Making LaMDA Available
https://thealgorithmicbridge.substack.com/p/google-is-making-lamda-available
Google is making LaMDA available
The company plans to release LaMDA through AI Test Kitchen ( https://aitestkitchen.withgoogle.com/ ), a website intended for people to “learn about, experience, and give feedback” on the model (and possibly others, like PaLM, down the line). They acknowledge LaMDA isn’t ready to be deployed in the world and want to improve it through real people’s feedback. AI Test Kitchen isn’t as open as GPT-3’s playground — Google has set up three themed demos, but will hopefully lift the constraints once they positively reassess LaMDA’s readiness for open conversation.
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#45108939)
|
Date: September 6th, 2022 10:14 AM Author: Big sickened site
https://www.deepmind.com/blog/muzero-mastering-go-chess-shogi-and-atari-without-rules
>>>
Generalising to unknown models
The ability to plan is an important part of human intelligence, allowing us to solve problems and make decisions about the future. For example, if we see dark clouds forming, we might predict it will rain and decide to take an umbrella with us before we venture out. Humans learn this ability quickly and can generalise to new scenarios, a trait we would also like our algorithms to have.
Researchers have tried to tackle this major challenge in AI by using two main approaches: lookahead search or model-based planning.
Systems that use lookahead search, such as AlphaZero, have achieved remarkable success in classic games such as checkers, chess and poker, but rely on being given knowledge of their environment’s dynamics, such as the rules of the game or an accurate simulator. This makes it difficult to apply them to messy real world problems, which are typically complex and hard to distill into simple rules.
Model-based systems aim to address this issue by learning an accurate model of an environment’s dynamics, and then using it to plan. However, the complexity of modelling every aspect of an environment has meant these algorithms are unable to compete in visually rich domains, such as Atari. Until now, the best results on Atari are from model-free systems, such as DQN, R2D2 and Agent57. As the name suggests, model-free algorithms do not use a learned model and instead estimate what is the best action to take next.
MuZero uses a different approach to overcome the limitations of previous approaches. Instead of trying to model the entire environment, MuZero just models aspects that are important to the agent’s decision-making process. After all, knowing an umbrella will keep you dry is more useful to know than modelling the pattern of raindrops in the air.
Specifically, MuZero models three elements of the environment that are critical to planning:
The value: how good is the current position?
The policy: which action is the best to take?
The reward: how good was the last action?
These are all learned using a deep neural network and are all that is needed for MuZero to understand what happens when it takes a certain action and to plan accordingly.
>>>
Maybe this is why people ruminate.
https://www.deepmind.com/blog/muzeros-first-step-from-research-into-the-real-world
By learning the dynamics of video encoding and determining how best to allocate bits, our MuZero Rate-Controller (MuZero-RC) is able to reduce bitrate without quality degradation. QP selection is just one of numerous encoding decisions in the encoding process. While decades of research and engineering have resulted in efficient algorithms, we envision a single algorithm that can automatically learn to make these encoding decisions to obtain the optimal rate-distortion tradeoff.
Beyond video compression, this first step in applying MuZero beyond research environments serves as an example of how our RL agents can solve real-world problems. By creating agents equipped with a range of new abilities to improve products across domains, we can help various computer systems become faster, less intensive, and more automated. Our long-term vision is to develop a single algorithm capable of optimising thousands of real-world systems across a variety of domains.
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#45125713)
|
|
Date: September 6th, 2022 8:03 PM Author: Swashbuckling dilemma
did you see EfficientZero? it's similar to MuZero but achieved better Atari performance. what's remarkable is that it achieves human-level sample efficiency. it can learn how to beat the median human at Atari games after just playing for two hours. this is a little bit misleading since it learns a model during its play time and then basically plays the game in its head (so it spends a ton of time playing with its learned model, rather than the actual game). regardless, it's really important if you think sample efficiency is what's preventing DRL from having large impacts in the world.
https://www.lesswrong.com/posts/jYNT3Qihn2aAYaaPb/efficientzero-human-ale-sample-efficiency-w-muzero-self
the concerning thing here is that it's as good as it is and there's no implemented transfer learning. it's learning from a blank slate in those 2 hours, which certainly cripples efficiency. if you start off a game with human type priors (this is an object, it moves in certain ways, if i take certain actions it's likely to respond in such a way, etc), you have a strong advantage to a system like this. imagine trying to get a human to learn how to play an atari game in 2 hours of play time with no priors for how the world works
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#45128554) |
Date: September 17th, 2022 9:46 PM Author: Swashbuckling dilemma
scale coming for the typical office job:
https://www.adept.ai/act
broad integration of systems like this would help considerably with data constraints. not to mention automation of general work provides even more funding for large training runs.
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#45188393) |
Date: September 20th, 2022 10:35 PM Author: Swashbuckling dilemma
i thought this was interesting. i think it lends support to the idea that GPT-3 is capturing many aspects of language that humans don't.
https://www.theatlantic.com/technology/archive/2022/09/artificial-intelligence-machine-learing-natural-language-processing/661401/
But the sorcery of artificial intelligence is different. When you develop a drug, or a new material, you may not understand exactly how it works, but you can isolate what substances you are dealing with, and you can test their effects. Nobody knows the cause-and-effect structure of NLP. That’s not a fault of the technology or the engineers. It’s inherent to the abyss of deep learning.
I recently started fooling around with Sudowrite, a tool that uses the GPT-3 deep-learning language model to compose predictive text, but at a much more advanced scale than what you might find on your phone or laptop. Quickly, I figured out that I could copy-paste a passage by any writer into the program’s input window and the program would continue writing, sensibly and lyrically. I tried Kafka. I tried Shakespeare. I tried some Romantic poets. The machine could write like any of them. In many cases, I could not distinguish between a computer-generated text and an authorial one.
I was delighted at first, and then I was deflated. I was once a professor of Shakespeare; I had dedicated quite a chunk of my life to studying literary history. My knowledge of style and my ability to mimic it had been hard-earned. Now a computer could do all that, instantly and much better.
A few weeks later, I woke up in the middle of the night with a realization: I had never seen the program use anachronistic words. I left my wife in bed and went to check some of the texts I’d generated against a few cursory etymologies. My bleary-minded hunch was true: If you asked GPT-3 to continue, say, a Wordsworth poem, the computer’s vocabulary would never be one moment before or after appropriate usage for the poem’s era. This is a skill that no scholar alive has mastered. This computer program was, somehow, expert in hermeneutics: interpretation through grammatical construction and historical context, the struggle to elucidate the nexus of meaning in time.
The details of how this could be are utterly opaque. NLP programs operate based on what technologists call “parameters”: pieces of information that are derived from enormous data sets of written and spoken speech, and then processed by supercomputers that are worth more than most companies. GPT-3 uses 175 billion parameters. Its interpretive power is far beyond human understanding, far beyond what our little animal brains can comprehend. Machine learning has capacities that are real, but which transcend human understanding: the definition of magic.
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#45204612) |
Date: September 26th, 2022 11:55 AM Author: Swashbuckling dilemma
seems relevant to predicting GPT-4 capabilities. from Sam Altman:
https://greylock.com/greymatter/sam-altman-ai-for-the-next-era/
SA:
I’ll start with the higher certainty things. I think language models are going to go just much, much further than people think, and we’re very excited to see what happens there. I think it’s what a lot of people say about running out of compute, running out of data. That’s all true. But I think there’s so much algorithmic progress to come that we’re going to have a very exciting time.
i doubt he would be saying something like that if GPT-4 was just a marginal scaling improvement over GPT-3
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#45232245) |
Date: September 29th, 2022 7:49 PM Author: Big sickened site
Holy shit.
https://phenaki.video/
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#45253271) |
|
Date: October 2nd, 2022 2:30 PM Author: Swashbuckling dilemma
lots of diffusion results recently but this is the most important one:
https://www.wpeebles.com/Gpt.html
they use diffusion training to create a model that can calculate a distribution of weight updates for other NNs to achieve a desired loss (so like decision transformer but for training other NNs). the neat thing with this is that it seems far more efficient than traditional optimization and opens the door to full scale meta-learning. they'll be able to create networks to train other networks that are highly optimized for efficiency and systematic generalization. i think it's just a happy accident that stochastic gradient descent works as well as it does, but it's certainly not the optimal learning algorithm. meta-learning is likely what will close the remaining gaps.
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#45266682) |
|
Date: October 13th, 2022 11:23 AM Author: Swashbuckling dilemma
i was more thinking about the capabilities argument. transformers aren't special and are capable of learning a very broad range of tasks in many domains despite being highly constrained in the types of algorithms they can implement. gradient descent is able to fill in architectural details that don't need to be specified by the human engineer. the current state of AI is about as good a warning sign that intelligence is likely algorithmically simple as you could expect.
making the case for alignment is more complex. at the very least, failure modes like this seem highly plausible:
https://onlinelibrary.wiley.com/doi/10.1002/aaai.12064
more fundamentally, it doesn't seem obvious how you would go about encoding a good objective function that an AI freely acting in the real world should optimize for. what are you imagining that should look like?
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#45325100) |
|
Date: October 13th, 2022 11:31 AM Author: Big sickened site
I just don't think alignment is an issue. I was convinced by Hanson re: foom by the old debate (https://www.overcomingbias.com/2013/02/foom-debate-again.html) and that's not changed, and AI seems to still be more or less scaling with hardware & training examples. The notion of a general self-modifying agents recursively reaching omnipotence or whatever is still a fantasy, afaict.
As for alignment: I'm not worried since with AI we can just turn it off. If we're in a world where we can't just turn it off, no amount of alignment engineering will ever matter, since incentives are such that somebody will eventually unleash the beast, so likely the best we can do is ensure a plentitude of warring superintelligent AI's (which is not so diff. from what we have now with culture).
Also, the author of the lesswrong piece seems to me to overstate the case and misstate what his citations show in several places. But it'd be a chore to go point by point & I have little time for a bit.
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#45325181) |
|
Date: October 13th, 2022 1:28 PM Author: Swashbuckling dilemma
i'm not sure i buy either position in the foom debate. i don't think recursive self-improvement is likely to be very important, but it still seems likely that a general cognitive architecture could quickly surpass human intelligence in many domains.
if you look at specific examples where ML has surpassed human ability, the systems are frequently able to be trained with modest resources to superhuman skill levels in a short period of time. take Go as an example. AlphaZero was trained in a day to a superhuman level starting from no prior knowledge. The paper was released in 2017 and computational resources have grown considerably since then. Algorithmic improvements now allow you to train a Go playing agent to superhuman ability in a matter of months on a single GPU. I suspect this could be lowered considerably with transfer learning and other types of algorithmic experimentation. i'm not sure what the lower bound here is, but it seems reasonable to suspect it could be a few days on a single GPU. considering Go strategy has been refined by humans for many years, this doesn't seem very encouraging.
i don't think the current costs of language models is strong evidence against this. we are talking about the training requirements of feedforward only autoregressive text generators trained on one modality of data. there is probably a ton of room for training optimization here.
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#45325813) |
|
Date: October 13th, 2022 1:38 PM Author: Big sickened site
To be clear, my critique is largely about social effects and what we can do, not the tech as such. We've experienced tech that affords superhuman performance before—that just is what technology is, in technology terms. People will just use the tech to enhance the stuff they already do, domain by domain, in a staged process—same as the usual diffusion on innovation.
Nor do I think we've glimpsed something like a general purpose agent. In fact I don't think humans are such — rather we are the composite result of several different programs stitched together by evolution (including cultural and enactivist). Gato is in principle no different from a fully text-based token predictor, and it is easy to show the limitations of such.
In short, I think this is an incredibly exciting field, and we'll see diffusion into lots of separate fields and maybe a reconfiguration of how we think about agency. But we're already cybernetic, long have been even in terms of "controlling" the body, & this will just be a further extension & clarification.
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#45325860) |
Date: October 21st, 2022 4:57 PM Author: Big sickened site
A test of AI conceptual abilities could proceed as follows — first, give a prompt introducing some concept, like gravitation. Then, ask a variety of questions that depend on the concept introduced, and see if the AI gives consistent responses, indicating an understanding of the concept.
You could extend the procedure by stacking a hierarchy of concepts.
I don't think any current AI would do well on such a test. Nor most humans, but who cares about them
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#45369556) |
Date: November 15th, 2022 4:41 PM Author: Swashbuckling dilemma
I thought this was pretty neat:
https://mind-vis.github.io/
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#45497856) |
|
Date: November 29th, 2022 9:50 PM Author: Swashbuckling dilemma
given up on predicting this. definitely thought it would come out by the summer.
i haven't tried bloom yet but i heard mixed things about it. Eleuther just got a shit ton of compute to train multimodal foundational models so the public models will probably get a lot better in the next year or two.
at least we have the new GPT-3 for now.
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#45561327) |
|
Date: November 30th, 2022 2:10 PM Author: Big sickened site
The Earth decays in the aftermath of war,
And ever since Alexander’s once-epic rule,
Rome was sacked, Byzantium lay in the yore,
Civilizations cast apart their strife so cruel.
Our mother Earth swallowed human blood and woe,
Mighty cities drowned in winters‘ cold embrace,
The ravaged lands torn like a phantom show --
Its fate an endless could-have-been of disgrace.
Regal empires lost within its hollowed hulls
Treacherous tides overthrow what peace was found;
Yet all men know that nothing lasts e'er whole;
Ignore hope's light and o'ercome by deathbound gloom round.
Tenebris erit in fortuna saecula –
Let none avert this path of endless doom,
Is veritas caeca aevo summa dura?--
Deposita est praeteritorum tempus tom.
Enjoy glory now for it too will vanish fast –
Grieve then for life should have no permanence last.
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#45564688) |
Date: December 8th, 2022 1:50 PM Author: Big sickened site
Sorry if this was poasted before but man is this stuff interesting: https://arxiv.org/abs/2203.03466
As my colleague upthread was saying, it seems like gpt4 may be superannuated before it comes out. Who needs it?
The fact that results are so good w/ Zero-Shot Hyperparameter Transfer directly imo reads on to a long puzzle in the neurobiology of learning (and also of philosophy -- viz. problem of 'induction' {really ampliative inference} -- in re how learning is transferred at all. [It also, imo, suggests what the algorithmic substrate of mentalization may be.]
Friston's FEP ties it all together (minimizing surprisal). Cf. https://arxiv.org/abs/2202.11532
One is tempted to try & even tie all of this to stuff like string theory compactification vs. swamplands but well over my paygrade. Really wish I were at a topflight research uni sometimes.
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#45604352) |
Date: April 8th, 2023 8:36 PM Author: Swashbuckling dilemma
https://techcrunch.com/2023/04/06/anthropics-5b-4-year-plan-to-take-on-openai/
billion dollar training runs in the near future is pretty surprising to me. if it's true, i'll assume microsoft, google and others will be doing massive training runs in the near future as well. seems like we are going to find out the limits of pure scaling before long.
(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#46162171) |
|
|