General Discussion - Is chatgpt smoking on the bolded section or is that really a thing? - Forum

Huh, do they really analyze «random chunks of the Internet» (what, trying to connect to random IPs ? This is going to work worse and worse with IPv6...), and not a curated list of Web sites, or maybe even just one already curated with more or less success by search engines ? ---- And most of this metadata already exists, and has existed for years if not decades, hasn't it ?? Search engines have the same issue, yet they do somehow manage to figure out the date a webpage has been created/modified. Or do they get it wrong too, it's just less noticeable ? ---- Was the answer in this case susceptible to the same kind of memory-holing as your Penance Brand example ?	Posted by BlueTemplar85 on Sep 23, 2024, 8:43:58 AM Quote this Post
" BlueTemplar85 wrote: Huh, do they really analyze «random chunks of the Internet» As always the answer is "it depends" there are many different models. When designing your model you have to try and find data as close as possible to the input data you are expecting. So if I were to train an AI to classify documents for company XYZ (e.g. differentiate between purchase order, inquiry, insurance etc.) I'd give it a data set it can learn from and tell it where it should expect the sender address, order number etc. Other models do other things, and thus the data will be different. If I want a language model to write short stories in the style of Stephen King, I'd chuck some of his books in there. It's fascinating, really. With news sites you can expect every article to have a date and sources listed (if applicable). Other websites may not be as dilligent tho The opposite of knowledge is not illiteracy, but the illusion of knowledge.	Posted by ArtCrusade on Sep 23, 2024, 8:53:01 AM On Probation Quote this Post
" BlueTemplar85 wrote: Huh, do they really analyze «random chunks of the Internet» (what, trying to connect to random IPs ? This is going to work worse and worse with IPv6...), and not a curated list of Web sites, or maybe even just one already curated with more or less success by search engines ? Datasets can be bought or at least used depending on the license. There are many of them and the biggest ones are internet preprocessed websites. This is what I meant with random internet data. Just querying data from IPs etc would have serious copyright issues ;) And those chunks are a little bit curated, meta tags etc are there, but not completely and not consistent. chatgpt used almost everything from wiki to reddit to all public domain available books and many more. Facebook trains their own shit with its websites etc. " And most of this metadata already exists, and has existed for years if not decades, hasn't it ?? Search engines have the same issue, yet they do somehow manage to figure out the date a webpage has been created/modified. Or do they get it wrong too, it's just less noticeable ? Yes meta data is there and sometimes can be used. but LLMs work different. they don't look up any meta data in some kind of databases. In training sets you can of course set a target label but you can't really save that, only build a bias for choosing it. The point is, an LLM is build to be able to answer questions like a human would, not in the sense of being full of correct information but in the sense of syntax and semantics. Other domain knowledge over a specific topic must be introduced either with a (vector)database (here tagging etc works) and hidden query engineering, or huge datasets (lets say reddit/ wiki). The former approach is extremely expensive and needs data scientist and domain experts, it mostly used by companies to build their own chatbots for support etc. They also need a base LLM for sentence construction. The latter approach is chaptgpt/mistral/etc. Those systems have general knowledge about almost anything, but not much in the sense of quality assurance or domain specifics. They only option the better their results is in query engineering: This is actually exactly that what openAI introduces in their new chatgpt4o Version. After getting a query an agent based system queries the LLM, and via reinforcement learning refines the answer until some level positive feedback is reached. As you can imagine, its not fast :D Search engines work more like classic NLP(natural language processing) approaches. Websites have tags and other info in their metadata. The search provider has also insanely large databases with even more tags an annotations to websites. Queries are simple dismantled into the important words (Subject, (ad)verb, object, adjective) and then searched in the database to the websites with the most hits. The improve results many combinations are assumed. There of course other metrics behind the scenes like page rank, paid promotions, and search history etc. " ---- Was the answer in this case susceptible to the same kind of memory-holing as your Penance Brand example ? You get those problems even with web search results. Try google a problem in Linux after a fresh kernel release. You will find the bug easily but most likely you will find entries from years ago, as those sites a a much larger page rank the a new entry might have. Current Build: Penance Brand God build?! https://pobb.in/bO32dZtLjji5	Posted by tsunamikun on Sep 23, 2024, 9:43:15 AM Quote this Post