LESSWRONG
LW

Comment Permalink

I think in some cases an embedding approach produces better results than either a LLM or a simple keyword search, but I'm not sure how often. For a keyword search you have to know the "relevant" keywords in advance, whereas embeddings are a bit more forgiving. Though not as forgiving as LLMs. Which on the other hand can't give you the sources and they may make things up, especially on information that doesn't occur very often in the source data.

See in context

xpostah's Shortform

by samuelshadrach

1st Jan 2025

1 min read

2

This is a special post for quick takes by samuelshadrach. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

43 comments, sorted by

top scoring

Click to highlight new comments since: Today at 4:25 PM

[-]samuelshadrach12d20

Search engine for books

http://booksearch.samuelshadrach.com

Aimed at researchers

Technical details (you can skip this if you want):

Dataset size: libgen 65 TB, (of which) unique english epubs 6 TB, (of which) plaintext 300 GB, (from which) embeddings 2 TB, (hosted on) 256+32 GB CPU RAM

Did not do LLM inference after embedding search step because human researchers are still smarter than LLMs as of 2025-03. This tool is meant for increasing quality for deep research, not for saving research time.

Main difficulty faced during project - disk throughput is a bottleneck, and popular languages like nodejs and python tend to have memory leak when dealing with large datasets. Most of my repo is in bash and perl. Scaling up this project further will require a way to increase disk throughput beyond what mdadm on a single machine allows. Having increased funds would've also helped me completed this project sooner. It took maybe 6 months part-time, could've been less.

[-]cubefox12d42

Error: TypeError: NetworkError when attempting to fetch resource.

[-]samuelshadrach12d10

use http not https

[-]cubefox12d20

Okay, that works in Firefox if I change it manually. Though the server seems to be configured to automatically redirect to HTTPS. Chrome doesn't let me switch to HTTP.

[-]samuelshadrach11d10

Thanks for your patience. I'd be happy to receive any feedback. Negative feedback especially.

[-]cubefox11d*20

I see you fixed the https issue. I think the resulting text snippets are reasonably related to the input question, though not overly so. Google search often answers questions more directly with quotes (from websites, not from books), though that may be too ambitious to match for a small project. Other than that, the first column could be improved with relevant metadata such as the source title. Perhaps the snippets in the second column could be trimmed to whole sentences if it doesn't impact the snippet length too much. In general, I believe snippets currently do not show line breaks present in the source.

[-]samuelshadrach10d10

Thanks for feedback.

I’ll probably do the title and trim the snippets.

One way of getting a quote would to be to do LLM inference and generate it from the text chunk. Would this help?

[-]cubefox10d20

I think not, because in my test the snippet didn't really contain such a quote that would have answered the question directly.

[-]samuelshadrach6d10

Can you send the query? Also can you try typing the query twice into the textbox? I'm using openai text-embedding-3-small, which seems to sometimes work better if you type the query twice. Another thing you can try is retry the query every 30 minutes. I'm cycling subsets of the data every 30 minutes as I can't afford to host the entire data at once.

[-]cubefox4d20

I think my previous questions were just too hard, it does work okay on simpler questions. Though then another question is whether text embeddings improve over keyword search or just an LLMs. They seem to be some middle ground between Google and ChatGPT.

Regarding data subsets: Recently there were some announcements of more efficient embedding models. Though I don't know what the relevant parameters here are vs that OpenAI embedding model.

[-]samuelshadrach4d30

Cool!

Useful information that you’d still prefer using ChatGPT over this. Is that true even when you’re looking for book recommendations specifically? If so yeah that means I failed at my goal tbh. Just wanna know.

Since Im spending my personal funds I can’t afford to use the best embeddings on this dataset. For example text-embedding-3-large is ~7x more expensive for generating embeddings and is slightly better quality.

The other cost is hosting cost, for which I don’t see major differences between the models. OpenAI gives 1536 float32 dims per 1000 char chunk so around 6 KB embeddings per 1 KB plaintext. All the other models are roughly the same. I could put in some effort and quantise the embeddings, will update if I do it.

[-]cubefox4d30

[-]samuelshadrach1d10

Got it. As of today a common setup is to let the LLM query an embedding database multiple times (or let it do Google searches, which probably has an embedding database as a significant component).

Self-learning seems like a missing piece. Once the LLM gets some content from the embedding database, performs some reasoning and reaches a novel conclusion, there’s no way to preserve this novel conclusion longterm.

When smart humans use Google we also keep updating our own beliefs in response to our searches.

P.S. I chose not to build the whole LLM + embedding search setup because I intended this tool for deep research rather than quick queries. For deep research I’m assuming it’s still better for the human researcher to go read all the original sources and spend time thinking about them. Am I right?

[-]samuelshadrach11d10

Update: HTTPS should work now

[-]samuelshadrach26d*21

Human genetic engineering targetting IQ as proposed by GeneSmith is likely to lead to an arms race between competing individuals and groups (such as nation states).

- Arms races can destabilise existing power balances such as nuclear MAD

- Which traits people choose to genetically engineer in offspring may depend on what's good for winning the race rather than what's long-term optimal in any sense.

- If maintaining lead time against your opponent matters, there are incentives to bribe, persuade or even coerce people to bring genetically edited offspring to term.

- It may (or may not) be possible to engineer traits that are politically important, such as superhuman ability to tell lies, superhuman ability to detect lies, superhuman ability to persuade others, superhuman ability to detect others true intentions, etc.

- It may (or may not) be possible to engineer cognitive enhancements adjacent to IQ such as working memory, executive function, curiosity, truth-seeking, ability to experience love or trust, etc.

- It may (or may not) be possible engineer cognitive traits that have implications on which political values you will find appealing. For instance affective empathy, respect for authority, introversion versus extroversion, inclination towards people versus inclination towards things, etc.

I'm spitballing here, I haven't yet studied genomic literature on which of these we know versus don't know the edits for. But also, we might end up investing money (trillions of dollars?) to find edits we don't know about today.

Has anyone written about this?

I know people such as Robin Hanson have written about arms races between digital minds. Automated R&D using AI is already likely to be used in an arms race manner.

I haven't seen as much writing on arms races between genetically edited human brains though. Hence I'm asking.

[-]cubefox25d8-1

Standard objection: Genetic engineering takes a lot of time till it has any effect. A baby doesn't develop into an adult over night. So it will almost certainly not matter relative to the rapid pace of AI development.

[-]samuelshadrach25d10

I agree my point is less important if we get ASI by 2030, compared to if we don’t get ASI.

That being said, the arms race can develop over the timespan of years not decades. 6-year superhumans will prompt people to create the next generation of superhumans, and within 10-15 years we will have children from multiple generations where the younger generation have edits with stronger effect sizes. Once we can see the effects on these multiple generations, people might go at max pace.

[-]samuelshadrach26d40

PSA

Popularising human genetic engineering is also by default going to popularise lots of neighbouring ideas, not just the idea itself. If you are attracting attention to this idea, it may be useful for you to be aware of this.

The example of this that has already played out is popularising "ASI is dangerous" also popularises "ASI is powerful hence we should build it".

[-]Viliam19d2-2

If you convince your enemies that IQ is a myth, they won't be concerned about your genetically engineered high IQ babies.

[-]samuelshadrach18d10

Superhumans that are actually better than you at making money will eventually be obvious. Yes, there may be some lead time obtainable before everyone understands, but I expect it will only be a few years at maximum.

[-]samuelshadrach26d10

P.S. Also we don't know the end state of this race. +5 SD humans aren't necessarily the peak, it's possible these humans further do research on more edits.

This is unlikely to be careful controlled experiment and is more likely to be nation states moving at maximum pace to produce more babies so that they control more of the world when a new equilibrium is reached. And we don't know when if ever this equilibrium will be hit.

[-]samuelshadrach7h10

Forum devs including lesswrong devs can consider implementing an "ACK" button on any comment, indicating I've read a comment. This is distinct from

a) Not replying - other person doesn't know if I've read their comment or not

b) Replying something trivial like "okay thanks" - other person gets a notification though I have nothing of value to say

[-]samuelshadrach11d10

Update: HTTPS issue fixed. Should work now.

booksearch.samuelshadrach.com

Books Search for Researchers

[-]samuelshadrach1mo10

Project idea for you

Figure out why don't we build one city with one billion population
- Bigger cities will probably accelerate tech progress, and other types of progress, as people are not forced to choose between their existing relationships and the place best for their career
- Assume end-to-end travel time must be below 2 hours for people to get benefits of living in the same city. Seems achievable via intra-city (not inter-city) bullet-train network. Max population = (200 km/h * 2h)^2 * (10000 people/km^2) = 1.6 billion people
- Is there any engineering challenge such as water supply that prevents this from happening? Or is it just lack of any political elites with willingness + engg knowledge + governing sufficient funds?
- If a govt builds the bullet train network, can market incentives be sufficient to drive everyone else (real estate developers, corporate leaders, etc) to build the city or will some elites within govt need to necessarily hand-hold other parts of this process?

[-]Purplehermann1mo20

Vr might be cheaper

[-]samuelshadrach1mo2-1

I agree VR might be one-day be able to do this (make online meetings as good as in-person ones). As of 2025, bullet trains are more proven tech than VR. I'd be happy if both were investigated in more depth.

[-]Purplehermann1mo10

A few notes on massive cities:

Cities of 10Ms exist, there is always some difficulty in scaling, but scaling 1.5-2 OOMs doesn't seem like it would be impossible to figure out if particularly motivated.

China and other countries have built large cities and then failed to populate them

The max population you wrote (1.6B) is bigger than china, bigger than Africa, similar to both American Continents plus Europe .

Which is part of why no one really wants to build something so big, especially not at once.

Everything is opportunity cost, and the question of alternate routes matters alot in deciding to pursue something. Throwing everything and the kitchen sink at something costs a lot of resources.

Given that VR development is currently underway regardless, starting this resource intense project which may be made obsolete by the time it's done is an expected waste of resources. If VR hit a real wall that might change things (though see above).

If this giga-city would be expected to 1000x tech progress or something crazy then sure, waste some resources to make extra sure it happens sooner rather than later.

Tl;dr:

Probably wouldn't work, there's no demand, very expensive, VR is being developed and would actually be able to say what you're hoping but even better

[-]samuelshadrach1mo10

especially not at once.

It could be built in stages. Like, build a certain number of bullet train stations at a time and wait to see if immigrants + real estate developers + corporations start building the city further, or do the stations end up unused?

I agree there is opportunity cost. It will help if I figure out the approx costs of train networks, water and sewage plumbing etc.

I agree there are higher risk higher reward opportunities out there, including VR. In my mind this proposal seemed relatively low risk so I figured it’s worth thinking through anyway.

no demand

This is demonstrably false. Honestly the very fact that city rents in many 1st world countries are much higher than rural rents proves that if you reduced the rents more people would migrate to the cities.

[-]Purplehermann1mo10

Lower/Higher risk and reward is the wrong frame.

Your proposal is high cost.

Building infrastructure is expensive. It may or may not be used, and even if used it may not be worthwhile.

R&D for VR is happening regardless, so 0 extra cost or risk.

Would you invest your own money into such a project?

"This is demonstrably false. Honestly the very fact that city rents in many 1st world countries are much higher than rural rents proves that if you reduced the rents more people would migrate to the cities."

Sure, there is marginal demand for living in cities in general. You could even argue that there is marginal demand to live in bigger vs smaller cities.

This doesn't change the equation: where are you getting one billion residents - all of Africa? There is no demand for a city of that size.

[-]samuelshadrach1mo10

Would you invest your own money in such a project?

If I were a billionaire I might.

I also have (maybe minor, maybe not minor) differences of opinion with standard EA decision-making procedures of assigning capital across opportunities. I think this is where our crux actually is, not on whether giant cities can be built with reasonable amounts of funding.

And sorry I won’t be able to discuss that topic in detail further as it’s a different topic and will take a bunch of time and effort.

[-]Purplehermann1mo10

Our cruxes is whether the amount of investment to build one has a positive expected return on investment, breaking down into

If you could populate such a city
Whether this is a "try everything regardless of cost" issue, given that a replacent is being developed for other reasons.

I suggest focusing on 1, as it's pretty fundamental to your idea and easier to get traction on

[-]samuelshadrach1mo10

1 is going to take a bunch of guesswork to estimate. Assuming it were possible to migrate to the US and live at $200/mo for example, how many people worldwide will be willing to accept that trade? You can run a survey or small scale experiment at best.

What can be done is expand cities to the point where no more new residents want to come in. You can expand the city in stages.

[-]Purplehermann26d10

Definitely an interesting survey to run.

I don't think the US wants to triple the population with immigrants, and $200/month would require a massive subsidy. (Internet says $1557/month average rent in US)

How many people would you have to get in your city to justify the progress?

100 Million would only be half an order of magnitude larger than Tokyo, and you're unlikely to get enough people to fill it in the US (at nearly a third of the population, you'd need to take a lot of population from other cities)

How much do you have to subsidize living costs, and how much are you willing to subsidize?

[-]samuelshadrach26d0-1

If I understand correctly it is possible to find $300/mo/bedroom accommodation in rural US today, and a large enough city will compress city rents down to rural rents. A govt willing to pursue a plan as interesting as this one may also be able to increase immigrant labour to build the houses and relax housing regulations. US residential rents are artificially high compared to global average. (In some parts of the world, a few steel sheets (4 walls + roof) is sufficient to count as a house, even water and sewage piping in every house is not mandatory as long as residents can access toilets and water supply within walking distance.)

(A gigacity could also increase rents because it'll increase the incomes of even its lowest income members. But yeah in general now you need to track median incomes of 1B people to find out new equilibrium.)

[-]ProgramCrafter1mo10

Is there any engineering challenge such as water supply that prevents this from happening? Or is it just lack of any political elites with willingness + engg knowledge + governing sufficient funds?

That dichotomy is not exhaustive, and I believe going through with the proposal will necesarily make the city inhabitants worse off.

Humans' social machinery is not suited to live in such large cities, as of the current generations. Who to get acquainted with, in the first place? Isn't there lots of opportunity cost to any event?
Humans' biomachinery is not suited to live in such large cities. Being around lots and lots of people might be regulating hormones and behaviour to settings we have not totally explored (I remember reading something that claims this a large factor to lower fertility).
Centralization is dangerous because of possibly-handmade mass weapons.
Assuming random housing and examining some quirk/polar position, we'll get a noisy texture. It will almost certainly have a large group of people supporting one position right next to group thinking otherwise. Depending on sizes and civil law enforcement, that may not end well.

After a couple hundred years, 1) and 2) will most probably get solved by natural selection so the proposal will be much more feasible.

[-]samuelshadrach1mo10

Sorry I didn’t understand your comment at all. Why are 1, 2 and 4 bigger problems in 1 billion population city versus say a 20 million population city?

[-]ProgramCrafter1mo10

I'd maintain that those problems already exist in 20M-people cities and will not necessarily become much worse. However, by increasing city population you bring in more people into the problems, which doesn't seem good.

[-]samuelshadrach1mo24

Got it. I understood what you're trying to say. I agree living in cities has some downsides compared to living in smaller towns, and if you could find a way to get the best of both instead it could be better than either.

[-]samuelshadrach2mo*-20

http://tokensfortokens.samuelshadrach.com

Pay for OpenAI API usage using cryptocurrency.

Currently supported: OpenAI o1 model, USDC on Optimism Rollup on ethereum.

Why use this?

- You want anonymity

- You want to use AI for cheaper than the rate OpenAI charges

How to use this?

- You have to purchase a few dollars of USDC and ETH on Optimism Rollup, and install Metamask browser extension. Then you can visit the website.

More info:

- o1 by OpenAI is the best AI model in the world as of Jan 2025. It is good for reasoning especially on problems involving math and code. OpenAI is partially owned by Microsoft and is currently valued above $100 billion.

- Optimism is the second largest rollup on top of ethereum blockchain. Ethereum is the second largest blockchain in terms of market capitalisation. (Bitcoin is the largest. Bitcoin has very limited functionality, and it is difficult to build apps using it.) People use rollups to avoid the large transaction fees charged by blockchains, while still getting similar level of security. As of 2025 users have trusted Optimism with around $7 billion in assets. Optimism is funded by Paradigm, one of top VCs in the cryptocurrency space.

- USDC is a stablecoin issued by Circle, a registered financial company in the US. A stablecoin is a cryptocurrency token issued by a financial company where the company holds one dollar (or euro etc) in their bank account for every token they issue. This ensures the value of the token remains $1. As of 2025, USDC is the world's second largest stablecoin with $45 billion in reserves.

[-]samuelshadrach3mo-30

Im selling $1000 tier5 OpenAI credits at discount. DM me if interested.

You can video call me and all my friends to reduce the probably I end up scamming you. Or vice versa I can video call your friends. We can do the transaction in tranches if we still can’t establish trust.

[+]samuelshadrach2mo-170

Moderation Log

Curated and popular this week

142On the Rationality of Deterring ASI

Dan H

138Levels of Friction

Zvi

186METR: Measuring AI Ability to Complete Long Tasks

Zach Stein-Perlman

43Comments