The Great Data Integration Schlep

sarahconstantin

267 The Great Data Integration Schlep

by sarahconstantin

13th Sep 2024

Rough Diamonds

10 min read

267

Midjourney, “Fourth Industrial Revolution Digital Transformation”

This is a little rant I like to give, because it’s something I learned on the job that I’ve never seen written up explicitly.

There are a bunch of buzzwords floating around regarding computer technology in an industrial or manufacturing context: “digital transformation”, “the Fourth Industrial Revolution”, “Industrial Internet of Things”.

What do those things really mean?

Do they mean anything at all?

The answer is yes, and what they mean is the process of putting all of a company’s data on computers so it can be analyzed.

This is the prerequisite to any kind of “AI” or even basic statistical analysis of that data; before you can start applying your fancy algorithms, you need to get that data in one place, in a tabular format.

Wait, They Haven’t Done That Yet?

Each of these machines in a semiconductor fab probably stores its data locally. The team that operates one machine might not be able to see the data from the next one over.

In a manufacturing context, a lot of important data is not on computers.

Some data is not digitized at all, but literally on paper: lab notebooks, QA reports, work orders, etc.

Other data is is “barely digitized”, in the form of scanned PDFs of those documents. Fine for keeping records, but impossible to search, or analyze statistically. (A major aerospace manufacturer, from what I heard, kept all of the results of airplane quality tests in the form of scanned handwritten PDFs of filled-out forms. Imagine trying to compile trends in quality performance!)

Still other data is siloed inside machines on the factory floor. Modern, automated machinery can generate lots of data — sensor measurements, logs of actuator movements and changes in process settings — but that data is literally stored in that machine, and only that machine.

Manufacturing process engineers, for nearly a hundred years, have been using data to inform how a factory operates, generally using a framework known as statistical process control. However, in practice, much more data is generated and collected than is actually used. Only a few process variables get tracked, optimized, and/or used as inputs to adjust production processes; the rest are “data exhaust”, to be ignored and maybe deleted. In principle the “excess” data may be relevant to the facility’s performance, but nobody knows how, and they’re not equipped to find out.

This is why manufacturing/industrial companies will often be skeptical about proposals to “use AI” to optimize their operations. To “use AI”, you need to build a model around a big dataset. And they don’t have that dataset.

You cannot, in general, assume it is possible to go into a factory and find a single dataset that is “all the process logs from all the machines, end to end”.

Moreover, even when that dataset does exist, there often won’t be even the most basic built-in tools to analyze it. In an unusually modern manufacturing startup, the M.O. might be “export the dataset as .csv and use Excel to run basic statistics on it.”

Why Data Integration Is Hard

In order to get a nice standardized dataset that you can “do AI to” (or even “do basic statistics/data analysis to”) you need to:

obtain the data
digitize the data (if relevant)
standardize/ “clean” the data
set up computational infrastructure to store, query, and serve the data

Data Access Negotiation, AKA Please Let Me Do The Work You Paid Me For

Obtaining the data is a hard human problem.

That is, people don’t want to give it to you.

When you’re a software vendor to a large company, it’s not at all unusual for it to be easier to make a multi-million dollar sale than to get the data access necessary to actually deliver the finished software tool.

Why?

Partly, this is due to security concerns. There will typically be strict IT policies about what data can be shared with outsiders, and what types of network permissions are kosher.

For instance, in the semiconductor industry, everyone is justifiably paranoid about industrial espionage. They are not putting their factory data “on the cloud.” They may have fully airgapped facilities where nothing is connected to the open internet. They do not want images of in-progress chips, or details of their production processes, getting into “the wrong hands.”

Other industries have analogous concerns, about leaking trade secrets or customer details or other sensitive information. You will have to meet a lot of security requirements to get sensitive data out of the “blessed zone”^[1] to put it on your computers, or to get your computers approved to install into the “blessed zone.”

Sometimes complying with security requirements for data sharing is simple; but the larger the company, the more likely you are to encounter multiple IT policies from different departments, and the more likely it is that some of them contradict each other, or are fundamentally incompatible with any kind of large-scale data integration.^[2]

I have worked with a company that requires even potential vendors to spend over $1M of their own money building a special, ultra-secure room to store their data in…before the sale is even closed. Security requirements can be intense.

Then there are more “political” or “personal” reasons it can be hard to get lots of a company’s data in one place.

Sometimes people are worried that a “big data” tool will replace their jobs, or make their performance look bad, and they’re trying to obstruct the process.

Sometimes there are inter-departmental rivalries, such that people in one department don’t want to share data with another.

Sometimes people are just busy, and taking time out of their workday to do setup for a vendor is an inconvenience.

Dealing with the “human problem” of negotiating for data access is a huge, labor-intensive headache, and the effort scales pretty much linearly in the amount of data you’re trying to collect.

Palantir Technologies is now embracing the “AI” label, but back when I worked there, in 2016-2017, they billed themselves as a “data integration” company, because this is fundamentally what they do. Palantir builds its own software tools for managing and analyzing data — databases, ETL pipelines, analytics dashboards — and those tools work fine, but they are not, as far as I know, unique or exceptional in the tech industry. What is remarkable is that they have invested in a large number of people — at least a third of the company by headcount — to descend en masse to the customer’s facilities, set up and customize their database and associated tools, teach the users to work the UI, and, crucially, negotiate for data access.

The Palantir playbook, at least on the commercial side,^[3] is:

only sell to companies facing an “existential threat”, i.e. where something has gone so drastically wrong they might go out of business altogether
get buy-in from the C-suite executives, who care about company-wide objectives (“not going out of business”, “making more profit”) and whose individual reputations gain from leading major company-wide initiatives (like a “digital transformation” push).
win the hearts of front-line workers who’ll use your software by being super helpful and making their jobs easier, and hanging out with them on-site and building personal connections
use your allies on the bottom (front-line workers) and the top (executives) to squeeze out your opponents in the middle (managers, often in IT or data science departments, who would rather not have an external vendor encroach on their turf).
- The “higher-ups” can demand that you be given speedy assistance and data access;
- the “front-line” guys can pressure their bosses to do what’s necessary to get the Cool New Palantir Tools up and running.

The Palantir Way is labor-intensive and virtually impossible to systematize, let alone automate away. This is why there aren’t a hundred Palantirs. You have to throw humans at the persuasion problem — well-paid, cognitively flexible, emotionally intelligent humans, who can cope with corporate dysfunction.

In that way, they’re a lot like management consultants…and in fact, data integration at large scale is inherently a little like management consulting.

Every company has something fucked-up and dumb going on somewhere, no matter how admirable they are in other respects, and if they’re facing an existential crisis there’s definitely something going badly wrong that somebody doesn’t want to face. If you ever want to get all your data in one place, you need to figure out some of the shape of the Badness, in an environment where most of the people you meet are presenting as “reasonable and highly competent professionals” and everybody’s got a different story about the Badness and why it’s unavoidable or someone else’s fault.

For instance, at the customer I was embedded with, there was an entire data science department, of maybe ten people, whose job it was to create a single number — a risk score — which would be sent to another department to deal with. The head of this second department didn’t trust statistics^[4] so he threw the number in the trash. The entire work output of those ten data scientists was irrelevant to the company.

I am not, myself, exceptionally good at this “corporate detective work.” I tend to ask questions pretty bluntly, which sometimes gets good results and sometimes ruffles feathers I didn’t intend… and then I tend to back off when I get negative reactions, which is itself sometimes the wrong move in retrospect. I have just enough experience with this process to be aware it is a thing, and to be humbled by how hard it is to see the whole picture of what’s going on in an organization.

What it’s like trying to figure out wtf is going on with your customer

Data Cleaning, AKA I Can’t Use This Junk

Every working data scientist will tell you they spend more time on “data cleaning” than actually running any statistical or machine-learning models.

What does that mean?

Removing commas. Formatting adjustments generally. Normalization. Imputing missing data in some consistent way.

And, understanding what your data sources actually refer to so that when you make judgment calls they make any practical sense at all.

Data cleaning doesn’t seem intellectually challenging, but it is surprisingly difficult to automate, in a way that I think would make David Chapman smirk — there are unenumerable ways you might need to “clean” data to get it into a standard format appropriate for training a model, and it empirically never seems to be possible to write a program that just does that for you, though of course individual parts of the process can be automated.

Part of the issue is that the “reasonable” thing to do can depend on the “real-world” meaning of the data, which you need to consult a human expert on. For instance, are these two columns identical because they are literal duplicates of the same sensor output (and hence one can safely be deleted), or do they refer to two different sensors which happened to give the same readings in this run because the setting that would allow them to differ was switched off this time? The answer can’t be derived from the dataset, because the question pertains to the physical machine the data refers to; the ambiguity is inherently impossible to automate away using software alone.

I would expect that LLMs could make substantial, if not total, improvements in automating data cleaning, but my preliminary experiments with commercial LLMs (like ChatGPT & Claude) have generally been disappointing; it takes me longer to ask the LLM repeatedly to edit my file to the appropriate format than to just use regular expressions or other scripting methods myself. I may be missing something simple here in terms of prompting, though, or maybe LLMs need more surrounding “software scaffolding” or specialized fine-tuning before they can make a dent in data cleaning tasks.

Data cleaning, like data negotiation, is labor-intensive, because it’s high-paid work that scales linearly with dataset size. The more different data sources you need to integrate, the more consultant-hours you need to spend negotiating for it, and the more engineer-hours you need to spend cleaning it.

Every big enterprise software company (think SAP, Salesforce, AWS, etc) that promises to be “the” singular place a large company can keep a large portion of its digital records, devotes a ton of labor to getting each client set up with the software, because that job involves data access, data standardization, and data transfer, all of which are hard.

There are entire industries of third-party “partner” companies that do nothing all day but help their clients set up Salesforce or AWS or whatever. And it’s not unusual for setup to take a year or more.

In general, it’s my observation that some kinds of work are harder than they look — tedious and seemingly routine, but often requiring judgment calls that are surprisingly hard to outsource to an intern or otherwise “junior” staff member, let alone automate away with software. Data cleaning is one of them. So are data labeling, data entry, summarizing and organizing “key” facts and numbers from large piles of documents, etc. Apart from the basic programming involved in data cleaning, none of this requires any special education or training; it’s “just” reading, writing, arithmetic, and “common sense”…and yet, somehow, not just anyone can do it right.^[5]

AI is Gated On Data Integration

This is why I disagree with a lot of people who imagine an “AI transformation” in the economic productivity sense happening instantaneously once the models are sufficiently advanced.

For AI to make really serious economic impact, after we’ve exploited the low-hanging fruit around public Internet data, it needs to start learning from business data and making substantial improvements in the productivity of large companies.

If you’re imagining an “AI R&D researcher” inventing lots of new technologies, for instance, that means integrating it into corporate R&D, which primarily means big manufacturing firms with heavy investment into science/engineering innovation (semiconductors, pharmaceuticals, medical devices and scientific instruments, petrochemicals, automotive, aerospace, etc). You’d need to get enough access to private R&D data to train the AI, and build enough credibility through pilot programs to gradually convince companies to give the AI free rein, and you’d need to start virtually from scratch with each new client. This takes time, trial-and-error, gradual demonstration of capabilities, and lots and lots of high-paid labor, and it is barely being done yet at all.

I’m not saying “AI is overrated”, at all — all of this work can be done and ultimately can be extremely high ROI. But it moves at the speed of human adaptation.

Think of the adoption of computers into the business world, decades ago. There was nothing fake about that “transformation”. It happened; it mattered; many fortunes were made. But it happened at the speed of human negotiation and learning — corporate executives had to decide to buy computers, and many individuals within each company had to figure out what they were going to do with the new machines, and this had to happen independently for each individual use case. It wasn’t instantaneous. And it didn’t hit everywhere at the same speed; some applications would be doing things manually long after other applications would be fully digitized.

So too with AI. You can’t skip over the work of convincing customers that the tool will be worth the setup costs, overcoming internal friction to actually change the way companies operate, and testing and adapting the tool for different contexts.

^{^}
not a technical term; the actual terminology will differ by industry and context. Sometimes they call it “OT” or “high side”.
^{^}
Pharma trade secrets, for instance, are often kept secret from other departments within the company. This can make it tricky to build datasets that span multiple departments.
^{^}
I don’t know as much about the government side, where they’re selling to military, security agency, and law enforcement customers, and where the constraints and incentives are significantly different.
^{^}
literally, he did not believe in probabilities between zero and one. yes, such people exist. he would say things like “either it is, or it isn’t” and didn’t buy it when we tried to explain that a 90% chance and a 10% chance are both uncertain but you should treat them differently.

^{^}
the “conventional wisdom” in image processing is that “data labeling” is a trivial task you can farm out to mechanical-Turk gig workers, and that smart, well-paid people write machine-learning models. But actually, I’m convinced that staring at the images is skilled work that can be done much better by some people than others. For instance, notice Casey Handmer’s discovery of the “crackle pattern” characteristic of ink on the Herculaneum scrolls; before any algorithms got written, he spent hours staring at the pictures. Do you, in principle, need to be a JPL engineer and successful startup founder to see a pattern in pictures? No. In principle, anyone could do it. In practice, it was someone “overqualified” for the job who did. This is not, anecdotally, unusual.

Economic Consequences of AGIWorld Modeling

Curated

267

Mentioned in

59How much I'm paying for AI productivity software (and the future of AI use)

12Bounded AI might be viable

7On the subject of in-house large language models versus implementing frontier models

The Great Data Integration Schlep

New Comment

16 comments, sorted by

top scoring

Click to highlight new comments since: Today at 9:03 PM

[-]Nathan Helm-Burger6mo*3310

I've done a lot of this kind of data cleaning, proof of concept, then scaling to deployed reliable product in my career. I agree it's as you describe.

The thing about using this as a way to predict that the economic/technological effects of a very general AI that's competent enough to do R&D is.... I don't think that change has to route through existing companies.

If the AI is sufficiently capable, and can exploit existing openly available research papers to learn about the world....

If this AI can use existing publicly available datasets to build specialized prediction tools for itself, along the lines of AlphaFold...

If the AI can skillfully control generalized robotic actuators (e.g. robotic arms, humanoid robotic bodies), and interpret visual data intelligently, and make skillful use of tools built for humans....

Such an AI can leapfrog over a ton of existing tech. At small scale anyway. So then you must look at special cases where a small scale prototype of a new technology could be transformative for the world.

There's not a lot of contexts I can imagine this being the case for. Most of them come down to situations where the small scale prototype is in some way self-replicating, so that it can scale. The others involve using a high-leverage small scale prototype to influence decision-making processes of society at large (e.g. super-persuasion or brain-control chips inserted into key political and corporate leaders).

Self-replicating technology is totally a viable and dangerous thing though. The most obvious case within the reach of current tech is bioweapons. Self-replicating nanotech remains a possibility on the table, theoretically allowed by physics but not yet demonstrated. What about industrial uses of self-replicating technology? Strange synthetic biology?

These things all seem quite unlikely in our current technological regime, but imagine a million AIs working together who are much smarter, faster, unsleeping, super-knowledgeable, super well-coordinated, obsessively work-focused, fast learners, good at crafting specialized tools for handling specific datasets or physical world problems. Things might change surprisingly quickly!

Here's some categories I think could be pretty world changing, even without buy-in from existing industry:

Cheap production of novel drugs with powerful and specific effects
Engineered organisms (animals, plants, or fungi) or collectives thereof which can make use of hydrocarbons in plastic and can successfully reproduce rapidly using just the material in a landfill plus sunlight and water, while also doing some kind of economically useful activity.
Same, but you feed them on industrial agricultural wastes (easier)
Same, but you can directly feed them fossil fuels (harder)
- crude oil consumption, plus ability to live deep under the ocean, plus ability to export products in such a way that they were easy to harvest from AI-controlled submarines or ships above would enable a huge rapid growth of an industrial base.
- https://youtube.com/shorts/c1ZhujO0Fqo?si=yMYIh_cGEP20oulb
Organism which can thrive in hard-vacuum, harness solar power and heat gradients, feed on ice, carbon, oxygen, etc. could be 'planted' by a small mission to a large asteroid or comet. It would be hard for this to be directly economically productive back on Earth, but could make for a big headstart for the beginnings of a race to colonize the local solar system.
Organic or cyborg computing systems. Scientists have already discussed the possibility of wiring up a bunch of rodent brains together through a computer network using brain-computer-interfaces. But what if you could skip using fragile inconvenient animal brains? What if you could get fungal mycelium to work like neurons for computation? You can grow a whole lot of engineered fungal mycelium really fast, and feed it quite easily. Even if the result was much lower average utility per cubic millimeter than a rodent brain, the ability to manufacture and use it on an industrial scale would likely make up for that.
Plants capable of producing plastic. This could do a lot, actually. Imagine if there were plants which could grow on the interface between land and sea, and produce little plastic bubbles for themselves which functioned as solar stills. They could refine fresh water from salt water using sunlight. What about plants forming highly robust and lengthy lateral stems which could pump this water inland and receive sugars back? Suddenly you have plants which can colonize deserts which border oceans. Plants are limited in how far they can move water vertically because of various difficulties including the vaporization pressure of water and limitations of capillary action. Horizontally though, plants can form enormous root-stem-root-stem complexes and the resulting colonies can cover many square miles.

I could go on about possibilities for hours. I'm not a super-intelligent corporation of a million R&D AIs, so of course my ideas are far more limited in variety and plausibility than what they could come up with. The point is, once things get that far, we can't predict them using the trends and behavior patterns of traditional human corporations.

[-]Raemon6mo97

This seems right, though I'd interpreted the context of Sarah's post to be more about what we expect in a pre-superintelligence economy.

[-]Nathan Helm-Burger6mo*20

Yes, I agree that that's what the post was talking about. I do think my comment is still relevant since the transition time from pre-superintelligence human-level-AGI economy to superintelligent-AGI-economy may be just a few months. Indeed, that is exactly what I expect due to the high likelihood I place on the rapid effects of recursive self-improvement enabled by human-level-AGI.

I would expect that the company developing the human-level-AGI may even observe the beginnings of a successful RSI process and choose to pursue that in secret without ever releasing the human-level-AGI. Afterall, if it's useful for rapid potent RSI then it would give their competitors a chance to catch up or overtake them if their competitors also had access to the human-level-AGI.

Thus, from the point of view of outside observers, it may seem that we jump straight from no-AGI to a world affected by technology developed by superintelligent-AGI, without ever seeing either the human-level-AGI or the superintelligent-AGI deployed.

[Edit: FWIW I think that Tom Davidson's report errs slightly in the other direction, of forecasting that things in the physical world might move somewhat faster than I expect. Maybe 1.2x to 5x faster than I expect. So that puts me somewhere in-between world-as-normal and world-goes-crazy in terms of physical infrastructure and industry. On the other hand, I think Tom Davidson's report somewhat underestimates the rate at which algorithm / intelligence improvement could occur. Algorithmic improvements which 'piggyback' on existing models, and thus start warm and improve cheaply from there, could have quite fast effects with no wait time for additional compute to become available or need to retrain from scratch. If that sort of thing snowballed, which I think it might, the result could get quite capable quite fast. And that would all be happening in software changes, and behind closed doors, so the public wouldn't necessarily know anything about it.]

[-]sarahconstantin5mo40

I agree that if the AI can run its own experiments (via robotic actuators) it can do R&D prototyping independently of existing private/corporate data, and that's potentially the whole game.

My current impression is that, as of 2024, we're starting to see enough investment into AI-controlled robots that in a few years it would be possible to get an "AI experimenter", albeit in the restricted set of domains where experiments can be automated easily. (biological experiments that are basically restricted to pipetting aqueous solutions and imaging the results? definitely yes. most sorts of benchtop electronics prototyping and testing? i imagine so, though I don't know for sure. the full range of reactions/syntheses a chemist can run at a lab bench? probably not for some time; creating a "mechanical chemist" is a famously hard problem since methods are so varied, though obviously it's not in principle impossible.)

[-]Dumbledore's Army6mo111

I spent eighteen months working for a quantitative hedge fund. So we were using financial data -- that is accounts, stock prices, things that are inherently numerical. (Not like, say, defining employee satisfaction.) And we got the data from dedicated financial data vendors, the majority from a single large company, who had already spent lots of effort to standardise it and make it usable. We still spent a lot of time on data cleaning.

[-]ErioirE6mo114

This seems like a useful and accurate overview of the general state of data utilization in many organizations.

In my work as a software engineer at a clinical research company, I'm frequently able to watch as my coworkers struggle to convince our clients (companies running clinical trials) that yes, it is critical to make sure all of available data entry options are locked to industry standardized terms FROM THE BEGINNING else they will be adding thousands of hours of data cleaning on the tail end of the study.

An example of an obstacle to this: Clinicians running/designing the trials are sometimes adamant that we include an option in the field for "Reason for treatment discontinuation" called "Investigator Decision" when that is not an available term in the standard list and the correct standardized code item is "Physician Decision". But they are convinced that the difference matters even though on the back end the people doing the data cleaning are required to match it with the acceptable coded terms and it'll get mapped to "Physician Decision" either way because the FDA only accepts applications that adhere to the standards.
In my opinion a common cause of this disconnect is those running trials are usually quite ignorant of what the process of data cleaning and analysis looks like and they have never been recipients of their own data.

As a pipe dream I would be in favor of mandatory data science courses for all medical professionals before letting them participate in any sort of research, but realistically that would only add regulatory burden while accomplishing little good as there's no practical way to guarantee they actually retain or make use of that knowledge.

literally, he did not believe in probabilities between zero and one. yes, such people exist. he would say things like “either it is, or it isn’t” and didn’t buy it when we tried to explain that a 90% chance and a 10% chance are both uncertain but you should treat them differently.

...How does someone this idiotic ever stay in a position of authority? I would get their statements on statistics and probability in writing and show it to the nearest person-with-ability-to-fire-them-who-is-not-also-a-moron.

[-]JBlack6mo234

...How does someone this idiotic ever stay in a position of authority? I would get their statements on statistics and probability in writing and show it to the nearest person-with-ability-to-fire-them-who-is-not-also-a-moron.

Maybe the nearest person-with-ability-to-fire-them-who-is-not-also-a-moron could give them one last chance:

"I have a red die and a blue die, each with 20 sides. If I roll the red one then you only keep your job if it rolls a 20. For the blue one you only get fired if it comes up 1.

"I'm going to roll the red one unless you can explain to me why you should want me to roll the blue one instead."

But probably not.

[-]Raemon6mo*101

Curated. This post gave me a lot of concrete gears for understanding and predicting how AI will affect the economy in the near future. This wasn't quite "virtue of scholarship" (my impression is Sarah is more reporting firsthand experience than doing research) but I appreciated the rich details.

I'm generally interested in curating posts where someone with a lot of industry experience writes up details about that industry.

Some particular notes:

I'm not surprised by "companies store their data really badly and siloed", but I appreciated the gears of several incentives that make this not trivial to fix by just saying "c'mon guys" (i.e. legitimate fear of losing trade secrets), as well as dumb screwups.
Correspondingly, understanding how human-social-labor-intensive Palantir's business model is, and why that's hard to replicate. (This fits into John Wentworth's Coordination as Scarce Resource)
Generally appreciating how long it takes technological changes to propagate.

[-]awg6mo73

I completely agree with your post in almost all senses, and this is coming from someone who has also worked out in the real world, with real problems, trying to collect and analyze real data (K-12 education, specifically--talk about a hard environment in which to do data collection and analyzation, the data is inherently very messy, and the analyzation is very high stakes).

But this part

For AI to make really serious economic impact, after we’ve exploited the low-hanging fruit around public Internet data, it needs to start learning from business data and making substantial improvements in the productivity of large companies.
If you’re imagining an “AI R&D researcher” inventing lots of new technologies, for instance, that means integrating it into corporate R&D, which primarily means big manufacturing firms with heavy investment into science/engineering innovation (semiconductors, pharmaceuticals, medical devices and scientific instruments, petrochemicals, automotive, aerospace, etc). You’d need to get enough access to private R&D data to train the AI, and build enough credibility through pilot programs to gradually convince companies to give the AI free rein, and you’d need to start virtually from scratch with each new client. This takes time, trial-and-error, gradual demonstration of capabilities, and lots and lots of high-paid labor, and it is barely being done yet at all.

I think undersells the extent to which

A) the big companies have already started to understand that their data is everything and that collecting, tracking, and analyzing every piece of business data they have is the most strategic move they can make, regardless of AI

B) the fact that even current levels of AI will begin speeding up the data integration efforts by orders of magnitude (automating the low-hanging fruit for data cleaning alone could save thousands of person hours for a company)

Between those two things, I think it's a few years at most before the conduits for sharing and analyzing this core business data are set up at scale. I work in the big tech software industry and know for a fact that this is already happening in a big way. And more and more, businesses of all sizes are getting used to the SaaS infrastructure where you pay for a company to have access to specific (or all) parts of your business such that they provide a blanket service for you that you know will help you. Think of all of the cloud security companies and how quickly that got stood up, or all the new POS platforms. I think those are more correct analogies than the massive hardware scaling that had to happen during the microchip and then PC booms. (Of course, there's datacenter scaling that must happen, but that's a manifestly different, more centralized concern.)

TL;DR: I think you make a lot of valuable insights about how organizations actually work with data under the current paradigms. But I don't think this data integration dynamic will slow down take off as much as you imply.

[-]anithite6mo50

This is why I disagree with a lot of people who imagine an “AI transformation” in the economic productivity sense happening instantaneously once the models are sufficiently advanced.

For AI to make really serious economic impact, after we’ve exploited the low-hanging fruit around public Internet data, it needs to start learning from business data and making substantial improvements in the productivity of large companies.

Definitely agree that private business data could advance capabilities if it were made available/accessible. Unsupervised Learning over all private CAD/CAM data would massively improve visuo-spatial reasoning which current models are bad at. Real problems to solve would be similarly useful as ground truth for reinforcement learning. Not having that will slow things down.

Once long(er) time horizon tasks can be solved though I expect rapid capabilities improvement. Likely a tipping point where AIs become able to do self-directed learning.

find technological thing: software/firmware/hardware
Connect the AI to it robustly.
- For hardware, AI is going to brick it, either have lots of spares or be able to re-flash firmware at will
- for software this is especially easy. Current AI companies are likely doing a LOT of RL on programming tasks in sandboxed environments.
AI plays with the thing and tries to get control of it
- can you rewrite the software/firmware?
- can you get it to do other cool stuff?
- can the artefact interact with other things
  - send packets between wifi chips (how low can round trips time be pushed)
  - make sound with anything that has a motor
Some of this is commercially useful and can be sold as a service.

Hard drives are a good illustrative example. Here's a hardware hacker reverse engineering and messing with the firmware to do something cool.

There is ... so much hardware out there that can be bought cheaply and then connected to with basic soldering skills. In some cases, if soft-unbricking is possible, just buy and connect to ethernet/usb/power.

Revenue?

There's a long tail (as measured by commercial value) of real world problems that are more accessible. On one end you have the subject of your article, software/devices/data at big companies. On the other, obsolete hardware whose mastery has zero value, like old hard disks. The distribution is somewhat continuous. Transaction costs for very low value stuff will set a floor on commercial viability but $1K+ opportunities are everywhere in my experience.

Not all companies will be as paranoid/obstructive. A small business will be happy using AI to write interface software for some piece of equipment to skip the usual pencil/paper --> excel-spreadsheet step. Many OEMs charge ridiculous prices for basic functionality and nickel and dime you for small bits of functionality since only their proprietary software can interface with their hardware. Reverse engineering software/firmware/hardware can be worth thousands of dollars. So much of it is terrible. AI competent at software/firmware/communication reverse engineering could unlock a lot of value from existing industrial equipment. OEMs can and are building new equipment to make this harder but industrial equipment already sold to customers isn't so hardened.

IOT and home automation is another big pool of solvable problems. There's some overlap between home automation and industrial automation. Industrial firmware/software complexity is often higher, but AI that learns how to reverse engineer IOT wireless microcontroller firmware could probably do the same for a PLC. Controlling a lighbulb is certainly easier than controlling a CNC lathe but similar software reverse engineering principles apply and the underlying plumbing is often similar.

[-]Roman Leventov5mo40

https://openmined.org/ develops Syft, a framework for "private computation" in secure enclaves. It potentially reduces the barriers for data integration both within particularly bureaucratic orgs and across orgs.

[-]topherhunt5mo20

it takes me longer to ask the LLM repeatedly to edit my file to the appropriate format than to just use regular expressions or other scripting methods myself

Not surprised. I would expect GPT to be better at helping me identify data cleaning issues, and helping me plan out how to safely fix each, and less good at actually producing cleaned data (which I wouldn't trust to be hallucination-free anyway).

[-]Josh You5mo20

AI systems can presumably be given at least as much access to company data as human employees at that company. So if rapidly scaling up the number and quality of human workers at a given company would be transformative, AI agents with >=human-level intelligence can also be transformative.

[-]Review Bot6mo20

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

[-]ryan_b6mo20

Following on this:

Moreover, even when that dataset does exist, there often won’t be even the most basic built-in tools to analyze it. In an unusually modern manufacturing startup, the M.O. might be “export the dataset as .csv and use Excel to run basic statistics on it.”

I wonder how feasible it would be to build a manufacturing/parts/etc company whose value proposition is solving this problem from the jump. That is to say, redesigning parts with the sensors built in, with accompanying analysis tools, preferably as drop-in replacements where possible. In this way companies could undergo a "digital transformation" gradually, at pretty much their regular operations speed.

It occurs to me we can approach the problem from a completely data-centric perspective: if we want tool AIs to be able to control manufacturing more closely, we can steal a page from the data-center-as-computer people and think of the job of the machines themselves as being production sensors in a production center.

Wrangling the trade-offs would be tricky though. How much better will things be with all this additional data as compared to, say, a fixed throughput or efficiency increase? If we shift further and think in terms of how the additional data can add additional value, are we talking about redesigning machines such that they have more degrees of freedom, ie every measurement corresponds to an adjustable variable on the machine?

[-][anonymous]6mo00

I think a ground-up educational solution could be groups of kids coming up with a classification scheme for a bunch of objects, then comparing schemes between groups. I'm curious what youngest age could work. I first encountered the problem in designing urban research surveys in college, but I think it might grab even kindergartners.

Given all the data to be systematized, and the apparent danger of everything suddenly going 'too fast', I propose that the highest-priority data to start with would be personal. Individuals resolving their own contradictions internally might even be a necessity to switch our evolution off the war-based track that has pushed us so far, so fast.

Moderation Log