Just wanted to jump in here and say that Nancy, Jeremy, and the whole team are both exceptionally rational, and exceptionally capable. In my ~10 years of programming professionally, Apptimize is easily the best working environment I've experienced.
Absolutely worth checking out.
The karma has spoken. I've registered proveitforreal.com. Thank you!
I think a trademarked "proved" image will do nicely for use on labels :)
"Did you kill yourself at any point during the last 24 hours?" is not likely to produce anything useful at all.
I see. Right now the system doesn't have any defined questions. I believe that suitable questions will be found so I'm focusing on the areas I have a solid background in.
If a project is unsafe in a literal way, shipping the product to consumers (or offering it for sale) is of course illegal. However, when considering a sous vide cooker in the past I have always worried about the dangers of potentially eating undercooked food (eg. diar...
The link I gave to the data collection webapp describes the data collection more depth, which I believe is what you are asking about between 6 and 7.
From that url:
Core function:
Potential changes to this story:
I described an overview in a different thread, but that was before a lot of discussion happened.
I'll use this as an opportunity to update the script based on what has been discussed. This is of course still non-final.
Awesome, great link. Example study here.
I think the needs for this project are still substantially different. Genomera trusts the scientists, which is usually a fine thing to do. I've applied for a beta invite, but don't have access. Based on the example study I've linked it seems like they are more focused on assisting in data gathering (which based on my recent experience seems like the easiest thing we are considering).
Okay, sorry I've been away from the thread for a while. I spent the last half day hacking together a rough version of the data collection webapp. This seemed reasonable because I haven't heard any disagreement on that part of the project, and I hope that having some working code will excite us :)
The models are quite good and well tested at this point, but the interface is still a proof of concept. I'll have some more time tomorrow evening, which will hopefully be enough time to finish off the question rendering and SMS sending. I think with those two featu...
Ahh, okay. That one goes on the scrap heap.
I think if you change the price of something by an order of magnitude you get a fundamental change in what it's used for. The examples that jump to mind are letters -> email, hand copied parchment -> printing press -> blogs, and SpaceX. If you increase the quality at the same time you (at least sometimes) get a mini-revolution.
I think a better example might be online courses. It can be annoying that you can't ask the professor any questions (customize the experience), but they are still vastly better than nothing.
Hmm. I'm confused. Let's look at something slightly more extreme than what we're talking about and see if that helps.
Level 0: Imagine we make a product study as good as possible, then allows anyone to perform the same study with a different product. Some products "shouldn't" be tested that way, but I don't see how a protocol like that will produce garbage (they will merely establish "no effect").
Level 1: We broaden to support more companies, and allow anyone to perform those studies as well.
Level 2: After a sufficient number of companie...
I did a poor job at the introduction. I'm assuming the studies exist, because if they don't that's full on false advertising.
Not to pick on anyone in particular here are some I recently encountered:
The probiotics section at wholefoods (and my interactions with customer...
I am hopeful that at minimum we can create guidelines for selecting questions.
I also think that some standardized health & safety questions by product category would be good (for nutritional supplements I would personally be interested in seeing data for nausea, diarrhea, weight change, stress/mood, and changes in sleep quality).
For productivity solutions I'd be curious about effects on social relationships, and other changes in relaxation activities.
Within a given product category, I'm also hopeful we can reuse a lot of questions. Soylent's test and Mealsquares' test shouldn't require significantly different questions.
Yeah, this is a brutal point. I wish I knew a good answer here.
Is there a gold standard approach? Last I checked even the state of the art wasn't particularly good.
Facebook / Google / StumbleUpon ads sound promising in that they can be trivially automated, and if only ad respondents could sign up for the study, then the friend issue is moot. Facebook is the most interesting of those, because of the demographic control it gives.
How bad is the bias? I performed a couple google scholar searches but didn't find anything satisfying.
To make things more complicat...
Thanks, that's a great point.
I'm worried that a statistical calculator will throw off founders who would otherwise test their products with us (specifically YC founders, an abnormally influential group), so as much as possible I'd like to keep sample sizes in the "Advanced Menu" section. (This is not to say this is an unimportant issue -- I'm saying this is a more important issue because many people won't be customizing the default values).
I also think there are three unique features for product studies that can help simply defining good default ...
Thank you for posting this!
I'm feeling like in this situation, I can safely say "I love standards, there are so many to choose from"
Getting a list of LessWrong approved questions would be awesome. Both because I think the LW list will be higher quality than a lot of what's out there, and because I feel question choice is one of the free variables we shouldn't leave in the hands of the corporation performing the test.
If anybody doubts that the products are real the trusted organisation has copies that they can analyse.
This is a great point. Maybe community members could bet karma on the outcome of a tox screening? This could create a prioritized list.
One problem with my earlier suggestion is that some companies will want narrowly selected participant pools. These will necessarily differ from the population at large, and might create data that looks like a poison placebo is being used. I see two possible solutions to this problem:
Can you provide an example of what you'd like to see pass muster?
I'm glad you're here. My background is in backend web software, and stats once the data has been collected. I read "Measuring the Weight of Smoke" in college, but that's not really a sufficient background to design the general protocol. That's a lot of my motivation behind posting this to LW - there seem to be protocol experts here, with great critiques of the existing ones.
My hope is we can create a "getting started testing" document that gets honest companies on the right track. Searching around the web I'm finding things like this ra...
This is awesomely paranoid. Thank you for pointing this out.
I'm a little worried a solution here will call for whoever controls the webapp to also be an expert at creating placebos for every product type. (If we trust contract manufacturers to be honest, then the issue of adding poisons to a placebo can be handled by having them ship directly to the third party for mailing... but I that's already the default case).
Perhaps poisons can be discovered by looking at other products which performed the same protocol? "This experiment has to be re-done becaus...
The participant has no way to decide between the two packages or know which one is the placebo and which one is the real thing so he doesn't need to go through the process of flipping a real coin.
I am worried that standardized shipping will come with standardized package layout, and I'm guessing "preference of left vs right identical thing" correlates with something the system will eventually test. Having thought about it more, this is the real issue with allowing customers to choose which product they'll use: that decision has to be purely ra...
Yes. I think if we can manage it, requiring data-analysis to be pre-declared is just better. I don't think science as a whole can do this, because not all data is as cheap to produce as product testing data.
Now that I've heard your reply to question #8, I need to consider this again. Perhaps we could have some basic claims done by software, while allowing for additional claims such as "those over 50 show twice the results" to be verified by grad students. I will think about this.
FTC is so much better than lawsuit. I don't know a single advertiser that isn't afraid of the FTC. It looks like enforcement is tied to complaint numbers, so the press release should include information about how to personally complain (and go out to a mailing list as well).
I would love assistance with the agreements. It sounds like you would be more suited to the Business <> Non-Profit agreements than the Participant <> Business agreements. How do I maximize the value of your contribution? Are you more suited to the high-level term sheet, or the final wording?
I'm very glad I asked for more clarification. I'm going to call this system The Reviewer's Dilemma, it's a very interesting solution for allowing non-software analysis to occur in a trusted manner. I am somewhat worried about a laziness bias (it's much easier to agree than disprove), but I imagine that there is a similar bounty for overturning previous results this might be handled.
I'll do a little customer development with some friends, but the possibility of reviewers being added as co-authors might also act as a nice incentive (both to reduce laziness, and as addition compensation).
Thank you. I had not seen the reproducibility initiative. Link very much appreciated, I'll start the conversation tonight. PLoS hosting the application would be ideal.
Oh, interesting.
I had been assuming that participants needed to be drawn from the general population. If we don't think there's too much hazard there, I agree a points system would work. Some portion of the population would likely just enjoy the idea of receiving free product to test.
Your approach to blinding makes sense, and works. I thought we were trying for a zero third party approach though?
I was giving more thought to a distributed solution during dinner, and I think I see how to solve the physical shipments problem in a scalable way. I'm still not 100% sold on it, but consider these two options:
Thank you. I had forgotten about that.
So let's say the two groups were, as you suggest:
Do you have any thoughts on what questions we should be asking about this product? Somehow the data collection and analysis once we have the timeseries data doesn't seem so hard... but the protocol and question design seems very difficult to me.
Thank you. Help considering the methodology and project growth prospects is very much appreciated.
I agree that compatibility with traditional papers is important. It was not stated explicitly, but I do want the results to be publishable in traditional journals. I plan on publishing the results for my company's product. It seemed to me like being overly rigorous might be a selling point initially -- "sure we did the study cheap / didn't use a university, but look how insanely rigorous we were"
Going after professional researchers seems much harder....
Thanks, fixed.
Interesting. I'm hoping that by getting a trustworthy non-profit to host the site (and paying for a security audit) we can largely side step the issues.
I spent a long time trying to create a way not to need the trusted third party, but I kept hitting dead ends. The specific dead end that hurt the most was blinding of physical product shipments.
If we can figure out a way to ship both products and placebos to people without knowing who's getting what, I think we can do this :)
This is not at all self-evident to me. How, for example, would you demonstrate product safety (for a diverse variety of products) via a standard template?
Templates not template. I think if you know roughly which bodily systems a product is likely to effect, the questions are not so diverse.
My background is not in question selection (it's ML and webapp programming), but here goes some general question ideas for edible products:
8 - Have you considered some form of reputation system, allowing commenters to build a reputation for debunking badly supported claims and affirming well-supported claims? (Or perhaps some other goodie?) I can imagine it becoming a pastime for grad students, which would be a Good Thing (TM).
I hadn't. I like the idea, but am less able to visualize it than the rest of this stuff. Grad students cleaning up marketing claims does indeed sound like a Good Thing...
7 - Can sponsors do a private mini-trial to test its trial design before going full bore (presumably, with their promise not to publicize the results)?
This is an awesome idea. I had not considered this until you posted it. This sounds great.
6 - Does a sponsor have any recourse if it designed the trial badly, leading to misleading results? Or is its remedy really to design a better trial and publicize that one?
This is a hard one. I anticipate that at least initially only Good People will be using this protocol. These are people who spent a lot of time creating something to (hopefully) make the world better. Not cool to screw them if they make a mistake, or if v1 isn't as awesome as anticipated.
A related question is: what can we do to help a company that has demonstrated its effectiveness?
5 - What do the participants get? Is that simply up to the sponsor? If so, who reviews it to assure that the incentive does not distort the data? If no one, will you at least require that the incentive be reported as part of the trial?
We need to design rules governing participant compensation.
At a minimum I think all compensation should be reported (it's part of what's needed for replication), and of course not related to the results a participant reports. Ideally we create a couple defined protocols for locating participants, and people largely choose to go with a known good solution.
4 - What action does the organization behind the app take if a sponsor publicly misrepresents the data or, more likely, its meaning? If the organization would take action, does it take the same action if the statement is merely misleading, rather than factually incorrect?
I imagined similar actions as the Free Software Foundation takes when a company violates the GPL: basically a lawsuit and press release warning people. For template studies, ideally what claims can be made would be specified by the template (ie "Our users lost XY more pounds over Z time".)
3 - If the sponsoring company does its own math and stats, must it publicly post its working papers before making claims based on the data? Does anyone review that to make sure it passes some light smell test, and isn't just pictures of cats?
At minimum the code used should be posted publicly and open-source licensed (otherwise there can be no scrutiny or replication). I also think paying to have a third party review the code isn't unreasonable.
2 - Is the data (presumably anonymized) made publicly available, so that others can dispute the meaning?
That was the initial plan, yes! Beltran (my co-founder at GB) is worried that will result in either HIPPA issues or something like this, so I'm ultimately unsure. Putting structures in place so the science is right the first time seems better.
1 - For more complicated propositions, who does the math and statistics? The application apparently gathers the data, but it is still subject to interpretation.
This problem can be reduced in size by having the webapp give out blinded data, and only reveal group names after the analysis has been publicly committed to. If participating companies are unhappy with the existing modules, they could perhaps hire "statistical consultants" to add a module, permanently improving the site for everyone.
This could be related to your #8 as well :)
Thank you! This is exactly the kind of discussion I was hoping for.
The general answer to your questions is: I want to build whatever LessWrong wants me to build. If it's debated in the open, and agreed as the least-worst option, that's the plan.
I'll post answers to each question in a separate thread, since they raise a lot of questions I was hoping for feedback on.
In my experience, startups want to demonstrate efficacy about basic things: weight removed, increased revenue, personal productivity, product safety, etc.
This kind of research lends itself extremely well to protocol templates: a standardized sequence of steps to locate the participants, collect the data, and decide the results. These steps could be performed by a website. I've posted a story of how that might work here.
Without such a project, founders have two options:
Perform the study themselves. The scientific background and time required to design a s
Thanks for pointing this out.
Let's use Beeminder as an example. When I emailed Daniel he said this: "we've talked with the CFAR founders in the past about setting up RCTs for measuring the effectiveness of beeminder itself and would love to have that see the light of day".
Which is a little open ended, so I'm going to arbitrarily decide that we'll study Beeminder for weight loss effectiveness.
Story* as follows:
Daniel goes to (our thing).com and registers a new study. He agrees to the terms, and tells us that this is a study which can impact health...
Those participants are randomly assigned to two groups: (1) act normal, and (2) use Beeminder to track exercise and food intake.
These kind of studies suffer from the Hawthorne effect. It is better to assign the control group to do virtually anything instead of nothing. In this case I'd suggest to have them simply monitor their exercise and food intake without any magical line and/or punishment.
Thanks for the example. It leads me to questions:
Apptimize (YC S13) is also interested. (disclosure: the CEO and CTO, Nancy and Jeremy, are friends of mine, and I worked there as a programmer).
If anyone else would like to be included, please reply here.
Replying to prevent the edit mark (given the sha1) -- last sentence was supposed to be "split the overlap difference and bet :)". Currency needs to be USD or BTC, let's have a reputable Berkeley area LWer doing the escrow.
Sure. The SHA1 of my odds in the following format:
"xx% to win +-y% spread {password}"
(meaning I'd sell you me to win for more than xx% + y%, and buy me to win from you at xx% - y%).
is:
897a40baca33e2a53a49bdddc00abede82713a7c (sha1-online.com)
Give me your odds and desired size. If there's overlap, let's split the overlap difference bet :)
Okay, fair enough. I don't see why the scoring changes for N=1 vs N=2 on the number of bots per player (edit: if anything it's more interesting since then you have an incentive to make your bots cooperate), but I'll just hold back the other design for now -- on the off chance we do a tournament like this again :p
I have two bots I'd like to submit -- one of which will likely win, and one of which is interesting.
Any chance we can allow two submissions per programmer?
I won't let one person submit two programs to compete in the tournament, but I came up with a possible compromise. If you want, you can send me two programs and tell me which one you want to compete. I'll run the non-competitor against the other programs as well, and tell you the results, but the non-competitor will not be ranked, and rounds involving the non-competitor will not be counted towards the scores of other submissions.
Hey guys, just wanted to let you know I'm still hacking away. Looking like Sunday is likely, with invites going out in the order the requests were received.
I'll post back here then :)
I'm not sure how comfortable you are working with Terminal, but this works:
curl -T MyFileToUpload.txt ftp://myusername:mypassword@ftp.myhost.com/directory/
(and can be repeated with two keystrokes: "Up, Enter")
Okay, so I like everyone else's comments, but they feel complicated with what I came up with:
Harry convinces himself of #2 enough to say it in parselmouth.
Harry says "I think I understand the prophecy you're trying to avoid, and I believe killing me makes it happen. I would say more, but you'd probably use it to kill me" in parselmouth.
Harry stays silent.