Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

gjm comments on Open Thread March 21 - March 27, 2016 - Less Wrong Discussion

3 Post author: Gunnar_Zarncke 20 March 2016 07:54PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (160)

You are viewing a single comment's thread.

Comment author: gjm 21 March 2016 01:22:18PM 8 points [-]

Finding comments on LW is more painful than it should be because sometimes this happens:

  • You remember that X replied to Y saying something with words Z in.
  • You put something like <<X Y Z site:lesswrong.com>> into Google (directly or via the "Google custom search" in the right sidebar.)
  • You get back a whole lot of pages, but
    • they all contain X and Y because of the top-contributors or recent-comments sections of the right sidebar;
    • they all contain Z because of the recent-comments section of the right sidebar.
  • None of those pages now contains either the comment in question or a link to it.
  • Using the "cached" link from the search results doesn't help, because the right sidebar is generated dynamically and is simply absent from the cached pages.
    • So how come they're found by the search? Beats me.

Here's a typical example; it happens to use only Z (I picked one of my comments from a couple of weeks ago) but including X and Y seldom helps.

I just tried the equivalent search in Bing and the results were more satisfactory, but only because the comment in question happened to appear fairly near the top of the overview page for the user I was replying to. I would guess that Bing isn't actually systematically better for these searches, but I haven't tested.

Does anyone know a good workaround for this problem?

Is there a way to make the dynamically-generated sidebar stuff on LW pages invisible to Google's crawler? It looks like there is. Should I file an issue on GitHub?

Comment author: Vaniver 21 March 2016 03:40:00PM 5 points [-]

Is there a way to make the dynamically-generated sidebar stuff on LW pages invisible to Google's crawler? It looks like there is. Should I file an issue on GitHub?

Yes, you should do this.

Comment author: gjm 21 March 2016 06:44:23PM 11 points [-]
Comment author: Viliam 22 March 2016 08:38:24AM 0 points [-]

Unfortunately, there is no standard way to make parts of page disappear from search engines' indexes. Which is super annoying, because almost every page contains some navigational parts which do not contribute to the content.

HTML 5 contains a semantic tag <nav> which defines navigational links in the document. I think a smart search engine should exclude these parts, but I have no idea if any engine actually does that. Maybe changing LW pages to HTML 5 and adding this tag would help.

Some search engines use specific syntax to exclude parts of the page, but it depends on the engine, and sometimes it even violates the HTML standards. For example, Google uses HTML comments <!--googleoff: all--> ... <!--googleon: all-->, Yahoo uses HTML attribute class="robots-nocontent", and Yandex introduces a new tag <noindex>. (I like the Yahoo way most.)

The most standards-following way seems to be putting the offending parts of the page into separate HTML pages which are included by <iframe>, and use the standard robots.txt mechanism to block those HTML pages. I think the disadvantage is that the included frames will have fixed dimensions, instead of changing dynamically with their content. Another solution would be to insert those texts by JavaScript, which means that users with JavaScript disabled would not see them.

Comment author: Vaniver 22 March 2016 01:18:26PM 3 points [-]

Since our local search is powered by Google, I'm content with a solution that only works for Google.

Comment author: philh 22 March 2016 03:48:17PM 2 points [-]

Another solution would be to insert those texts by JavaScript, which means that users with JavaScript disabled would not see them.

They're already inserted by javascript. E.g. the 'recent comments' one works by fetching http://lesswrong.com/api/side_comments and inserting its contents directly in the page.

Editing robots.txt might exclude those parts from the google index, but idk.

Comment author: Douglas_Knight 22 March 2016 05:33:47PM 1 point [-]

I think robots.txt would work.

Comment author: TheAltar 21 March 2016 02:35:12PM 1 point [-]

I've run into this problem several times before. It would be very helpful if the search feature ignored the text in the sidebar.