Alexandros comments on 96 Bad Links in the Sequences - Less Wrong

34 Post author: Alexandros 07 April 2011 10:39AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (27)

You are viewing a single comment's thread. Show more comments above.

Comment author: Alexandros 08 April 2011 06:14:20AM 0 points [-]

Yeah, lxml processes all the html into a tree and gives you an API so you can access it as you like. It takes a lot of the grunt work out of extracting data from HTML.

Comment author: jwhendy 08 April 2011 01:23:11PM 0 points [-]

Which is awesome, as I just felt the pain of hand pruning a heckuva lot of html tags out of something I wanted to transform to a different format. Even with my find-replacing, line breaks would prevent the tag from getting detected fully and I had to do a lot of tedious stuff :)