If you write an article in Word, Writer, Scrivener, Google Docs, or another rich text editor, and then copy+paste that rich text into an online WYSIWYG editor like the one on Less Wrong or WordPress, the HTML generated by LW or WordPress is incredibly messy and does tons of weird stuff to your text.

Because of this, I've taken to composing all my posts in Markdown, which is plain text (like HTML) but easier to read, and can be easily converted to clean HTML.

Ideally, though, authors would be able to compose articles in whatever editor they want, and then paste their rich text into a simple web tool that strips all formatting from the HTML except the formatting they want to keep.

HTML PurifierTIDY, and HTML Tidy aren't quite what we need. Word2CleanHTML, Word HTML Cleaner and WordOff, along with CKEditor's and TinyMCE's 'Paste from Word' features, kinda work, but not really: they still make mistakes pretty often when I try them.

What I was hoping to find was something like Word2CleanHTML but with three changes:

 

  1. Does a good job when pasting from just about any rich text editor, not just Word.
  2. Allows the user to choose which formatting to keep, using a list of checkboxes for bold, italic, strikethrough, headings, text coloring, blockquotes, etc.
Does this exist, and I couldn't find it?
Or, is this relatively easy for a coder to create?

 

New to LessWrong?

New Comment
8 comments, sorted by Click to highlight new comments since: Today at 4:44 AM

A bit aside the question, but personally I would like to completely scrap the wysiwyg editor in LW and be able to create articles typing the markdown markup directly in LW (or more actually, copy-pasting it from my Emacs ;) ). Would that be doable ?

Edit : I mean as a per-user option (in preferences for example), leaving the wysiwyg for those who like those kind of things.

[-][anonymous]12y80

Seconded. Markdown is so much more convenient

For example, pasting from Scrivener to Word2CleanHTML, WordOff, and TinyMCE all remove italics.

Edit: Exporting from Scrivener to .rtf, then pasting into Word2CleanHTML preserved italics.

But exporting from Scrivener to .rtf and then pasting into Word2CleanHTML removes text coloring and some other stuff, right?

I see this is a pretty old post and some links are broken. WordOff has been discontinued and WordHTML.com has taken its place. I think HTML-Cleaner.com, HTML-Online.com or the HTML G Editor are the best options nowadays.

A quick google didn't turn anything free and useful up (probably for a good reason), but I'll try my hand at building one. I'm working on building my web development skills anyway, and that looks like a good puzzle.

Edit: There is a reason there is no free one. Have a look here to see why. It's doable, but would be a considerable amount of work to actually do. I suspected as much, but trying was low-cost enough that it was worth a shot.

I hate WYSIWYG editors (also because WYG may differ from what readers using different systems will get). I seriously can't even remember when the last time I used a MS Word-like word processor for something non-trivial. (I use LaTeX or code HTML by hand (with gedit's plugins is much less of a PITA than it sounds like) or stick to plaintext/MarkDown depending on what I need.)

html2text $URL| pandoc -t markdown

where html2text and pandoc can be found on github.