Browse Prior Art Database

Automatic Correction of Mangled Hyperlinks and Other Document Corruptions, with Optional Prompting

IP.com Disclosure Number: IPCOM000016023D
Original Publication Date: 2002-Sep-15
Included in the Prior Art Database: 2003-Jun-21
Document File: 5 page(s) / 86K

Publishing Venue

IBM

Abstract

This invention pertains to automated correction of the most common forms of corruption in markup language documents, to improve the user’s online experience. More particularly, it provides automated correction, with optional user prompting, of mangled URLs, block text with insertion markers, formatted columnar plain text, and so forth, in HTML that got corrupted in the conversion from rich text to plain text and/or from plain text to rich text.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 24% of the total text.

Page 1 of 5

  Automatic Correction of Mangled Hyperlinks and Other Document Corruptions, with Optional Prompting

This invention pertains to automated correction of the most common forms of corruption in markup language documents, to improve the user's online experience. More particularly, it provides automated correction, with optional user prompting, of mangled URLs, block text with insertion markers, formatted columnar plain text, and so forth, in HTML that got corrupted in the conversion from rich text to plain text and/or from plain text to rich text.

There are a variety of ways in which a URL passed from one user to another in the data payload portion of email, quoted replies to email, newsgroup postings, web forums and so forth can become mangled and therefore unusable without an exacting series of manual operations too tedious for most users.

The problem is first illustrated with respect to email.

Many email users use plain-text email programs (such as Netscape) that remove or alter formatting when cutting and pasting from a rich-text source, such as a web page. Such programs frequently insert line breaks into the text in an attempt to regulate line length. The algorithm employed evidently scans for "word" breaks such as a space or punctuation mark to select a line-break insertion point that will not disrupt the text too badly. Stuffing a line break into a URL, however, breaks up the URL. (Similarly, stuffing a line break into columnar text such as an unformatted table makes the table unreadable; there are myriad other examples.)

The reformatted plain-text email may be forwarded to an email client that supports rich text (such as Outlook Express). OE attempts to convert a string beginning "http://" or "www." into a URL to make it clickable within the body of the email. Here, it will only pick up the first part of the mangled URL up to the first inserted line break and omit the rest.

The same is true of any client that attempts to interpret a mangled URL string. One example is a text-only client such as the news reader Xnews. Xnews attempts to make things that look like URLs clickable and send the URL to one's favorite browser. The problem also occurs in web-based forums or web mail where users typically share information by typing or pasting a block of text into a web form, or by emailing a post from their rich-text or plain-text email client to a list server such as listproc or mailman. The web server application program that converts the block of text into HTML and appends it to a forum for browsing typically inserts line breaks at awkward places, often breaking up URLs or chopping up tables.

URLs can also become mangled and unparsable when a user pastes a URL into a document and places a punction mark directly before or after the URL.

URLs can also be mangled during email composition. For example, an email client's "Reply" function may copy the message to which one is replying into the body of the new email, set off with a special str...