32. The World Wide Web

Next to email, the most useful service on the Internet is the World Wide Web (often written "WWW" or "Web"). It is a giant network of hypertext documents and services, and it keeps growing by the instant--anyone with an Internet-connected computer can read anything on the Web, and anyone can publish to the Web. It could well be the world's largest public repository of information.

This chapter describes tools for accessing and using the Web. It also describes tools for writing text files in HTML ("HyperText Markup Language"), the native document format of the Web.

32.1 Browsing the Web    Netscape's famous Web browser.

32.2 Viewing an Image from the Web    Viewing images from the Web.

32.3 Reading Text from the Web    Reading text from the Web.

32.4 Browsing the Web in Emacs

32.5 Getting Files from the Web    Getting files from the Web.

32.6 Writing HTML

32.7 More Web Browsers and Tools    More Web tools to try.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

32.1 Browsing the Web

@sf{Debian}: `mozilla'
@sf{Debian}: `skipstone'
@sf{WWW}: http://www.mozilla.org/
@sf{WWW}: http://galeon.sourceforge.net/
@sf{WWW}: http://www.muhri.net/skipstone/

When most people think of browsing or surfing the Web, they think of doing it graphically--and the mental image they conjure is usually that of the famous Netscape Web browser. Most Web sites today make heavy use of graphic images; furthermore, commercial Web sites are usually optimized for Netscape-compatible browsers--many of them not even accessible with other alternative browsers. That means you'll want to use this application for browsing this kind of Web site.

The version of Netscape's browser which had been released as free, open source software (see section What's Open Source?) in 1998 to much fanfare is called Mozilla.(40) When first released, the Mozilla application was a "developer's only" release, but as of this writing it is finally reaching a state where it is ready for general use.

Once the Mozilla browser has been installed, run it in X either by typing mozilla in a shell or by selecting it from a menu in the usual fashion, as dictated by your window manager.

Like most graphical Web browsers, its use is fairly self-explanatory; type a URL in the Location dialog box to open that URL, and left-click on a link to follow it, replacing the contents of the browser's main window with the contents of that link. One nice feature for Emacs fans is that you can use Emacs-style keystrokes for cursor movement in Mozilla's dialog boxes (see section Basic Emacs Editing Keys).

A typical Mozilla window looks like this:

(In this example, the URL http://slashdot.org/ is loaded.)

A criticism of the earlier Netscape Navigator programs is that the browser is a bloated application: it contained its own email client, its own Usenet newsreader, and other functions that are not necessary when one wants to simply browse the Web. Since Mozilla is free software, anyone can take out these excess parts to make a slimmer, faster, smaller application--and that is what some have done. Two of these projects, Galeon and Skipstone, show some promise; see the above URLs for their home pages.(41)

The following recipes will help you get the most out of using a graphical Web browser in Linux.

NOTE: Mozilla development is moving very rapidly these days, and while Mozilla is continually improving at a fantastic rate, some of these recipes may not work as described with the version you have.

Another way to browse the Web is to use Emacs (see section Browsing the Web in Emacs); more alternative browsers are listed in More Web Browsers and Tools.

32.1.1 Maintaining a List of Visited Web Sites    Keeping a history of your browsing.

32.1.2 Opening a URL from a Script    Running Mozilla from a script.

32.1.3 Mozilla Browsing Tips    Tips for using Mozilla.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

32.1.1 Maintaining a List of Visited Web Sites

@sf{Debian}: `browser-history'
@sf{WWW}: http://www.inria.fr/koala/colas/browser-history/

Use the browser-history tool to maintain a history log of all the Web sites you visit.

You start it in the background, and each time you visit a URL in a Web browser (as of this writing, works with the Netscape, Arena, and Amaya browsers), it writes the name and URL to its current history log, which you can view at any time.

To start browser-history every time you start X, put the following line in your `.xsession' file:
browser-history &

The browser history logs are kept in a hidden directory called `.browser-history' in your home directory. The current history log is always called `history-log.html'; it's an HTML file that you can view in a Web browser.

To view the current history log with lynx, type:

$ lynx ~/.browser-history/history-log.html RET

Past history logs have the year, month, and week appended to their name, and they are compressed (see section Compressed Files). After uncompressing them, you can view them just as you would view the current log (if you are viewing them in Mozilla, you don't even need to uncompress them--it handles this automagically.)

You can also use zgrep to search through your old browser history logs. The logs keep the URL and title of each site you visit, so you can search for either--then when someone asks, "Remember that good article about such-and-such?" you can do a zgrep on the files in your `~/.browser-history' directory to find it.

To find any URLs from the list of those you visited in the year 2000 whose titles contain the word `Confessions', type:
$ zgrep Confessions ~/.browser-history/history-log-2000* RET

This command searches all your logs from the year 2000 for the text `Confessions' in it, and outputs those lines.

NOTE: For more about zgrep, see Matching Lines in Compressed Files.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

32.1.2 Opening a URL from a Script

To open a Web page in Mozilla from a shell script, use the `-remote' option followed by the text `'openURL(URL)'', where URL is the URL to open.

To open the URL http://www.drudgereport.com/ in Mozilla from a shell script, use the following line:
mozilla -remote 'openURL(http://www.drudgereport.com/)'

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

32.1.3 Mozilla Browsing Tips

The following tips make Web browsing with Mozilla easier and more efficient.

Many users disable Java and JavaScript altogether; most Web sites don't require their use, and they often introduce security problems or have other pernicious effects on your browsing. Just say no.
Disabling the automatic loading of images can help if you are on a slow connection; the broken-image icons take some getting used to, but you'll be surprised at how much more quickly pages will load! If you need to see a page's images, just left-click on the Load Images button. You can also right-click on the broken-image icon of the image you want to load and select Open this Image.
Right-click on an image to save it to a file; you will be given a choice to either open the image in the browser window or save it to a file.
To open a link in another browser window, middle-click on the link. Opening multiple links in their own windows saves time when you are doing a lot of "power browsing."
If a site forces links to open in a new window, and you don't want to do that, right-click on the link you want to open, and choose Open this Link; the link will open in the current browser window.
To go back to the last URL you visited, type ALT-@leftarrow, and to go forward to the next URL in your history, type ALT-@rightarrow. (These keys may not have the desired effect in some window managers; if they don't work for you, try using the CTRL key instead of the ALT key.) [GNU INFO BUG: any <> in the preceding line should be the <- and/or -> arrow keys.]
If your visited-URL history on the Go menu is very large, and earlier URLs are truncated, you can still visit them by doing this: left-click one of the lowest entries on the menu, and visit that; then, left-click on the Home button. This eliminates all the URLs in the history list that are more recent than the page you'd just visited, but all of the old pages will be back in the list.
To open your bookmarks file in a new window, type ALT-b.
To open a new Mozilla window, ALT-n (it's often useful to have several windows open at once).

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

32.2 Viewing an Image from the Web

@sf{Debian}: `imagemagick'
@sf{WWW}: ftp://ftp.wizards.dupont.com/pub/ImageMagick/

If you just want to view an image file from the Web, you don't have to use a Web browser at all--instead, you can use display, giving the URL you want to view as an argument. This is especially nice for viewing your favorite webcam image, or for viewing images on ftp sites--you don't have to log in or type any other commands at all.

To view the image at ftp://garbo.uwasa.fi/garbo-gifs/garbo01.gif, type:
$ display ftp://garbo.uwasa.fi/garbo-gifs/garbo01.gif RET

NOTE: When viewing the image, you can use all of the image manipulation commands that display supports, including resizing and changing the magnification of the image. For more information about display, see Viewing an Image in X.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

32.3 Reading Text from the Web

@sf{Debian}: `lynx'
@sf{WWW}: http://lynx.browser.org/

As of this writing, the venerable lynx is still the standard Web browser for use on Debian systems; it was also one of the first Web browsers available for general use.(42) It can't display graphics at all, but it's a good interface for reading hypertext.

Type lynx to start it--if a "start page" is defined, it will load. The start page is defined in `/etc/lynx.cfg', and can be a URL pointing to a file on the local system or to an address on the Web; you need superuser privileges to edit this file. On Debian systems, the start page comes defined as the Debian home page, http://www.debian.org/ (but you can change this, of course; many experienced users write their own start page, containing links to frequently-visited URLs, and save it as a local file in their home directory tree).

To open a URL, give the URL as an argument.

To view the URL http://lycaeum.org/, type:
$ lynx http://lycaeum.org/ RET

When in lynx, the following keyboard commands work:

COMMAND	DESCRIPTION
`@uparrow` `and` `@downarrow`	Move forward and backward through links in the current document.
`@rightarrow` `or` `RET`	Follow the hyperlink currently selected by the cursor.
`@leftarrow`	Go back to the previously displayed URL.
`DEL`	View a history of all URLs visited during this session.
`PgDn` `or` `SPC`	Scroll down to the next page in the current document.
`PgUp`	Scroll up to the previous page in the current document.
`=`	Display information about the current document (like all pages in `lynx`, type `@leftarrow` to go back to the previous document).
`g`	Go to a URL; `lynx` will prompt you for the URL to go to. Type `@uparrow` to insert on this line the last URL that was visited; once inserted, you can edit it.
`h`	Display the `lynx` help files.
`q`	Quit browsing and exit the program; `lynx` will ask to verify this action.

The following are some recipes for using lynx.

NOTE: Emacs users might want to use the `-emacskeys' option when starting lynx; it enables you to use Emacs-style keystrokes for cursor movement (see section Basic Emacs Editing Keys).

32.3.1 Perusing Text from the Web    Perusing text from the Web.

32.3.2 Viewing a Site That Requires Authorization    Browsing sites which require logging in.

32.3.3 Options Available while Browsing Text    Lynx startup options.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

32.3.1 Perusing Text from the Web

To peruse just the text of an article that's on the Web, output the text of the URL using lynx with the `-dump' option. This dumps the text of the given URL to the standard output, and you can pipe this to less for perusal, or use redirection to save it to a file.

To peruse the text of http://www.sc.edu/fitzgerald/winterd/winter.html, type (all on one line):

$ lynx -dump
http://www.sc.edu/fitzgerald/winterd/winter.html | less RET

It's an old net convention for italicized words to be displayed in an etext inside underscores like `_this_'; use the `-underscore' option to output any italicized text in this manner.

By default, lynx annotates all the hyperlinks and produces a list of footnoted links at the bottom of the screen. If you don't want them, add the `-nolist' option and just the "pure text" will be returned.

To output the pure text, with underscores, of the previous URL, and save it to the file `winter_dreams', type (all on one line):

$ lynx -dump -nolist -underscore
http://www.sc.edu/fitzgerald/winterd/winter.html > winter_dreams RET

You can do other things with the pure text, like pipe it to enscript for setting it in a font for printing.

To print the pure text, with underscores, of the previous URL in a Times Roman font, type (all on one line):

$ lynx -dump -nolist -underscore
http://www.sc.edu/fitzgerald/winterd/winter.html | enscript -B
-f "Times-Roman10" RET

NOTE: To peruse the plain text of a URL with its HTML tags removed and no formatting done to the text, see Converting HTML to Another Format.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

32.3.2 Viewing a Site That Requires Authorization

To view a site or Web page that requires registration, use lynx with the `-auth' option, giving as arguments the username and password to use for authorization, separating them by a colon (`:') character.

To view the URL http://www.nytimes.com/archive/ with a username and password of `cypherpunks', type (all on one line):
$ lynx -auth=cypherpunks:cypherpunks http://www.nytimes.com/archive/ RET

It's often common to combine this with the options for saving to a file, so that you can retrieve an annotated text copy of a file from a site that normally requires registration.

To save the URL http://www.nytimes.com/archive/ as an annotated text file, `mynews', type (all on one line):

$ lynx -dump -number_links -auth=cypherpunks:cypherpunks
http://www.nytimes.com/archive/ > mynews RET

NOTE: The username and password argument you give on the command line will be recorded in your shell history log (see section Command History), and it will be visible to other users on the system should they look to see what processes you're running (see section Listing All of a User's Processes).

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

32.3.3 Options Available while Browsing Text

The following table describes some of the command-line options lynx takes.

OPTION	DESCRIPTION
`-anonymous`	Use the "anonymous ftp" account when retrieving ftp URLs.
`-auth=user:pass`	Use a username of `user` and password of `pass` for protected documents.
`-cache=integer`	Keep `integer` documents in memory.
`-case`	Make searches case-sensitive.
`-dump`	Dump the text contents of the URL to the standard output, and then exit.
`-emacskeys`	Enable Emacs-style key bindings for movement.
`-force_html`	Forces rendering of HTML when the URL does not have a `.html' file name extension.
`-help`	Output a help message showing all available options, and then exit.
`-localhost`	Disable URLs that point to remote hosts--useful for using `lynx` to read HTML- or text-format documentation in `/usr/doc' and other local documents while not connected to the Internet.
`-nolist`	Disable the annotated link list in dumps.
`-number_links`	Number links both in dumps and normal browse mode.
`-partial`	Display partial pages while downloading.
`-pauth=user:pass`	Use a username of `user` and password of `pass` for protected proxy servers.
`-underscore`	Output italicized text like _this_ in dumps.
`-use_mouse`	Use mouse in an `xterm`.
`-version`	Output `lynx` version and exit.
`-vikeys`	Enable `vi`-style key bindings for movement.
`-width=integer`	Format dumps to a width of `integer` columns (default 80).

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

32.4 Browsing the Web in Emacs

@sf{Debian}: `w3-el-e20'
@sf{WWW}: ftp://ftp.cs.indiana.edu/pub/elisp/w3/

Bill Perry's Emacs/W3, as its name implies, is a Web browser for Emacs (giving you, as Bill says, one less reason to leave the editor). Its features are many--just about the only things it lacks that you may miss are SSL support (although this is coming) and JavaScript and Java support (well, you may not miss it, but it will make those sites that require their use a bit hard to use). It can handle frames, tables, stylesheets, and many other HTML features.

To start W3 in Emacs, type:
M-x w3 RET

To open a URL in a new buffer, type C-o and, in the minibuffer, give the URL to open (leaving this blank visits the Emacs/W3 home page). Middle-click a link to follow it, opening the URL in a new buffer.

To open the URL http://gnuscape.org/, type:
C-o http://gnuscape.org/ RET
To open the URL of the Emacs/W3 home page, type:
C-o RET

The preceding example opens the Emacs/W3 home page in a buffer of its own:

The following table describes some of the various special W3 commands.

COMMAND	DESCRIPTION
`RET`	Follow the link at point.
`SPC`	Scroll down in the current buffer.
`BKSP`	Scroll up in the current buffer.
`M-TAB`	Insert the URL of the current document into another buffer.
`M-s`	Save a document to the local disk (you can choose HTML Source, Formatted Text, LaTeX Source, or Binary).
`C-o`	Open a URL.
`B`	Move backward in the history stack of visited URLs.
`F`	Move forward in the history stack of visited URLs.
`i`	View information about the document in current buffer (opens in new buffer called `Document Information').
`I`	View information about the link at point in current buffer (opens in new buffer called `Document Information').
`k`	Put the URL of the document in the current buffer in the kill ring, and make it the X selection (useful for copying and pasting the URL into another buffer or to another application; see section Selecting Text).
`K`	Put the URL of the link at point in the kill ring and make it the X selection (useful for copying and pasting the URL into another buffer or to another application; see section Selecting Text).
`l`	Move to the last visited buffer.
`o`	Open a local file.
`q`	Quit W3 mode, kill the current buffer, and go to the last visited buffer.
`r`	Reload the current document.
`s`	View HTML source of the document in the current buffer (opens in new buffer with the URL as its name).
`S`	View HTML source of the link at point in the current buffer (opens in new buffer with the URL as its name).
`v`	Show the URL of the current document (URL is shown in the minibuffer).
`V`	Show URL of the link under point in the current buffer (URL is shown in the minibuffer).

NOTE: If you get serious about using Emacs/W3, you'll almost certainly want to run the XEmacs flavor of Emacs--as of this writing, GNU Emacs cannot display images.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

32.5 Getting Files from the Web

@sf{Debian}: `wget'
@sf{WWW}: http://www.wget.org/

Use wget, "Web get," to download files from the World Wide Web. It can retrieve files from URLs that begin with either `http' or `ftp'. It keeps the file's original timestamp, it's smaller and faster to use than a browser, and it shows a visual display of the download progress.

The following subsections contain recipes for using wget to retrieve information from the Web. See Info file `wget.info', node `Examples', for more examples of things you can do with wget.

NOTE: To retrieve an HTML file from the Web and save it as formatted text, use lynx instead--see Perusing Text from the Web.

32.5.1 Saving a URL to a File    Getting a URL.

32.5.2 Archiving an Entire Web Site    Archiving a site.

32.5.3 Archiving Part of a Web Site    Archiving part of a site.

32.5.4 Reading the Headers of a Web Page    Looking at Web server headers.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

32.5.1 Saving a URL to a File

To download a single file from the Web, give the URL of the file as an argument to wget.

For example, to download ftp://ftp.neuron.net/pub/spiral/septembr.mp3 to a file, type:
$ wget ftp://ftp.neuron.net/pub/spiral/septembr.mp3 RET

This command reads a given URL, writing its contents to a file with the same name as the original, `septembr.mp3', in the current working directory.

If you interrupt a download before it's finished, the contents of the file you were retrieving will contain only the portion of the file wget retrieved until it was interrupted. Use wget with the `-c' option to resume the download from the point it left off.

To resume download of the URL from the previous example, type:

$ wget -c ftp://ftp.neuron.net/pub/spiral/septembr.mp3 RET

NOTE: In order for the `-c' option to have the desired effect, you should run wget from the same directory as it was run previously, where that partially-retrieved file should still exist.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

32.5.2 Archiving an Entire Web Site

To archive a single Web site, use the `-m' ("mirror") option, which saves files with the exact timestamp of the original, if possible, and sets the "recursive retrieval" option to download everything. To specify the number of retries to use when an error occurs in retrieval, use the `-t' option with a numeric argument---`-t3' is usually good for safely retrieving across the net; use `-t0' to specify an infinite number of retries, good for when a network connection is really bad but you really want to archive something, regardless of how long it takes. Finally, use the `-o' with a file name as an argument to write a progress log to the file--examining it can be useful in the event that something goes wrong during the archiving; once the archival process is complete and you've determined that it was successful, you can delete the log file.

To mirror the Web site at http://www.bloofga.org/, giving up to three retries for retrieving files and putting error messages in a log file called `mirror.log', type:
$ wget -m -t3 http://www.bloofga.org/ -o mirror.log RET

This command makes an archive of the Web site at `www.bloofga.org' in a subdirectory called `www.bloofga.org' in the current directory. Log messages are written to a file in the current directory called `mirror.log'.

To continue an archive that you've left off, use the `-nc' ("no clobber") option; it doesn't retrieve files that have already been downloaded. For this option to work the way you want it to, be sure that you are in the same directory that you were in when you originally began archiving the site.

To continue an interrupted mirror of the Web site at http://www.bloofga.org/ and make sure that existing files are not downloaded, giving up to three retries for retrieval of files and putting error messages in a log file called `mirror.log', type:
$ wget -nc -m -t3 http://www.bloofga.org/ -o mirror.log RET

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

32.5.3 Archiving Part of a Web Site

To archive only part of a Web site--such as, say, a user's home page--use the `-I' option followed by a list of the absolute path names of the directories to archive; all other directories on the site are ignored.

To archive the Web site at http://dougal.bris.ac.uk/~mbt/, only archiving the `/~mbt' directory, and writing log messages to a file called `uk.log', type:
$ wget -m -t3 -I /~mbt http://dougal.bris.ac.uk/~mbt/ -o uk.log RET

This command archives all files on the http://dougal.bris.ac.uk/~mbt/ Web site whose directory names begin with `/~mbt'.

To only get files in a given directory, use the `-r' and `-l1' options (the `-l' option specifies the number of levels to descend from the given level). To only download files in a given directory, combine these options with the `--no-parent' option, which specifies not to ascend to the parent directory.

Use the `-A' option to specify the exact file name extensions to accept--for example, use `-A txt,text,tex' to only download files whose names end with `.txt', `.text', and `.tex' extensions. The `-R' option works similarly, but specifies the file extensions to reject and not download.

To download only the files ending in a `.gz' extension and only in the given directory `/~rjh/indiepop-l/download/' at `monash.edu.au', type:
$ wget -m -r -l1 --no-parent -A.gz http://monash.edu.au/~rjh/indiepop-l/download/ RET

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

32.5.4 Reading the Headers of a Web Page

All Web servers output special headers at the beginning of page requests, but you normally don't see them when you retrieve a URL with a Web browser. These headers contain information such as the current system date of the Web server host and the name and version of the Web server and operating system software.

Use the `-S' option with wget to output these headers when retrieving files; headers are output to standard output, or to the log file, if used.

To retrieve the file at http://slashdot.org/ and output the headers, type:
$ wget -S http://slashdot.org/ RET

This command writes the server response headers to standard output and saves the contents of http://slashdot.org/ to a file in the current directory whose name is the same as the original file.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

32.6 Writing HTML

@sf{Debian}: `bluefish'
@sf{WWW}: http://bluefish.openoffice.nl/

Hypertext Markup Language (HTML) is the markup language of the Web; HTML files are just plain text files written in this markup language. You can write HTML files in any text editor; then, open the file in a Web browser to see the HTML markup rendered in its resulting hypertext appearance.

Many people swear by Bluefish, a full-featured, user-friendly HTML editor for X.

Emacs (see section Emacs) has a major mode to facilitate the editing of HTML files; to start this mode in a buffer, type:

M-x html-mode RET

The features of HTML mode include the insertion of "skeleton" constructs.

The help text for the HTML mode function includes a very short HTML authoring tutorial--view the documentation on this function to display the tutorial.

To read a short HTML tutorial in Emacs, type:
C-h f html-mode RET

NOTE: When you're editing an HTML file in an Emacs buffer, you can open the same file in a Web browser in another window--Web browsers only read and don't write the HTML files they open, so you can view the rendered document in the browser as you create it in Emacs. When you make and save a change in the Emacs buffer, reload the file in the browser to see your changes take effect immediately.

32.6.1 Adding Parameters to Image Tags    Adding params to image tags.

32.6.2 Converting HTML to Another Format    Converting HTML to text.

32.6.3 Validating an HTML File    Validating HTML.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

32.6.1 Adding Parameters to Image Tags

@sf{Debian}: `imgsizer'
@sf{WWW}: http://www.tuxedo.org/~esr/software.html#imgsizer

For usability, HTML image source tags should have `HEIGHT' and `WIDTH' parameters, which specify the dimensions of the image the tag describes. By specifying these parameters in all the image tags on a page, the text in that page will display in the browser window before the images are loaded. Without them, the browser must load all images before any of the text on the page is displayed.

Use imgsizer to automatically determine the proper values and insert them into an HTML file. Give the name of the HTML file to fix as an argument.

To add `HEIGHT' and `WIDTH' parameters to the file `index.html', type:
$ imgsizer index.html RET

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

32.6.2 Converting HTML to Another Format

@sf{Debian}: `unhtml'
@sf{Debian}: `html2ps'
@sf{WWW}: http://dragon.acadiau.ca/~013639s/
@sf{WWW}: http://www.tdb.uu.se/~jan/html2ps.html

There are several ways to convert HTML files to other formats. You can convert the HTML to plain text for reading, processing, or conversion to still other formats; you can also convert the HTML to PostScript, which you can view, print, or also convert to other formats, such as PDF.

To simply remove the HTML formatting from text, use unhtml. It reads from the standard input (or a specified file name), and it writes its output to standard output.

To peruse the file `index.html' with its HTML tags removed, type:
$ unhtml index.html | less RET
To remove the HTML tags from the file `index.html' and write the output to a file called `index.txt', type:
$ unhtml index.html > index.txt RET

When you remove the HTML tags from a file with unhtml, no further formatting is done to the text. Furthermore, it only works on files, and not on URLs themselves.

Use lynx to save an HTML file or a URL as a formatted text file, so that the resultant text looks like the original HTML when viewed in lynx. It can also preserve italics and hyperlink information in the original HTML. See section Perusing Text from the Web.

One thing you can do with this lynx output is pipe it to tools for spacing text, and then send that to enscript for setting in a font. This is useful for printing a Web page in typescript "manuscript" form, with images and graphics removed and text set double-spaced in a Courier font.

To print a copy of the URL http://example.com/essay/ in typescript manuscript form, type:

$ lynx -dump -underscore -nolist http://example.com/essay/ | pr -d
| enscript -B RET

NOTE: In some cases, you might want to edit the file before you print it, such as when a Web page contains text navigation bars or other text that you'd want to remove before you turn it into a manuscript. In such a case, you'd pipe the lynx output to a file, edit the file, and then use pr on the file and pipe that output to enscript for printing.

Finally, you can use html2ps to convert an HTML file to PostScript; this is useful when you want to print a Web page with all its graphics and images, or when you want to convert all or part of a Web site into PDF. Give the URLs or file names of the HTML files to convert as options. Use the `-u' option to underline the anchor text of hypertext links, and specify a file name to write to as an argument to the `-o' option. The defaults are to not underline links, and to write to the standard output.

To print a PostScript copy of the document at the URL http://example.com/essay/ to the default printer, type:
$ html2ps http://example.com/essay/ | lpr RET
To write a copy of the document at the URL http://example.com/essay/ to a PostScript file `submission.ps' with all hypertext links underlined, type:
$ html2ps -u -o submission.ps http://example.com/essay/ RET

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

32.6.3 Validating an HTML File

@sf{Debian}: `weblint'
@sf{WWW}: http://www.weblint.org/

Use weblint to validate the basic structure and syntax of an HTML file. Give the name of the file to be checked as an argument, and weblint outputs any complaints it has with the file to standard output, such as whether or not IMG elements are missing ALT descriptions, or whether nested elements overlap.

To validate the HTML in the file `index.html', type:
$ weblint index.html RET

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

32.7 More Web Browsers and Tools

Surprisingly, there are not nearly as many Web browsers for Linux as there are text editors--or even text viewers. This remains true for any operating system, and I have often pondered why this is; perhaps "browsing the Web," a fairly recent activity in itself, may soon be obsoleted by Web readers and other tools. In any event, the following lists other browsers that are currently available for Linux systems.

WEB BROWSER	DESCRIPTION
`amaya`	Developed by the World Wide Web Consortium; both a graphical Web browser and a WYSIWYG editor for writing HTML. {@sf{Debian}}: `amaya' {@sf{WWW}}: http://www.w3.org/amaya/
`arena`	Developed by the World Wide Web Consortium; a very compact, HTML 3.0-compliant Web browser for X. {@sf{Debian}}: `arena' {@sf{WWW}}: http://www.w3.org/arena/
`dillo`	A very fast, small graphical Web browser. {@sf{Debian}}: `dillo' {@sf{WWW}}: http://dillo.sourceforge.net/
`express`	A small browser that works in X with GNOME installed. {@sf{Debian}}: `express' {@sf{WWW}}: http://www.ca.us.vergenet.net/~conrad/express/
`links`	A relatively new text-only browser. {@sf{WWW}}: http://artax.karlin.mff.cuni.cz/~mikulas/links/
`gzilla`	A graphical browser for X, currently in an early stage of development. {@sf{Debian}}: `gzilla' {@sf{WWW}}: http://www.levien.com/gzilla/
`w3m`	Another new text-only browser whose features include table support and an interesting free-form cursor control; some people swear by this one. {@sf{Debian}}: `w3m' {@sf{WWW}}: http://ei5nazha.yz.yamagata-u.ac.jp/

[ << ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

This document was generated using texi2html

32.1 Browsing the Web		Netscape's famous Web browser.
32.2 Viewing an Image from the Web		Viewing images from the Web.
32.3 Reading Text from the Web		Reading text from the Web.
32.4 Browsing the Web in Emacs
32.5 Getting Files from the Web		Getting files from the Web.
32.6 Writing HTML
32.7 More Web Browsers and Tools		More Web tools to try.

32.1.1 Maintaining a List of Visited Web Sites		Keeping a history of your browsing.
32.1.2 Opening a URL from a Script		Running Mozilla from a script.
32.1.3 Mozilla Browsing Tips		Tips for using Mozilla.

32.3.1 Perusing Text from the Web		Perusing text from the Web.
32.3.2 Viewing a Site That Requires Authorization		Browsing sites which require logging in.
32.3.3 Options Available while Browsing Text		Lynx startup options.

32.5.1 Saving a URL to a File		Getting a URL.
32.5.2 Archiving an Entire Web Site		Archiving a site.
32.5.3 Archiving Part of a Web Site		Archiving part of a site.
32.5.4 Reading the Headers of a Web Page		Looking at Web server headers.

32.6.1 Adding Parameters to Image Tags		Adding params to image tags.
32.6.2 Converting HTML to Another Format		Converting HTML to text.
32.6.3 Validating an HTML File		Validating HTML.