Get External Hyperlinks From A Web Page

Share this

May 20, 2018

Get External Hyperlinks From A Webpage


Introduction


Some years ago, I published a VBA code that proved to be very popular in blog users: the code and the accompanying workbook could be used for downloading files from the internet. One common request that I get since I published that code, is how to retrieve the hyperlinks from a web page, to download the files afterward.

I decided to help all these blog users by creating a workbook that does exactly what they requested. So, the workbook that you will find in the downloads section can be used for grabbing all the hyperlinks from a given URL. The procedure is quite straightforward: just enter the URL on a specific cell and hit the Get Links button. After a few seconds (depending on your internet speed) you will have all the external hyperlinks from the particular URL. Moreover, apart from the hyperlinks, you will also get the displayed text for each hyperlink. In this way, you can match the hyperlink with its position on the web page and filter out those links that are not relevant.

NOTE: the code ignores internal hyperlinks and anchor tags. In other words, the hyperlinks that are retrieved should all start with “http”.

 


How to get external hyperlinks from a web page – demonstration video


The short video below demonstrates the usage of the spreadsheet using as an example the URL of this blog. In the particular case, 43 hyperlinks retrieved.

 


Downloads


Download

The file can be opened with Excel 2007 or newer. Please enable macros before using it. The VBA code was protected using the Unviewable+.

 


Read also


Excel & VBA: Download Internet Files Automatically
Get Image Size From A URL

Page last updated: 14/06/2019

Page last modified: February 20, 2020

Christos Samaras

Hi, I am Christos, a Mechanical Engineer by profession (Ph.D.) and a Software Developer by obsession (10+ years of experience)! I founded this site back in 2011 intending to provide solutions to various engineering and programming problems.

Christos E. Samaras

  • audeser servicios says:

    I’d run on MsOffice2k3 and MsOffice 2k7-SP3 with no luck. As you try to activate the macros, Excel warns that VBA project is corrupted and rips entirely apart, leaving the file useless.
    Can you please check it is not happening with other Office versions?

  • audeser servicios says:

    I’ll go a little further on this as I can not run the file posted here.

    Some time ago I developed (for my purposes of webscrapping) an HTML parser. I must confess it’s a daunting job well over 5.000 lines of code… but that nightmare came because I coded with no reference to any library (so it’s no iEFrame.dll dependant). I’m happy with it as it served my purpose to parse HTML, and in the end comes handy to also parse C++, JS, JSON,… whatever that has parser rules. But I must confess that dealing with it it’s horrendous (as I should inspect the target site before webscrapping nothing to get only the interesting… and leaving ads and other things outside of it… or it will go on anything it finds until the last linked item of the Net is downloaded). The worst thing is that as is recalls in some APIs to download files, Windows Defender marks its as Infected, and from time to time it hiddenly deletes the file }:|>

    I suspect (from looking on what is not cyphered inside the “code”) that that is not your case, and that you have arrived to a shorter code using these two:
    • Microsoft HTML Object Library (mshtml.tlb): This library is required to access all HTML controls which can be present on your HTML page.
    • Microsoft Internet Controls (ieframe.dll): This reference is required to do operations on Internet Explorer because to open an HTML page we need to access Internet Explorer.

    Regardless any one is interested, I think that this post could enlight a little more on this topic: https://audeser.wordpress.com/2018/10/30/vba-interact-with-html/ , as it explains how to go through a login system, it barely shows how to interact with HTML items, and output links to an Excel Sheet.

    The initial code is not mine (anyway it’s referred on the post), but has been heavily modified to get to a functional webBrowser on Excel… to bypass Gmail two steps login and lets download from there on.

    The needed code is something like this. Throw a “Microsoft Internet Controls” object inside an UserForm (or use oBrowser object as suggested in the commented code), with a textbox called txtURL, and a CommandButton. Paste this is the CommandButton1_Click event:

    On Error GoTo Err

    ‘Dim oBrowser As InternetExplorer
    Dim HTMLDoc As HTMLDocument ‘Reference to Microsoft HTML Object Library
    Dim oHTML_Element As IHTMLElement
    Dim oHTML_TagCol As IHTMLElementCollection
    Dim lgItem As Long

    ‘Set oBrowser = Me.WebBrowser1 ‘New InternetExplorer
    ‘With oBrowser
    With Me.WebBrowser1
    .Silent = False
    .Navigate Me.txtURL.Text
    .Visible = True ‘False

    ‘Call Browser_Complete

    Set HTMLDoc = .Document
    Set oHTML_TagCol = HTMLDoc.getElementsByTagName(“a”)
    ‘N# items: oHTML_TagCol.length
    For Each oHTML_Element In oHTML_TagCol
    lgItem = lgItem + 1
    ActiveSheet.Cells(lgItem, 1).Value = oHTML_Element.href
    ActiveSheet.Cells(lgItem, 2).Value = oHTML_Element.innerText
    Next

    ‘.Visible = True
    End With

    ExitProc:
    Exit Sub

    Err:
    MsgBox (“Error Occured”)

  • From the tool’s page:

    Question: In which Office versions
    actively supported by Microsoft can the file converted by the
    Unviewable+ application compatible be opened with:
    Reply: Office 2007 (32 bit), 2010, 2013, 2016 (32/64 bit)

    The workbook was developed in Office 2013 (32 bit) and is working fine on my computer. In general, Office 2007 is known to have several issues.

  • audeser servicios says:

    … I think Unviewable+ is incompatible with Office2K7. Will try later on a machine with Office2k13.

  • {"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}
    Add Content Block
    >