Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Get url for pdf file from AxSHDocVw.AxWebBrowser

183 views
Skip to first unread message

LN Mike

unread,
Apr 29, 2009, 5:27:02 PM4/29/09
to
VB.Net 2003
My "Collect PDF" app collects pdf files from several web sites, most of the
web sites provide a src tag in the web page before the pdf is displayed in my
web control (AxSHDocVw.AxWebBrowser). Once my app has the url, it downloads
the file and stores it.

The problem, some web sites don't provide the web page with a src tag that
has the url of the pdf file. So, how do I get the url of the pdf file if the
web site doesn't give me the web page with the src tag? Also, I run 9
"Collect PDF" apps on each pc, so reading the cache is not a good idea.

Cor Ligthert[MVP]

unread,
Apr 30, 2009, 1:14:47 AM4/30/09
to
Not easy as a webpage is nothing else then a DOM document, for which is the
MSHTML class.

No easy stuff and needs a lot of casting (easier to start with option strict
of and than afterward set it on to correct that)

http://msdn.microsoft.com/en-us/library/aa741317.aspx

You should not set an import to it, but fully describe it every time as
everything becomes terrible slow in VB Net 2003 as the import is used.

Cor

Jie Wang [MSFT]

unread,
Apr 30, 2009, 6:36:24 AM4/30/09
to
Hello Mike,

When you say "some web sites don't provide the web page with a src tag", do
you mean those "web pages" actually just let the Acrobat reader take over
the entire browser control (the PDF document was fully filled into the
control)? Or those web pages uses other means to embed a PDF file as a
portion of the entire HTML page?

Could you give me a specific sample of the problem you met so I can better
understand the situation and see if we can come out a solution to it?

Thanks,

Jie Wang (jie...@online.microsoft.com, remove 'online.')

Microsoft Online Community Support

Delighting our customers is our #1 priority. We welcome your comments and
suggestions about how we can improve the support we provide to you. Please
feel free to let my manager know what you think of the level of service
provided. You can send feedback directly to my manager at:
msd...@microsoft.com.

==================================================
Get notification to my posts through email? Please refer to
http://msdn.microsoft.com/en-us/subscriptions/aa948868.aspx#notifications.

Note: MSDN Managed Newsgroup support offering is for non-urgent issues
where an initial response from the community or a Microsoft Support
Engineer within 2 business days is acceptable. Please note that each follow
up response may take approximately 2 business days as the support
professional working with you may need further investigation to reach the
most efficient resolution. The offering is not appropriate for situations
that require urgent, real-time or phone-based interactions. Issues of this
nature are best handled working with a dedicated Microsoft Support Engineer
by contacting Microsoft Customer Support Services (CSS) at
http://msdn.microsoft.com/en-us/subscriptions/aa948874.aspx
==================================================
This posting is provided "AS IS" with no warranties, and confers no rights.

LN Mike

unread,
Apr 30, 2009, 5:12:01 PM4/30/09
to
Some of the "web pages" actually just let the acrobat reader take over the
entire browser control. I put a break point on my brower control
"DocumentComplete" event, so I'm able to see each and every page that comes
through. However with the web sites that I can't get a src tag, I can't open
the document at all, it seems DocumentComplete sends me a pdf file not a html
file, so I can't see outerhtml.

WebDisplay.Document.All(0).outerhtml gives me the following error message

Run-time exception thrown : System.MissingMemberException - Public member
'All' on type 'IAcroAXDocShim' not found.

But with the web sites that do work, I'm able to see the html code that has
the src tag.

LN Mike

unread,
May 1, 2009, 10:13:01 AM5/1/09
to
I not sure, but maybe when the webdisplay.document.all(0).outerhtml doesn't
work, it is the pdf file. So, how to I copy the object (which I think is the
pdf file) to a local folder?

Jie Wang [MSFT]

unread,
May 4, 2009, 10:47:32 AM5/4/09
to
If the browser control navigates directly to a PDF file, there will not be
a src tag to extract the file name. Instead, we can use the
DocumentComplete event to trap the PDF URL:

Private Sub Form1_Load(ByVal sender As System.Object, ByVal e As
System.EventArgs) Handles MyBase.Load

AxWebBrowser1.Navigate("http://research.microsoft.com/en-us/um/people/cmbish
op/prml/bishop-prml-sample.pdf")
End Sub

Private Sub AxWebBrowser1_DocumentComplete(ByVal sender As System.Object,
ByVal e As AxSHDocVw.DWebBrowserEvents2_DocumentCompleteEvent) Handles
AxWebBrowser1.DocumentComplete
MessageBox.Show(e.uRL.ToString())
End Sub

With the code above, we'll see the MessageBox pops up with the URL of the
PDF file.

Will this then help you to download the PDF file?

Regards,

Nikos

unread,
May 5, 2009, 2:32:57 PM5/5/09
to
Hi,

I would like to ask you how you download the file.

I tried to do that by upgrading an existing vb6 app
using the internet transfer control but it didn't seem to
work.

Will you please tell me?

Thanks in advance.

Jie Wang [MSFT]

unread,
May 11, 2009, 4:47:45 AM5/11/09
to
Hi Mike,

Any updates on this issue?

If you have any further questions regarding this issue, please kindly let
me know.

Thanks,

LN Mike

unread,
May 12, 2009, 12:08:01 PM5/12/09
to
No, didn't work. The DocumentComplete url is the web page not the URL. I can
see the web sites that work the pdf file has "...
cgi-bin/show_temp.pl?file=pdf18445142694797&type=application/pdf" but the web
sites that don't work only show the web pages like "... doc1/038110817935"

Due to company policy I can't list the full urls, but I did list the ending
of the urls that are the difference.

LN Mike

unread,
May 12, 2009, 12:14:01 PM5/12/09
to
(URLDownloadToFile) urlmon

Here is a good link on the topic, as of 5/12/09
http://www.vbforums.com/showthread.php?p=2686017

LN Mike

unread,
May 12, 2009, 12:17:03 PM5/12/09
to
Still doesn't work, see other above thread for more detail.

Jie Wang [MSFT]

unread,
May 14, 2009, 9:27:40 AM5/14/09
to
Hi,

From the form of the URL we can't really see why some work while others not.

However, I can think of two ways of sending a PDF file to the client side
for it to be displayed in the entire browser control:

1. Write PDF stream directly to the response stream, while setting the
content type to application/pdf.

or, it can

2. Send a HTTP redirect code to the client, so the client is responsible
for re-sending the request to the new URL to get the *real* PDF file.

Now I suspect the "not working" scenario could be cause by the second way.

I'll try to setup an environment to test these two scenarios and see what
exactly happens.

Could you let me know, for the not working URLs like "...
doc1/038110817935", what if you manually navigate to that URL in a IE
browser? Do you get the PDF file?

Thanks,

LN Mike

unread,
May 14, 2009, 12:27:01 PM5/14/09
to
If I manually navigate to that URL in a IE browser, I get the login screen,
because a new session has been created.

If I redirect the web control to the URL "... doc1/038110817935" it will
show the web page, not the pdf file, this occurs for both web sites that work
and don't work.

To clarify, the web sites that work, DocumentComplete captures
/doc1/038110817935
and cgi-bin/show_temp.pl?file=pdf18445142694797&type=application/pdf
The web sites that don't work, DocumentComplete captures only
/doc1/038110817935

Jie Wang [MSFT]

unread,
May 20, 2009, 8:10:51 AM5/20/09
to
Hi,

I've found another way to get the URL of the PDF file currently being
displayed "full screen" in the web browser control.

Actually, when the PDF document is displayed "full screen" (or shall we say
full control) in the web browser control, the Document property actually
returns an IAcroAxDocShim interface instead of an HTML DOM interface. This
matches the error description in your second post.

So all we need to do is check the type of the Document property and see if
it is IAcroAxDocShim, we can call src property on that interface and get

the URL of the PDF file.

The code looks like this (suppose I have a WebBrowser control named
AxWebBrowser1):

If (TypeOf AxWebBrowser1.Document Is AcroPDFLib.IAcroAXDocShim) Then
' this is a full screen PDF file
Dim pdfSrc As String
pdfSrc = CType(AxWebBrowser1.Document, AcroPDFLib.IAcroAXDocShim).src

Else
' this is a normal HTML page, process the page.
End If

To access the IAcroAxDocShim interface, you need to add a COM reference to
your project, named "Adobe Acrobat Browser Control Type Library 1.0".

What I'm not sure is whether or not we can download the file even we got
the URL in some cases - like if the page request requires an authenticated
session, this approach may still fail. I tried to save the PDF file via the
IAcroAXDocShim, but failed to find a way to do so. This control is made by
Adobe and I was not able to find a document of how to use it from their
website.

Anyway, please let me know if the IAcroAXDocShim can help you get the URL
first. Then we'll think of a way to get the file via the URL. You can also
try Adobe's online forum to ask questions related to the IAcroAXDocShim
interface to get more information.

Best regards,

LN Mike

unread,
May 20, 2009, 1:31:01 PM5/20/09
to
I put your code in, and it returns the URL of the web page not the URL of the
PDF file.

Returns "… /doc1/038110817935"
I want "… /cgi-bin/show_temp.pl?file=pdf44974838787197&type=application/pdf"

This is a challenging problem to say the least. I searched on Adobe site,
Adobe forms and Google didn't find anything relevant in any of those oceans.
I posted a new question on Adobe forms, I will keep this tread updated with
any helpful info. It might be a few days.

Jie Wang [MSFT]

unread,
May 22, 2009, 6:19:01 AM5/22/09
to
I was thinking to find a way to get the PDF document object from the
IAcroAxDocShim interface we have in hand, then we can save the document
using it's object model. However, there is no apparent way of doing that,
still trying to figure out.

Another notable method on the IAcroAxDocShim interface is execCommand, I
don't know if there could be some command to be executed on it to save the
PDF to local disk (should be one because the PDF reader control itself has
a save button on the UI so I guess there should be a corresponding OM way
to do that). But lacking of document from Adobe makes it hard to figure
out, too.

Hope these clues help in some way.

Regards,

LN Mike

unread,
May 22, 2009, 1:07:03 PM5/22/09
to
I hate Adobe.
After reading over 320 pages about Adobe Interapplication Communication API
Reference and posts from Adobe forms and code samples, I've come up with this.


Dim PDDoc As Object
Dim AVDoc As Object
Dim AcroExchApp As Object
Dim AVDocTarget As Object

AcroExchApp = CreateObject("AcroExch.App")
AVDocTarget = CreateObject("AcroExch.AVDoc")
AVDoc = AcroExchApp.GetActiveDoc
PDDoc = AVDocTarget.GetPDDoc

PDDoc.Save(1 Or 4 Or 32, "C:\IMAGE\test.pdf")

PDDoc.Close()
PDDoc = Nothing
AVDoc.Close(True)
AVDoc = Nothing

AcroExchApp.Exit()
AcroExchApp = Nothing
AVDocTarget.Exit()
AVDocTarget = Nothing


But AVDoc and PDDoc are coming back with nothing. PDDoc.Save will error
because PDDoc is nothing.

I might be on the wrong track, and need to get back to just using
IAcroAxDocShim interface.
I researched execCommand, like you said no documentation, in fact the white
paper "Adobe Interapplication Communication API Reference" doesn't even list
it as one of the methods.

Some of the code I got from
http://support.adobe.com/devsup/devsup.nsf/docs/51415.htm

I need to vent some more, I hate Adobe. 99% of what I read was crap but I
had to read through it so I can find the 1% that I did need. I also talked
with Adobe tech support, they were no help.

Let's keep looking into IAcroAxDocShim execCommand, unless you know how to
solve the AVDoc and PDDoc are coming back with nothing, stated above.

LN Mike

unread,
May 22, 2009, 3:34:02 PM5/22/09
to
What about IDownloadManager::Download Method ?

Jie Wang [MSFT]

unread,
May 25, 2009, 9:35:59 AM5/25/09
to
> What about IDownloadManager::Download Method ?

Not sure how can I cast the document into IDownloadManager interface?

I was looking at the interfaces implemented by the Adobe PDF Reader object,
it implemented the IPersistFile interface and I tried to call Save method
of that interface. However, I got a Not Implemented exception.

I'll try check the ROT to see what else COM interfaces I can get.

I don't know why Adobe implemented a LoadFile method on the IAcroAxDocShim
interface, but didn't put a SaveAs method there.

Meanwhile, please keep trying getting some help from Adobe - I will keep
assisting on this issue and see if there is any other workarounds beyond
dealing with the Acrobat object model, but it just looks weird a MSFT
engineer is supporting on Adobe products. ;-)

LN Mike

unread,
May 26, 2009, 2:01:04 PM5/26/09
to
On Adobe scripts form I posted, "Save PDF in AxSHDocVw.AxWebBrowser", an
Adobe tech employee said it can't be done, brower control doesn't provide any
functionality for saving PDFs.

I will continue to research.

Jie Wang [MSFT]

unread,
May 29, 2009, 7:23:01 AM5/29/09
to
Since the Acrobat Reader ActiveX control way is a dead end, I need to
clarify one thing with you: does these pages providing PDF files require
logon before you can get to the file? Or any other authentications needed?

Thanks,

LN Mike

unread,
Jun 1, 2009, 6:18:02 PM6/1/09
to
Yes, a login is required.

I would like to try other interfaces. Back on 5/25 you posted, "I was

looking at the interfaces implemented by the Adobe PDF Reader object,
it implemented the IPersistFile interface and I tried to call Save method
of that interface. However, I got a Not Implemented exception.

I'll try check the ROT to see what else COM interfaces I can get. "

I too, recieved a "Not implemented".

I want to try to just save the pdf file, how can I get the web brower
control AxSHDocVw.AxWebBrowser to save the pdf file it is displaying?

CType(WebDisplay.Document, UCOMIPersistFile).Save("C:\IMAGE\test.pdf", True)

but error with

Jie Wang [MSFT]

unread,
Jun 2, 2009, 7:12:26 AM6/2/09
to
If a login is required, that means the web page feeding the PDF file will
check the cookie/session on the server side before sending the file. Then
how do you use a separate download class which is not likely to have the
the access to the browser control's session to request file from the server
page? I don't think there is a regular way to do that.

And since Adobe said there is no way to get the PDF file from the PDF
Reader control, it looks like our only hope is to try extracting the PDF
file from IE cache.

I'll check the possibilitie of the cache approach and get back here.

Regards,

Jie Wang

LN Mike

unread,
Jun 2, 2009, 4:09:01 PM6/2/09
to
I almost solved this.


WebDisplay.ExecWB(SHDocVw.OLECMDID.OLECMDID_SAVEAS,
SHDocVw.OLECMDEXECOPT.OLECMDEXECOPT_DONTPROMPTUSER, "C:\IMAGE\test.pdf",
"C:\IMAGE\test.pdf")

Saves the pdf that is displayed in the web browser, however prompt comes up,
even tho the "dontpromptuser" parameter is used. Lots of posts on the net
about this, one post stated after IE4, MS blugged the security hole, and now
requires prompt regarless of "dontpromptuser".

Do you know how make ExecWB not prompt a user on saveas?

I tried OLECMDID_SAVE, but nothing happened, no file was saved to the HD.

Jie Wang [MSFT]

unread,
Jun 4, 2009, 12:04:17 AM6/4/09
to
First I have to say the ExecWB idea is brilliant. I thought that only works
with HTML documents and ignored it as a possible solution, what a mistake!

Regarding the save dialog, there is no way to suppress it. However, we can
use another thread to automate the dialog:

<DllImport("user32.dll", SetLastError:=True, CharSet:=CharSet.Ansi)> _
Private Shared Function FindWindowEx(ByVal parentHandle As IntPtr, _
ByVal childAfter As IntPtr, _
ByVal lclassName As String, _
ByVal windowTitle As String) As IntPtr
End Function

<DllImport("user32.dll", SetLastError:=True, CharSet:=CharSet.Auto)> _
Private Shared Function SendMessage( _
ByVal hWnd As IntPtr, _
ByVal Msg As UInteger, _
ByVal wParam As IntPtr, _
ByVal lParam As IntPtr) As IntPtr
End Function

Private Const WM_SETTEXT As UInteger = &HC
Private Const BM_CLICK As UInteger = &HF5
Private Const MutexName = "SavePDFMutex"

Private Sub SavePDF(ByVal param As Object)
Dim fileName As String = CType(param, String)
Dim timeOut As Integer = 5
Dim hWndSaveAs As IntPtr

Thread.Sleep(500)

Do While True
' Get the Adobe Reader Save a Copy... dialog window handle
hWndSaveAs = FindWindowEx(IntPtr.Zero, IntPtr.Zero, "#32770", "Save
a Copy...")

If hWndSaveAs = IntPtr.Zero Then
Thread.Sleep(1000)
timeOut = timeOut - 1

If timeOut = 0 Then
' 5 seconds timeout, still can't find the dialog
Throw New ApplicationException("Unable to find the Save
dialog window")
End If
Else
' Dialog found, proceed.
Exit Do
End If
Loop

Dim lpFileName As IntPtr = Marshal.StringToHGlobalAuto(fileName)

Try
Dim hWndCboEx As IntPtr = FindWindowEx(hWndSaveAs, IntPtr.Zero,
"ComboBoxEx32", Nothing)
Dim hWndCbo As IntPtr = FindWindowEx(hWndCboEx, IntPtr.Zero,
"ComboBox", Nothing)
Dim hWndTxt As IntPtr = FindWindowEx(hWndCbo, IntPtr.Zero, "Edit",
Nothing)
Dim hWndSave As IntPtr = FindWindowEx(hWndSaveAs, IntPtr.Zero,
"Button", "Save")

' Set the filename
SendMessage(hWndTxt, WM_SETTEXT, IntPtr.Zero, lpFileName)
' Click on the button
SendMessage(hWndSave, BM_CLICK, IntPtr.Zero, IntPtr.Zero)
Finally
Marshal.FreeHGlobal(lpFileName)
End Try
End Sub

Now at the time you want to save the PDF, use the following code:

' Since you're going to have more than one instance of the application
running,
' the mutex will make sure there will be only one save dialog at a time.
Dim mu As New Mutex(False, MutexName)
Dim t As New Thread(AddressOf SavePDF)

mu.WaitOne()

' Start the save PDF thread, passing the filename to be saved.
t.Start("D:\SavedPDF" & Guid.NewGuid().ToString("N") & ".pdf")

AxWebBrowser1.ExecWB(SHDocVw.OLECMDID.OLECMDID_SAVEAS,
SHDocVw.OLECMDEXECOPT.OLECMDEXECOPT_DONTPROMPTUSER)

' Wait until the thread exists
t.Join()
mu.ReleaseMutex()

**************************

Another possible alternative to the ExecWB is the URLDownloadToFile
function.

<DllImport("urlmon.dll", CharSet:=CharSet.Auto, preservesig:=False)> _
Private Shared Sub URLDownloadToFile( _
<MarshalAs(UnmanagedType.IUnknown)> ByVal pCaller As
Object, _
ByVal szURL As String, _
ByVal szFileName As String, _
ByVal dwReserved As Integer, _
ByVal lpfnCB As IntPtr)
End Sub

Now at the time you want to save the PDF, use the following code:

If (TypeOf AxWebBrowser1.Document Is AcroPDFLib.IAcroAXDocShim) Then

' Get the PDF source URL
Dim url As String = CType(AxWebBrowser1.Document,
AcroPDFLib.IAcroAXDocShim).src
' Download the PDF file in the web browser control's context
URLDownloadToFile(AxWebBrowser1.GetOcx(), url, "D:\test.pdf", 0,
IntPtr.Zero)
End If

Please let me know how these two solutions works.

Thanks,

LN Mike

unread,
Jun 9, 2009, 1:47:01 PM6/9/09
to
The second solution won't work because src returns the url of the web page
not the url of the pdf. If the pdf is local then src will return the url of
the pdf, but if the pdf is online on someone's server, the src will return
the web page url.

The first solution might work. I'm wait back from the webmaster to see if
the remaining sites can provide me the temp page that has the pdf url. I'll
keep you updated.

Jie Wang [MSFT]

unread,
Jun 9, 2009, 11:45:05 PM6/9/09
to
Hi,

The URLDownloadToFile works if the "web page" on the server side actually
writes a stream of the PDF file to the response.

In my test, if that is the case, the URLDownloadToFile can actually get the
PDF file saved to the disk:

URLDownloadToFile(AxWebBrowser1.GetOcx(),
"http://testSever/getPDF.aspx?file=test.pdf", "D:\test.pdf", 0, IntPtr.Zero)

So why not have a try if you got a minute or two? Just one line of code. :)

Anyway, I'll keep watching this post for you update.

Best regards,

0 new messages