differing look of response

22 views
Skip to first unread message

Malik Rumi

unread,
May 27, 2017, 2:58:52 PM5/27/17
to scrapy-users

This is calling a script from the command line with scrapy crawl:


' <p class="indent">The petitioner brought this suit for '

'damages under the Jones Act,<a class="footnote" href="#fn1" '

'id="fn1_ref">1</a> alleging that her husband while employed by '

'the respondent railroad as a tug fireman was drowned because of '

'the negligent failure of respondent to provide him with a safe '

'place to work. The District Judge directed the jury to return a '


Here is the same text in the shell with response.text:


In [2]: response.text

\n <p class="date">Argued March 27, 28, 1956.</p>\n <p class="date">Decided April 9, 1956.</p>\n <div class="prelims">\n <p class="indent">Mr. Nathan Baker, New York City, for petitioner.</p>\n <p class="indent">Mr. Joseph P. Allen, New York City, for respondent.</p>\n <p class="indent">Mr. Justice BLACK delivered the opinion of the Court.</p>\n </div>\n <div class="num" id="p1">\n <span class="num">1</span>\n <p class="indent">The petitioner brought this suit for damages under the Jones Act,<a class="footnote" href="#fn1" id="fn1_ref">1</a> alleging that her husband while employed by the respondent railroad as a tug fireman was drowned because of the negligent failure of respondent to provide him with a safe place to work. The District Judge directed the jury to


And the same text again using fetch:


<p class="date">Argued March 27, 28, 1956.</p>

<p class="date">Decided April 9, 1956.</p>

<div class="prelims">

<p class="indent">Mr. Nathan Baker, New York City, for petitioner.</p>

<p class="indent">Mr. Joseph P. Allen, New York City, for respondent.</p>

<p class="indent">Mr. Justice BLACK delivered the opinion of the Court.</p>

</div>

<div class="num" id="p1">

<span class="num">1</span>

<p class="indent">The petitioner brought this suit for damages under the Jones Act,<a class="footnote" href="#fn1" id="fn1_ref">1</a> alleging that her husband while employed by the respondent railroad as a tug fireman was drowned because of the negligent failure of respondent to provide him with a safe place to work. The District Judge directed the jury to


<<>>


The first result is using my spider, which omits all the preliminary stuff. Note how it is squished in, (on both sides, but the left didn't copy over) as if it was in a narrow column. It has newlines only at the end of a paragraph, but it looks like each literal line is itself a string, with single quotes around all of them. It also has items, if that makes a difference.


The second is response.text. It has newlines but appears to be a single unformatted string. Note how it goes all the way to both margins.


The third is using fetch. Now it "respects" the html tags/layout, and has no newlines. It also goes margin to margin.


But neither the 2nd nor 3rd examples use my spider, even when I tried the --spider=myspider flag. That's why all the preliminary stuff is in there. 


And here is a copy paste of the same text from the original, which also goes margin to margin:


The petitioner brought this suit for damages under the Jones Act,1 alleging that her husband while employed by the respondent railroad as a tug fireman was drowned because of the negligent failure of respondent to provide him with a safe place to work. The District Judge directed the jury to return a


So, my question is, why am I getting this difference and how do I take control of it?

Reply all
Reply to author
Forward
0 new messages