Skip to content

Commit

Permalink
Rename collate_chars -> get_text
Browse files Browse the repository at this point in the history
collate_chars still available, to avoid breaking scripts that use it.
  • Loading branch information
jsvine committed Mar 7, 2016
1 parent bc9d541 commit 1b861b9
Show file tree
Hide file tree
Showing 4 changed files with 7 additions and 6 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ The `pdfplumber.Page` class is at the core of `pdfplumber`. Most things you'll d
- By default, the cropped page retains objects that fall at least partly within the bounding box. If an object falls only partly within the box, its dimensions are sliced to fit the bounding box.
- Calling `.crop` with `strict=True`, however, retains only objects that fall *entirely* within the bounding box.

- `.collate_chars(x_tolerance=0, y_tolerance=0)`: Collates all of the page's character objects into a single string. Adds spaces where the difference between the `x1` of one character and the `x0` of the next is greater than `x_tolerance`. Adds newline characters where the difference between the `doctop` of one character and the `doctop` of the next is greater than `y_tolerance`.
- `.get_text(x_tolerance=0, y_tolerance=0)`: Collates all of the page's character objects into a single string. Adds spaces where the difference between the `x1` of one character and the `x0` of the next is greater than `x_tolerance`. Adds newline characters where the difference between the `doctop` of one character and the `doctop` of the next is greater than `y_tolerance`.

- `.extract_table(...)`: Extracts tabular data from the page. For more details see "[Extracting tables](#extracting-tables)" below.

Expand Down
4 changes: 2 additions & 2 deletions examples/notebooks/extract-table-nics.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -294,7 +294,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Use `collate_chars` to extract the report month\n",
"### Use `get_text` to extract the report month\n",
"\n",
"It looks like the month of the report is listed in an area 35px to 65px from the top of the page. But there's also some other text directly above and below it. So when we crop for that area, we'll use `strict=True` to select only characters (and other objects) that are fully within the crop-box."
]
Expand Down Expand Up @@ -329,7 +329,7 @@
}
],
"source": [
"month_chars = month_crop.collate_chars(x_tolerance=2, y_tolerance=2)\n",
"month_chars = month_crop.get_text(x_tolerance=2, y_tolerance=2)\n",
"month_chars"
]
},
Expand Down
4 changes: 2 additions & 2 deletions pdfplumber/page.py
Original file line number Diff line number Diff line change
Expand Up @@ -171,8 +171,8 @@ def use_strategy(param, name):

return table

def collate_chars(self, x_tolerance=0, y_tolerance=0):
return utils.collate_chars(self.chars,
def get_text(self, x_tolerance=0, y_tolerance=0):
return utils.get_text(self.chars,
x_tolerance=x_tolerance,
y_tolerance=y_tolerance)

Expand Down
3 changes: 2 additions & 1 deletion pdfplumber/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ def collate_line(line_chars, tolerance=0):
coll += char["text"]
return coll

def collate_chars(chars, x_tolerance=0, y_tolerance=0):
def get_text(chars, x_tolerance=0, y_tolerance=0):
if len(chars) == 0:
raise Exception("List of chars is empty.")

Expand All @@ -69,6 +69,7 @@ def collate_chars(chars, x_tolerance=0, y_tolerance=0):
coll = "\n".join(lines)
return coll

collate_chars = get_text

def find_gutters(chars, orientation, min_size=5):
"""
Expand Down

0 comments on commit 1b861b9

Please sign in to comment.