public class SourceText
extends java.lang.Object
FilterHtml
class that it uses to do the parsing.Constructor and Description |
---|
SourceText()
Create a new instance of SourceText
|
Modifier and Type | Method and Description |
---|---|
java.util.ArrayList<java.net.URL> |
getHyperLinks()
Return the list of parsed hyperlinks.
|
java.lang.String |
parseHTML(java.lang.String htmlSource,
SourceConfig parseConfig)
Parse the html source as defined by
parseConfig . |
java.lang.String |
parseHTML(java.net.URL htmlUri,
SourceConfig parseConfig)
Read the source URL and parse the html source as defined by
parseConfig . |
java.util.ArrayList<java.net.URL> |
parseHyperLinks(java.lang.String theText)
Parse all hyperlinks in the text if any exist and keep all found.
|
java.util.ArrayList<java.net.URL> |
parseHyperLinks(java.lang.String theText,
java.util.ArrayList<java.lang.String> toKeep,
java.lang.String formatType,
boolean resetHyperList)
Parse all hyperlinks in the text if any exist, but only keep certain types of link.
|
java.lang.String |
readableContent(java.lang.String htmlSource)
Get only the readable content of the html document.
|
java.lang.String |
readableContent(java.net.URL htmlUri)
Get only the readable content of the html document.
|
public java.lang.String parseHTML(java.net.URL htmlUri, SourceConfig parseConfig) throws java.lang.Exception
parseConfig
.
To automatically remove all of the markup, you can try readableContent
instead.htmlUri
- uri of the html file.parseConfig
- defines what parsing should be done. Can be null when
no text removal is performed.java.lang.Exception
- any error.public java.lang.String parseHTML(java.lang.String htmlSource, SourceConfig parseConfig) throws java.lang.Exception
parseConfig
. To automatically remove
all of the markup, you can try readableContent
instead.htmlSource
- the source text in html format.parseConfig
- defines what parsing should be done. Can be null
for default settings.java.lang.Exception
- any error.public java.lang.String readableContent(java.net.URL htmlUri) throws java.lang.Exception
htmlUri
- uri of the html file.java.lang.Exception
- any error.public java.lang.String readableContent(java.lang.String htmlSource) throws java.lang.Exception
htmlSource
- the source text in html format.java.lang.Exception
- any error.public java.util.ArrayList<java.net.URL> parseHyperLinks(java.lang.String theText) throws java.net.MalformedURLException, java.lang.Exception
hyperLinks
list, which is reset first and a html format
is assumed.theText
- the text with the hyperlink descriptions.java.net.MalformedURLException
- hyperlink cannot be created.java.lang.Exception
- any error.public java.util.ArrayList<java.net.URL> parseHyperLinks(java.lang.String theText, java.util.ArrayList<java.lang.String> toKeep, java.lang.String formatType, boolean resetHyperList) throws java.net.MalformedURLException, java.lang.Exception
hyperLinks
list.theText
- the text with the hyperlink descriptions.toKeep
- a list of descriptions of link types to keep. If null or empty,
then all link types are kept. This can currently be HtmlConst
.HTTPLINKTYPE
or IMAGELINKTYPE
.formatType
- the format of the text. Can be VendorEngine
.HTML
or JSON
.resetHyperList
- true if reset the hyperlink list first.java.net.MalformedURLException
- hyperlink cannot be created.java.lang.Exception
- any error.public java.util.ArrayList<java.net.URL> getHyperLinks()